内容简介:This is a guest post by Mark Dawson:As Denis alluded to in his articleTake, for instance, a multithreaded application designed to react to favorable trading signals at a financial exchange (e.g., CME or NYSE) by executing an order (e.g., BUY or SELL): a na
This is a guest post by Mark Dawson:
Mark E. Dawson, Jr. is a Sr. Performance Engineer with more than 10 years of experience in systems architecture and low-latency development in the FinTech industry. He enjoys boxing in his spare time at the famed Windy City Boxing Gym and is an avid fan of the Wu Tang Clan.
As Denis alluded to in his article “Performance Analysis of Multithreaded Applications” , profiling a multithreaded application is not as straightforward as doing so with a single-threaded one – its single flow of execution through the CPU microarchitecture can be traced intuitively and have its chokepoints identified clearly, lending itself well to a Top Down Mental Model. However, for multithreaded software where multiple flows execute concurrently, the choke point(s) in any given thread’s flow of execution may not have much bearing at all on the critical path of the application as a whole, presenting a blind spot for a sampling profiler.
Take, for instance, a multithreaded application designed to react to favorable trading signals at a financial exchange (e.g., CME or NYSE) by executing an order (e.g., BUY or SELL): a naive implementation might dedicate one thread for busy-polling the NIC to grab UDP multicast packets containing exchange market data messages which it then inserts into a software queue from which a 2nd thread pulls to process those messages looking for interesting signals, while a 3rd thread waits to be triggered to send a BUY or SELL order. The busiest, most commonly “on-CPU” instructions will derive from the 1st thread since it spins in a tight loop grabbing network packets from the NIC – as a result, a sampling profiler (e.g., Linux perf or Intel VTune or AMD uProf) will most certainly point you in that direction. However, the real aim is to ensure that the application can send an order as quickly as possible after observing a favorable signal (referred to as “tick-to-trade latency”). That code path is executed relatively infrequently, rendering sampling profilers largely ineffective. That’s where COZ comes in.
COZ is a new kind of profiler that fills the gaps left behind by traditional software profilers. It uses a novel technique called “causal profiling”, whereby experiments are conducted during the runtime of an application by virtually speeding up segments of code to predict the effect of certain optimizations. It accomplishes these “virtual speedups” by inserting pauses that slow down all other concurrently running code. It’s actually quite clever. You can read all about it here. Let’s look at an example.
I’ll use the C-Ray benchmark , which is a simple ray tracer that spawns 16 threads per core (the number of recognized cores is configurable via the NUM_CPU_CORES
environment variable) to generate a 1600 x 1200 image. The target machine comprises a dual-socket Skylake Gold 6154 CPU with 36 cores running CentOS 7.3.1611. I used the OS-bundled GCCv4.85 and added the flags -g -fno-omit-frame-pointer -ldl
to the existing compilation flags ( -O3
) of the install script, install.sh
. I also set NUM_CPU_CORES
to ‘3’ to force a longer runtime that would give COZ sufficient experimentation time. Running install.sh compiles the program and creates a Bash driver script named c-ray. I executed it as follows:
mdawson@labmach01:~$ ./c-ray
Running this command 10 times gives an average duration of 31.07 seconds.
Next, I downloaded and built COZ according to the instructions specified here . I located the function provided to each thread inside of pthread_create
– it is called thread_func
. Within that function, I found the loop around the workhorse function, render_scanline
. Since I want to determine what optimization is necessary to execute the render_scanline
function as quickly as possible, I placed the macro COZ_PROGRESS
on the line directly after where render_scanline
is called:
656 void *thread_func(void *tdata) { 657 int i; 658 struct thread_data *td = (struct thread_data*)tdata; 659 660 pthread_mutex_lock(&start_mutex); 661 while(!start) { 662 pthread_cond_wait(&start_cond, &start_mutex); 663 } 664 pthread_mutex_unlock(&start_mutex); 665 666 for(i=0; i<td->sl_count; i++) { 667 render_scanline(xres, yres, i + td->sl_start, td->pixels, rays_per_pixel); 668 + COZ_PROGRESS 669 } 670 671 return 0;
Then, in the c-ray Bash driver script, I prepended line 4 with coz run ---
so that the full line reads:
mdawson@labmach01:~$ coz run --- ./c-ray-mt -t $RT_THREADS -s 3840x2160 -r 16 -i sphfract -o output.ppm > /tmp/c-ray.out 2>&1
Upon completion of a subsequent run of the benchmark, COZ created a file named profile.coz
in my current working directory – this file that contains all of COZ’s performance estimations and the source code line numbers associated with them. Here’s the output after plotting profile.coz
:
It points to line 540
in c-ray-mt.c
as the prime area of focus for overall application performance. According to the graph, if we improve the performance of that line by 20%, COZ expects a corresponding increase in application performance of C-Ray benchmark overall of about 17% – once we reach ~45% improvement of that line, the impact on the application begins to level off by COZ’s estimation. Here is the line 540
of c-ray-mt.c
:
525 /* Calculate ray-sphere intersection, and return {1, 0} to signify hit or no hit. 526 * Also the surface point parameters like position, normal, etc are returned through 527 * the sp pointer if it is not NULL. 528 */ 529 int ray_sphere(const struct sphere *sph, struct ray ray, struct spoint *sp) { 530 double a, b, c, d, sqrt_d, t1, t2; 531 532 a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z); 533 b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) + 534 2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) + 535 2.0 * ray.dir.z * (ray.orig.z - sph->pos.z); 536 c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) + 537 SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) + 538 2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad); 539 540 if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
It’s computing a square, performing some multiplications, and a subtraction on a few doubles. What does “perf annotate” show as the generated assembly for that line?
7.01 : 401717: mulsd %xmm14,%xmm1 11.15 : 40171c: subsd %xmm1,%xmm7 9.10 : 401720: ucomisd %xmm7,%xmm12 4.68 : 401725: ja 401788 <shade+0x298>
There’s a subtraction instruction that is dependent on the preceding multiplication due to the shared register, %xmm1
. This code block is screaming to be enhanced by a Fused Multiply Add (FMA) instruction – well, in this case, it would be a Fused Multiply Subtract.
I struggled to find a way to get the compiler to generate FMA only for line 540
, from using a generic STDC FP_CONTRACT
pragma to using a Clang-specific pragma – neither one worked for me. So, I had to add -mfma
to the compiler flags, which not only replaced the assembly for line 540
with FMA but also a few other preceding lines in the same file. This makes it a bit difficult to check the predictive value of COZ’s profile.coz
graph plot. Now let’s see what the generated assembly is once we compile with -mfma
option:
12.82 : 4016db: vfmsub231sd %xmm1,%xmm1,%xmm0 12.10 : 4016e0: vucomisd %xmm0,%xmm10 5.55 : 4016e4: ja 401741 <shade+0x1f1>
The number of perf samples attributed to the block of assembly employing FMA is 641,916
while that of the unoptimized block amounts to 955,280
– a 33% improvement. According to the COZ graph plot, this should translate to somewhere between 20 – 30% improvement in overall application performance. When I ran with the new FMA-enabled binary, I measured a runtime duration of 23.43 seconds (vs. the original 31.07 seconds), representing a 25% improvement and a relatively accurate COZ assessment.
In this example, both COZ and Linux perf agree on the source of the bottleneck since the workload type of this benchmark lends itself well to sampling. Next time, we’ll look at a workload where perf points to code that is simply on-CPU most often yet has very little to do with the critical path performance of the application. The whitepaper “COZ: Finding Code that Counts with Causal Profiling” provides a few examples of popular opensource applications where COZ revealed critical-path bottleneck areas that were downplayed by sampling profilers.
以上所述就是小编给大家介绍的《Guest post: COZ vs Sampling Profilers.》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
马化腾自述-我的互联网思维
赵黎 / 石油工业出版社 / 2014-8-1 / 35
马化腾自述:我的互联网思维》讲述了些人说移动互联网就是加了“移动”两个字,互联网十几年了,移动互联网应该是个延伸。我的感受是,移动互联网远远不只是一个延伸,甚至是一个颠覆。互联网是一个开放交融、瞬息万变的大生态,企业作为互联网生态里面的物种,需要像自然界的生物一样,各个方面都具有与生态系统汇接、和谐、共生的特性。开放和分享并不是一个宣传口号,也不是一个简单的概念。开放很多时候被看作一种姿态,但是我......一起来看看 《马化腾自述-我的互联网思维》 这本书的介绍吧!