Guest post&#58 COZ vs Sampling Profilers.

栏目: IT技术 · 发布时间: 6年前

内容简介:This is a guest post by Mark Dawson:As Denis alluded to in his articleTake, for instance, a multithreaded application designed to react to favorable trading signals at a financial exchange (e.g., CME or NYSE) by executing an order (e.g., BUY or SELL): a na

This is a guest post by Mark Dawson:

Mark E. Dawson, Jr. is a Sr. Performance Engineer with more than 10 years of experience in systems architecture and low-latency development in the FinTech industry. He enjoys boxing in his spare time at the famed Windy City Boxing Gym and is an avid fan of the Wu Tang Clan.

As Denis alluded to in his article “Performance Analysis of Multithreaded Applications” , profiling a multithreaded application is not as straightforward as doing so with a single-threaded one – its single flow of execution through the CPU microarchitecture can be traced intuitively and have its chokepoints identified clearly, lending itself well to a Top Down Mental Model. However, for multithreaded software where multiple flows execute concurrently, the choke point(s) in any given thread’s flow of execution may not have much bearing at all on the critical path of the application as a whole, presenting a blind spot for a sampling profiler.

Take, for instance, a multithreaded application designed to react to favorable trading signals at a financial exchange (e.g., CME or NYSE) by executing an order (e.g., BUY or SELL): a naive implementation might dedicate one thread for busy-polling the NIC to grab UDP multicast packets containing exchange market data messages which it then inserts into a software queue from which a 2nd thread pulls to process those messages looking for interesting signals, while a 3rd thread waits to be triggered to send a BUY or SELL order. The busiest, most commonly “on-CPU” instructions will derive from the 1st thread since it spins in a tight loop grabbing network packets from the NIC – as a result, a sampling profiler (e.g., Linux perf or Intel VTune or AMD uProf) will most certainly point you in that direction. However, the real aim is to ensure that the application can send an order as quickly as possible after observing a favorable signal (referred to as “tick-to-trade latency”). That code path is executed relatively infrequently, rendering sampling profilers largely ineffective. That’s where COZ comes in.

COZ is a new kind of profiler that fills the gaps left behind by traditional software profilers. It uses a novel technique called “causal profiling”, whereby experiments are conducted during the runtime of an application by virtually speeding up segments of code to predict the effect of certain optimizations. It accomplishes these “virtual speedups” by inserting pauses that slow down all other concurrently running code. It’s actually quite clever. You can read all about it here. Let’s look at an example.

I’ll use the C-Ray benchmark , which is a simple ray tracer that spawns 16 threads per core (the number of recognized cores is configurable via the NUM_CPU_CORES environment variable) to generate a 1600 x 1200 image. The target machine comprises a dual-socket Skylake Gold 6154 CPU with 36 cores running CentOS 7.3.1611. I used the OS-bundled GCCv4.85 and added the flags -g -fno-omit-frame-pointer -ldl to the existing compilation flags ( -O3 ) of the install script, install.sh . I also set NUM_CPU_CORES to ‘3’ to force a longer runtime that would give COZ sufficient experimentation time. Running install.sh compiles the program and creates a Bash driver script named c-ray. I executed it as follows:

mdawson@labmach01:~$ ./c-ray

Running this command 10 times gives an average duration of 31.07 seconds.

Next, I downloaded and built COZ according to the instructions specified here . I located the function provided to each thread inside of pthread_create – it is called thread_func . Within that function, I found the loop around the workhorse function, render_scanline . Since I want to determine what optimization is necessary to execute the render_scanline function as quickly as possible, I placed the macro COZ_PROGRESS on the line directly after where render_scanline is called:

656 void *thread_func(void *tdata) {
657         int i;
658         struct thread_data *td = (struct thread_data*)tdata;
659 
660         pthread_mutex_lock(&start_mutex);
661         while(!start) {
662                 pthread_cond_wait(&start_cond, &start_mutex);
663         }
664         pthread_mutex_unlock(&start_mutex);
665 
666         for(i=0; i<td->sl_count; i++) {
667                 render_scanline(xres, yres, i + td->sl_start, td->pixels, rays_per_pixel);
668 +               COZ_PROGRESS
669         }
670 
671         return 0;

Then, in the c-ray Bash driver script, I prepended line 4 with coz run --- so that the full line reads:

mdawson@labmach01:~$ coz run --- ./c-ray-mt -t $RT_THREADS -s 3840x2160 -r 16 -i sphfract -o output.ppm > /tmp/c-ray.out 2>&1

Upon completion of a subsequent run of the benchmark, COZ created a file named profile.coz in my current working directory – this file that contains all of COZ’s performance estimations and the source code line numbers associated with them. Here’s the output after plotting profile.coz :

Guest post: COZ vs Sampling Profilers.

It points to line 540 in c-ray-mt.c as the prime area of focus for overall application performance. According to the graph, if we improve the performance of that line by 20%, COZ expects a corresponding increase in application performance of C-Ray benchmark overall of about 17% – once we reach ~45% improvement of that line, the impact on the application begins to level off by COZ’s estimation. Here is the line 540 of c-ray-mt.c :

525 /* Calculate ray-sphere intersection, and return {1, 0} to signify hit or no hit.
526  * Also the surface point parameters like position, normal, etc are returned through
527  * the sp pointer if it is not NULL.
528  */
529 int ray_sphere(const struct sphere *sph, struct ray ray, struct spoint *sp) {
530         double a, b, c, d, sqrt_d, t1, t2;
531 
532         a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);
533         b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +
534             2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +
535             2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);
536         c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) +
537             SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) +
538             2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad);
539 
540         if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;

It’s computing a square, performing some multiplications, and a subtraction on a few doubles. What does “perf annotate” show as the generated assembly for that line?

7.01 :        401717:       mulsd  %xmm14,%xmm1
   11.15 :        40171c:       subsd  %xmm1,%xmm7
    9.10 :        401720:       ucomisd %xmm7,%xmm12
    4.68 :        401725:       ja     401788 <shade+0x298>

There’s a subtraction instruction that is dependent on the preceding multiplication due to the shared register, %xmm1 . This code block is screaming to be enhanced by a Fused Multiply Add (FMA) instruction – well, in this case, it would be a Fused Multiply Subtract.

I struggled to find a way to get the compiler to generate FMA only for line 540 , from using a generic STDC FP_CONTRACT pragma to using a Clang-specific pragma – neither one worked for me. So, I had to add -mfma to the compiler flags, which not only replaced the assembly for line 540 with FMA but also a few other preceding lines in the same file. This makes it a bit difficult to check the predictive value of COZ’s profile.coz graph plot. Now let’s see what the generated assembly is once we compile with -mfma option:

12.82 :        4016db:       vfmsub231sd %xmm1,%xmm1,%xmm0
   12.10 :        4016e0:       vucomisd %xmm0,%xmm10
    5.55 :        4016e4:       ja     401741 <shade+0x1f1>

The number of perf samples attributed to the block of assembly employing FMA is 641,916 while that of the unoptimized block amounts to 955,280 – a 33% improvement. According to the COZ graph plot, this should translate to somewhere between 20 – 30% improvement in overall application performance. When I ran with the new FMA-enabled binary, I measured a runtime duration of 23.43 seconds (vs. the original 31.07 seconds), representing a 25% improvement and a relatively accurate COZ assessment.

In this example, both COZ and Linux perf agree on the source of the bottleneck since the workload type of this benchmark lends itself well to sampling. Next time, we’ll look at a workload where perf points to code that is simply on-CPU most often yet has very little to do with the critical path performance of the application. The whitepaper “COZ: Finding Code that Counts with Causal Profiling” provides a few examples of popular opensource applications where COZ revealed critical-path bottleneck areas that were downplayed by sampling profilers.


以上所述就是小编给大家介绍的《Guest post: COZ vs Sampling Profilers.》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

大话存储

大话存储

张冬 / 清华大学出版社 / 2008-11 / 58.00元

网络存储,是近二十年来的新兴行业。从纸带到硬盘再到大型磁盘阵列,存储系统经历了从简单到复杂,从单块硬盘到存储区域网络(SAN)。网络存储行业目前已经是一个步入正轨的IT行业了。. 网络存储是一个涉及计算机硬件以及网络协议/技术、操作系统以及专业软件等各方面综合知识的领域。目前国内阐述网络存储的书籍少之又少,大部分是国外作品,对存储系统底层细节的描述不够深入,加之术语太多,初学者很难真正理解网......一起来看看 《大话存储》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换