Too much locality... for stores to forward

栏目: IT技术 · 发布时间: 4年前

内容简介:Apologies for the failedI’ve been responsible forAs I updated more operators to use this data structure, I noticed that we were spending a lot of time in its inner loop. In fact,

Apologies for the failed Cake reference .

I’ve been responsible for Backtrace.io ’s crash analytic database for a couple months now.I have focused my recent efforts on improving query times for in-memory grouped aggregations, i.e., the archetypal MapReduce use-case where we generate key-value pairs, and fold over the values for each key in some (semi) group . We have a cute cache-efficient data structure for this type of workload; the inner loop is simply inserting in a small hash table with Robin Hood linear probing , in order to guarantee entries in the table are ordered by hash value. This ordering lets us easily dump the entries in sorted order, and block the merge loop for an arbitrary number of sorted arrays into a unified, larger, ordered hash table (which we can, again, dump to a sorted array).

Observation

As I updated more operators to use this data structure, I noticed that we were spending a lot of time in its inner loop. In fact, perf showed that the query server as a whole was spending 4% of its CPU time on one instruction in that loop:

2.17 |       modvqu     (%rbx),%xmm0
39.63 |       lea        0x1(%r8),%r14  # that's 40% of the annotated function
      |       mov        0x20(%rbx),%rax
 0.15 |       movaps     %xmm0,0xa0(%rsp)

The first thing to note is that instruction-level profiling tends to put the blame on the instruction following the one that triggered a sampling interrupt. It’s not the lea (which computes r14 <- r8 + 1 ) that’s slow, but the movdqu just before. So, what is that movdqu loading into xmm0 ? Maybe it’s just a normal cache miss, something inherent to the workload.

I turned on source locations (hit s in perf report ) , and observed that this instruction was simply copying to the stack an argument that was passed by address. The source clearly showed that the argument should be hot in cache: the inner loop was essentially

A1. Generate a new key-value pair
B1. Mangle that kv pair just a bit to turn it into a hash element
C1. Insert the new hash element
A2.
B2.
C2.

and the movdqu happens in step C, to copy the element that step B just constructed.

At this point, an important question suggests itself: does it matter? We could simply increase the size of the base case and speed up the rest of the bottom-up recursion… eventually, the latency for the random accesses in the initial hash table will dominate the inner loop.

When I look into the performance of these deep inner loop, my goal isn’t only to do the same thing better. The big wins, in my experience, come from the additional design freedom that we get from being able to find new uses for the same code. Improved latency, throughput, or memory footprint really shine when the increased optionality from multiple such improvements compounds and lets us consider a much larger design space for the project as a whole. That’s why I wanted to make sure this hash table insertion loop worked on as wide a set of parameter as possible: because that will give future me the ability to combine versatile tools.

Hypothesis

Back to the original question. Why do we spend so many cycles loading data we just wrote to cache?

The answer is in the question and in the title of this post: too little time elapses between the instructions that write data to the cache and the ones that read the same data.A modern out-of-order machine (e.g., most amd64 chips) can execute multiple instructions at the same time, and will start executing instructions as soon as their operands are ready, even when earlier instructions in program order are still waiting for theirs. Machine code is essentially a messy way to encode a dataflow graph , which means our job as micro-optimisers is, at a high level, to avoid long dependency chains and make the dataflow graph as wide as possible. When that’s too hard, we should distribute as much scheduling slack as possible between nodes in a chain, in order to absorb the knock-on effects of cache misses and other latency spikes. If we fail, the chip will often find itself with no instruction ready to execute; stalling the pipeline like that is like slowing down by a factor of 10.

The initial inner loop simply executes steps A, B, and C in order, where step C depends on the result of step B, and step B on that of step A. In theory, a chip with a wide enough instruction reordering window could pipeline multiple loop iterations. In practice, real hardware can only plan on the order of 100-200 instructions ahead , and that mechanism depends on branches being predicted correctly. We have to explicitly insert slack in our dataflow schedule, and we must distribute it well enough for instruction reordering to see the gaps.

As it is, the dataflow graph for each loop iteration is a pure chain:

A1
         |
         v
         B1
         |
         v
         C1
                A2
                |
                v
                B2
                |
                v
                C2

How does one add slack? With bounded queues!

Experiment

My first fix was to add a one-element buffer between steps B and C. The inner loop became

A1. Generate a new key-value pair
C0. Insert the hash element from the previous iteration
B1. Mangle the kv pair and stash that in the buffer
A2.
C1.
B2
etc.

which yields a dataflow graph like

|     A1
        v     |
        C0    |
              |
              v
              B1
              |
              |     A2
              v     |
              C1    |
                    |
                    v
                    B2
                    |

We’ve introduced slack between steps A and B (there’s now step C from the previous iteration between them), and between steps B and C (we shifted step A from the next iteration between them). There isn’t such a long delay between the definition of a value and its use that the data is likely to be evicted from L1. However, there is more than enough work available between them to keep the pipeline busy with useful work while C waits for B’s result, or B for A’s. That was a nice single-digit improvement in query latency for my internal benchmark, just by permuting a loop.

If a one-element buffer helps, we should clearly experiment with the buffer size, and that’s where I found a more impactful speed-up. Once we have an array of elements to insert in a hash table, we can focus on a bulk insert of maybe 8 or 10 elements: instead of trying to improve the latency for individual writes, we can focus on the throughput for multiple inserts at once. That’s good because throughput is an easier problem than latency . In the current case, passing the whole buffer to the hash table code made it easier to pipeline the insert loop in software : we can compute hashes ahead of time, and accelerate random accesses to the hash table with software prefetching . The profile for the new inner loop is flatter, and the hottest part is as follows

|       mov        0x8(%rsp),%rdx
 9.91 |       lea        (%r12,%12,4),%rax
 0.64 |       prefetcht0 (%rdx,%rax,8)
17.04 |       cmp        %rcx,0x28(%rsp)

Again, the blame for a “slow” instruction hits the following instruction, so it’s not lea (multiplying by 5) or cmp that are slow; it’s the load from the stack and the prefetch. The good news is that these instructions do not have any dependent. It’s all prefetching, and that’s only used for its side effects. Moreover, they come from a block of code that was pipelined in software and executes one full iteration ahead of where its side effects might be useful. It doesn’t really matter if these instructions are slow: they’re still far from being on the critical path! This last restructuring yielded a 20% speed-up on a few slow queries.

I described two tools that I use regularly when optimising code for contemporary hardware. Finding ways to scatter around scheduling slack is always useful, both in software and in real life planning.However, I think the more powerful one is using buffering to expose bulk operations, which tends to open up more opportunities than just doing the same thing in a loop. In the case above, we found a 20% speed-up which, for someone who visit their Backtrace dashboard a couple times a day, can add up to an hour or two at the end of the year.

TL;DR: When a function is hot enough to look into, it’s worth asking why it’s called so often, in order to focus on higher level bulk operations.


以上所述就是小编给大家介绍的《Too much locality... for stores to forward》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Programming Rust

Programming Rust

Jim Blandy / O'Reilly Media / 2016-8-25 / GBP 47.99

This practical book introduces systems programmers to Rust, the new and cutting-edge language that’s still in the experimental/lab stage. You’ll learn how Rust offers the rare and valuable combination......一起来看看 《Programming Rust》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

MD5 加密
MD5 加密

MD5 加密工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具