内容简介:Faster addition and subtraction on modern CPUsDo you remember how to do long addition on paper?Starting from the “ones” position, we add 6 + 6 = 12, write down a 2 and carry a 1. We proceed to the left, one position at a time, until there are no more digit
Faster addition and subtraction on modern CPUs
Do you remember how to do long addition on paper?
¹¹ ¹ 6876 + 3406 ------ 10282
Starting from the “ones” position, we add 6 + 6 = 12, write down a 2 and carry a 1. We proceed to the left, one position at a time, until there are no more digits to add.
When implementing addition for large integers (e.g. 2 64 and above), it’s common to write code that looks quite similar to this algorithm. Interestingly, there’s a straightforward trick that can speed up this process enormously on modern CPUs.
But first, a question: why do we start long addition with the “ones”? Why not start on the left?
The answer, of course, is the carries. We can’t figure out for sure what a given digit of the answer will be until we’ve completed all of the additions to the right of that digit.
Imagine if we tried to add left-to-right instead:
6 + 3 = 9. So the first digit is 9.
8 + 4 = 12. OK, the second digit is 2… but carry a 1, so the first digit was actually 9 + 1 = 10… now carry back that 1…
For mental math, this isn’t too bad (and some people actually prefer it when working with small enough numbers). As an algorithm, however, this approach has some fundamental limitations that become clear when working with larger numbers. Most importantly, because the later parts of the computation rely on information from the earlier parts of the computation, it’s hard to split up and parallelize the work.
What about computers?
Computers don’t work in base 10, of course. Instead, modern desktop and server CPUs expose an interface for operating on (for the most part) 64-bit integers.
; Add the 64-bit value in B to the 64-bit value in A add A, B ; Note: I'll use letters instead of real register names to keep things simple
As long as our numbers fit within a single 64-bit value, things are easy.
But what if we want to add, say, two 256-bit integers, x
and y
?
The obvious solution would be to break up each 256-bit number into four 64-bit
pieces (commonly referred to as “limbs”).
Place the highest 64 bits of x
into register A,
the next 64 bits into register B,
and so on for registers C and D.
Do the same for y
with registers E, F, G, H.
Now we can add x
and y
by adding the corresponding parts:
; Equivalent to x += y add A, E add B, F add C, G add D, H
But wait, this might give us the wrong result! If one of the last three additions overflow, then we need to “carry” that extra 1 up to the next 64-bit piece. Oh hey, does that sound familiar?
Fortunately, x86 has a dedicated instruction for this called “add with carry”. adc
will automatically check if the previous operation overflowed, adding 1
if needed.
Here’s how the proper code would look:
add D, H adc C, G ; include carry from previous op adc B, F ; include carry from previous op adc A, E ; include carry from previous op
Just like with long addition in base 10, we start with the least-significant “digits” (D and H) and work our way up to the most-significant “digits” (A and E), carrying 1s as needed along the way.
But now it’s slow
Interestingly, our fixed code is slower than the original (incorrect) code. Much slower. Why is this?
The first reason is that adc
is just slower to execute than a normal add
on
most popular x86 CPUs.
Since adc
has a third input (the carry flag),
it’s a more complex instruction than add
.
It’s also used less often than add
,
so there is less incentive for CPU designers to spend chip area on optimizing adc
performance.
The second reason is more interesting. Let’s look at the Intel Haswell microarchitecture as an example.
On a Haswell CPU, a single add
instruction takes 1 cycle to execute.
However, in ideal conditions, Haswell CPUs can execute up to 4 add
instructions in a single cycle.
How is this possible? Parallelism.
Modern processors look ahead at what instructions are coming up and try to
schedule them so that they can be executed in parallel whenever possible.
Since Haswell CPUs have 8 execution ports, and 4 of those ports can execute an
integer add
instruction, a Haswell processor can execute up to 4 add
instructions at once.
In our original adding code, all 4 add
instructions were independent of one
another, so it was straightforward for the processor to run them in parallel.
Now, with adc
, each instruction depends on an output from the previous
instruction.
The processor has no choice but to execute the instructions serially, one after
the other, instead of in parallel.
The performance difference is even more dramatic if we use SIMD (Single
Instruction, Multiple Data) instructions.
For example, a single vpaddq
(Vector Packed Add Quadword) instruction does
four 64-bit adds simultaneously.
Combine that with the fact that Haswell processors can execute two vpaddq
s
per cycle, and you can see that we’re taking a serious performance hit
in order to handle carries properly.
Eliminating carries, part 1: on paper
Back to base 10 for a minute. How can we eliminate the need for carries?
Let’s make some changes to how the number system works. First, we’ll extend the range of digits available. Instead of 0-9, we will use 0-9, A-Z, and *:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ*
(Yeah, I needed an extra character to make the numbers work out nicely. Bear with me.)
Although we have 37 digits, we are not using base 37. Numbers will still have “ones”, “tens”, and “hundreds” positions, just like a normal base 10 system. 29 still means 29, and 29 + 1 is still 30. The only difference is that digits happen to be capable of counting past 9: 29 + 1 could also be written as 2A, 1K, or even U.
With this new, more flexible number system, we can add without needing any carries!
672415 + 736606 -------- DA8A1B
This trick won’t work for all numbers in our number system (e.g. 9 + W will need a carry), but it will work if the numbers we are adding are normalized , i.e. all of their digits are 9 or below. In fact, we can add up to four normalized numbers in this notation before any carries are possible:
999 <-- largest possible normalized 3-digit number 999 999 + 999 ----- *** <-- valid 3-digit result, no carries (recall that * is the highest digit)
So, with some clever tweaks to the number system, we’ve cheated our way out of some carries. Of course, at some point, we will need to convert from the 37-digit base 10 system back to normal base 10. We can do that by normalizing a number such that each of its digits is between 0 and 9:
¹¹ ¹ ¹ DA8A1B = 1409021 note: D = 10 + 3 A = 10 + 0 B = 10 + 1
We normalize a number starting at the right, determining how many “tens” are in each digit, subtracting those “tens”, and carrying them to the next digit. 672415 and 736606 do in fact sum to 1409021, so the system works!
The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end.
Eliminating carries, part 2: computers
Carry propagation was at the heart of the performance problems we encountered earlier. As you’ve probably anticipated by now, we can use this technique to help speed up big number arithmetic!
Previously, we split a 256-bit number into four 64-bit pieces, since x86_64 processors operate on 64-bit integers. One way to understand this is to view the pieces as “digits” in base 2 64 , since each digit has a value between 0 and 2 64 - 1 (inclusive).
In base 10, we kept the same base, but extended the range of digits that were allowed in order to prevent carries from occurring. Unfortunately, we can’t do that here – a 64-bit integer only has so many possible values, and we can’t change the hardware. Instead, we can get the same effect by reducing the size of the base.
Instead of splitting 256 bits into four base 2 64 digits, we’ll split 256 bits into five base 2 51 digits. Each digit can still range from 0 to 2 64 - 1, but the smaller base gives us the flexibility needed to prevent digits from needing a carry. This technique is generally referred to as “radix 2 51 representation” in the cryptography literature.
Here’s how it will look when we split 256 bits across five limbs (i.e. digits):
| [--------------------- 52 bits --------------------]| | [-------------------- 51 bits --------------------]| | [-------------------- 51 bits --------------------]| | [-------------------- 51 bits --------------------]| | [-------------------- 51 bits --------------------]|
Each limb has 51 (or 52) bits of the original 256-bit number. The remaining 12 or 13 bits give us the extra “digits” we need for preventing carries. Effectively, the highest bits of each limb are reserved as storage for any carries that occur during the computation.
In our base 10 example, 37 digits allowed us to add up to four normalized numbers before needing to propagate carries. In radix 2 51 representation, 2 64 digits allow us to add up to 2 13 normalized numbers before we need to worry about the high 13 bits overflowing.
Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2 256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.
With this new representation, our addition code now looks like:
; Assume x is split across A, B, C, D, E (A = most significant) ; and assume y is split across F, G, H, I, J (F = most significant) add A, F add B, G add C, H add D, I add E, J ; Parallel goodness, yay!
Despite the fact that we now need 5 add
s instead of 4,
addition is much faster due to the lack of carries.
Of course, we also need code to normalize a number by propagating carries.
; Assume x is split across A, B, C, D, E (A = most significant) ; Register T is for temporary storage mov T, E ; Copy E into T shr T, 51 ; Shift out everything except the carries add D, T ; Add carries from E into D and E, 0x0007FFFFFFFFFFFF ; Zero out the carries in E mov T, D ; Copy D into T shr T, 51 ; Shift out everything except the carries add C, T ; Add carries from D into C and D, 0x0007FFFFFFFFFFFF ; Zero the carries in D mov T, C ; Copy C into T shr T, 51 ; Shift out everything except the carries add B, T ; Add carries from C into B and C, 0x0007FFFFFFFFFFFF ; Zero the carries in C mov T, B ; Copy B into T shr T, 51 ; Shift out everything except the carries add A, T ; Add carries from B into A and B, 0x0007FFFFFFFFFFFF ; Zero the carries in B and A, 0x000FFFFFFFFFFFFF ; Zero the carries in A
Amazingly, some quick and dirty benchmarks show that radix 2 51 addition already outperforms radix 2 64 addition on my Haswell CPU for as few as three additions – and that’s including the cost of converting to and from radix 2 51 representation . The performance savings scale up appropriately as the number of additions increases.
Subtraction
So far we’ve only looked at addition. It’s straightforward though to extend this technique to subtraction. The main difference between addition and subtraction is that subtraction has negative carries.
Previously, we treated all limbs (and their carries) as unsigned integers. To support subtraction, we can treat limbs as signed integers, allowing individual digits to be either positive or negative. With this change, each limb can store either a positive or negative carry.
A side effect of this is that the most significant bit of each limb is now reserved as a sign bit. This lowers the number of operations we can perform between normalizations from 2 13 to 2 12 – a small sacrifice in most cases.
I find this technique rather fascinating because of how counterintuitive it is: by spreading data across more registers and using more operations, performance is actually improved. I hope you found it as interesting as I did!
以上所述就是小编给大家介绍的《The radix 2^51 trick》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
The Photoshop Anthology
Corrie Haffly / SitePoint Pty. Ltd. / 2006 / USD 39.95
The Photoshop Anthology is full-color, question-and-answer book for Web Designers who want to use Photoshop to build Websites and create better looking web graphics more effectively. The book covers: ......一起来看看 《The Photoshop Anthology》 这本书的介绍吧!