Check for borrows in bitwise operations

栏目: IT技术 · 发布时间: 4年前

内容简介:In theLooking at the bit patterns,the OP’s solution withIn the context of a sorting library’s partition loop,

In the fourth installment of his series on sorting with AVX2 , @damageboy has a short aside where he tries to detect partitioning (pivot) patterns where elements less than and greater than or equal to the pivot are already in the correct order: in that case, the partitioning routine does not need to permute the block of values. The practical details are irrelevant for this post; what matters is that we wish to quickly identify whether a byte value matches any of the follow nine cases:

0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000

Looking at the bit patterns,the OP’s solution with popcount and bitscan is pretty natural. These instructions are somewhat complex (latency closer to 3 cycles than 1, and often port restricted), and it seems like the sort of problem that would have had efficient solutions before SSE4 finally graced x86 with a population count instruction.

In the context of a sorting library’s partition loop, popcnt and bsf is probably more than good enough: the post shows that the real issue is branch mispredictions being slower than permuting unconditionally. This is just a fun challenge to think about (:

Warm-up: is_power_of_two

Detecting whether a machine integer is a power of two (or zero) is another task that has a straightforward solution in terms of popcount or bitscan. There’s also a simpler classic solution to this problem:

x == 0 || is_power_of_two(x) <==> (x & (x - 1)) == 0

How does that expression work? Say x is a power of two. Its binary representation is 0b0...010...0 : any number of leading zeros,a single “1” bit, and trailing zeros (maybe none). Let’s see what happens when we subtract 1 from x :

x           = 0b00...0010...0
     x - 1  = 0b00...0001...1
x & (x - 1) = 0b00...0000...0

The subtraction triggered a chain of borrows throughout the trailing zeros, until we finally hit that 1 bit. In decimal, subtracting one from 10...0 yields 09...9 ; in binary we instead find 01...1 . If you ever studied the circuit depth (latency) of carry chains (for me, that was for circuit complexity theory), you know that this is difficult to do well. Luckily for us, chip makers work hard to pull it off , and we can just use carries as a data-controlled primitive to efficiently flip ranges of bits.

When x is a power of two, x and x - 1 have no “1” bit in common, so taking the bitwise and yields zero. That’s also true when x is 0, since and ing anything with 0 yields zero. Let’s see what happens for non-zero, non-power-of-two values x = 0bxx...xx10..0 , i.e., where x consists of an arbitrary non-zero sequence of bits xx..xx followed by the least set bit (there’s at least one, since x is neither zero nor a power of two), and the trailing zeros:

x           = 0bxx...xx10...0
     x - 1  = 0bxx...xx01...1
x & (x - 1) = 0bxx...xx000000

The leading not-all-zero 0bxx...xx is unaffected by the subtraction, so it passes through the bitwise and unscathed ( and ing any bit with itself yields that same bit), and we know there’s at least one non-zero bit in there; our test correctly rejects it!

Stretching: decoding varints

When decoding variable length integers in ULEB format, e.g., for protocol buffers , it quickly becomes clear that, in order to avoid byte-at-a-time logic, we must rapidly segment ( lex or tokenize , in a way) our byte stream to determine where each ULEB ends. Let’s focus on the fast path, when the encoded ULEB fits in a machine register.

We have uleb = 0bnnnnnnnnmmmmmmmm...1zzzzzzz0yyyyyyy0... : a sequence of byteswith the topmost bit equal to 0, followed by a byte with the top bit set to 1, and, finally, arbitrary nuisance bytes ( m...m , n...n , etc.) we wish to ignore. Ideally, we’d extract data = 0b0000000000000000...?zzzzzzz?yyyyyyy?... from uleb : we want to clear the nuisance bytes, and are fine with arbitrary values in the ULEB’s control bits.

It’s pretty clear that the first thing to do is to clear out everything but potential ULEB control bits (the high bit of each byte), with c = uleb & (128 * (WORD_MAX / 255)) , i.e., compute the bitwise and of uleb with a bitmask of the high bit in each byte.

uleb = 0bnnnnnnnnmmmmmmmm...1zzzzzzz0yyyyyyy0...
      c = 0bn0000000m0000000...10000000000000000...

We could now bitscan to find the index of the first 1 (marking the last ULEB byte), and then generate a mask. However, it seems wasteful to generate an index with a scan, only to convert it back into bitmap space with a shift. We’ll probably still want that index to know how far to advance the decoder’s cursor, but we can hopefully update the cursor in parallel with decoding the current ULEB value.

When we were trying to detect powers of two, we subtracted 1 from x , a value kind of like c , in order to generate a new value that differed from x in all the bits up to and including the first set (equal to 1 ) bit of x , and identical in the remaining bits. We then used the fact that and ing a bit with itself yields that same bit to detect whether there was any non-zero bit in the remainder.

Here, we wish to do something else with the remaining untouched bits, we wish to set them all to zero. Another bitwise operator does what we want: xor ing a bit with itself always yields zero, while xor ing bits that differ yields 1 . That’s the plan for ULEB. We’ll subtract 1 from c and xor that back with c .

uleb   = 0bnnnnnnnnmmmmmmmm...1zzzzzzz0yyyyyyy0...
c           = 0bn0000000m0000000...10000000000000000...
     c - 1  = 0bn0000000m0000000...01111111111111111...
c ^ (c - 1) = 0b0000000000000000...11111111111111111...

We now just have to bitwise and uleb with c ^ (c - 1) to obtain the bits of the first ULEB value in uleb , while overwriting everything else with 0. Once we have that, we can either extract data bits with PEXT , or, if one is stuck with older or AMD chips, pull out interesting stunts for SWAR shifts by variable amounts.

Now for @damageboy ’s problem

Let’s first repeat the question that motivated this post. We want to detect when a byte p is one of the following nine values:

0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000

These bit patterns feel similar to those for power of two bytes: if we complement the bits, these values are all 1 less than a power of two (or -1, one less than zero). We already know how to detect when a value x is zero or a power of two ( x & (x - 1) == 0 ), so it’s easy to instead determine whether ~p is one less than zero or a power of two: (~p + 1) & ~p == 0 .

This is already pretty good: bitwise not the byte p , and check if it’s one less than zero or a power of two (three simple instructions on the critical path). We can do better.

There’s another name for ~p + 1 , i.e., for bitwise complementing a value and adding one: that’s simply -p , the additive inverse of p in two’s complement! We can use -p & ~p == 0 . That’s one fewer instruction on the critical path of our dependency graph (down to two, since we can test whether and ing yields zero ), and still only uses simple instructions that are unlikely to be port constrained.

Let’s check our logic by enumerating all byte-sized values.

CL-USER> (dotimes (p 256)
           (when (zerop (logand (- p) (lognot p) 255))
             (format t "0b~2,8,'0r~%" p)))
0b00000000
0b10000000
0b11000000
0b11100000
0b11110000
0b11111000
0b11111100
0b11111110
0b11111111

These are the bytes we’re looking for (in ascending rather than descending order)!

Remember the power of borrows

I hope the examples above communicated a pattern I often observe when mangling bits: operations that are annoying (not hard, just a bit more complex than we’d like) in the bitmap domain can be simpler in two’s complement arithmetic. Arithmetic operations are powerful mutators for bitmaps, but they’re often hard to control. Subtracting or adding 1 are the main exceptions: it’s easy to describe their impact in terms of the low bits of the bitmap. In fact, we can extend that trick to subtracting or adding powers of two: it’s the same carry/borrow chain effect as for 1, except that bits smaller than the power of two pass straight through… which might be useful when we expect a known tag followed by a ULEB value that must be decoded.

If you find yourself wishing for a way to flip ranges of bits in a data-dependent fashion, it’s always worth considering the two’s complement representation of the problem for a couple minutes. Adding or subtracting powers of two doesn’t always work, but the payoff is pretty good when it does.

P.S., Wojciech Muła offers a different 3-operation sequence with -p to solve damageboy’s problem. That’s another nice primitive to generate bitmasks dynamically.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Flash ActionScript 3.0 动画高级教程

Flash ActionScript 3.0 动画高级教程

Keith Peters / 苏金国、荆涛 / 人民邮电出版社 / 2010-1 / 65.00元

《Flash ActionScript 3.0 动画高级教程》是介绍Flash 10 ActionScript动画高级技术的经典之作,是作者在这一领域中多年实践经验的结晶。书中不仅涵盖了3D、最新绘图API以及Pixel Bender等Flash 10 ActionScript特性,深入介绍了碰撞检测、转向、寻路等Flash游戏开发技术,还通过实例具体讲解了等角投影和数值积分的基本理论和应用。 ......一起来看看 《Flash ActionScript 3.0 动画高级教程》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换