The Hunt for the Fastest Zero

栏目: IT技术 · 发布时间: 5年前

内容简介：Let’s say I ask you to fill aIf this were C, we would probably reach forYou might come up with something like:

Let’s say I ask you to fill a char array of size n with zeros. I don’t know why, exactly, but please play along for now.

If this were C, we would probably reach for memset , but let’s pretend we are trying to write idiomatic C++ instead.

You might come up with something like:

void fill1(char *p, size_t n) {
    std::fill(p, p + n, 0);
}

I’d give this solution full marks. In fact, I’d call it more or less the canonical modern C++ solution to the problem.

What if told you there was a solution that was up to about 29 times faster? It doesn’t even requiring sacrificing any goats to the C++ gods, either: just adding three characters:

void fill2(char *p, size_t n) {
    std::fill(p, p + n, '\0');
}

Yes, switching 0 to '\0' speeds this up by nearly a factor of thirty on my SKL box, at least with my default compiler (gcc) and optimization level (-O2):

Function	Bytes / Cycle
fill1	1.0
fill2	29.1

The why is becomes obvious if you look at the assembly :

fill1:

fill1(char*, unsigned long):
        add     rsi, rdi
        cmp     rsi, rdi
        je      .L1
.L3:
        mov     BYTE PTR [rdi], 0  ; store 0 into memory at [rdi]             
        add     rdi, 1             ; increment rdi
        cmp     rsi, rdi           ; compare rdi to size    
        jne     .L3                ; keep going if rdi < size
.L1:
        ret

This version is using a byte-by-byte copy loop, which I’ve annotated – it is more or less a 1:1 translation of how you’d imagine std::fill is written. The result of 1 cycle per byte is exactly what we’d expect usingspeed limit analysis: it is simultaneously limited by two different bottlenecks: 1 taken branch per cycle, and 1 store per cycle.

The fill2 version doesn’t have a loop at all:

fill2:

fill2(char*, unsigned long):
        test    rsi, rsi
        jne     .L8                ; skip the memcpy call if size == 0
        ret
.L8:
        mov     rdx, rsi
        xor     esi, esi
        jmp     memset             ; tailcall to memset

Rather, it simply defers immediately to memset . We aren’t going to dig into the assembly for memset here, but the fastest possible memset would run at 32 bytes/cycle, limited by 1 store/cycle and maximum vector the width of 32 bytes on my machine, so the measured value of 29 bytes/cycle indicates it’s using an implementation something along those lines.

So that’s the why , but what’s the why of the why (second order why)?

I thought this had something to do with the optimizer. After all, at -O3 even the fill1 version using the plain 0 constant calls memset .

I was wrong, however. The answer actually lies in the implementation of the C++ standard library (there are various, gcc is using libstdc++ in this case). Let’s take a look at the implementation of std::fill (I’ve reformatted the code for clarity and removed some compile-time concept checks):

/*
   *  ...
   *
   *  This function fills a range with copies of the same value.  For char
   *  types filling contiguous areas of memory, this becomes an inline call
   *  to @c memset or @c wmemset.
  */
  template<typename _ForwardIterator, typename _Tp>
  inline void fill(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value)
  {
    std::__fill_a(std::__niter_base(__first), std::__niter_base(__last), __value);
  }

The included part of the comment already hints at what is to come: the implementor of std::fill has apparently considered specifically optimizing the call to a memset in some scenarios. So we keep following the trail, which brings us to the helper method std::__fill_a . There are two overloads that are relevant here, the general method and an overload which handles the special case:

template<typename _ForwardIterator, typename _Tp>
  inline typename
  __gnu_cxx::__enable_if<!__is_scalar<_Tp>::__value, void>::__type
  __fill_a(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value)
  {
    for (; __first != __last; ++__first)
      *__first = __value;
  }
    
  // Specialization: for char types we can use memset.
  template<typename _Tp>
  inline typename
  __gnu_cxx::__enable_if<__is_byte<_Tp>::__value, void>::__type
  __fill_a(_Tp* __first, _Tp* __last, const _Tp& __c)
  {
    const _Tp __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

Now we see how the memset appears. It is called explicitly by the second implementation shown above, selected by enable_if when the SFINAE condition __is_byte<_Tp> is true. Note, however, that unlike the general function, this variant has a single template argument: template<typename _Tp> , and the function signature is:

__fill_a(_Tp* __first, _Tp* __last, const _Tp& __c)

Hence, it will only be considered when the __first and __last pointers which delimit the range have the exact same type as the value being filled . When when you write std::fill(p, p + n, 0) where p is char * , you rely on template type deduction for the parameters, which ends up deducing char * and int for the iterator type and value-to-fill type, because 0 is an integer constant .

That is, it is if you had written:

std::fill<char *, int>(p, p + n, 0);

This prevents the clever memset optimization from taking place: the overload that does it is never called because the iterator value type is different than the value-to-fill type.

This suggests a fix: we can simply force the template argument types rather than rely on type deduction:

void fill3(char *p, size_t n) {
    std::fill<char *, char>(p, p + n, 0);
}

This way, we get the memset version .

Finally, why does fill2 using '\0' get the fast version, without forcing the template arguments? Well '\0' is a char constant, so the value-to-assign type is char . You could achieve the same effect with a cast, e.g., static_cast<char>(0) – and for buffers which have types like unsigned char this is necessary because '\0' does not have the same type as unsigned char (at least on gcc ).

One might reasonably ask if this could be fixed in the standard library. I think so.

One idea would be to keying off of only the type of the first and last pointers, like this:

template<typename _Tp, typename _Tvalue>
  inline typename
  __gnu_cxx::__enable_if<__is_byte<_Tp>::__value, void>::__type
  __fill_a(_Tp* __first, _Tp* __last, const _Tvalue& __c)
  {
    const _Tvalue __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

This says: who cares about the type of the value, it is going to get converted during assignment to the value type of the pointer anyways, so just look at the pointer type. E.g., if the type of the value-to-assign _Tvalue is int , but _Tp is char then this expands to this version, which is totally equivalent:

__fill_a(char* __first, char* __last, const int& __c)
  {
    const int __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

This works … for simple types like int . Where it fails is if the value to fill has a tricky, non-primitive type, like this:

struct conv_counting_int {
    int v_;
    mutable size_t count_ = 0;

    operator char() const {
        count_++;
        return (char)v_;
    }
};

size_t fill5(char *p, size_t n) {
    conv_counting_int zero{0};
    std::fill(p, p + n, zero);
    return zero.count_;
}

Here, the pointer type passed to std::fill is char , but you cannot safely apply the memset optimization above, since the conv_counting_int counts the number of types it is converted to char , and this value will be wrong (in particular, it will be 1 , not n ) if you perform the above optimization.

This can be fixed. You could limit the optimization to the case where the pointer type is char-like and the value-to-assign type is “simple” in the sense that it won’t notice how many times it has been converted. A sufficient check would be that the type is scalar, i.e. std::is_scalar<T> – although there is probably a less conservative check possible. So something like this for the SNIFAE check:

template<typename _Tpointer, typename _Tp>
    inline typename
    __gnu_cxx::__enable_if<__is_byte<_Tpointer>::__value && __is_scalar<_Tp>::__value, void>::__type
    __fill_a( _Tpointer* __first,  _Tpointer* __last, const _Tp& __value) {
      ...

Here’s an example of how that would work. It’s not fully fleshed out but shows the idea.

Finally, one might ask why memset is used when gcc is run at -O3 or when clang is used ( like this ). The answer is the optimizer. Even if the compile-time semantics of the language select what appears to a byte-by-byte copy loop, the compiler itself can transform that into memset , or something else like a vectorized loop, if it can prove it is as-if equivalent. That recognition happens at -O3 for gcc but at -O2 for clang.

Source

The source for my benchmark is available on GitHub .

Thanks

Thanks to Matt Godbolt for creatingGodbolt, without which this type of investigation would be much more painful and often wouldn’t happen at all.

tc for finding a typo.

Discuss

I am still working on my comments system (no, I don’t want Disqus), but in the meantime you can discuss this post on Hacker News .

If you liked this post, check out thehomepage for others you might enjoy.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

The Hunt for the Fastest Zero

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

软件人才管理的艺术

Michael Lopp / 罗小平 / 人民邮电出版社 / 201008 / 35.00元

本书作者具有15年的硅谷人才管理经验，他在博客上发表了大量探讨软件人才的管理之道的文章，深受读者欢迎。本书素材取自他的博客文章，用深入浅出的语言，讲述发人深思的道理，具有很强的现实操作性。本书分为三大部分：“管理的箭袋”、“过程就是产品”、“你的其他版本”。前两部分分别讲述了人员与产品的管理，第三部分除了讨论管理之外，还讲述了如何有针对性地准备简历和电话面试，来提高自己面试成功的几率。书中......一起来看看《软件人才管理的艺术》这本书的介绍吧!

码农工具