AVX-512 Auto-Vectorization in MSVC

栏目: IT技术 · 发布时间: 4年前

内容简介:February 27th, 2020InThe compiler’sauto vectorizer analyzes loops in the user’s source code and generates vectorized code for a vectorization target where feasible and beneficial.

Rui

February 27th, 2020

In Visual Studio 2019 version 16.3 we added AVX-512 support to the auto-vectorizer of the MSVC compiler. This post will show some examples and help you enable it in your projects.

What is the auto vectorizer?

The compiler’sauto vectorizer analyzes loops in the user’s source code and generates vectorized code for a vectorization target where feasible and beneficial.

static const int length = 1024 * 8;
static float a[length];
float scalarAverage() {
    float sum = 0.0;
    for (uint32_t j = 0; j < _countof(a); ++j) {
        sum += a[j];
    }
 
    return sum / _countof(a);
}

For example, if I build the code above using cl.exe /O2 /fp:fast /arch:AVX2 targeting AVX2, I get the following assembly. The lines 11-15 are the vectorized loop using ymm registers. The lines 16-21 are to calculate the scalar value sum from vector values coming out of the vector loop. Please note the number of iterations of the vector loop is only 1/8 of the scalar loop, which usually translates to improved performance.

?scalarAverage@@YAMXZ (float __cdecl scalarAverage(void)):
00000000: push ebp
00000001: mov ebp,esp
00000003: and esp,0FFFFFFF0h
00000006: sub esp,10h
00000009: xor eax,eax
0000000B: vxorps xmm1,xmm1,xmm1
0000000F: vxorps xmm2,xmm2,xmm2
00000013: nop dword ptr [eax]
00000017: nop word ptr [eax+eax]
00000020: vaddps ymm1,ymm1,ymmword ptr ?a@@3PAMA[eax]
00000028: vaddps ymm2,ymm2,ymmword ptr ?a@@3PAMA[eax+20h]
00000030: add eax,40h
00000033: cmp eax,8000h
00000038: jb 00000020</span>
0000003A: vaddps ymm0,ymm2,ymm1
0000003E: vhaddps ymm0,ymm0,ymm0
00000042: vhaddps ymm1,ymm0,ymm0
00000046: vextractf128 xmm0,ymm1,1
0000004C: vaddps xmm0,xmm1,xmm0
00000050: vmovaps xmmword ptr [esp],xmm0</span>
00000055: fld dword ptr [esp]
00000058: fmul dword ptr [__real@39000000]
0000005E: vzeroupper
00000061: mov esp,ebp
00000063: pop ebp
00000064: ret

What is AVX-512?

AVX-512 is a family of processor extensions introduced by Intel which enhancevectorization by extending vectors to 512 bits, doubling the number of vector registers, and introducing element-wise operation masking. You can detect support for AVX-512 using the __isa_available variable, which will be 6 or greater if AVX-512 support is found. This indicates support for the F(Foundational) instructions, as well as instructions from the VL, BW, DQ, and CD extensions which add additional integer vector operations, 128-bit and 256-bit operations with the additional AVX-512 registers and masking, and instructions to detect address conflicts with scatter stores. These are the same instructions that are enabled by/arch:AVX512 as described below. These extensions are available on all processors with AVX-512 that Windows officially supports. More information about AVX-512 can be found in the following blog posts that we published before.

How to enable AVX-512 vectorization?

/arch:AVX512 is the compiler switch to enable AVX-512 support including auto vectorization. With this switch, the auto vectorizer may vectorize a loop using instructions from the F, VL, BW, DQ, and CD extensions in AVX-512.

To build your application with AVX-512 vectorization enabled:

  • In the Visual Studio IDE, you can either add the flag /arch:AVX512 to the project Property Pages > C/C++ > Command Line > Additional Options text box or turn on /arch:AVX512 by choosing Advanced Vector Extension 512 following Project Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set > Advanced Vector Extension 512 (/arch:AVX512). The second approach is available in Visual Studio 2019 version 16.4.
  • If you compile from the command line using cl.exe, add the flag /arch:AVX512 before any /link options.

If I build the prior example again using cl.exe /O2 /fp:fast /arch:AVX512 , I’ll get the following assembly targeting AVX-512. Similarly, the lines 7-11 are the vectorized loop. Note the loop is vectorized with zmm registers instead of ymm registers. With the expanded width of zmmx registers, the number of iterations of the AVX-512 vector loop is only half of its AVX2 version.

?scalarAverage@@YAMXZ (float __cdecl scalarAverage(void)):
00000000: push ecx
00000001: vxorps xmm0,xmm0,xmm0
00000005: vxorps xmm1,xmm1,xmm1
00000009: xor eax,eax
0000000B: nop dword ptr [eax+eax]
00000010: vaddps zmm0,zmm0,zmmword ptr ?a@@3PAMA[eax]
0000001A: vaddps zmm1,zmm1,zmmword ptr ?a@@3PAMA[eax+40h]
00000024: sub eax,0FFFFFF80h
00000027: cmp eax,8000h
0000002C: jb 00000010
0000002E: vaddps zmm1,zmm0,zmm1
00000034: vextractf32x8 ymm0,zmm1,1
0000003B: vaddps ymm1,ymm0,ymm1
0000003F: vextractf32x4 xmm0,ymm1,1
00000046: vaddps xmm1,xmm0,xmm1
0000004A: vpsrldq xmm0,xmm1,8
0000004F: vaddps xmm1,xmm0,xmm1
00000053: vpsrldq xmm0,xmm1,4
00000058: vaddss xmm0,xmm0,xmm1
0000005C: vmovss dword ptr [esp],xmm0
00000061: fld dword ptr [esp]
00000064: fmul dword ptr [__real@39000000]
0000006A: vzeroupper
0000006D: pop ecx
0000006E: ret

Closing remarks

For this release, we aim at achieving parity with /arch:AVX2 in terms of vectorization capability. There are still many things that we plan to improve in future releases. For example, our next AVX-512 improvement will take advantage of the new masked instructions. Subsequent updates will support embedded broadcast, scatter, and 128-bit and 256-bit AVX-512 instructions in the VL extension.

As always, we’d like to hear your feedback and encourage you to downloadVisual Studio 2019 to try it out. If you encounter any issue or have any suggestion for us, please let us know through  Help > Send Feedback > Report A Problem / Suggest a Feature in Visual Studio IDE, or via  Developer Community , or or Twitter @visualc .


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

格蠹汇编

格蠹汇编

张银奎 / 电子工业出版社 / 2013-3-1 / 66.00元

《格蠹汇编——软件调试案例集锦》以案例形式讨论了使用调试技术解决复杂软件问题的工具和方法。全书共36章,分为四篇。前两篇每章讲述一个有代表性的真实案例,包括从堆里抢救丢失的博客,修复因误杀而瘫痪的系统,徒手战木马,拯救“发疯”的windows7,经典阅读器的经典死锁,拯救挂死的powerpoint,转储分析之双误谜团,是谁动了我的句柄,寻找系统中的“耗电大王”,解救即将被断网的系统,转储分析之系统......一起来看看 《格蠹汇编》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

html转js在线工具
html转js在线工具

html转js在线工具