内容简介:I main(){I w=960,h=540,s=16;V e(-22,5,25),g=!(V(-3,4,0)+e*-1),l=!V(g.z, 0,-g.x)*(1./w),u(g.y*l.z-g.z*l.y,g.z*l.x-g.x*l.z,g.x*l.y-g.y*l.x);printf("P\ 6 %d %d 255 ",w,h);for(I y=h;y--;)for(I x=w;x--;){V c;for(I p=s;p--;)c=c+T(e ,!(g+l*(x-w/2+U())+u*(y-h/2+U(
I main(){I w=960,h=540,s=16;V e(-22,5,25),g=!(V(-3,4,0)+e*-1),l=!V(g.z, 0,-g.x)*(1./w),u(g.y*l.z-g.z*l.y,g.z*l.x-g.x*l.z,g.x*l.y-g.y*l.x);printf("P\ 6 %d %d 255 ",w,h);for(I y=h;y--;)for(I x=w;x--;){V c;for(I p=s;p--;)c=c+T(e ,!(g+l*(x-w/2+U())+u*(y-h/2+U())));c=c*(1./s)+14./241;V o=c+1;c=V(c.x/o.x,c. y/o.y,c.z/o.z)*255;printf("%c%c%c",(I)c.x,(I)c.y,(I)c.z);}}
// Andrew Kensler
Usual tricks,Vector class,Utils,Database,Ray marching,Sampling,main.
With a bit of renaming and clang-tidy,a.cpp gives a much clearer picture.
Establishing the baseline
To make testing easier, I modified the code to take as parameter the number of samples per pixel. The 1mn budget was completely blown up when establishing the baseline (with all optimization disabled). Even lowering spp to one, it took 5mn27s to generate a very noisy image.
$ clang -O0 -lm -o a a.cpp $ time ./a <font>1</font> > /dev/null real 5m27.311s user 5m27.094s sys 0m0.078s
Compiler optimizations
The first step, and some people would even argue that there is nothing below -O3 and it should have been the base, is to enable compiler optimizations. Performance improved 30x resulting in 6spp.
$ clang -lm -O3 -o a a.cpp $ time ./a <font>6</font> > /dev/null real 0m58.100s user 0m58.266s sys 0m0.031s
-fFast-math
There is a compilation flag, -ffast-math, which allows the compiler to relax IEEE 754 compliance in favor of performance. It is automatically enabled when using -Ofast and shows another 2.6x performance improvement allowing 16spp.
$ clang -lm -Ofast -o a a.cpp $ time ./a <font>16</font> > /dev/null real 0m56.304s user 0m56.266s sys 0m0.031s
Going SIMD
Going SIMD is something I also did when I revisited the business card raytracer. Or at least I thought I did. I was mistaken when after opening the binary and seeing XMM registers in use I concluded that SIMD instructions were used.
Just because compiler is outputting SIMD instructions it does not mean it's taken care of. In asm you can see that pretty much all mul/add instructions have ss or sd suffix - meaning they operate on single data element. What you want is to have mulps/mulpd instructions.
If Clang is be able to generate SIMD instructions via auto-vectorization, I found the feature capricious to trigger.
struct Vec{
float p[<font>3</font>];
<font>NOT SIMD!!</font>
Vec operator+(Vec o){
Vec v;
for(int i=0 ; i<<font>3</font> ; i++){
v.p[i] = p[i] + o.p[i];
}
return v;
}
}; |
struct Vec{
float p[<font>4</font>];
<font>NOT SIMD!!</font>
Vec operator+(Vec o){
Vec v;
for(int i=0 ; i<<font>3</font> ; i++){
v.p[i] = p[i] + o.p[i];
}
return v;
}
}; |
struct Vec{
float p[<font>4</font>];
<font>SIMD!!</font>
Vec operator+(Vec o){
Vec v;
for(int i=0 ; i<<font>4</font> ; i++){
v.p[i] = p[i] + o.p[i];
}
return v;
}
}; |
In a professional environment I would probably not have felt confident relying on the compiler's good will. Using intrinsics via "nmmintrin.h", __m128, _mm_set_ps, _mm_add_ps, and _mm_div_pswould have been safer. For research purposes however it was fine.
Modifying the Vector struct to operate on four items (b.cpp) secured the Packet SIMD Instructions mulps and addps.
$ clang -lm -Ofast -o b b.cpp $ time ./b <font>17</font> > /dev/null real 0m59.722s user 0m59.719s sys 0m0.000s
The speed increase was marginal (17spp) and the visual difference imperceptible.
Using all cores
At this point, it was time to use the 11 other cores on the Ryzen 5 2600. The task was considerably facilitated by the code from the previous exercise. A few pthread_create and pthread_join resulted inc.cpp which spawn 540 threads, each rendering lines of 960 pixels.
$ clang -lm -lpthread -Ofast -o c c.cpp $ time ./c <font>128</font> > /dev/null real 0m55.203s user 10m52.188s sys 0m0.063s
As expected, the performance gain was linear with the number of cores. Twelve of them allowed to push sampling to 128spp. However the visual result had too much personality for my liking.
Nope. LCG is fine. The problem was that each thread pseudo-random number generator seed is initialized with the same value. Which means the same "random" serie is used across each line. Changing the seed value to use the thread id fixed the glitch (d.cpp).
GPGPU with CUDA
All optimizations used for the business card raytracer were applied intoe.cu.
- Maximize occupancy -> 0.2 :/.
- Maximize branch efficiency -> 90% :).
- Avoid float64.
- Use intrinsics.
- Use -use_fast_math flag.
C:\Users\fab>nvcc -O3 -o e -use_fast_math e.cu C:\Users\fab>e.exe <font>2048</font> > NUL Time: 59s
The speed gain (2048spp) was monstrous. Unfortunately, so was the generated image.
Probably a bug in the kernel.
That bug took an afternoon to track down. Ultimately the culprit was determined to be fastmath flag which made intrinsic __powf too inaccurate for ray-marching. Replacing it with manual multiplication (f.cu) solved the problem but also reduced spp to 600. There was also another issue.
GPGPU rendering, 600spp. And still noisy.
Denoising
Starting with OptiX 5.0, NVidia provides an A.I based denoiser with a pre-trained neural network. Even though NVidia claimed the model was "trained using tens of thousands of images rendered from one thousand 3D scenes", I had reservations about the result. I didn't even check the API and instead used Declan Russeel's standalone executable.
The denoiser ran in 300ms and the result made me eat my words. It is so good it is mind-blowing. A denoised 600spp which takes 1mn to render is equivalent to a 40,960spp which takes 1h to render.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
程序员的数学思维修炼(趣味解读)
周颖 / 清华大学出版社 / 2014-4-1 / 45.00元
本书是一本专门为程序员而写的数学书,介绍了程序设计中常用的数学知识。本书门槛不高,不需要读者精通很多高深的数学知识,只需要读者具备基本的四则运算、乘方等数学基础知识和日常生活中的基本逻辑判断能力即可。本书拒绝枯燥乏味的讲解,而是代之以轻松活泼的风格。书中列举了大量读者都很熟悉,而且非常有趣的数学实例,并结合程序设计的思维和算法加以剖析,可以训练读者的数学思维能力和程序设计能力,进而拓宽读者的视野,......一起来看看 《程序员的数学思维修炼(趣味解读)》 这本书的介绍吧!
Base64 编码/解码
Base64 编码/解码
RGB HSV 转换
RGB HSV 互转工具