7 Tips For Max PyTorch Performance
May 12 ·5min read
Throughout the last 10 months, while working on PyTorch Lightning , the team and I have been exposed to many styles of structuring PyTorch code and we have identified a few key places where we see people inadvertently introducing bottlenecks.
We’ve taken great care to make sure that PyTorch Lightning does not make any of these mistakes for the code we automate for you, and we even try to correct it for users when we detect them. However, since Lightning is just structured PyTorch and you still control all of the scientific PyTorch, there’s not much we can do in many cases for the user.
In addition, if you’re not using Lightning, you might inadvertently introduce these issues into your code.
To help you train the faster, here are 8 tips you should be aware of that might be slowing down your code.
Use workers in DataLoaders
This first mistake is an easy one to correct. PyTorch allows loading data on multiple processes simultaneously ( documentation ).
In this case, PyTorch can bypass the GIL lock by processing 8 batches, each on a separate process. How many workers should you use? A good rule of thumb is:
num_worker = 4 * num_GPU
This answe r has a good discussion about this.
Warning: The downside is that your memory usage will also increase ( source ).
Pin memory
You know how sometimes your GPU memory shows that it’s full but you’re pretty sure that your model isn’t using that much? That overhead is called pinned memory. ie: this memory has been reserved as a type of “working allocation.”
When you enable pinned_memory in a DataLoader it “automatically puts the fetched data Tensors in pinned memory, and enables faster data transfer to CUDA-enabled GPUs” ( source ).
This also means you should not unnecessarily call:
torch.cuda.empty_cache()
Avoid CPU to GPU transfers or vice-versa
# bad.cpu() .item() .numpy()
I see heavy usage of the .item() or .cpu() or .numpy() calls. This is really bad for performance because every one of these calls transfers data from GPU to CPU and dramatically slows your performance.
If you’re trying to clear up the attached computational graph, use .detach() instead.
# good.detach()
This won’t transfer memory to GPU and it will remove any computational graphs attached to that variable.
Construct tensors directly on GPUs
Most people create tensors on GPUs like this
t = tensor.rand(2,2).cuda()
However, this first creates CPU tensor, and THEN transfers it to GPU… this is really slow. Instead, create the tensor directly on the device you want.
t = tensor.rand(2,2, device=torch.device('cuda:0'))
If you’re using Lightning, we automatically put your model and the batch on the correct GPU for you. But, if you create a new tensor inside your code somewhere (ie: sample random noise for a VAE, or something like that), then you must put the tensor yourself.
t = tensor.rand(2,2, device=self.device)
Every LightningModule has a convenient self.device call which works whether you are on CPU, multiple GPUs, or TPUs (ie: lightning will choose the right device for that tensor.
Use DataParallel not DistributedDataParallel
PyTorch has two main models for training on multiple GPUs. The first, DataParallel ( DP ) , splits a batch across multiple GPUs. But this also means that the model has to be copied to each GPU and once gradients are calculated on GPU 0, they must be synced to the other GPUs.
That’s a lot of GPU transfers which are expensive! Instead, DistributedDataParallel ( DDP ) creates a siloed copy of the model on each GPU (in its own process), and makes only a portion of the data available to that GPU. Then its like having N independent models training, except that once each one calculates the gradients, they all sync gradients across models… this means we only transfer data across GPUs once during each batch.
In Lightning , you can trivially switch between both
Trainer(distributed_backend='ddp', gpus=8) Trainer(distributed_backend='dp', gpus=8)
Use 16-bit precision
This is another way to speed up training which we don’t see many people using. In 16-bit training parts of your model and your data go from 32-bit numbers to 16-bit numbers. This has a few advantages:
- You use half the memory (which means you can double batch size and cut training time in half).
- Certain GPUs (V100, 2080Ti) give you automatic speed-ups (3x-8x faster) because they are optimized for 16-bit computations.
In Lightning this is trivial to enable:
Trainer(precision=16)
Note: Before PyTorch 1.6 you ALSO had to install Nvidia Apex… now 16-bit is native to PyTorch. But if you’re using Lightning, it supports both and automatically switches depending on the detected PyTorch version.
Profile your code
This last tip may be hard to do without Lightning, but you can use things like the cprofiler to do. However, in Lightning you can get a summary of all the calls made during training in two ways:
First, the built-in basic profiler
Trainer(profile=True)
Which gives an output like this:
or the advanced profiler:
profiler = AdvancedProfiler()
trainer = Trainer(profiler=profiler)
which gets very granular
The full documentation for the Lightning profiler can be found here .
Adopting Lightning in your code
PyTorch Lightning is nothing more than structured PyTorch.
If you’re ready to have most of these tips automated for you (and well tested), then check out this video on refactoring your PyTorch code into the Lightning format!
以上所述就是小编给大家介绍的《7 Tips For Squeezing Maximum Performance From PyTorch》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
编译原理
Alfred V. Aho、Monica S.Lam、Ravi Sethi、Jeffrey D. Ullman / 赵建华、郑滔、戴新宇 / 机械工业出版社 / 2008年12月 / 89.00元
本书全面、深入地探讨了编译器设计方面的重要主题,包括词法分析、语法分析、语法制导定义和语法制导翻译、运行时刻环境、目标代码生成、代码优化技术、并行性检测以及过程间分析技术,并在相关章节中给出大量的实例。与上一版相比,本书进行了全面的修订,涵盖了编译器开发方面的最新进展。每章中都提供了大量的系统及参考文献。 本书是编译原理课程方面的经典教材,内容丰富,适合作为高等院校计算机及相关专业本科生及研......一起来看看 《编译原理》 这本书的介绍吧!