7 Tips For Squeezing Maximum Performance From PyTorch

栏目: IT技术 · 发布时间: 4年前

7 Tips For Max PyTorch Performance

7 Tips For Squeezing Maximum Performance From PyTorch

May 12 ·5min read

Throughout the last 10 months, while working on PyTorch Lightning , the team and I have been exposed to many styles of structuring PyTorch code and we have identified a few key places where we see people inadvertently introducing bottlenecks.

We’ve taken great care to make sure that PyTorch Lightning does not make any of these mistakes for the code we automate for you, and we even try to correct it for users when we detect them. However, since Lightning is just structured PyTorch and you still control all of the scientific PyTorch, there’s not much we can do in many cases for the user.

In addition, if you’re not using Lightning, you might inadvertently introduce these issues into your code.

To help you train the faster, here are 8 tips you should be aware of that might be slowing down your code.

Use workers in DataLoaders

This first mistake is an easy one to correct. PyTorch allows loading data on multiple processes simultaneously ( documentation ).

In this case, PyTorch can bypass the GIL lock by processing 8 batches, each on a separate process. How many workers should you use? A good rule of thumb is:

num_worker = 4 * num_GPU

This answe r has a good discussion about this.

Warning: The downside is that your memory usage will also increase ( source ).

Pin memory

You know how sometimes your GPU memory shows that it’s full but you’re pretty sure that your model isn’t using that much? That overhead is called pinned memory. ie: this memory has been reserved as a type of “working allocation.”

When you enable pinned_memory in a DataLoader it “automatically puts the fetched data Tensors in pinned memory, and enables faster data transfer to CUDA-enabled GPUs” ( source ).

This also means you should not unnecessarily call:

torch.cuda.empty_cache()

Avoid CPU to GPU transfers or vice-versa

# bad.cpu()
.item()
.numpy()

I see heavy usage of the .item() or .cpu() or .numpy() calls. This is really bad for performance because every one of these calls transfers data from GPU to CPU and dramatically slows your performance.

If you’re trying to clear up the attached computational graph, use .detach() instead.

# good.detach()

This won’t transfer memory to GPU and it will remove any computational graphs attached to that variable.

Construct tensors directly on GPUs

Most people create tensors on GPUs like this

t = tensor.rand(2,2).cuda()

However, this first creates CPU tensor, and THEN transfers it to GPU… this is really slow. Instead, create the tensor directly on the device you want.

t = tensor.rand(2,2, device=torch.device('cuda:0'))

If you’re using Lightning, we automatically put your model and the batch on the correct GPU for you. But, if you create a new tensor inside your code somewhere (ie: sample random noise for a VAE, or something like that), then you must put the tensor yourself.

t = tensor.rand(2,2, device=self.device)

Every LightningModule has a convenient self.device call which works whether you are on CPU, multiple GPUs, or TPUs (ie: lightning will choose the right device for that tensor.

Use DataParallel not DistributedDataParallel

PyTorch has two main models for training on multiple GPUs. The first, DataParallel ( DP ) , splits a batch across multiple GPUs. But this also means that the model has to be copied to each GPU and once gradients are calculated on GPU 0, they must be synced to the other GPUs.

That’s a lot of GPU transfers which are expensive! Instead, DistributedDataParallel ( DDP ) creates a siloed copy of the model on each GPU (in its own process), and makes only a portion of the data available to that GPU. Then its like having N independent models training, except that once each one calculates the gradients, they all sync gradients across models… this means we only transfer data across GPUs once during each batch.

In Lightning , you can trivially switch between both

Trainer(distributed_backend='ddp', gpus=8)
Trainer(distributed_backend='dp', gpus=8)

Note that both PyTorch and Lightning , discourage DP use.

Use 16-bit precision

This is another way to speed up training which we don’t see many people using. In 16-bit training parts of your model and your data go from 32-bit numbers to 16-bit numbers. This has a few advantages:

You use half the memory (which means you can double batch size and cut training time in half).
Certain GPUs (V100, 2080Ti) give you automatic speed-ups (3x-8x faster) because they are optimized for 16-bit computations.

In Lightning this is trivial to enable:

Trainer(precision=16)

Note: Before PyTorch 1.6 you ALSO had to install Nvidia Apex… now 16-bit is native to PyTorch. But if you’re using Lightning, it supports both and automatically switches depending on the detected PyTorch version.

Profile your code

This last tip may be hard to do without Lightning, but you can use things like the cprofiler to do. However, in Lightning you can get a summary of all the calls made during training in two ways:

First, the built-in basic profiler

Trainer(profile=True)

Which gives an output like this:

or the advanced profiler:

profiler = AdvancedProfiler()
trainer = Trainer(profiler=profiler)

which gets very granular

The full documentation for the Lightning profiler can be found here .

Adopting Lightning in your code

PyTorch Lightning is nothing more than structured PyTorch.

If you’re ready to have most of these tips automated for you (and well tested), then check out this video on refactoring your PyTorch code into the Lightning format!

以上所述就是小编给大家介绍的《7 Tips For Squeezing Maximum Performance From PyTorch》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

7 Tips For Squeezing Maximum Performance From PyTorch

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

新经济，新规则

[美]Kevin Kelly / 刘仲涛 / 译言·东西文库／电子工业出版社 / 2014-7 / CNY 45.00

近年来，互联网持续震动着全世界各个行业以至于整个经济规则……在中国，以小米为代表的各类“互联网思维”轰轰烈烈地颠覆着各个行业……而这一切的一切，凯文凯利早就通过这本小书轻松写定。《新规则，新经济》一书介绍互联网时代，互联网影响下的经济运行的十个新游戏规则。一起来看看《新经济，新规则》这本书的介绍吧!

码农工具