Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

栏目: IT技术 · 发布时间: 5年前

内容简介：So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a l

A brief overview of the problem and the solution

Mayukh Bhattacharyya

May 7 ·4min read

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning — Photo by Nic Low on Unsplash

L et’s be honest. Deep Learning without GPUs sucks big time! Yes, people will claim you can do without it but life isn’t just about training a neat and cool MNIST classifier.

So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a large batch to train. The problem is far more apparent when we talk about state-of-the-art computer vision algorithms. We have crossed much longer ground since the time of VGG or even ResNet18. Modern deeper architectures like UNet, ResNet-152, RCNN, Mask-RCNN are extremely memory intensive. Hence, there exists quite a high probability that we will run out of memory while training deeper models.

Here is an OOM error from while running the model in PyTorch.

RuntimeError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.76 GiB total capacity; 9.46 GiB already allocated; 30.94 MiB free; 9.87 GiB reserved in total by PyTorch)

There are usually 2 solutions that practitioners do instantly whenever encountering the OOM error.

Reduce batch size
Reduce image dimensions

In over 90% of cases, these two solutions are more than enough. So the question you want to ask is: why does the remaining 5% need something else. In order to answer, let’s check out the below images.

It’s from the Kaggle competition, Understanding Clouds from Satellite Images . The task was to correctly segment the different types of clouds. Now, these images were of very high resolution 1400 x 2100. As well you can understand, reducing image dimensions too much will have a very negative impact in this scenario, since the minute patterns and textures are important features to learn here. Hence the only other option is to reduce the batch size.

As a refresher, if you happen to remember gradient descent or specifically mini-batch gradient descent in our case, you’ll remember that instead of calculating the loss and the eventual gradients on the whole dataset, we do the operation on the smaller batches. Other than helping us to fit the data into memory, it also helps us to converge faster, since the parameters are updated after each mini-batch. But what happens when the batch size becomes too small as in the above case. Taking a rough estimate that maybe 4 such images can be fit into a single batch in an 11GB GPU, the loss and the gradients calculated will not accurately represent the whole dataset. As a result, the model will converge a lot slower, or worse, not converge at all.

Enters gradient accumulation.

The idea behind gradient accumulation is stupidly simple. It calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.

Coding the gradient accumulation part is also ridiculously easy on pytorch . All you need to do is to store the loss at each batch and then update the model parameters only after a set number of batches that you choose.

We hold onto optimizer.step() which updates the parameters for accumulation_steps number of batches. Also, model.zero_grad() is called at the same time to reset the accumulated gradients.

Doing the same thing is a little more tricky for keras/tensorflow . There are different versions written by people that you’ll find on the internet. Here’s one of those written by @alexeydevederkin

There are tensorflow codes also available which are smaller in size. You’ll find those easily.

Gradient Accumulation is a great tool for hobbyists with less computing or even for practitioners intending to use images without scaling them down. Whichever one you are, it is always a handy trick in your armory.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Web之困：现代Web应用安全指南

(美)Michal Zalewski / 朱筱丹 / 机械工业出版社 / 2013-10 / 69

《web之困：现代web应用安全指南》在web安全领域有“圣经”的美誉，在世界范围内被安全工作者和web从业人员广为称道，由来自google chrome浏览器团队的世界顶级黑客、国际一流安全专家撰写，是目前唯一深度探索现代web浏览器安全技术的专著。本书从浏览器设计的角度切入，以探讨浏览器的各主要特性和由此衍生出来的各种安全相关问题为主线，深入剖析了现代web浏览器的技术原理、安全机制和设计上的......一起来看看《Web之困：现代Web应用安全指南》这本书的介绍吧!

码农工具

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

A brief overview of the problem and the solution

Web之困：现代Web应用安全指南

CSS 压缩/解压工具

RGB转16进制工具

HTML 编码/解码