Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

栏目: IT技术 · 发布时间: 3年前

内容简介：This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.All of my code related to this article can be found in my GitHub repository,First, we need to understand our options f

Quick Start to Distributed Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

Introduction

This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.

Photo by Markus Spiske on Unsplash

Code

All of my code related to this article can be found in my GitHub repository, here . The code in my repository is an example of running a version of BERT on data from Kaggle, specifically the Jigsaw Multilingual Toxic Comment Classification competition. So of my code is adopted from a top public kernel .

The Need to Know Information

Getting Started

First, we need to understand our options for running deep learning on AWS Sagemaker.

Run your code in a notebook instance
Run your code in a tailored Sagemaker TensorFlow container

In this article, we focus on option #2 because it’s cheaper and it’s the intended design of Sagemaker.

(option #1 is a nice way to get started, but it’s more expensive because you’re paying for every second the notebook instance is running).

Running a Sagemaker TensorFlow Container

There is a lot of flexibility to Sagemaker TensorFlow containers, but we’re going to focus on the bare essentials.

Photo by Upadek Matmy on Unsplash

To start, we need to launch a Sagemaker notebook instance and store our data on S3. If you don’t know how to do this, I review some simple options on my blog . Once we have our data in S3, we can launch a Jupyter notebook (from our notebook instance) and start coding. This notebook will be responsible for launching your training job, or i.e. your Sagemaker TensorFlow container.

Again, we’re going to focus on the bare essentials. We need a variable to indicate where our data is located, and then we need to add that location to a dictionary.

data_s3 = 's3://<your-bucket>/'
inputs = {'data':data_s3}

Pretty simple. Now we need to create a Sagemaker TensorFlow container object.

Our entry_point is a Python script (which we’ll make later) that contains all of our modeling code. Our train_instance_type is a multi-GPU Sagemaker instance type. You can find a full list of Sagemaker instance types here . Notice that a ml.p3.8xlarge runs 4 V100 NVIDIA GPUs . And since we’re going to be using MirroredStrategy (more on this later) we need train_instance_count=1. So that’s 1 machine with 4 V100s. The other settings you can leave alone for now, or research further as needed.

The main settings we need to get right are entry_point and train_instance_type . (And then for Mirrored Strategy we need train_instance_count=1).

# create estimator
estimator = TensorFlow(entry_point='jigsaw_DistilBert_SingleRun_v1_sm_tfdist0.py',
 train_instance_type='ml.p3.8xlarge',
 output_path="s3://<your-bucket>",
 train_instance_count=1,
 role=sagemaker.get_execution_role(),
 framework_version='2.1.0',
 py_version='py3',
 script_mode=True)

We can kick off our training job by running the following line.

estimator.fit(inputs)

Notice that we included our dictionary (which contained our S3 location) as an input to ‘fit()’. Before we run this code, we need to create the Python script which we tied to entry_point (otherwise our container won’t have any code to run) .

Create Training Script

I have a lot going on in my training script because I’m running a version of BERT on some data from Kaggle, but I’m going to highlight the main code required for Sagemaker.

Photo by Brooks Leibee on Unsplash

First we need to grab our data location, which was passed when we ran ‘estimator.fit(inputs)’. We can do this using argparse.

def parse_args(): 
 parser = argparse.ArgumentParser()
 parser.add_argument(‘ — data’, 
 type=str, 
 default=os.environ.get(‘SM_CHANNEL_DATA’)) 
 return parser.parse_known_args()args, _ = parse_args()

You could probably simplify this even further by just hard coding your S3 location in your training script.

If all we wanted to do was run our training job in a Sagemaker container, that’s basically all we need. Now if we want to run multi-GPU train using tf.distribute we need a few more things.

Say Goodbye to Horovod, Say Hello to TF.Distribute

Photo by Taylor Vick on Unsplash

First we need to indicate that we want to run multi-GPU training. We can do that very easily with the following line.

strategy = tf.distribute.MirroredStrategy()

We’re going to use our strategy object throughout our training code. Next we need to adjust our batch size for multi-GPU training by including the following line.

BATCH_SIZE = 16 * strategy.num_replicas_in_sync

To distribute our model we can define our model using strategy as well.

with strategy.scope():
 # define model here

And that’s it! We can then continue on to run ‘model.fit()’ we usually do.

Again, full code related to this article can be found in my GitHub repository, here .

Thanks for reading and hope you find this helpful!

以上所述就是小编给大家介绍的《Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Effective java 中文版（第2版）

Joshua Bloch / 俞黎敏 / 机械工业出版社 / 2009-1-1 / 52.00元

本书介绍了在Java编程中78条极具实用价值的经验规则，这些经验规则涵盖了大多数开发人员每天所面临的问题的解决方案。通过对Java平台设计专家所使用的技术的全面描述，揭示了应该做什么，不应该做什么才能产生清晰、健壮和高效的代码。本书中的每条规则都以简短、独立的小文章形式出现，并通过例子代码加以进一步说明。本书内容全面，结构清晰，讲解详细。可作为技术人员的参考用书。一起来看看《Effective java 中文版（第2版）》这本书的介绍吧!

码农工具