Google Open Sources a New Architecture for Massively Scalable Deep Reinforcement Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:The new architecture improves upon the IMPALA model to achieve massive levels of scalability.Deep reinforcement learning(DRL) is one of the fastest areas of research in the deep learning space. Responsible for some of the top milestones in the recent years

Google Open Sources a New Architecture for Massively Scalable Deep Reinforcement Learning

The new architecture improves upon the IMPALA model to achieve massive levels of scalability.

Deep reinforcement learning(DRL) is one of the fastest areas of research in the deep learning space. Responsible for some of the top milestones in the recent years of AI such as AlphaGo, Dota2 Five or Alpha Star, DRL seems to be the discipline that approximates human intelligence the closest. However, despite all the progress, the real world implementations of DRL methods remain constrained to the big artificial intelligence(AI) labs. This is partially due to the fact that DRL architecture rely of disproportionally large amounts of training which makes them computationally expensive and unpractical for most organizations. Recently, Google Research published a paper proposing SEED RL , a new architecture for massively scalable DRL models.

The challenges of implementing DRL models in the real world is directly tied to the their architecture. Intrinsically, DRL comprised of heterogeneous tasks such as running environments, model inference, model training or replay buffers. Most modern DRL architectures fail to efficiently distribute compute resources for this tasks making it unreasonably expensive to implement. Components such as AI hardware accelerators have helped with some of these limitations but they can only go so far. In recent years, new architectures have emerged that have been adopted by many of the most successful DRL implementations in the market.

Drawing Inspiration from IMPALA

Among the current generation of DRL architectures, IMPALA set a new standard for the space. Originally proposed by DeepMind in a 2018 research paper, IMPALA introduced a model that made use of accelerators specialized for numerical calculations, taking advantage of the speed and efficiency from which supervised learning has benefited for years. At the center of IMPALA was an actor-based model that is so commonly used to maximize concurrency and parallelization.

The architecture of an IMPALA-based DRL agent is separated into two main components: actors and learners. In this model, the actors typically run on CPUs and iterate between taking steps in the environment and running inference on the model to predict the next action. Frequently the actor will update the parameters of the inference model, and after collecting a sufficient amount of observations, will send a trajectory of observations and actions to the learner, which then optimizes the model. In this architecture, the learner trains the model on GPUs using input from distributed inference on hundreds of machines. From a computational standpoint, the IMPALA architecture enables the acceleration of learners using GPUs while actors can be scaled across many machines.

IMPALA set up new standards in DRL architectures. However, the model have some intrinsic limitations.

· Using CPUs for neural network inference: The actor machines are usually CPU-based. When the computational needs of a model increase, the time spent on inference starts to outweigh the environment step computation. The solution is to increase the number of actors which increases the cost and affects convergence.

· Inefficient resource utilization: Actors alternate between two tasks: environment steps and inference steps. The compute requirements for the two tasks are often not similar which leads to poor utilization or slow actors.

· Bandwidth requirements: Model parameters, recurrent state and observations are transferred between actors and learners. Furthermore, memory-based models send large states, increase bandwidth requirements.

Using the IMPALA actor model as an inspiration, Google worked on a new architecture that addresses some of the limitations of its predecessors for the scaling of DRL models.

SEED RL

At a high level, Google’s SEED RL architecture looks incredibly similar to IMPALA but it introduces a few variations that address some of the main limitations of the DeepMind model. In SEED RL, neural network inference is done centrally by the learner on specialized hardware (GPUs or TPUs), enabling accelerated inference and avoiding the data transfer bottleneck by ensuring that the model parameters and state are kept local. For every single environment step, the observations are sent

to the learner, which runs the inference and sends actions back to the actors. This clever solution addresses the inference limitations of models like IMPALA but might introduce latency challenges.

To minimize the latency impact, SEED RL relies on gPRC for messaging and streaming. Specifically, SEED RL leverages streaming RPCs where the connection from actor to learner is kept open and metadata sent only once. Furthermore, the framework includes a batching module that efficiently batches multiple actor inference calls together.

Deep diving into the IMPALA architecture, there are three fundamental types of threads running:

1. Inference

2. Data Prefetching

3. Training

Inference threads receive a batch of observations, rewards and episode termination flags. They load the recurrent states and send the data to the inference TPU core. The sampled actions and new recurrent states are received, and the actions are sent back to the actors while the latest recurrent states are stored. When a trajectory is fully unrolled it is added to a FIFO queue or replay buffer and later sampled by data prefetching threads. Finally, the trajectories are pushed to a device buffer for each of the TPU cores taking part in training. The training thread (the main Python thread) takes the prefetched trajectories, computes gradients using the training TPU cores and applies the gradients on the models of all TPU cores (inference and training) synchronously. The ratio of inference and training cores can be adjusted for maximum throughput and utilization.

The SEED RL architecture allows learners to be scaled to thousands of cores and the number of actors can be scaled to thousands of machines to fully utilize the learner, making it possible to train at millions of frames per second. Given that the SEED RL is based on the TensorFlow 2 API and its performance was accelerated by TPUs.

To evaluate SEED RL, Google used common DRL benchmark environments such as cade Learning Environment, DeepMind Lab environments, and on the recently released Google Research Football environment. The results across all environments were remarkable. For instance, on the DeepMind Lab environment, SEED RL achieved 2.4 million frames per second with 64 Cloud TPU cores, which represents an improvement of 80x over the previous state-of-the-art distributed agent, IMPALA. Improvements in speed and CPU utilization were also seen.

SEED RL represents an improvement in massively scalable DRL models. Google Research open sourced the initial SEED RL architecture in GitHub. I can imagine this is going to be the underlying model for many practical DRL implementations in the foreseeable future.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

C++设计新思维

C++设计新思维

(美)Andrei Alexandrescu / 侯捷、於春景 / 华中科技大学出版社 / 2003-03 / 59.8

本书从根本上展示了generic patterns(泛型模式)或pattern templates(模式模板),并将它们视之为“在C++中创造可扩充设计”的一种功能强大的新方法。这种方法结合了template和patterns,你可能未曾想过,但的确存在。为C++打开了全新视野,而且不仅仅在编程方面,还在于软件设计本身;对软件分析和软件体系结构来说,它也具有丰富的内涵。一起来看看 《C++设计新思维》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

MD5 加密
MD5 加密

MD5 加密工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具