The Evolution of AlphaGo to MuZero

栏目: IT技术 · 发布时间: 5年前

内容简介：DeepMind recently released their MuZero algorithm, headlined by superhuman ability inReinforcement Learning agents that can play Atari games are interesting because, in addition to a visually complex state space, Atari games don’t come with aThis idea of a

The Evolution of AlphaGo to MuZero

DeepMind recently released their MuZero algorithm, headlined by superhuman ability in 57 different Atari games .

Reinforcement Learning agents that can play Atari games are interesting because, in addition to a visually complex state space, Atari games don’t come with a perfect simulator .

This idea of a “ perfect simulator ” is one of the key limitations that keep AlphaGo and subsequent improvements such as AlphaGo Zero and AlphaZero, limited to Chess, Shogi and Go and useless for certain real-world applications such as Robotic Control.

Reinforcement Learning problems are framed within Markov Decision Processes (MDPs) depicted below:

Chess, Go, and Shogi come with a simulator that knows how

The family of algorithms from AlphaGo, AlphaGo Zero, AlphaZero, and MuZero extend this framework by using planning , depicted below:

Integrated planning extends the framing of Reinforcement Learning problems

DeepMind’s AlphaGo, AlphaGo Zero, and AlphaZero exploit having a perfect model of (action, state) → next state to do lookahead planning in the form of Monte Carlo Tree Search (MCTS) . MCTS is a perfect complement to using Deep Neural Networks for policy mappings and value estimation because it averages out the errors from these function approximations. MCTS provides a huge boost for AlphaZero in Chess, Shogi, and Go where you can do perfect planning because you have a perfect model of the environment.

MuZero comes with a way of salvaging MCTS planning by learning a dynamics model depicted below:

MuZero’s Monte Carlo Tree Search

MuZero’s approach to Model-Based Reinforcement Learning, having a parametric model map from (s,a) → (s’, r), is that it does not exactly reconstruct the pixel-space at s’ . Contrast that with the image below from “World Models” by Ha and Schmidhuber:

An example of Model-Based RL reconstructing the pixel-space in the model. Image taken from: https://worldmodels.github.io/

This planning algorithm from MuZero is very successful in the Atari domain and could have enormous application potential for Reinforcement Learning problems. This article will explain the evolution from AlphaGo, AlphaGoZero, AlphaZero, and MuZero to get a better understanding for how MuZero works. I have also made a video explaining this if you are interested:

AlphaGo

AlphaGo is the first paper in the series, showing that Deep Neural Networks could play the game of Go by predicting a policy (mapping from state to action) and value estimate (probability of winning from a given state). These policy and value networks are used to enhance tree-based lookahead search by selecting which actions to take from given states and which states are worth exploring further.

AlphaGo uses 4 Deep Convolutional Neural Networks, 3 policy networks and a value network. 2 of the policy networks are trained with supervised learning on expert moves .

Supervised learning describes loss functions consisting of some kind of L(y’, y). In this case, the y’ is the action the policy network predicted from a given state, and the y is the action the expert human player had taken in that state.

The rollout policy is a smaller neural network that takes in a smaller input state representation as well. As a consequence of this, the rollout policy has a significantly lower modeling accuracy of expert moves than the higher capacity network. However the rollout policy network’s inference time (time to make a prediction of action given state) is 2 microseconds compared to 3 milliseconds with the larger network, making it useful for Monte Carlo Tree Search simulations.

The SL policy network is used to initialize the 3rd policy network which is trained with self-play and policy gradients. Policy gradients describe the idea of optimizing the policy directly with respect to the resulting rewards, compared to other RL algorithms that learn a value function and then make the policy greedy with respect to the value function. The policy gradient trained policy network plays against previous iterations of its own parameters, optimizing its parameters to select the moves that result in wins. The self-play dataset is then used to train a value network to predict the winner of a game from a given state.

The final workhorse of AlphaGo is the combination of policy and value networks in MCTS, depicted below:

The idea of MCTS is to perform lookahead search to get a better estimate of which immediate action to take. This is done by starting from a root node (the current state of the board), expanding that node by selecting an action and repeating this with subsequent states that result from the state, action transitions. MCTS chooses which edge of the tree to follow based on this Q + u(P) term which is a weighted combination of the value network’s estimate of the state, the original probability density the policy network had given to this state, and a negative weighting of how many times the node has been visited, since this is repeated over and over again. Unique to AlphaGo is the use of a rollout policy simulation to average the contribution of the value network. The rollout policy simulates until the episode and wether that resulted in a win or a loss is blended with the value function estimate of that state with an extra parameter, lambda.

AlphaGo Zero

AlphaGo Zero significantly improves the AlphaGo algorithm by making it more general and starting from “Zero” human knowledge . AlphaGo Zero avoids the supervised learning of expert moves initialization and combines the value and policy network into a single neural network. This neural network is scaled up as well to utilize a ResNet compared to a simpler convolutional network in AlphaGo. The contribution of the ResNet performing both value and policy mappings is evident in the diagram below comparing the dual task ResNet to separate task CNNs:

One of the most interesting characteristics of AlphaGo Zero is the way it trains its policy network using the action distribution found by MCTS, depicted below:

The MCTS trains the policy network by using it as supervision to update the policy network. This is a clever idea since MCTS produces a better action distribution through lookahead search than the policy network’s instant mapping from state to action.

AlphaZero

AlphaZero is the first step towards generalizing the AlphaGo family outside of Go, looking at changes needed to play Chess and Shogi as well. This requires formulating input state and output action representations for the residual neural network.

In AlphaGo, the state representation uses a few handcrafted feature planes, depicted below:

AlphaGo Zero uses a more general representation, simply passing in the previous 8 locations of stones for both players and a binary feature plane telling the agent which player it is controlling, depicted below:

AlphaZero uses a similar idea to encode the input state representation for Chess and Shogi, depicted below:

AlphaZero also makes some more subtle changes to the algorithm such as the way the self-play champion is crowned and the eliminations of data augmentation from Go board games such as reflections and rotations.

MuZero

This leads us to the current state-of-the-art in this series, MuZero. MuZero presents a very powerful generalization to the algorithm that allows it to learn without a perfect simulator. Chess, Shogi, and Go are all examples of games that come with a perfect simulator, if you move your pawn forward 2 positions, you know exactly what the resulting state of the board will be. You can’t say the same thing about applying 30 N of force on a given joint in complex dexterous manipulation tasks like OpenAI’s rubik’s cube hand.

The diagram below illustrates the key ideas of MuZero:

Diagram A shows the pipeline of using a representation function h to map raw observations into a hidden state s0 that is used for tree-based planning. In MuZero, the combined value / policy network reasons in this hidden state space , so rather than mapping raw observations to actions or value estimates, it takes these hidden states as inputs. The dynamics function g learns to map from hidden state and action to a future hidden states.

Diagram B shows how the policy network is similarly trained by mimicking the action distribution produced by MCTS as first introduced in AlphaGo Zero.

Diagram C shows how this system is trained. Each of the three neural networks are trained in a joint optimization of the difference between the value network and the actual return, the difference between the intermediate reward experienced and predicted by the dynamics model and the difference between the MCTS action distribution and policy mapping.

How does the representation function h get trained in this optimization loop?

The representation function h comes into play in this joint optimization equation through back-propagation through time . Let’s say you are taking the difference between the MCTS action distribution pi(s1) and the policy distribution p(s1). The output of p(s1) is a result of p(g(s0, a1)), which is a result of p(g(h(raw_input), a1)). This is how backprop through time sends update signals all the way back into the hidden representation function as well.

AlphaGo → AlphaGo Zero → AlphaZero → MuZero

I hope this article helped clarify how MuZero works within the context of the previous algorithms, AlphaGo, AlphaGo Zero, and AlphaZero! Thanks for reading!

以上所述就是小编给大家介绍的《The Evolution of AlphaGo to MuZero》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

The Evolution of AlphaGo to MuZero

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

交易系统

武剑锋 / 上海人民出版社 / 2011-1 / 32.00元

《交易系统:更新与跨越》是中国第一部研究证券交易系统的专业著作，填补了这一领域的学术空白。既回顾和总结了系统规划、建设和上线过程中，技术管理、架构设计、应用调优、切换部署、运行维护等方面的经验和教训，也从较为宏观的角度描述了独具中国特色的交易技术支撑体系，特别是，通过分析其他资本市场交易系统的近年来发展历程和后续的技术发展规划，探索了未来业务创新和技术创新的大致框架和可能的模式。相信《交易系统:更......一起来看看《交易系统》这本书的介绍吧!

码农工具