Hierarchical Reinforcement Learning: FeUdal Networks

栏目: IT技术 · 发布时间: 5年前

Hierarchical Reinforcement Learning: FeUdal Networks

Letting computers see the bigger picture

Apr 29 ·7min read

Hierarchical Reinforcement Learning: FeUdal Networks — Photo by Johann Siemens on Unsplash

Every task can be naturally divided into subtasks. When we prepare dinner, we don’t micromanage every little movement that our hands make. Instead, we segment the task into smaller pieces (taking out ingredients, cutting, cooking, serving), and then, we focus on how to accomplish each one individually. This concept of decomposition is what inspired hierarchical approaches to reinforcement learning. Specifically, FeUdal Networks (FuNs) divide computing between a manager and worker by using a modular neural network. The manager assigns the worker local, specific goals to learn optimally. At the same time, the manager learns how to optimally assign these subgoals to best accomplish a “bigger-picture” task.

In this article, we outline the architecture, intuition, and mathematics behind FeUdal Networks.

FuN’s Architecture

FuN is a modular neural network (MNN) consisting of two independent networks: the manager and the worker. Here, we describe an environment with a discrete action space .

The architecture can seem dense so here’s a breakdown of what each of the nodes on the graph represents:

It’s useful to take a step back and dissect what’s going on. Given a task to learn, we divide labor between two entities: the “manager” and the “worker” sections of our MNN. We can see that the variable z is just another representation of our observation, x. In other words, z carries the same information as x, just in terms of different variables! We pass that same information to both worker and manager, but both handle the information somewhat differently.

The Manager

After receiving z, the manager creates a different latent state ( s ) by passing z into another function. This latent state is, yet another, representation of the environment but in a higher dimension. The manager operates in a much higher dimensional vector space than the worker to encode how the manager is considering a much larger picture instead of solely local information.

Then, the manager pushes this latent state ( s ) into a recurrent neural network (RNN) and consequently outputs a goal for the worker to achieve. This goal represents a relative change in state for the worker. More formally:

Manager RNN Forward Function

where h represents the RNN’s hidden states. After normalizing the goal, we do something special. We pool all the goals over a finite horizon c and then pass the result into a linear transformation with no bias. This effectively transitions from the manager’s vector space to the worker’s vector space and encodes a representation of the previous c goals assigned by the manager.

The vector w has k dimensions and two key properties:

Since the transformation has no bias, it can never produce a constant, non-zero vector. As a result, the worker will never ignore the manager’s input. There is always some “meaning” for the worker to extract.
Due to pooling, the manager’s conditioning varies smoothly over time. This prevents any erratic changes in goals that the worker cannot understand or handle.

The Worker

Once the worker receives z, it passes z into a different recurrent neural network. However, instead of outputting a vector, the worker’s RNN outputs a matrix. The matrix has several rows equal to the number of possible actions and k columns.

where each h represents the RNN’s hidden states. To develop intuition regarding why we output a matrix instead of a vector, we look at the equation below:

Action Output for Worker

This output is the probability distribution over the worker’s actions. However, let’s get a slightly different perspective. Assume that each of U ’s rows encodes the resulting state if we chose that corresponding action. Then, if we look at the vector Uw , each element is the dot product between a row of U and the encoded goal w. Thinking of the dot product as a metric of similarity and knowing SoftMax retains relative ordering, this vector has elements proportional to the probability of achieving the manager’s goal given the worker chooses that action. As a result, it makes sense to sample actions according to this distribution.

The entire forward process for FuN is shown below.

How It Learns

Let’s consider this. After executing an action, we receive a reward and another set of observations. Then, we could train the entire MNN per usual with TD-learning by optimizing over actions taken by the worker. Afterward, we would propagate these gradients to the manager as well. However, this defeats the purpose of hierarchical learning since the manager’s outputs g would lose all semantic meaning . This would make FuN no different from any other network as g just becomes another internal latent variable. As a result, we instead independently train the manager and worker.

The Manager

Intuitively, we want the manager to give the worker goals that not only maximize reward over time but are also achievable by the worker. Therefore, we maximize over some similarity measure between the worker’s change in state and the goal set by the manager.

The manager’s section of the MNN is updated according to the equation:

Manager Policy Update Equation

where d_cos represents the cosine similarity between two vectors, A is the manager’s advantage function and c represents the manager’s horizon. By multiplying the similarity measure by the advantage function, this update rule effectively finds the optimal balance between feasibility and pay-off. The advantage is computed using the manager’s internal value function and is updated in a similar way to other actor-critic algorithms. The manager’s reward function is defined depending on the task at hand.

The Worker

We want to encourage the worker to follow the manager’s provided goal. As a result, let’s define an intrinsic reward:

This reward is an average of how closely the worker follows the manager’s instructions over a finite horizon. The algorithm is trained to maximize a weighted sum consisting of the environment reward and the given intrinsic reward. Using these, we train the worker’s value function, similar to the manager. Then, we update the worker’s policy using:

Worker Policy Update Equation

where we maximize over the log probabilities scaled by the advantage function. This is analogous to typical actor-critic algorithms.

The algorithm notes that the manager and worker could potentially have different discount factors. As a result, the worker could focus more on immediate, local rewards while the manager focuses on long-term events.

The Results

The paper on FuNs [1] uses many experiments to show the algorithm’s robust learning ability, most notably on Montezuma’s Revenge and DeepMind Lab’s games. Using recurrent LSTM networks trained with A3C as baselines, the FuN outperforms other methods in these two experiments.

Even more incredibly, FuN learns semantically meaningful subgoals. In the visualization above, the tall bars represent consistently administered goals from the manager, each of which corresponds to big “turning points” in the game.

That’s It!

FeUdal Networks present a large stepping stone for reinforcement learning, giving agents the ability to autonomously decompose a task with semantically meaningful results. Next time, we’ll explore how this algorithm can be extended to various multi-agent frameworks.

References

[1] A. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, FeUdal Networks for Hierarchical Reinforcement Learning (2017), ICML ‘17.

以上所述就是小编给大家介绍的《Hierarchical Reinforcement Learning: FeUdal Networks》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Hierarchical Reinforcement Learning: FeUdal Networks

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

移动Web手册

Peter-Paul Koch（彼得-保罗科赫） / 奇舞团 / 电子工业出版社 / 2015-4 / 65.00元

《移动Web手册（双色）》主要讲解了移动Web开发和传统PC网站开发的不同之处。作者首先对移动互联网相关的运营商、设备、操作系统和软件进行了简单的介绍，让读者理解移动开发的复杂之处。接下来对移动设备上的各种浏览器进行了详细介绍，以及这些浏览器的市场占有率、特性支持等。《移动Web手册（双色）》为那些想进入移动Web开发领域的人提供了一些指导性的建议并对移动Web开发的未来进行了展望。《移动......一起来看看《移动Web手册》这本书的介绍吧!

码农工具

Hierarchical Reinforcement Learning: FeUdal Networks