Hierarchical Reinforcement Learning: FeUdal Networks
Letting computers see the bigger picture
Apr 29 ·7min read
Every task can be naturally divided into subtasks. When we prepare dinner, we don’t micromanage every little movement that our hands make. Instead, we segment the task into smaller pieces (taking out ingredients, cutting, cooking, serving), and then, we focus on how to accomplish each one individually. This concept of decomposition is what inspired hierarchical approaches to reinforcement learning. Specifically, FeUdal Networks (FuNs) divide computing between a manager and worker by using a modular neural network. The manager assigns the worker local, specific goals to learn optimally. At the same time, the manager learns how to optimally assign these subgoals to best accomplish a “bigger-picture” task.
In this article, we outline the architecture, intuition, and mathematics behind FeUdal Networks.
FuN’s Architecture
FuN is a modular neural network (MNN) consisting of two independent networks: the manager and the worker. Here, we describe an environment with a discrete action space .
The architecture can seem dense so here’s a breakdown of what each of the nodes on the graph represents:
It’s useful to take a step back and dissect what’s going on. Given a task to learn, we divide labor between two entities: the “manager” and the “worker” sections of our MNN. We can see that the variable z is just another representation of our observation, x. In other words, z carries the same information as x, just in terms of different variables! We pass that same information to both worker and manager, but both handle the information somewhat differently.
The Manager
After receiving z, the manager creates a different latent state ( s ) by passing z into another function. This latent state is, yet another, representation of the environment but in a higher dimension. The manager operates in a much higher dimensional vector space than the worker to encode how the manager is considering a much larger picture instead of solely local information.
Then, the manager pushes this latent state ( s ) into a recurrent neural network (RNN) and consequently outputs a goal for the worker to achieve. This goal represents a relative change in state for the worker. More formally:
where h represents the RNN’s hidden states. After normalizing the goal, we do something special. We pool all the goals over a finite horizon c and then pass the result into a linear transformation with no bias. This effectively transitions from the manager’s vector space to the worker’s vector space and encodes a representation of the previous c goals assigned by the manager.
The vector w has k dimensions and two key properties:
- Since the transformation has no bias, it can never produce a constant, non-zero vector. As a result, the worker will never ignore the manager’s input. There is always some “meaning” for the worker to extract.
- Due to pooling, the manager’s conditioning varies smoothly over time. This prevents any erratic changes in goals that the worker cannot understand or handle.
The Worker
Once the worker receives z, it passes z into a different recurrent neural network. However, instead of outputting a vector, the worker’s RNN outputs a matrix. The matrix has several rows equal to the number of possible actions and k columns.
where each h represents the RNN’s hidden states. To develop intuition regarding why we output a matrix instead of a vector, we look at the equation below:
This output is the probability distribution over the worker’s actions. However, let’s get a slightly different perspective. Assume that each of U ’s rows encodes the resulting state if we chose that corresponding action. Then, if we look at the vector Uw , each element is the dot product between a row of U and the encoded goal w. Thinking of the dot product as a metric of similarity and knowing SoftMax retains relative ordering, this vector has elements proportional to the probability of achieving the manager’s goal given the worker chooses that action. As a result, it makes sense to sample actions according to this distribution.
The entire forward process for FuN is shown below.
How It Learns
Let’s consider this. After executing an action, we receive a reward and another set of observations. Then, we could train the entire MNN per usual with TD-learning by optimizing over actions taken by the worker. Afterward, we would propagate these gradients to the manager as well. However, this defeats the purpose of hierarchical learning since the manager’s outputs g would lose all semantic meaning . This would make FuN no different from any other network as g just becomes another internal latent variable. As a result, we instead independently train the manager and worker.
The Manager
Intuitively, we want the manager to give the worker goals that not only maximize reward over time but are also achievable by the worker. Therefore, we maximize over some similarity measure between the worker’s change in state and the goal set by the manager.
The manager’s section of the MNN is updated according to the equation:
where d_cos represents the cosine similarity between two vectors, A is the manager’s advantage function and c represents the manager’s horizon. By multiplying the similarity measure by the advantage function, this update rule effectively finds the optimal balance between feasibility and pay-off. The advantage is computed using the manager’s internal value function and is updated in a similar way to other actor-critic algorithms. The manager’s reward function is defined depending on the task at hand.
The Worker
We want to encourage the worker to follow the manager’s provided goal. As a result, let’s define an intrinsic reward:
This reward is an average of how closely the worker follows the manager’s instructions over a finite horizon. The algorithm is trained to maximize a weighted sum consisting of the environment reward and the given intrinsic reward. Using these, we train the worker’s value function, similar to the manager. Then, we update the worker’s policy using:
where we maximize over the log probabilities scaled by the advantage function. This is analogous to typical actor-critic algorithms.
The algorithm notes that the manager and worker could potentially have different discount factors. As a result, the worker could focus more on immediate, local rewards while the manager focuses on long-term events.
The Results
The paper on FuNs [1] uses many experiments to show the algorithm’s robust learning ability, most notably on Montezuma’s Revenge and DeepMind Lab’s games. Using recurrent LSTM networks trained with A3C as baselines, the FuN outperforms other methods in these two experiments.
Even more incredibly, FuN learns semantically meaningful subgoals. In the visualization above, the tall bars represent consistently administered goals from the manager, each of which corresponds to big “turning points” in the game.
That’s It!
FeUdal Networks present a large stepping stone for reinforcement learning, giving agents the ability to autonomously decompose a task with semantically meaningful results. Next time, we’ll explore how this algorithm can be extended to various multi-agent frameworks.
References
[1] A. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, FeUdal Networks for Hierarchical Reinforcement Learning (2017), ICML ‘17.
以上所述就是小编给大家介绍的《Hierarchical Reinforcement Learning: FeUdal Networks》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
企业应用架构模式
Martin Fowler / 人民邮电出版社 / 2009 / 79.00元
随着信息技术的广泛应用,系统需要处理的数据量越来越大,企业级软件开发已经渐成主流,而开发人员面临的困难与挑战也是显而易见的。更糟糕的是,这一领域的资料一直非常缺乏。 本书是软件开发大师Martin Fowler的代表作,采用模式的形式系统总结了业界多年积累的经验,被称为“企业级应用开发领域的圣经”,出版以来一直畅销不衰,至今仍然无可替代。作 者在精彩地阐述了企业应用开发和设计中的核心原则基础......一起来看看 《企业应用架构模式》 这本书的介绍吧!