How DeepMind’s UNREAL Agent Performed 9 Times Better Than Experts on Atari

栏目: IT技术 · 发布时间: 5年前

内容简介:We can think of auxiliary tasks as “side quests.” Although they don’tOverall, the goal is to maximize the sum of two terms:where the superscript c denotes an auxiliary control task reward. Here are the two control tasks used by UNREAL:

Auxiliary Control Tasks

We can think of auxiliary tasks as “side quests.” Although they don’t directly help achieve the overall goal, they help the agent learn about environment dynamics and extract relevant information. In turn, that helps the agent learn how to achieve the desired overall end state. We can also view them as additional pseudo-reward functions for the agent to interact with.

Overall, the goal is to maximize the sum of two terms:

  1. The expected cumulative extrinsic reward
  2. The expected cumulative sum of auxiliary rewards
Overall Maximization Goal

where the superscript c denotes an auxiliary control task reward. Here are the two control tasks used by UNREAL:

  • Pixel Changes (Pixel Control): The agent tries to maximize changes in pixel values since these changes often correspond to important events.
  • Network Features (Feature Control): The agent tries to maximize the activation of all units in a given layer. This can force the policy and value networks to extract more task-relevant, high-level information.

For more details on how these tasks are defined and learned, feel free to skim this paper [1]. For now, just know that the agent tries to find accurate Q value functions to best achieve these auxiliary tasks, using auxiliary rewards defined by the user.

Okay, perfect! Now we just add the extrinsic and auxiliary rewards then run A3C using the sum as a newly defined reward! Right?

How UNREAL is Clever

In actuality, UNREAL does something different. Instead of training a single policy to optimize this reward, it trains a policy for each of the tasks on top of the base A3C policy . While all auxiliary task policies share some network components with the base A3C agent, they each also add individual components to define separate policies.

For example, the “Pixel Control” task has a deconvolutional network after the shared convolutional network and LSTM. The output defines the Q-values for the pixel control policy. (Skim [1] for details on the implementation)

Each of the policies optimizes an n-step Q-learning loss:

Auxiliary Control Loss Using N-Step Q

Even more amazingly, we never explicitly use these auxiliary control task policies. Even though we discover which actions optimize each of the auxiliary tasks, we only use the base A3C agent’s actions in the environment. Then, you may think, all this auxiliary training was for nothing!

Not quite. The key is that there are shared parts of the architecture between the A3C agent and auxiliary control tasks! As we optimize policies over the auxiliary tasks, we are changing parameters that are also used by the base agent. This has, what I like to call, a “nudging effect.”

Updating shared components not only helps learn auxiliary tasks but also better equips the agent to solve the overall problem by extracting relevant information from the environment.

In other words, we get more information from the environment than if we did not use auxiliary tasks.


以上所述就是小编给大家介绍的《How DeepMind’s UNREAL Agent Performed 9 Times Better Than Experts on Atari》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

大思维:集体智慧如何改变我们的世界

大思维:集体智慧如何改变我们的世界

杰夫·摩根 / 郭莉玲、尹玮琦、徐强 / 中信出版集团股份有限公司 / 2018-8-1 / CNY 65.00

智能时代,我们如何与机器互联,利用技术来让我们变得更聪明?为什么智能技术不会自动导致智能结果呢?线上线下群体如何协作?社会、政府或管理系统如何解决复杂的问题?本书从哲学、计算机科学和生物学等领域收集见解,揭示了如何引导组织和社会充分利用人脑和数字技术进行大规模思考,从而提高整个集体的智力水平,以解决我们时代的巨大挑战。是英国社会创新之父的洞见之作,解析企业、群体、社会如何明智决策、协作进化。一起来看看 《大思维:集体智慧如何改变我们的世界》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器