Tutoring Reinforcement Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:Reinforcement Learning agents start from scratch, knowing nothing and learning by experience, which is effective but slow. Could we give them some hints to get them started?This story is based on a paper I co-authored and you can findReinforcement Learning

Reinforcement Learning agents start from scratch, knowing nothing and learning by experience, which is effective but slow. Could we give them some hints to get them started?

This story is based on a paper I co-authored and you can find here .

Reinforcement Learning (RL) has been shown to achieve amazing results in several tasks such as video games, robotics and recommender systems in recente years. However, these successful results come after training RL agents for millions of iterations. Until then, the agent’s performance will be far from great . In fact, when the agent starts, its behavior is random while exploring the different actions it can take, and even after several iterations and gaining some experience, the agent will often make mistakes because of variability of its actions in the environment and unseen states that may arise at any time.

In environments like video games, this might not be a problem (even though it means you need time and some serious computing resources), but in real world systems this is a serious challenge. You might think of a robot that uses RL to learn how to move: the robot might take days to learn how to properly move and it might damage itself if a movement is dangerous.

You’ve got to love the fails though [video from IEEE.]

As humans, we do not learn everything from scratch and if we would, we would not be far from cave men. All the progress we’ve made has been achieved through learning from others achievements and improving upon them. Couldn’t we do something similar with RL? Couldn’t we teach the agent some general rules that it can apply while learning from its own experience?

With my colleagues at NEC, we have been working on this idea and we have developed an approach we called Tutor4RL . We have published it at AAAI-MAKE: Combining Machine Learning and Knowledge Engineering in Practice this year and you can see it here (unfortunately, the spring venue was canceled this year but papers were published.) Tutor4RL is still work in progress, so here I present our approach and initial results we’ve had so far.

Common approaches

As I have said, slow learning is a common challenge in RL and many methods have emerged to address it:

  • Simulations : one common approach is to create a simulation environment in which the agent can experiment and be trained on, before deploying it in the real environment. This approach is very effective but creating the simulation environment requires great effort and we normally end up making assumptions on the real environment to create the simulation, which might not always be true. These assumptions impact on the performance of the agent on the real environment.
  • Model-based RL : as opposed to model-free RL, model-based RL creates a model of its environment which allows the agent to learn quicker. In the case of robots, we could base our RL model on physics and the mechanics of the robot. However, to make this model we need a great deal of knowledge from the environment and again, we usually make assumptions that will hurt us. In addition, the resulting agent is specific to its environment and if we want to use it for a different environment we need to modify its model which involves extra work.
  • Learning from Demonstrations : in Hester et al. (2018), the authors develop an approach so an agent can efficiently learn from demonstrations of a human. Learning is accelerated a great deal and the agent is able to achieve state-of-the-art results. However, what happens when we do not have access to the environment beforehand and it is therefore not possible to provide a demonstration? This approach can be combined with Tutor4RL, using the Tutor to provide demonstrations when the agent is already deployed in its environment.

Tutor4RL

We want our agent to perform well (or at least decently) from the start, but this is very hard when we do not have access to the environment beforehand, and if we make any assumptions on the environment and they’re wrong, the agent’s performance will suffer, not only initially but throughout its whole life. However, this doesn’t mean we cannot give some useful information the agent could be able to use once it is deployed in the environment.

In the end, when we learn something we are not given all the details of every situation we will encounter. Instead, we can learn from theoretical ideas and hints other persons give us. Can a RL agent do this?

In an attempt to do this, we’ve modified the RL framework as shown in the figure below:

Tutoring Reinforcement Learning

Standard RL framework as compared to the Tutor4RL framework [taken from the original publication .]

We’ve added a component we call the Tutor which contains external knowledge in the form of an ensemble of knowledge functions . These functions are normal programmable functions that take as input the state and reward and output a vector with a value for each action, in a similar way to how a policy maps states to actions with Q-values. There are two types of knowledge functions:

  • Guide functions : these functions express guides or hints to the agent and are interpreted as suggestions and even if they’re wrong, the agent will learn with its own experience they are not good and will not follow them.
  • Constrain functions : these functions limit the behavior of the agent, they’re useful in case when we are sure the agent should not do something in a certain scenario, maybe because it might damage itself or it might cause some dangerous situation. Constrain functions tell the agent what NOT to do , an output a vector in which there is 1 for each action that can be taken and 0 for each action that must not be taken.

The agent can ask the tutor for its knowledge when it’s uncertain about what to do, for example, in its initial steps. The tutor will reply with an ensemble of its functions that the agent will then use to choose what action to execute and learn from this experience. In this way, the tutor guides the agent but the agent can always find if a suggestion from the tutor is wrong or, by using some exploration mechanism such as ε-greedy, can also find if there is a better action than the tutor’s choice. However, constrain functions are always applied to both, guide functions and the agent’s policy, so it provides a safety layer to avoid serious errors that might put the agent in danger or have a great negative impact in the task.

Evaluation

Tutoring Reinforcement Learning

Out environment: Breakout from OpenAI Gym .

We have implemented a prototype of Tutor4RL using Keras-RL on Python. We have applied Tutor4RL to a DQN agent (Mnih et al. 2015) , to play Breakout on OpenAI Gym . We used one simple guide function, that tells the agent to move in the direction to where the ball is, if the ball is not directly above the bar (not that in this test, we did not use constrain functions). In addition, we have implemented a very simple approach to control when the agent will use the tutor’s guides: we define a parameter called τ that is used in the same way as ε is used in ε-greedy exploration. When τ is greater than a sample taken from U(0,1) distribution, we use the tutor’s output, otherwise, we use the policy’s output. We initialize the agent with τ=1 and decrease it linearly over time, so when the agent starts, the tutor’s output is used heavily but this decreases as the agent gathers more experience.

Below, you can see the reward achieved by the DQN agent with Tutor4RL compared against a standard, plain DQN agent. The reward in Breakout is directly the score achieved on the game, so the more blocks you break, the higher score and the higher the reward. As you can see, the DQN agent with Tutor4RL shows an initial good performance thanks to the tutor guide while the plain DQN agent struggles. It takes about 1.3 million iterations for the plain DQN agent to catch up and in iteration 1.5 million the tutor ceases to be used completely. Note that before this, the tutor is used intermittently, depending on the value of τ that decreases in each iteration. After that, we can see both agents have a similar performance, showing the tutored DQN agent also learned but avoided the rough start of the plain DQN agent.

Tutoring Reinforcement Learning

Average reward of a DQN agent with Tutor4RL compared against a plain DQN agent [taken from the original publication .]

Conclusion

Tutor4RL has proven to be able to help the agent to start with some knowledge of its environment, improving its performance in its initial steps. However, Tutor4RL is still work in progress and several things can be improved:

  • We have only tested guide functions so the next steps is implementing the mechanisms for constrain functions as well as some example constrain functions and testing them.
  • Managing the uncertainty with τ is a naïve approach and can be much improved by using other approaches such as Bootstrapping ( Kahn et al. (2017) ) or a Bayesian approach (Clements et al. (2020)).
  • Better ways to ensemble and use the output vectors from the knowledge functions are possible, utilizing correlation between the functions as in Snorkel (Ratner et al. (2019)) .

As always, thanks for reading! I hope you found this approach interesting and I look forward to hearing your feedback. I’m sure there are more ways in which we can improve Tutor4RL.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

算法时代

算法时代

Luke Dormehl / 胡小锐、钟毅 / 中信出版集团 / 2016-4-1 / CNY 59.00

世界上的一切事物都可以被简化成一个公式吗?数字可以告诉我们谁是适合我们的另一半,而且能和我们白头偕老吗?算法可以准确预测电影的票房收入,并且让电影更卖座吗?程序软件能预知谁将要实施犯罪,并且精确到案发时间吗?这些事听起来都像是科幻小说中的情节,但事实上,它们仅是日益被算法主宰的人类世界的“冰山一角”。 近年来随着大数据技术的快速发展,我们正在进入“算法经济时代”。每天,算法都会对展示在我们眼......一起来看看 《算法时代》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具