内容简介:Reinforcement Learning agents start from scratch, knowing nothing and learning by experience, which is effective but slow. Could we give them some hints to get them started?This story is based on a paper I co-authored and you can findReinforcement Learning
Reinforcement Learning agents start from scratch, knowing nothing and learning by experience, which is effective but slow. Could we give them some hints to get them started?
Jun 11 ·7min read
This story is based on a paper I co-authored and you can find here .
Reinforcement Learning (RL) has been shown to achieve amazing results in several tasks such as video games, robotics and recommender systems in recente years. However, these successful results come after training RL agents for millions of iterations. Until then, the agent’s performance will be far from great . In fact, when the agent starts, its behavior is random while exploring the different actions it can take, and even after several iterations and gaining some experience, the agent will often make mistakes because of variability of its actions in the environment and unseen states that may arise at any time.
In environments like video games, this might not be a problem (even though it means you need time and some serious computing resources), but in real world systems this is a serious challenge. You might think of a robot that uses RL to learn how to move: the robot might take days to learn how to properly move and it might damage itself if a movement is dangerous.
As humans, we do not learn everything from scratch and if we would, we would not be far from cave men. All the progress we’ve made has been achieved through learning from others achievements and improving upon them. Couldn’t we do something similar with RL? Couldn’t we teach the agent some general rules that it can apply while learning from its own experience?
With my colleagues at NEC, we have been working on this idea and we have developed an approach we called Tutor4RL . We have published it at AAAI-MAKE: Combining Machine Learning and Knowledge Engineering in Practice this year and you can see it here (unfortunately, the spring venue was canceled this year but papers were published.) Tutor4RL is still work in progress, so here I present our approach and initial results we’ve had so far.
Common approaches
As I have said, slow learning is a common challenge in RL and many methods have emerged to address it:
- Simulations : one common approach is to create a simulation environment in which the agent can experiment and be trained on, before deploying it in the real environment. This approach is very effective but creating the simulation environment requires great effort and we normally end up making assumptions on the real environment to create the simulation, which might not always be true. These assumptions impact on the performance of the agent on the real environment.
- Model-based RL : as opposed to model-free RL, model-based RL creates a model of its environment which allows the agent to learn quicker. In the case of robots, we could base our RL model on physics and the mechanics of the robot. However, to make this model we need a great deal of knowledge from the environment and again, we usually make assumptions that will hurt us. In addition, the resulting agent is specific to its environment and if we want to use it for a different environment we need to modify its model which involves extra work.
- Learning from Demonstrations : in Hester et al. (2018), the authors develop an approach so an agent can efficiently learn from demonstrations of a human. Learning is accelerated a great deal and the agent is able to achieve state-of-the-art results. However, what happens when we do not have access to the environment beforehand and it is therefore not possible to provide a demonstration? This approach can be combined with Tutor4RL, using the Tutor to provide demonstrations when the agent is already deployed in its environment.
Tutor4RL
We want our agent to perform well (or at least decently) from the start, but this is very hard when we do not have access to the environment beforehand, and if we make any assumptions on the environment and they’re wrong, the agent’s performance will suffer, not only initially but throughout its whole life. However, this doesn’t mean we cannot give some useful information the agent could be able to use once it is deployed in the environment.
In the end, when we learn something we are not given all the details of every situation we will encounter. Instead, we can learn from theoretical ideas and hints other persons give us. Can a RL agent do this?
In an attempt to do this, we’ve modified the RL framework as shown in the figure below:
We’ve added a component we call the Tutor which contains external knowledge in the form of an ensemble of knowledge functions . These functions are normal programmable functions that take as input the state and reward and output a vector with a value for each action, in a similar way to how a policy maps states to actions with Q-values. There are two types of knowledge functions:
- Guide functions : these functions express guides or hints to the agent and are interpreted as suggestions and even if they’re wrong, the agent will learn with its own experience they are not good and will not follow them.
- Constrain functions : these functions limit the behavior of the agent, they’re useful in case when we are sure the agent should not do something in a certain scenario, maybe because it might damage itself or it might cause some dangerous situation. Constrain functions tell the agent what NOT to do , an output a vector in which there is 1 for each action that can be taken and 0 for each action that must not be taken.
The agent can ask the tutor for its knowledge when it’s uncertain about what to do, for example, in its initial steps. The tutor will reply with an ensemble of its functions that the agent will then use to choose what action to execute and learn from this experience. In this way, the tutor guides the agent but the agent can always find if a suggestion from the tutor is wrong or, by using some exploration mechanism such as ε-greedy, can also find if there is a better action than the tutor’s choice. However, constrain functions are always applied to both, guide functions and the agent’s policy, so it provides a safety layer to avoid serious errors that might put the agent in danger or have a great negative impact in the task.
Evaluation
We have implemented a prototype of Tutor4RL using Keras-RL on Python. We have applied Tutor4RL to a DQN agent (Mnih et al. 2015) , to play Breakout on OpenAI Gym . We used one simple guide function, that tells the agent to move in the direction to where the ball is, if the ball is not directly above the bar (not that in this test, we did not use constrain functions). In addition, we have implemented a very simple approach to control when the agent will use the tutor’s guides: we define a parameter called τ that is used in the same way as ε is used in ε-greedy exploration. When τ is greater than a sample taken from U(0,1) distribution, we use the tutor’s output, otherwise, we use the policy’s output. We initialize the agent with τ=1 and decrease it linearly over time, so when the agent starts, the tutor’s output is used heavily but this decreases as the agent gathers more experience.
Below, you can see the reward achieved by the DQN agent with Tutor4RL compared against a standard, plain DQN agent. The reward in Breakout is directly the score achieved on the game, so the more blocks you break, the higher score and the higher the reward. As you can see, the DQN agent with Tutor4RL shows an initial good performance thanks to the tutor guide while the plain DQN agent struggles. It takes about 1.3 million iterations for the plain DQN agent to catch up and in iteration 1.5 million the tutor ceases to be used completely. Note that before this, the tutor is used intermittently, depending on the value of τ that decreases in each iteration. After that, we can see both agents have a similar performance, showing the tutored DQN agent also learned but avoided the rough start of the plain DQN agent.
Conclusion
Tutor4RL has proven to be able to help the agent to start with some knowledge of its environment, improving its performance in its initial steps. However, Tutor4RL is still work in progress and several things can be improved:
- We have only tested guide functions so the next steps is implementing the mechanisms for constrain functions as well as some example constrain functions and testing them.
- Managing the uncertainty with τ is a naïve approach and can be much improved by using other approaches such as Bootstrapping ( Kahn et al. (2017) ) or a Bayesian approach (Clements et al. (2020)).
- Better ways to ensemble and use the output vectors from the knowledge functions are possible, utilizing correlation between the functions as in Snorkel (Ratner et al. (2019)) .
As always, thanks for reading! I hope you found this approach interesting and I look forward to hearing your feedback. I’m sure there are more ways in which we can improve Tutor4RL.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。