内容简介:Albeit we have made very good progress in reinforcement learning research, a unified framework to compare the algorithms is missing. Furthermore, reported metrics in research papers do not give enough information. Here, we discuss potential things to look
Albeit we have made very good progress in reinforcement learning research, a unified framework to compare the algorithms is missing. Furthermore, reported metrics in research papers do not give enough information. Here, we discuss potential things to look at to make the analysis more rigorous.
Mar 23 ·5min read
We all know how reinforcement learning paper mostly works. Researcher A publishes an algorithm B, algorithm B outperforms a subset of other “state-of-the-art” algorithms, on a strategically chosen subset of environments which coincidentally work well for the algorithm. In addition, the authors may or may not optimize the hyperparameters of the baselines, but in turn, report the best runs for algorithm B.
Not to get into detail about what has caused this research trend, we can do something to improve it. Proper evaluation metrics of algorithms (in addition to proper benchmarks) are essential in order to have a valid comparison. Mostly what researchers use in order to evaluate the performance of algorithms is the mean performance across runs, if you get lucky, then they would even report the median which is perhaps a bit more informative. Although sounding a bit cynical regarding RL research, I have to say that RL is bread and butter compared to the things that happen in other subfields of machine learning, such as not even having multiple runs of the algorithm (vision people, I am talking to you :) ).
Hence, this post is about what do we have to look at to compare one reinforcement learning algorithm to another, a great source of inspiration is [1], where the authors suggest a concrete way of calculating various metrics for RL algorithms, but we’ll look at it more from a top-level perspective since the intricacies are just technical details of the greater goal.
Most intuitively, when one develops an algorithm, you should look at how sensitive it is to various factors during the training procedure, such as the random seed and hyperparameters. Less variability means that the algorithm is more stable, robust, reliable etc. Except for general variability, we want to look at the worst case of different things, i.e. when having the metric, what is it in the lower tail of the distribution. No wonder that the authors from [1] took inspiration from finance in order to define concrete metrics since it turns out that we care about risk and variability in RL also. All in all, the different “reliability” categories can be separated as follows:
1. Variability during Training within Rollouts
Ideally, we would like to have continuous, monotonous improvement. Meaning that the average performance should increase with each rollout and within rollouts and that the performance shouldn’t get (significantly) worse from rollout to rollout. This is unfortunately mostly the case, that RL algorithms tend to be unstable. The source of the variability though can be the environment, so what you want to do is to account for the stochasticity in the environment also and adjust for it in the metric. Ideally, you would
want to obtain the best performance at the end of the training run, not somewhere in the middle.
2. Variability across Different Training Runs
The initial conditions of the training shouldn’t influence the algorithm’s performance significantly, this is why it is important to look at different random seeds in different training runs (vision people, I am looking at you!). Sensitivity to hyperparameters should also be accounted for.
3. Variability across Rollouts in Evaluation
We would like the algorithm to produce similar performance and behavior in evaluation. This shows how the algorithm deals with the stochasticity of the environment and different initialization conditions. One also must take into account though that the maximum achievable performance within the rollout can depend on the initial state, should also be taken into account.
4. Short-term Risk within Training Rollouts
The algorithm should exhibit some guarantees in worst-case performance. This is especially important in situations with safety considerations during training. In the short-term case, we want the algorithm’s performance not to wiggle too much, locally. Looking at the risk effectively means looking at the expected value of the lowest tail of the (local) distribution, below a certain percentile (let's say 5%).
5. Long-term Risk within Training Rollouts
Looking at the whole rollout, we want to close the gap between the worst and best performance within the rollout. In comparison to the short-term case, here we would fit the distribution based on the whole rollout. The expected value of the performance in the worst performance that we rarely obtain within the rollout, but it is possible. Obviously, again, this can come from the instability of the algorithm, but also the characteristics of the environment.
6. Risk across Training Runs
In contrast to the 1. point where we look at the variability after discarding outliers, here we want to see what happens with low probability, that we get a really bad seed or with a really bad set of hyperparameters.
7. Risk across Rollouts at Evaluation
In contrast to the 3. point, we look at the worst-case performance across many rollouts in evaluation. Again, the source of the variability can be the algorithm but also the environment.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Flask Web开发实战
李辉 / 机械工业出版社 / 2018-8-1 / 129
这是一本面向Python程序员的,全面介绍Python Web框架Flask的书。关于本书的详细介绍、相关资源等更多信息可以访问本书的官方主页http://helloflask.com/book了解。 • 国内首本Flask著作,在内容上涵盖完整的Flask Web开发学习路径,在实践上包含完整的Flask Web程序开发流程。同时兼容Python2 .7和Python3.6。 • 内......一起来看看 《Flask Web开发实战》 这本书的介绍吧!