Infinite Steps CartPole Problem With Variable Reward

栏目: IT技术 · 发布时间: 5年前

内容简介:In theThe CartPole problem is considered to be solved when the average reward is greater than or equal toThe CartPole problem has the following conditions for episode termination:

Infinite Steps CartPole Problem With Variable Reward

Modify Step Method of CartPole OpenAI Gym Environment Using Inheritance

In the last blog post , we wrote our first reinforcement learning application — CartPole problem. We used Deep -Q-Network to train the algorithm. As we can see in the blog, the fixed reward of +1 was used for all the stable states and when the CartPole loses its balance, a reward of 0 was given. We saw at the end: when the CartPole approaches 200 steps, it tends to lose balance. We ended the blog suggesting a remark: the maximum number of steps (which we defined 200) and the fixed reward may have led to such behavior. Today, let’s not limit the number of steps and modify the reward and see how the CartPole behaves.

CartPole Problem Definition

The CartPole problem is considered to be solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials. This is considering the fixed reward of 1.0 . Thanks to its definition, it makes sense to keep a fixed reward of 1.0 for every balance state and limit the maximum number of steps to 200 . It delights to know that the problem was solved in the previous blog .

The CartPole problem has the following conditions for episode termination:

  1. Pole angle is more than 12 degrees.
  2. Cart position is more than 2.4 — center of the cart reaches the edge of the display.

Variable Reward

Our goal here is to remove the number of steps limitation and give a variable reward to each state.

If x and θ represents cart position and pole angle respectively, we define the reward as:

reward = (1 - (x ** 2) / 11.52 - (θ ** 2) / 288)

Here, both the cart position and pole angle components are normalized to [0, 1] interval to give equal weightage to them. Let’s see the screenshot of the 2D view of the 3D graph.

We see in the graph that when the CartPole is perfectly balanced (i.e. x = 0 and θ = 0 ), the maximum reward is achieved (i.e. 1 ). With increase in the absolute values of x and θ , the reward decreases and reaches 0 when |x| = 2.4 and |θ| = 12 .

Let’s inherit the CartPole environment gym class ( CartPoleEnv) to our custom class, CustomCartPoleEnv, and overwrite the step method. In the step method, we write the variable reward instead of the fixed reward.

By using the above block of code, the components of TF-Agents are made and the Deep Q-Network is trained. We see that the CartPole is even more balanced and stable over a large number of steps.

Demonstration

Let’s see the video of how our CartPole behaves after using the variable reward.

One episode lasts 35.4 seconds on an average. Impressive, isn’t it?

Possible Improvements

Here, the reward becomes zero only when both of the expressions (pole angle and cart position) reach the extreme values. We can employ different reward function that returns zero when one of the extreme conditions is reached. I expect such a reward function to do even better. Therefore, readers are encouraged to try such a reward function and comment how the CartPole behaved. Happy RLing!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

平台革命:改变世界的商业模式

平台革命:改变世界的商业模式

[美]杰奥夫雷G.帕克(Geoffrey G. Parker)、马歇尔W.范·埃尔斯泰恩(Marshall W. Van Alstyne)、桑基特·保罗·邱达利(Sangeet Paul Choudary) / 志鹏 / 机械工业出版社 / 2017-10 / 65.00

《平台革命》一书从网络效应、平台的体系结构、颠覆市场、平台上线、盈利模式、平台开放的标准、平台治理、平台的衡量指标、平台战略、平台监管的10个视角,清晰地为读者提供了平台模式最权威的指导。 硅谷著名投资人马克·安德森曾经说过:“软件正在吞食整个世界。”而《平台革命》进一步指出:“平台正在吞食整个世界”。以平台为导向的经济变革为社会和商业机构创造了巨大的价值,包括创造财富、增长、满足人类的需求......一起来看看 《平台革命:改变世界的商业模式》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

html转js在线工具
html转js在线工具

html转js在线工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具