MC Control Methods

栏目: IT技术 · 发布时间: 4年前

内容简介:In this new post of theHowever, Monte Carlo prediction methodsIn theprevious post we have introduced how the Monte Carlo control algorithm collects a large number of episodes to build the Q-table (

Constant- α MC Control

MC Control Methods

In this new post of the Deep Reinforcement Learning Explained series, we will improve the Monte Carlo Control Methods to estimate the optimal policy presented in the previous post. In the previous algorithm for Monte Carlo control, we collect a large number of episodes to build the Q-table. Then, after the values in the Q-table have converged, we use the table to come up with an improved policy.

However, Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis and this is what we will do in this post. Even though the policy is updated before the values in the Q-table accurately approximate the action-value function, this lower-quality estimate nevertheless still has enough information to help propose successively better policies.

Improvements to Monte Carlo Control

In theprevious post we have introduced how the Monte Carlo control algorithm collects a large number of episodes to build the Q-table ( policy evaluation step). Then, once the Q-table closely approximates the action-value function ​, the algorithm uses the table to come up with an improved policy π that is ϵ -greedy with respect to the Q-table (indicated as ϵ-greedy(Q) ), which will yield a policy that is better than the original policy π ( policy improvement step).

Maybe would it be more efficient to update the Q-table after every episode? Yes, we could amend the policy evaluation step to update the Q-table after every episode of interaction. Then, the updated Q-table could be used to improve the policy. That new policy could then be used to generate the next episode, and so on:

MC Control Methods

Constant-alpha MC Control Algorithm

The most popular variation of the MC control algorithm that updates the policy after every episode (instead of waiting to update the policy until after the values of the Q-table have fully converged from many episodes) is the Constant-alpha MC Control.

Constant-alpha MC Control

In this variation of MC control, during the policy evaluation step, the Agent collects an episode

using the most recent policy π . After the episode finishes in time-step T , for each time-step t , the corresponding state-action pair (St, At) is modified using the following update equation :

where Gt is the return at time-step t , and Q(St,At) is the entry in the Q-table corresponding to state St ​ and action At ​.

Generally speaking, the basic idea behind this update equation is that the Q ( St ​, At ​) element of Q-table contains the Agent’s estimate for the expected return if the Environment is in state St ​ and the Agent selects action At ​. Then, If the return Gt​ is not equal to the expected return contained in Q(St​,At​) , we “push” the value of Q(St​,At​) to make it agree slightly more with the return Gt . The magnitude of the change that we make to Q(St​,At​) is controlled by the hyperparameter α that acts as a step-size for the update step.

We always should set the value for α to a number greater than zero and less than (or equal to) one. In the outermost cases:

  • If α =0, then the action-value function estimate is never updated by the Agent.
  • If α =1, then the final value estimate for each state-action pair is always equal to the last return that was experienced by the Agent.

Epsilon-greedy policy

In theprevious post we advanced that random behavior is better at the beginning of the training when our Q-table approximation is bad, as it gives us more uniformly distributed information about the Environment states. However, as our training progresses, random behavior becomes inefficient, and we want to use our Q-table approximation to decide how to act. We introduced Epsilon-Greedy policies in theprevious post for this purpose, a method that performs such a mix of two extreme behaviors which just is switching between random and Q policy using the probability hyperparameter ϵ . By varying ϵ , we can select the ratio of random actions.

We will define that a policy is ϵ -greedy with respect to an action-value function estimate Q if for every state,

  • with probability 1− ϵ , the Agent selects the greedy action, and
  • with probability ϵ , the Agent selects an action uniformly at random from the set of available (non-greedy and greedy) actions.

So the larger ϵ is, the more likely you are to pick one of the non-greedy actions.

To construct a policy π that is ϵ -greedy with respect to the current action-value function estimate Q , mathematically we will set the policy as

if action a maximizes Q ( s , a ). Else

for each s ∈S and a ∈A( s ).

In this equation, it is included an extra term ϵ /∣A( s )∣ for the optimal action (∣A( s )∣ is the number of possible actions) because the sum of all the probabilities needs to be 1. Note that if we sum over the probabilities of performing all non-optimal actions, we will get (∣A(s)∣−1)×ϵ/∣A(s)∣, and adding this to 1− ϵ + ϵ /∣A( s )∣ , the probability of the optimal action, the sum gives one.

Setting the Value of Epsilon

Remember that in order to guarantee that MC control converges to the optimal policy π ∗​, we need to ensure the conditions Greedy in the Limit with Infinite Exploration (presented in the previous post) that ensure the Agent continues to explore for all time steps, and the Agent gradually exploits more and explores less. We presented that one way to satisfy these conditions is to modify the value of ϵ , making it gradually decay, when specifying an ϵ -greedy policy.

The usual practice is to start with ϵ = 1.0 (100% random actions) and slowly decrease it to some small value ϵ > 0 (in our example we will use ϵ = 0.05) . In general, this can be obtained by introducing a factor ϵ-decay with a value near 1 that multiply the ϵ in each iteration.

Pseudocode

We can summarize all the previous explanations with this pseudocode for the constant- α MC Control algorithm that will guide our implementation of the algorithm:

MC Control Methods

A simple MC Control implementation

In this section, we will write an implementation of constant- MC control that can help an Agent recover the optimal policy the Blackjack Environment following the pseudocode introduced in the previous post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

MC Control algorithm for Blackjack Environment

In the previous post, we implemented a policy where the player almost always sticks if the sum of her cards exceeds 18 for the BlackJack Environment. In this case, the function generate_episode sampled an episode using this defined policy by the programmer.

Here, instead of being the policy hardcoded by the programmer, the MC Control algorithm will estimate and return an optimal policy, together with the Q-table:

env = gym.make('Blackjack-v0')
num_episodes=1000000
alpha = 0.02
eps_decay=.9999965
gamma=1.0policy, Q = MC_control(env, num_episodes, alpha, eps_decay, gamma)

Specifically, policy is a dictionary whose key corresponds to a state s (a 3-tuple indicating the player’s current sum, the dealer’s face-up card, and whether or not the player has a usable ace) and the value of the corresponding entry indicates the action that the Agent chooses after observing this state following this policy.

Remember that the other dictionary returned by the function, the Q-table Q , is a dictionary where the key of a given entry in the dictionary corresponds to a state s and the value of the corresponding entry contains an array of dimension equal to the number of actions (2 dimensions in our case) where each element contains the estimated action-value for each action.

As input this MC Control algorithm has the following arguments:

env
num_episodes
alpha
eps_decay
gamma

Setting the Value of Epsilon

Before starting to program, a code based on the previously presented pseudocode takes a moment to see how we modify the value of ϵ , making it gradually decay when specifying an ϵ -greedy policy. Remember that this is important to guarantee that MC control converges to the optimal policy π ∗​.

With the following code that sets the value for ϵ in each episode and monitor its evolutions with a print we can check that selecting an eps_decay = 0.9999965 we can obtain the gradual decay of ϵ:

eps_start=1.0
eps_decay=.9999965
eps_min=0.05epsilon = eps_start
for episode in range(num_episodes):
            epsilon = max(epsilon*eps_decay, eps_min)
            if episode % 100000 == 0: print(“Episode {} 
               -> epsilon={}.”.format(episode, epsilon))

MC Control Methods

Before entering the loop over episodes, we initialize the value of epsilon to one. Then, for each episode, we slightly decay the value of Epsilon by multiplying it by the value eps_decay . We don’t want Epsilon to get too small because we want to constantly ensure at leans some small amount of exploration throughout the process.

Main function

Let’s start to program a code based on the previously presented pseudocode. The first thing, following the pseudocode, is to initialize all the values in the Q-table to zero. So Q is initialized to an empty dictionary of arrays with the total number of actions that are in the Environment:

nA = env.action_space.n
Q = defaultdict(lambda: np.zeros(nA)

After that, we loop num_episodes over episodes, and then with each episode we compute the corresponding ϵ, construct the corresponding ϵ- greedy policy with respect to the most recent estimate of the Q-table, and then generate an episode using that ϵ -greedy policy. Finally, we update the Q-table using the update equation presented before:

for episode in range(1, num_episodes+1):
       epsilon = max(epsilon*eps_decay, eps_min)
       episode_generated=generate_episode_from_Q(env,Q,epsilon,nA)
       Q = update_Q(env, episode_generated, Q, alpha, gamma)

After finishing the loop of episodes, the policy corresponding to the final Q-table is calculated with the following code:

policy=dict((state,np.argmax(actions)) \ 
            for state, actions in Q.items())

That is, the policy indicates for each state which action to take, which just corresponds to the action that has the maximum action-value in the Q-table.

See the GitHub for the complete code the main algorithm of our Constant-α MC Control method.

Generate episodes using Q-table and epsilon-greedy policy

The construction of the corresponding ϵ- greedy policy and the generation of an episode using this ϵ -greedy policy are wrapped up in the generate_episode_from_Q function instantiated in the previous code.

This function takes as input the Environment, the most recent estimate of the Q-table, the value of current Epsilon and the number of actions. As an output, it returns an episode.

The Agent will use the Epsilon-greedy policy to select actions. We have implemented that using the random.choice method from Numpy, which takes as input the set of possible actions and the probabilities corresponding to the Epsilon greedy policy. The obtention of the action probabilities corresponding to ϵ -greedy policy will be done using this code:

def get_probs(Q_s, epsilon, nA):
    policy_s = np.ones(nA) * epsilon / nA
    max_action = np.argmax(Q_s)  
    policy_s[max_action] = 1 — epsilon + (epsilon / nA)
    return policy_s

If you take a look at get_probs function code, it implements the epsilon-greedy policy detailed in the previous section.

Obviously, if the state is not already in Q-table, we randomly choose one action using the action_space.sample(). The complete code for this function that generates an episode following the epsilon-greedy policy is coded as follows:

def generate_episode_from_Q(env, Q, epsilon, nA):
    episode = []
    state = env.reset()
    while True:
          probs = get_probs(Q[state], epsilon, nA)
          action = np.random.choice(np.arange(nA), p=probs)     \
                   if state in Q else env.action_space.sample()
          next_state, reward, done, info = env.step(action) 
          episode.append((state, action, reward))
          state = next_state
          if done:
             break
    return episode

Update Q-table

Once we have the episode we just look at each state-action and we apply the update equation:

The code that programs this equation is

def update_Q(env, episode, Q, alpha, gamma):
    states, actions, rewards = zip(*episode)
    discounts=np.array([gamma**i for i in range(len(rewards)+1)])
    for i, state in enumerate(states):
           old_Q = Q[state][actions[i]]
           Q[state][actions[i]] = old_Q + alpha   \
                   (sum(rewards[i:]*discounts[:-(1+i)]) — old_Q)
    return Q

Plot state-value function

As in the example of the previous post, we can obtain the corresponding estimated optimal state-value function and plot it:

MC Control Methods

Remember there are two plots corresponding to whether we do or don’t have a usable ace.

With a simple visual analysis of the graphs of this post and those of the previous post, we can see that the policy obtained with the MC Control presented here is better since the state-value values are much higher.

What is next?

We have reached the end of this post!. So far we have seen Monte Carlo methods and dynamic programming, which are the two main threads from which they derive the origins of modern RL according to the book of Dr. Sutton. A third thread arriving later in the form of temporal difference learning , that we will introduce in the following post. See you in the next post!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

复制互联网之2

复制互联网之2

文飞翔//刘伟 / 清华大学出版社 / 2011-6 / 45.00元

《复制互联网之2:2011年全球最值得模仿的100个网站》从行业的整体发展趋势中,收录了国内外最值得关注的互联网商业模式,为初创网站设计者提供了诸多可供借鉴的最具有启发价值的商业案例。此外,《复制互联网之2:2011年全球最值得模仿的100个网站》对前沿互联网产品的介绍和思考,也值得网站开发人员、产品设计人员及公司管理人员在产品和运营的创新上借鉴与参考。 作者是网易科技频道的编辑,长期致力于......一起来看看 《复制互联网之2》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具