内容简介:In this new post of theHowever, Monte Carlo prediction methodsIn theprevious post we have introduced how the Monte Carlo control algorithm collects a large number of episodes to build the Q-table (
Constant- α MC Control
Jul 25 ·10min read
In this new post of the “ Deep Reinforcement Learning Explained ” series, we will improve the Monte Carlo Control Methods to estimate the optimal policy presented in the previous post. In the previous algorithm for Monte Carlo control, we collect a large number of episodes to build the Q-table. Then, after the values in the Q-table have converged, we use the table to come up with an improved policy.
However, Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis and this is what we will do in this post. Even though the policy is updated before the values in the Q-table accurately approximate the action-value function, this lower-quality estimate nevertheless still has enough information to help propose successively better policies.
Improvements to Monte Carlo Control
In theprevious post we have introduced how the Monte Carlo control algorithm collects a large number of episodes to build the Q-table ( policy evaluation step). Then, once the Q-table closely approximates the action-value function qπ , the algorithm uses the table to come up with an improved policy π ′ that is ϵ -greedy with respect to the Q-table (indicated as ϵ-greedy(Q) ), which will yield a policy that is better than the original policy π ( policy improvement step).
Maybe would it be more efficient to update the Q-table after every episode? Yes, we could amend the policy evaluation step to update the Q-table after every episode of interaction. Then, the updated Q-table could be used to improve the policy. That new policy could then be used to generate the next episode, and so on:
The most popular variation of the MC control algorithm that updates the policy after every episode (instead of waiting to update the policy until after the values of the Q-table have fully converged from many episodes) is the Constant-alpha MC Control.
Constant-alpha MC Control
In this variation of MC control, during the policy evaluation step, the Agent collects an episode
using the most recent policy π . After the episode finishes in time-step T , for each time-step t , the corresponding state-action pair (St, At) is modified using the following update equation :
where Gt is the return at time-step t , and Q(St,At) is the entry in the Q-table corresponding to state St and action At .
Generally speaking, the basic idea behind this update equation is that the Q ( St , At ) element of Q-table contains the Agent’s estimate for the expected return if the Environment is in state St and the Agent selects action At . Then, If the return Gt is not equal to the expected return contained in Q(St,At) , we “push” the value of Q(St,At) to make it agree slightly more with the return Gt . The magnitude of the change that we make to Q(St,At) is controlled by the hyperparameter α that acts as a step-size for the update step.
We always should set the value for α to a number greater than zero and less than (or equal to) one. In the outermost cases:
- If α =0, then the action-value function estimate is never updated by the Agent.
- If α =1, then the final value estimate for each state-action pair is always equal to the last return that was experienced by the Agent.
Epsilon-greedy policy
In theprevious post we advanced that random behavior is better at the beginning of the training when our Q-table approximation is bad, as it gives us more uniformly distributed information about the Environment states. However, as our training progresses, random behavior becomes inefficient, and we want to use our Q-table approximation to decide how to act. We introduced Epsilon-Greedy policies in theprevious post for this purpose, a method that performs such a mix of two extreme behaviors which just is switching between random and Q policy using the probability hyperparameter ϵ . By varying ϵ , we can select the ratio of random actions.
We will define that a policy is ϵ -greedy with respect to an action-value function estimate Q if for every state,
- with probability 1− ϵ , the Agent selects the greedy action, and
- with probability ϵ , the Agent selects an action uniformly at random from the set of available (non-greedy and greedy) actions.
So the larger ϵ is, the more likely you are to pick one of the non-greedy actions.
To construct a policy π that is ϵ -greedy with respect to the current action-value function estimate Q , mathematically we will set the policy as
if action a maximizes Q ( s , a ). Else
for each s ∈S and a ∈A( s ).
In this equation, it is included an extra term ϵ /∣A( s )∣ for the optimal action (∣A( s )∣ is the number of possible actions) because the sum of all the probabilities needs to be 1. Note that if we sum over the probabilities of performing all non-optimal actions, we will get (∣A(s)∣−1)×ϵ/∣A(s)∣, and adding this to 1− ϵ + ϵ /∣A( s )∣ , the probability of the optimal action, the sum gives one.
Setting the Value of Epsilon
Remember that in order to guarantee that MC control converges to the optimal policy π ∗, we need to ensure the conditions Greedy in the Limit with Infinite Exploration (presented in the previous post) that ensure the Agent continues to explore for all time steps, and the Agent gradually exploits more and explores less. We presented that one way to satisfy these conditions is to modify the value of ϵ , making it gradually decay, when specifying an ϵ -greedy policy.
The usual practice is to start with ϵ = 1.0 (100% random actions) and slowly decrease it to some small value ϵ > 0 (in our example we will use ϵ = 0.05) . In general, this can be obtained by introducing a factor ϵ-decay with a value near 1 that multiply the ϵ in each iteration.
Pseudocode
We can summarize all the previous explanations with this pseudocode for the constant- α MC Control algorithm that will guide our implementation of the algorithm:
A simple MC Control implementation
In this section, we will write an implementation of constant- MC control that can help an Agent recover the optimal policy the Blackjack Environment following the pseudocode introduced in the previous post.
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .
MC Control algorithm for Blackjack Environment
In the previous post, we implemented a policy where the player almost always sticks if the sum of her cards exceeds 18 for the BlackJack Environment. In this case, the function generate_episode
sampled an episode using this defined policy by the programmer.
Here, instead of being the policy hardcoded by the programmer, the MC Control algorithm will estimate and return an optimal policy, together with the Q-table:
env = gym.make('Blackjack-v0') num_episodes=1000000 alpha = 0.02 eps_decay=.9999965 gamma=1.0policy, Q = MC_control(env, num_episodes, alpha, eps_decay, gamma)
Specifically, policy
is a dictionary whose key corresponds to a state s
(a 3-tuple indicating the player’s current sum, the dealer’s face-up card, and whether or not the player has a usable ace) and the value of the corresponding entry indicates the action that the Agent chooses after observing this state following this policy.
Remember that the other dictionary returned by the function, the Q-table Q
, is a dictionary where the key of a given entry in the dictionary corresponds to a state s
and the value of the corresponding entry contains an array of dimension equal to the number of actions (2 dimensions in our case) where each element contains the estimated action-value for each action.
As input this MC Control algorithm has the following arguments:
env num_episodes alpha eps_decay gamma
Setting the Value of Epsilon
Before starting to program, a code based on the previously presented pseudocode takes a moment to see how we modify the value of ϵ , making it gradually decay when specifying an ϵ -greedy policy. Remember that this is important to guarantee that MC control converges to the optimal policy π ∗.
With the following code that sets the value for ϵ in each episode and monitor its evolutions with a print
we can check that selecting an eps_decay
= 0.9999965
we can obtain the gradual decay of ϵ:
eps_start=1.0 eps_decay=.9999965 eps_min=0.05epsilon = eps_start for episode in range(num_episodes): epsilon = max(epsilon*eps_decay, eps_min) if episode % 100000 == 0: print(“Episode {} -> epsilon={}.”.format(episode, epsilon))
Before entering the loop over episodes, we initialize the value of epsilon to one. Then, for each episode, we slightly decay the value of Epsilon by multiplying it by the value eps_decay
. We don’t want Epsilon to get too small because we want to constantly ensure at leans some small amount of exploration throughout the process.
Main function
Let’s start to program a code based on the previously presented pseudocode. The first thing, following the pseudocode, is to initialize all the values in the Q-table to zero. So Q
is initialized to an empty dictionary of arrays with the total number of actions that are in the Environment:
nA = env.action_space.n Q = defaultdict(lambda: np.zeros(nA)
After that, we loop num_episodes
over episodes, and then with each episode we compute the corresponding ϵ, construct the corresponding ϵ- greedy policy with respect to the most recent estimate of the Q-table, and then generate an episode using that ϵ -greedy policy. Finally, we update the Q-table using the update equation presented before:
for episode in range(1, num_episodes+1): epsilon = max(epsilon*eps_decay, eps_min) episode_generated=generate_episode_from_Q(env,Q,epsilon,nA) Q = update_Q(env, episode_generated, Q, alpha, gamma)
After finishing the loop of episodes, the policy corresponding to the final Q-table is calculated with the following code:
policy=dict((state,np.argmax(actions)) \ for state, actions in Q.items())
That is, the policy indicates for each state which action to take, which just corresponds to the action that has the maximum action-value in the Q-table.
See the GitHub for the complete code the main algorithm of our Constant-α MC Control method.
Generate episodes using Q-table and epsilon-greedy policy
The construction of the corresponding ϵ- greedy policy and the generation of an episode using this ϵ -greedy policy are wrapped up in the generate_episode_from_Q
function instantiated in the previous code.
This function takes as input the Environment, the most recent estimate of the Q-table, the value of current Epsilon and the number of actions. As an output, it returns an episode.
The Agent will use the Epsilon-greedy policy to select actions. We have implemented that using the random.choice
method from Numpy, which takes as input the set of possible actions and the probabilities corresponding to the Epsilon greedy policy. The obtention of the action probabilities corresponding to ϵ -greedy policy will be done using this code:
def get_probs(Q_s, epsilon, nA): policy_s = np.ones(nA) * epsilon / nA max_action = np.argmax(Q_s) policy_s[max_action] = 1 — epsilon + (epsilon / nA) return policy_s
If you take a look at get_probs
function code, it implements the epsilon-greedy policy detailed in the previous section.
Obviously, if the state is not already in Q-table, we randomly choose one action using the action_space.sample().
The complete code for this function that generates an episode following the epsilon-greedy policy is coded as follows:
def generate_episode_from_Q(env, Q, epsilon, nA): episode = [] state = env.reset() while True: probs = get_probs(Q[state], epsilon, nA) action = np.random.choice(np.arange(nA), p=probs) \ if state in Q else env.action_space.sample() next_state, reward, done, info = env.step(action) episode.append((state, action, reward)) state = next_state if done: break return episode
Update Q-table
Once we have the episode we just look at each state-action and we apply the update equation:
The code that programs this equation is
def update_Q(env, episode, Q, alpha, gamma): states, actions, rewards = zip(*episode) discounts=np.array([gamma**i for i in range(len(rewards)+1)]) for i, state in enumerate(states): old_Q = Q[state][actions[i]] Q[state][actions[i]] = old_Q + alpha \ (sum(rewards[i:]*discounts[:-(1+i)]) — old_Q) return Q
Plot state-value function
As in the example of the previous post, we can obtain the corresponding estimated optimal state-value function and plot it:
Remember there are two plots corresponding to whether we do or don’t have a usable ace.
With a simple visual analysis of the graphs of this post and those of the previous post, we can see that the policy obtained with the MC Control presented here is better since the state-value values are much higher.
What is next?
We have reached the end of this post!. So far we have seen Monte Carlo methods and dynamic programming, which are the two main threads from which they derive the origins of modern RL according to the book of Dr. Sutton. A third thread arriving later in the form of temporal difference learning , that we will introduce in the following post. See you in the next post!
Deep Reinforcement Learning Explained - Jordi TORRES.AI
Content of this series
torres.ai
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。