内容简介:In theprevious post, we presented theTheNext, we will present in detail the code that makes up the Value Iteration method that we presented in the previous post. So let’s come to the code. In the beginning, we import the used libraries and define the main
V-function in Practice for Frozen-Lake Environment
Jun 15 ·9min read
In theprevious post, we presented the Value Iteration method to calculate the V-values and Q-values required by Value-based Agents. In this post, we will present how to implement the Value Iteration method for computing the state value by solving the Frozen-Lake Environment.
Value Iteration with V-function in Practice
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .
Next, we will present in detail the code that makes up the Value Iteration method that we presented in the previous post. So let’s come to the code. In the beginning, we import the used libraries and define the main constants:
import gym import collections from torch.utils.tensorboard import SummaryWriterENV_NAME="FrozenLake-v0"GAMMA = 0.9 TEST_EPISODES = 20 N =100 REWARD_GOAL = 0.8
Agent’s data structures
The main data structures which will keep Agent’s information are:
-
rewards
: A dictionary with the composite key “source state” + “action” +“target state”. The value is obtained from the immediate reward. -
transits
: A table as a dictionary keeping counters of the experienced transitions. The key is the composite “state” + “action”, and the value is another dictionary that maps the target state into a count of times that we have seen it. -
values
: A dictionary that maps a state into the calculated value of this state (V-value). -
state
: Current state of the Agent.
The main data structures are created in the class constructor of the Agent.
Training Algorithm
The overall logic of our training algorithm is simple. Until the desired Reward goal is not reached we will execute the following steps:
- play N random steps from the environment to populating the
reward
andtransits
tables. - After those N steps, it performs a value iteration step over all states, updating the
value
table. - Then we play several full episodes to check the improvements using the updated value table.
- If the average reward for those test episodes is above the desired boundary, then we stop training.
Before we go into more detail about this code, it will help to review the agent’s code first.
Agent Class
In the Agent
class constructor, we create an Environment that we will be using for data samples, obtain our first observation, and define tables for rewards, transitions, and values:
class Agent: def __init__(self): self.env = gym.make(ENV_NAME) self.state = self.env.reset() self.rewards = collections.defaultdict(float) self.transits = collections.defaultdict( collections.Counter) self.values = collections.defaultdict(float)
Play random steps
Remember that in the previous post we advanced that the estimation of Transitions and Rewards will be obtained through the history of the Agent’s interaction with the Environment.
This is done by the method play_n_random_steps
that plays N
random steps from the Environment, populating the reward
and transits
tables with random experiences.
def play_n_random_steps(self, count): for _ in range(count): action = self.env.action_space.sample() new_state, reward, is_done, _ = self.env.step(action) self.rewards[(self.state, action, new_state)] = reward self.transits[(self.state, action)][new_state] += 1 if is_done: self.state = self.env.reset() else: self.state = new_state
Note that we don’t need to wait for the end of the episode to start learning; we just perform N
steps and remember their outcomes. This is one of the differences between the Value Iteration and the Cross-Entropy method shown in a previous post, which requires full episodes to learn.
Value of the action
The next method calculates the Q-function, the value of an action from a state using the transits
, reward
, and value
tables of the Agent. We will use it for two purposes: to select the best action to perform from the state and to calculate the new value of the state on the Value Iteration algorithm.
def calc_action_value(self, state, action): target_counts = self.transits[(state, action)] total = sum(target_counts.values()) action_value = 0.0 for tgt_state, count in target_counts.items(): reward = self.rewards[(state, action, tgt_state)] val = reward + GAMMA * self.values[tgt_state] action_value += (count / total) * val return action_value
First, from the transits
table, we extract transition counters for the given state and action that the method receives as arguments. We sum all counters to obtain the total count of times we have executed the action from the state. Then we iterate every target state that our action has landed on and calculate its contribution to the total action value using the formula presented in the Bellman equation post:
This value is equal to immediate reward plus discounted value for the target state and multiplied this sum to the probability of this transition (individual counter divided by the total value computed before). We add the result for each iteration to a variable action_value
, the variable that will be returned.
Select best Action
In order to select the best action from a given state, we have the method select_action
that uses the previous calc_action_value
method to make the decision:
def select_action(self, state): best_action, best_value = None, None for action in range(self.env.action_space.n): action_value = self.calc_action_value(state, action) if best_value is None or best_value < action_value: best_value = action_value best_action = action return best_action
The code of this method iterates over all possible actions in the Environment and calculates the value for every action and returns the action with the largest Q-value.
Value Iteration function
And here we have the main function, that as we described in the previous post, what Value Iteration method do is just loop over all states in the Environment:
def value_iteration(self): for state in range(self.env.observation_space.n): state_values = [ self.calc_action_value(state, action) for action in range(self.env.action_space.n) ] self.values[state] = max(state_values)
For every state, we update its value with the maximum value of the action available from the state.
Training Loop and the monitoring of the code
After presenting the Agent’s class and its methods, we come back to describe the main loop. First of all, we create the Environment that we will be using for testing, create an instance of the Agent class, the summary writer for TensorBoard, and some variables that we will use:
test_env = gym.make(ENV) agent = Agent() writer = SummaryWriter()iter_no = 0 best_reward = 0.0
Remember that Value Iteration method formally requires an infinite number of iterations to converge exactly to obtain the optimal value function. In practice, in the previous post, we showed that we can stop once the value function changes by only a small amount in an iteration of the training loop.
n this example, in order to keep the code simple and with few iterations, we decide to stop when we reach some reward threshold. But all the rest of the code is identical.
The overall logic of our code is simply a loop that will iterate until the Agent achieve the expected performance (If the average reward for those test episodes is above the REWARD_GOAL
boundary, then we stop training):
while best_reward < REWARD_GOAL: agent.play_n_random_steps(N) agent.value_iteration() ...
The body of the loop is composed of the steps introduced earlier:
STEP 1: play N random steps calling the method plays_n_random_steps.
STEP 2: It performs a value iteration sweep over all states calling the method value_iteration.
STEP 3: Then we play several full episodes to check the improvements using the updated value table. For that the code uses agent.elect_action()
to find the best action to take in the new test_env
Environment (we don’t want to mess with the current state of the main Environment used to gather random data), to check improvements of the Agent:
iter_no += 1 reward_avg = 0.0 for _ in range(TEST_EPISODES): total_reward = 0.0 state = test_env.reset() while True: action = Select_action(state) new_state, new_reward, is_done, _ = test_env.step(action) total_reward += new_reward if is_done: break state = new_state reward_test += total_reward reward_test /= TEST_EPISODES
Additionally, the code writes data into TensorBoard in order to tracks the best average reward:
writer.add_scalar("reward", reward_test, iter_no) if reward_test > best_reward: print("Best reward updated %.2f at iteration %d " % (reward_test ,iter_no) ) best_reward = reward_test
And that’s all!
Run the program
Okay, let’s run our program:
Best reward updated 0.40 in iteration 13 Best reward updated 0.65 in iteration 20 Best reward updated 0.80 in iteration 23 Best reward updated 0.85 in iteration 28 Best reward updated 0.90 in iteration 76
Test the client
Now, if we try with the same client test code as in the case of Cross-Entropy, we can see that the Agent we have built, can learn from a slippery Environment:
new_test_env = gym.make(‘FrozenLake-v0’, is_slippery=True) state = new_test_env.reset() new_test_env.render() is_done = False iter_no = 0 while not is_done: action = Select_action(state) new_state, reward, is_done, _ = new_test_env.step(action) test_env.render() state = new_state iter_no +=1 print(“reward = “, reward) print(“iterations =”, iter_no)
. . .
Conclusions
Note that the algorithm presented is stochastic and for different executions, it employs a different number of iterations to reach a solution. However, it could learn from a slippery Environment instead the Cross-Entropy presented earlier. We can plot all of them using Tensorboard that will show graphs like the following one:
We can notice that in all cases, it needs a few seconds as a maximum to find a good policy that solves the Environment in 80% of runs. If you remember the Cross-Entropy method, for a slippery Environment, it took many hours to achieve a success rate of only 60%.
Moreover, we can apply this algorithm to a larger version of FrozenLake, which has the name FrozenLake8x8-v0. The larger version of FrozenLake can take many more iterations to solve, and, according to TensorBoard charts, most of the time it waits for the first successful episode (it need to have at least one successful episode to start learning from useful value table), then it very quickly reaches convergence. Below is a chart that compares reward dynamics during training on FrozenLake-4x4 and 8x8 version:
In addition to changing the Environment in ENV_NAME
(“FrozenLake-v0” vs “FrozenLake8x8-v0”), the reader can perform tests with different values of the hyperparameters as GAMMA, TEST_EPISODE, REWARD_GOAL
or N
. Why don’t you try it?
What is next?
In the next post, we will present the Value Iteration method code that learns the values of the actions instead of learns the values of the states as we have done here. See you in the next post!
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .
Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject .
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。