内容简介:In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present
Implementation of the Cross-Entropy Training Loop
Jun 8 ·9min read
In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present the improved variant of the method that keeps “elite” episodes for several iterations of the training process. Finally, we will show the limitations of the Cross-Entropy method to motivate other approaches.
Overview of the Training Loop
Next, we will present in detail the code that makes up the training loop that we presented in the previous post.
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .
Main variables
The code begins by defining the main parameters of the method.
BATCH_SIZE = 100 GAMMA = 0.9PERCENTILE = 30 REWARD_GOAL = 0.8
Helper classes
We will require a series of helper classes:
from collections import namedtupleEpisode = namedtuple(‘Episode’, field_names=[‘reward’, ‘steps’])EpisodeStep = namedtuple(‘EpisodeStep’, field_names=[‘observation’, ‘action’])
Here we will define two helper classes that are named tuples from the collections
package in the standard library:
EpisodeStep Episode
Initialization of variables
At this point, a set of variables that we will use in the training loop are initialized. We will present each of them as they are required in the loop:
iter_no = 0reward_mean = 0full_batch = [] batch = []episode_steps = [] episode_reward = 0.0state = env.reset()
The training loop
We learned in the previous post that the training loop of our Agent that implements the Cross-Entropy algorithm repeats 4 main steps until we become satisfied with the result:
1 — Play N number of episodes
2 — Calculate the Return for every episode and decide on a return boundary
3 — Throw away all episodes with a return below the boundary.
4 — Train the neural network using episode steps from the “elite” episodes
We have decided that the Agent must be trained until a certain Reward threshold is reached. Specifically, we have decided a threshold of 80% indicated in the variable REWARD_GOAL
:
while reward_mean < REWARD_GOAL:
STEP 1 — Play N number of episodes
The next piece of code is the one that generates the batches with episodes:
action = select_action(state) next_state, reward, episode_is_done, _ = env.step(action)episode_steps.append(EpisodeStep(observation=state,action=action))episode_reward += rewardif episode_is_done: # Episode finished batch.append(Episode(reward=episode_reward, steps=episode_steps)) next_state = env.reset() episode_steps = [] episode_reward = 0.0 <STEP 2> <STEP 3> <STEP 4>state = next_state
The main variables we will use are:
-
batch
accumulates the list ofEpisode
instances (BATCH_SIZE=100
). -
episode_steps
accumulates the list of steps in the current episode. -
episode_reward
maintain a reward counter for the current episode (in our case we only have Reward at the end of the episode, but the algorithm is described for a more general situation where we can have Rewards not only at the last step).
The list of episode steps is extended with an (observation, action) pair. It is important to note that we save the observed state
that was used to choose the action (but not the observation next_state
returned by the Environment as a result of the action):
episode_steps.append(EpisodeStep(observation=state,action=action))
The reward is added to the current episode’s total reward:
episode_reward += reward
When the current episode is over (hole or goal state) we need to append the finalized episode to the batch, saving the total reward and steps we have taken. Then, we reset our environment to start over and we reset variables episode_steps
and episode_reward
to start to track next episode:
batch.append(Episode(reward=episode_reward, steps=episode_steps))next_obs = env.reset() episode_steps = [] episode_reward = 0.0
STEP 2 — Calculate the Return for every episode and decide on a return boundary
The next piece of code implements step 2:
if len(batch) == BATCH_SIZE: reward_mean = float(np.mean(list(map(lambda s: s.reward, batch)))) elite_candidates= batch returnG = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), elite_candidates)) reward_bound = np.percentile(returnG, PERCENTILE)
The training loop executes this step when a number of plays equal to BATCH_SIZE
have been run:
if len(batch) == BATCH_SIZE:
First, the code calculates the Return for all the episodes:
elite_candidates= batch returnG = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), elite_candidates))
In this step, from the given batch of episodes and percentile value, we calculate a boundary reward, which will be used to filter “elite” episodes to train the Agents neural networks:
reward_bound = np.percentile(returnG, PERCENTILE)
To obtain the boundary reward, we will use NumPy’s percentile function, which, from the list of values and the desired percentile, calculates the percentile’s value. In this code, we will use the top 30% of episodes (indicated by the variable PERCENTILE
) to create the “elite” episodes.
During this step we compute the reward_mean
that is used to decide when to finish the training loop:
reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
STEP 3 — Throw away all episodes with a return below the boundary
Next, we will filter off our episodes with the following code:
train_obs = [] train_act = [] elite_batch = []for example, discounted_reward in zip(elite_candidates, returnG): if discounted_reward > reward_bound: train_obs.extend(map(lambda step: step.observation, example.steps)) train_act.extend(map(lambda step: step.action, example.steps)) elite_batch.append(example)full_batch=elite_batch state=train_obs acts=train_act
For every episode in the batch:
for example, discounted_reward in zip(elite_candidates, returnG):
we will check that the episode has a higher total reward than our boundary:
if discounted_reward > reward_bound:
and if it has, we will populate the list of observed states and actions that we will train on, and keep track of the elite episodes:
train_obs.extend(map(lambda step: step.observation,example.steps)) train_act.extend(map(lambda step: step.action, example.steps)) elite_batch.append(example)
Then we will update this tree variable with the “elite” episodes, the list of states and actions with which we will train our neural network:
full_batch=elite_batch state=train_obs acts=train_act
STEP 4 — Train the neural network using episode steps from the “elite” episodes
Every time our loop accumulates enough episodes ( BATCH_SIZE
), we compute the “elite” episodes and at the same iteration the loop trains the neural network of the Agent with this code:
state_t = torch.FloatTensor(state) acts_t = torch.LongTensor(acts)optimizer.zero_grad() action_scores_t = net(state_t) loss_t = objective(action_scores_t, acts_t) loss_t.backward() optimizer.step() iter_no += 1 batch = []
This code train the neural network using episode steps from the “elite” episodes, using the state s as the input and issued actions a as the label (desired output). Let’s go to comment it in more detail al the code lines:
First, we transform the variables to tensors:
state_t = torch.FloatTensor(state) acts_t = torch.LongTensor(acts)
We zero gradients of our neural network
optimizer.zero_grad()
and pass the observed state to the neural network, obtaining its action scores:
action_scores_t = net(state_t)
These scores are passed to the objective function, which will calculate cross-entropy between the neural network output and the actions that the agent took
loss_t = objective(action_scores_t, acts_t)
Remember that we only consider “elite” actions. The idea of this is to reinforce our neural network to carry out those “elite” actions that have led to good rewards.
Finally, we need to calculate gradients on the loss using the backward
method and adjust the parameters of our neural network using the step
method of the optimizer:
loss_t.backward() optimizer.step()
Monitor the progress of the Agent
In order to monitor the progress of the Agent’s learning performance, we included this print in the training loop:
print(“%d: loss=%.3f, reward_mean=%.3f” % (iter_no, loss_t.item(), reward_mean))
With it we show the iteration number, the loss and the mean reward of the batch (in the next section we also write the same values to TensorBoard to get a nice chart):
0: loss=1.384, reward_mean=0.020 1: loss=1.353, reward_mean=0.040 2: loss=1.332, reward_mean=0.010 3: loss=1.362, reward_mean=0.020 4: loss=1.337, reward_mean=0.020 5: loss=1.378, reward_mean=0.020 . . .639: loss=0.471, reward_mean=0.730 640: loss=0.511, reward_mean=0.730 641: loss=0.472, reward_mean=0.760 642: loss=0.481, reward_mean=0.650 643: loss=0.472, reward_mean=0.750 644: loss=0.492, reward_mean=0.720 645: loss=0.480, reward_mean=0.660 646: loss=0.479, reward_mean=0.740 647: loss=0.474, reward_mean=0.660 648: loss=0.517, reward_mean=0.830
We can check that the last value of the reward_mean
variable is the one that allowed to finish the training loop.
Improving the Agent with a better neural network
In aprevious post, we already introduced TensorBoard, a tool that helps in the process of data visualization. Instead, the “print” used in the previous section, we could use these two sentences to plot the behavior of these two variables:
writer.add_scalar(“loss”, loss_t.item(), iter_no) writer.add_scalar(“reward_mean”, reward_mean, iter_no)
In this case, the output is:
More complex Neural Network
One question that arises is if we could improve the Agent’s neural network. For instance, what happens if we consider a hidden layer with more neurons, let say 128 neurons:
HIDDEN_SIZE = 128 net= nn.Sequential( nn.Linear(obs_size, HIDDEN_SIZE), nn.Sigmoid(), nn.Linear(HIDDEN_SIZE, n_actions) )objective = nn.CrossEntropyLoss() optimizer = optim.Adam(params=net.parameters(), lr=0.001)train_loop()
The result can be shown here (or executing the GitHub code ):
We can see that this network learns faster than the previous one.
ReLU activation function
What happens if we change the activation function? e.g. a ReLU instead a Sigmoid?
Below you can see what happens: the network converges much earlier, with only 200 iterations it has already been completed.
Improving the Cross-Entropy algorithm
So far we have shown how to improve the neural network architecture. But we can also improve the algorithm itself: we can keep “elite” episodes for a longer time. The previous version of the algorithm samples episodes from the Environment, train on the best ones and threw them away. However, when the number of successful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them. We need to change only one line in the code:
elite_candidates= full_batch + batch#elite_candidates= batch
The result seen through TensorBoard is:
We can see that the number of iterations required is reduced again.
Limitations of the Cross-Entropy method
So far we have seen that with the proposed improvements, with very few iterations of the training loop we can find a good neural network. But this is because we are talking about a very simple “non-slippery” Environment. But what if we have a “slippery” environment?
slippedy_env = gym.make(‘FrozenLake-v0’, is_slippery=True)class OneHotWrapper(gym.ObservationWrapper): def __init__(self, env): super(OneHotWrapper, self).__init__(env) self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32) def observation(self, observation): r = np.copy(self.observation_space.low) r[observation] = 1.0 return renv = OneHotWrapper(slippedy_env)
Again TensorBoard is a big help. In the following figure, we see the behavior of the algorithm during the first iterations. It is not able to take off the value of the Reward:
But if we wait for 5,000 more iterations, we see that it can improve, but from there it stagnates and is no longer able to surpass a threshold:
And although we have waited more than two hours, it fails to improve and not surpass the threshold of 60%:
Conclusion
With an example as simple as Frozen-Lake we see that the Cross-Entropy method cannot find the solution (of training a neural network). Later in the series, you will become familiar with other methods that address these limitations. See you in the next post.
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .
Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject .
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。