Cross-Entropy Method Performance Analysis

栏目: IT技术 · 发布时间: 4年前

内容简介：In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present

Implementation of the Cross-Entropy Training Loop

Jordi TORRES.AI

Jun 8 ·9min read

Cross-Entropy Method Performance Analysis

In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present the improved variant of the method that keeps “elite” episodes for several iterations of the training process. Finally, we will show the limitations of the Cross-Entropy method to motivate other approaches.

Overview of the Training Loop

Next, we will present in detail the code that makes up the training loop that we presented in the previous post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Main variables

The code begins by defining the main parameters of the method.

BATCH_SIZE = 100
GAMMA = 0.9PERCENTILE = 30
REWARD_GOAL = 0.8

Helper classes

We will require a series of helper classes:

from collections import namedtupleEpisode = namedtuple(‘Episode’, field_names=[‘reward’, ‘steps’])EpisodeStep = namedtuple(‘EpisodeStep’, 
                          field_names=[‘observation’, ‘action’])

Here we will define two helper classes that are named tuples from the collections package in the standard library:

EpisodeStep
Episode

Initialization of variables

At this point, a set of variables that we will use in the training loop are initialized. We will present each of them as they are required in the loop:

iter_no = 0reward_mean = 0full_batch = []
batch = []episode_steps = []
episode_reward = 0.0state = env.reset()

The training loop

We learned in the previous post that the training loop of our Agent that implements the Cross-Entropy algorithm repeats 4 main steps until we become satisfied with the result:

1 — Play N number of episodes

2 — Calculate the Return for every episode and decide on a return boundary

3 — Throw away all episodes with a return below the boundary.

4 — Train the neural network using episode steps from the “elite” episodes

We have decided that the Agent must be trained until a certain Reward threshold is reached. Specifically, we have decided a threshold of 80% indicated in the variable REWARD_GOAL :

while reward_mean < REWARD_GOAL:

STEP 1 — Play N number of episodes

The next piece of code is the one that generates the batches with episodes:

action = select_action(state)
next_state, reward, episode_is_done, _ = env.step(action)episode_steps.append(EpisodeStep(observation=state,action=action))episode_reward += rewardif episode_is_done: # Episode finished
    batch.append(Episode(reward=episode_reward,
                         steps=episode_steps))
    next_state = env.reset()
    episode_steps = []
    episode_reward = 0.0    <STEP 2>    <STEP 3>    <STEP 4>state = next_state

The main variables we will use are:

batch accumulates the list of Episode instances ( BATCH_SIZE=100 ).
episode_steps accumulates the list of steps in the current episode.
episode_reward maintain a reward counter for the current episode (in our case we only have Reward at the end of the episode, but the algorithm is described for a more general situation where we can have Rewards not only at the last step).

The list of episode steps is extended with an (observation, action) pair. It is important to note that we save the observed state that was used to choose the action (but not the observation next_state returned by the Environment as a result of the action):

episode_steps.append(EpisodeStep(observation=state,action=action))

The reward is added to the current episode’s total reward:

episode_reward += reward

When the current episode is over (hole or goal state) we need to append the finalized episode to the batch, saving the total reward and steps we have taken. Then, we reset our environment to start over and we reset variables episode_steps and episode_reward to start to track next episode:

batch.append(Episode(reward=episode_reward, steps=episode_steps))next_obs = env.reset()
episode_steps = []
episode_reward = 0.0

STEP 2 — Calculate the Return for every episode and decide on a return boundary

The next piece of code implements step 2:

if len(batch) == BATCH_SIZE:
    reward_mean = float(np.mean(list(map(lambda s: 
                  s.reward, batch))))
    elite_candidates= batch
    returnG = list(map(lambda s: s.reward * (GAMMA **          
              len(s.steps)), elite_candidates))
    reward_bound = np.percentile(returnG, PERCENTILE)

The training loop executes this step when a number of plays equal to BATCH_SIZE have been run:

if len(batch) == BATCH_SIZE:

First, the code calculates the Return for all the episodes:

elite_candidates= batch
    returnG = list(map(lambda s: s.reward * (GAMMA **          
              len(s.steps)), elite_candidates))

In this step, from the given batch of episodes and percentile value, we calculate a boundary reward, which will be used to filter “elite” episodes to train the Agents neural networks:

reward_bound = np.percentile(returnG, PERCENTILE)

To obtain the boundary reward, we will use NumPy’s percentile function, which, from the list of values and the desired percentile, calculates the percentile’s value. In this code, we will use the top 30% of episodes (indicated by the variable PERCENTILE ) to create the “elite” episodes.

During this step we compute the reward_mean that is used to decide when to finish the training loop:

reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))

STEP 3 — Throw away all episodes with a return below the boundary

Next, we will filter off our episodes with the following code:

train_obs = []
train_act = []
elite_batch = []for example, discounted_reward in zip(elite_candidates, returnG):
    if discounted_reward > reward_bound:
       train_obs.extend(map(lambda step: step.observation, 
                            example.steps))
       train_act.extend(map(lambda step: step.action, 
                            example.steps))
       elite_batch.append(example)full_batch=elite_batch
state=train_obs
acts=train_act

For every episode in the batch:

for example, discounted_reward in zip(elite_candidates, returnG):

we will check that the episode has a higher total reward than our boundary:

if discounted_reward > reward_bound:

and if it has, we will populate the list of observed states and actions that we will train on, and keep track of the elite episodes:

train_obs.extend(map(lambda step: step.observation,example.steps))
train_act.extend(map(lambda step: step.action, example.steps))
elite_batch.append(example)

Then we will update this tree variable with the “elite” episodes, the list of states and actions with which we will train our neural network:

full_batch=elite_batch
state=train_obs
acts=train_act

STEP 4 — Train the neural network using episode steps from the “elite” episodes

Every time our loop accumulates enough episodes ( BATCH_SIZE ), we compute the “elite” episodes and at the same iteration the loop trains the neural network of the Agent with this code:

state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)optimizer.zero_grad()
action_scores_t = net(state_t)
loss_t = objective(action_scores_t, acts_t)
loss_t.backward()
optimizer.step()

iter_no += 1
batch = []

This code train the neural network using episode steps from the “elite” episodes, using the state s as the input and issued actions a as the label (desired output). Let’s go to comment it in more detail al the code lines:

First, we transform the variables to tensors:

state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)

We zero gradients of our neural network

optimizer.zero_grad()

and pass the observed state to the neural network, obtaining its action scores:

action_scores_t = net(state_t)

These scores are passed to the objective function, which will calculate cross-entropy between the neural network output and the actions that the agent took

loss_t = objective(action_scores_t, acts_t)

Remember that we only consider “elite” actions. The idea of this is to reinforce our neural network to carry out those “elite” actions that have led to good rewards.

Finally, we need to calculate gradients on the loss using the backward method and adjust the parameters of our neural network using the step method of the optimizer:

loss_t.backward()
optimizer.step()

Monitor the progress of the Agent

In order to monitor the progress of the Agent’s learning performance, we included this print in the training loop:

print(“%d: loss=%.3f, reward_mean=%.3f” % 
      (iter_no, loss_t.item(), reward_mean))

With it we show the iteration number, the loss and the mean reward of the batch (in the next section we also write the same values to TensorBoard to get a nice chart):

0: loss=1.384, reward_mean=0.020 
1: loss=1.353, reward_mean=0.040  
2: loss=1.332, reward_mean=0.010  
3: loss=1.362, reward_mean=0.020  
4: loss=1.337, reward_mean=0.020   
5: loss=1.378, reward_mean=0.020 . . .639: loss=0.471, reward_mean=0.730  
640: loss=0.511, reward_mean=0.730 
641: loss=0.472, reward_mean=0.760 
642: loss=0.481, reward_mean=0.650 
643: loss=0.472, reward_mean=0.750 
644: loss=0.492, reward_mean=0.720 
645: loss=0.480, reward_mean=0.660 
646: loss=0.479, reward_mean=0.740 
647: loss=0.474, reward_mean=0.660  
648: loss=0.517, reward_mean=0.830

We can check that the last value of the reward_mean variable is the one that allowed to finish the training loop.

Improving the Agent with a better neural network

In aprevious post, we already introduced TensorBoard, a tool that helps in the process of data visualization. Instead, the “print” used in the previous section, we could use these two sentences to plot the behavior of these two variables:

writer.add_scalar(“loss”, loss_t.item(), iter_no)
writer.add_scalar(“reward_mean”, reward_mean, iter_no)

In this case, the output is:

More complex Neural Network

One question that arises is if we could improve the Agent’s neural network. For instance, what happens if we consider a hidden layer with more neurons, let say 128 neurons:

HIDDEN_SIZE = 128
net= nn.Sequential(
           nn.Linear(obs_size, HIDDEN_SIZE),
           nn.Sigmoid(),
           nn.Linear(HIDDEN_SIZE, n_actions)
           )objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001)train_loop()

The result can be shown here (or executing the GitHub code ):

We can see that this network learns faster than the previous one.

ReLU activation function

What happens if we change the activation function? e.g. a ReLU instead a Sigmoid?

Below you can see what happens: the network converges much earlier, with only 200 iterations it has already been completed.

Improving the Cross-Entropy algorithm

So far we have shown how to improve the neural network architecture. But we can also improve the algorithm itself: we can keep “elite” episodes for a longer time. The previous version of the algorithm samples episodes from the Environment, train on the best ones and threw them away. However, when the number of successful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them. We need to change only one line in the code:

elite_candidates= full_batch + batch#elite_candidates= batch

The result seen through TensorBoard is:

We can see that the number of iterations required is reduced again.

Limitations of the Cross-Entropy method

So far we have seen that with the proposed improvements, with very few iterations of the training loop we can find a good neural network. But this is because we are talking about a very simple “non-slippery” Environment. But what if we have a “slippery” environment?

slippedy_env = gym.make(‘FrozenLake-v0’, is_slippery=True)class OneHotWrapper(gym.ObservationWrapper):
      def __init__(self, env):
          super(OneHotWrapper, self).__init__(env)
          self.observation_space = gym.spaces.Box(0.0, 1.0,
                (env.observation_space.n, ), dtype=np.float32)
 
      def observation(self, observation):
          r = np.copy(self.observation_space.low)
          r[observation] = 1.0
          return renv = OneHotWrapper(slippedy_env)

Again TensorBoard is a big help. In the following figure, we see the behavior of the algorithm during the first iterations. It is not able to take off the value of the Reward:

But if we wait for 5,000 more iterations, we see that it can improve, but from there it stagnates and is no longer able to surpass a threshold:

And although we have waited more than two hours, it fails to improve and not surpass the threshold of 60%:

Conclusion

With an example as simple as Frozen-Lake we see that the Cross-Entropy method cannot find the solution (of training a neural network). Later in the series, you will become familiar with other methods that address these limitations. See you in the next post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject .

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Cross-Entropy Method Performance Analysis

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

CSS 3实战

成林 / 机械工业出版社 / 2011-5 / 69.00元

全书一共分为9章，首先从宏观上介绍了CSS 3技术的最新发展现状、新特性，以及现有的主流浏览器对这些新特性的支持情况；然后详细讲解了CSS 3的选择器、文本特性、颜色特性、弹性布局、边框和背景特性、盒模型、UI设计、多列布局、圆角和阴影、渐变、变形、转换、动画、投影、开放字体、设备类型、语音样式等重要的理论知识，这部分内容是本书的基础和核心。不仅每个知识点都配有丰富的、精心设计的实战案例，而且详细......一起来看看《CSS 3实战》这本书的介绍吧!

码农工具

HTML 编码/解码

RGB HSV 转换

RGB HSV 互转工具