Cross-Entropy Method Performance Analysis

栏目: IT技术 · 发布时间: 4年前

内容简介:In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present

Implementation of the Cross-Entropy Training Loop

Cross-Entropy Method Performance Analysis

In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present the improved variant of the method that keeps “elite” episodes for several iterations of the training process. Finally, we will show the limitations of the Cross-Entropy method to motivate other approaches.

Overview of the Training Loop

Next, we will present in detail the code that makes up the training loop that we presented in the previous post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Main variables

The code begins by defining the main parameters of the method.

BATCH_SIZE = 100
GAMMA = 0.9PERCENTILE = 30
REWARD_GOAL = 0.8

Helper classes

We will require a series of helper classes:

from collections import namedtupleEpisode = namedtuple(‘Episode’, field_names=[‘reward’, ‘steps’])EpisodeStep = namedtuple(‘EpisodeStep’, 
                          field_names=[‘observation’, ‘action’])

Here we will define two helper classes that are named tuples from the collections package in the standard library:

EpisodeStep
Episode

Initialization of variables

At this point, a set of variables that we will use in the training loop are initialized. We will present each of them as they are required in the loop:

iter_no = 0reward_mean = 0full_batch = []
batch = []episode_steps = []
episode_reward = 0.0state = env.reset()

The training loop

We learned in the previous post that the training loop of our Agent that implements the Cross-Entropy algorithm repeats 4 main steps until we become satisfied with the result:

1 — Play N number of episodes

2 — Calculate the Return for every episode and decide on a return boundary

3 — Throw away all episodes with a return below the boundary.

4 — Train the neural network using episode steps from the “elite” episodes

We have decided that the Agent must be trained until a certain Reward threshold is reached. Specifically, we have decided a threshold of 80% indicated in the variable REWARD_GOAL :

while reward_mean < REWARD_GOAL:

STEP 1 — Play N number of episodes

The next piece of code is the one that generates the batches with episodes:

action = select_action(state)
next_state, reward, episode_is_done, _ = env.step(action)episode_steps.append(EpisodeStep(observation=state,action=action))episode_reward += rewardif episode_is_done: # Episode finished
    batch.append(Episode(reward=episode_reward,
                         steps=episode_steps))
    next_state = env.reset()
    episode_steps = []
    episode_reward = 0.0    <STEP 2>    <STEP 3>    <STEP 4>state = next_state

The main variables we will use are:

  • batch accumulates the list of Episode instances ( BATCH_SIZE=100 ).
  • episode_steps accumulates the list of steps in the current episode.
  • episode_reward maintain a reward counter for the current episode (in our case we only have Reward at the end of the episode, but the algorithm is described for a more general situation where we can have Rewards not only at the last step).

The list of episode steps is extended with an (observation, action) pair. It is important to note that we save the observed state that was used to choose the action (but not the observation next_state returned by the Environment as a result of the action):

episode_steps.append(EpisodeStep(observation=state,action=action))

The reward is added to the current episode’s total reward:

episode_reward += reward

When the current episode is over (hole or goal state) we need to append the finalized episode to the batch, saving the total reward and steps we have taken. Then, we reset our environment to start over and we reset variables episode_steps and episode_reward to start to track next episode:

batch.append(Episode(reward=episode_reward, steps=episode_steps))next_obs = env.reset()
episode_steps = []
episode_reward = 0.0

STEP 2 — Calculate the Return for every episode and decide on a return boundary

The next piece of code implements step 2:

if len(batch) == BATCH_SIZE:
    reward_mean = float(np.mean(list(map(lambda s: 
                  s.reward, batch))))
    elite_candidates= batch
    returnG = list(map(lambda s: s.reward * (GAMMA **          
              len(s.steps)), elite_candidates))
    reward_bound = np.percentile(returnG, PERCENTILE)

The training loop executes this step when a number of plays equal to BATCH_SIZE have been run:

if len(batch) == BATCH_SIZE:

First, the code calculates the Return for all the episodes:

elite_candidates= batch
    returnG = list(map(lambda s: s.reward * (GAMMA **          
              len(s.steps)), elite_candidates))

In this step, from the given batch of episodes and percentile value, we calculate a boundary reward, which will be used to filter “elite” episodes to train the Agents neural networks:

reward_bound = np.percentile(returnG, PERCENTILE)

To obtain the boundary reward, we will use NumPy’s percentile function, which, from the list of values and the desired percentile, calculates the percentile’s value. In this code, we will use the top 30% of episodes (indicated by the variable PERCENTILE ) to create the “elite” episodes.

During this step we compute the reward_mean that is used to decide when to finish the training loop:

reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))

STEP 3 — Throw away all episodes with a return below the boundary

Next, we will filter off our episodes with the following code:

train_obs = []
train_act = []
elite_batch = []for example, discounted_reward in zip(elite_candidates, returnG):
    if discounted_reward > reward_bound:
       train_obs.extend(map(lambda step: step.observation, 
                            example.steps))
       train_act.extend(map(lambda step: step.action, 
                            example.steps))
       elite_batch.append(example)full_batch=elite_batch
state=train_obs
acts=train_act

For every episode in the batch:

for example, discounted_reward in zip(elite_candidates, returnG):

we will check that the episode has a higher total reward than our boundary:

if discounted_reward > reward_bound:

and if it has, we will populate the list of observed states and actions that we will train on, and keep track of the elite episodes:

train_obs.extend(map(lambda step: step.observation,example.steps))
train_act.extend(map(lambda step: step.action, example.steps))
elite_batch.append(example)

Then we will update this tree variable with the “elite” episodes, the list of states and actions with which we will train our neural network:

full_batch=elite_batch
state=train_obs
acts=train_act

STEP 4 — Train the neural network using episode steps from the “elite” episodes

Every time our loop accumulates enough episodes ( BATCH_SIZE ), we compute the “elite” episodes and at the same iteration the loop trains the neural network of the Agent with this code:

state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)optimizer.zero_grad()
action_scores_t = net(state_t)
loss_t = objective(action_scores_t, acts_t)
loss_t.backward()
optimizer.step()

iter_no += 1
batch = []

This code train the neural network using episode steps from the “elite” episodes, using the state s as the input and issued actions a as the label (desired output). Let’s go to comment it in more detail al the code lines:

First, we transform the variables to tensors:

state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)

We zero gradients of our neural network

optimizer.zero_grad()

and pass the observed state to the neural network, obtaining its action scores:

action_scores_t = net(state_t)

These scores are passed to the objective function, which will calculate cross-entropy between the neural network output and the actions that the agent took

loss_t = objective(action_scores_t, acts_t)

Remember that we only consider “elite” actions. The idea of this is to reinforce our neural network to carry out those “elite” actions that have led to good rewards.

Finally, we need to calculate gradients on the loss using the backward method and adjust the parameters of our neural network using the step method of the optimizer:

loss_t.backward()
optimizer.step()

Monitor the progress of the Agent

In order to monitor the progress of the Agent’s learning performance, we included this print in the training loop:

print(“%d: loss=%.3f, reward_mean=%.3f” % 
      (iter_no, loss_t.item(), reward_mean))

With it we show the iteration number, the loss and the mean reward of the batch (in the next section we also write the same values to TensorBoard to get a nice chart):

0: loss=1.384, reward_mean=0.020 
1: loss=1.353, reward_mean=0.040  
2: loss=1.332, reward_mean=0.010  
3: loss=1.362, reward_mean=0.020  
4: loss=1.337, reward_mean=0.020   
5: loss=1.378, reward_mean=0.020 . . .639: loss=0.471, reward_mean=0.730  
640: loss=0.511, reward_mean=0.730 
641: loss=0.472, reward_mean=0.760 
642: loss=0.481, reward_mean=0.650 
643: loss=0.472, reward_mean=0.750 
644: loss=0.492, reward_mean=0.720 
645: loss=0.480, reward_mean=0.660 
646: loss=0.479, reward_mean=0.740 
647: loss=0.474, reward_mean=0.660  
648: loss=0.517, reward_mean=0.830 

We can check that the last value of the reward_mean variable is the one that allowed to finish the training loop.

Improving the Agent with a better neural network

In aprevious post, we already introduced TensorBoard, a tool that helps in the process of data visualization. Instead, the “print” used in the previous section, we could use these two sentences to plot the behavior of these two variables:

writer.add_scalar(“loss”, loss_t.item(), iter_no)
writer.add_scalar(“reward_mean”, reward_mean, iter_no)

In this case, the output is:

Cross-Entropy Method Performance Analysis

More complex Neural Network

One question that arises is if we could improve the Agent’s neural network. For instance, what happens if we consider a hidden layer with more neurons, let say 128 neurons:

HIDDEN_SIZE = 128
net= nn.Sequential(
           nn.Linear(obs_size, HIDDEN_SIZE),
           nn.Sigmoid(),
           nn.Linear(HIDDEN_SIZE, n_actions)
           )objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001)train_loop()

The result can be shown here (or executing the GitHub code ):

Cross-Entropy Method Performance Analysis

We can see that this network learns faster than the previous one.

ReLU activation function

What happens if we change the activation function? e.g. a ReLU instead a Sigmoid?

Below you can see what happens: the network converges much earlier, with only 200 iterations it has already been completed.

Cross-Entropy Method Performance Analysis

Improving the Cross-Entropy algorithm

So far we have shown how to improve the neural network architecture. But we can also improve the algorithm itself: we can keep “elite” episodes for a longer time. The previous version of the algorithm samples episodes from the Environment, train on the best ones and threw them away. However, when the number of successful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them. We need to change only one line in the code:

elite_candidates= full_batch + batch#elite_candidates= batch

The result seen through TensorBoard is:

Cross-Entropy Method Performance Analysis

We can see that the number of iterations required is reduced again.

Limitations of the Cross-Entropy method

So far we have seen that with the proposed improvements, with very few iterations of the training loop we can find a good neural network. But this is because we are talking about a very simple “non-slippery” Environment. But what if we have a “slippery” environment?

slippedy_env = gym.make(‘FrozenLake-v0’, is_slippery=True)class OneHotWrapper(gym.ObservationWrapper):
      def __init__(self, env):
          super(OneHotWrapper, self).__init__(env)
          self.observation_space = gym.spaces.Box(0.0, 1.0,
                (env.observation_space.n, ), dtype=np.float32)
 
      def observation(self, observation):
          r = np.copy(self.observation_space.low)
          r[observation] = 1.0
          return renv = OneHotWrapper(slippedy_env)

Again TensorBoard is a big help. In the following figure, we see the behavior of the algorithm during the first iterations. It is not able to take off the value of the Reward:

Cross-Entropy Method Performance Analysis

But if we wait for 5,000 more iterations, we see that it can improve, but from there it stagnates and is no longer able to surpass a threshold:

Cross-Entropy Method Performance Analysis

And although we have waited more than two hours, it fails to improve and not surpass the threshold of 60%:

Cross-Entropy Method Performance Analysis

Conclusion

With an example as simple as Frozen-Lake we see that the Cross-Entropy method cannot find the solution (of training a neural network). Later in the series, you will become familiar with other methods that address these limitations. See you in the next post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

数据结构

数据结构

霍罗威茨 / 机械工业出版社 / 2006-7-1 / 48.00元

《数据结构》(C语言版)针对采用ANSI C实现数据结构进行了全面的描述和深入的讨论。书中详细讨论了栈、队列、链表以及查找结构、高级树结构等功能,对裴波那契堆、伸展树、红黑树、2-3树、2-3-4树、二项堆、最小-最大堆、双端堆等新的数据结构进行了有效分析。《数据结构》(C语言版)对一些特殊形式的堆结构,诸如应用在双端优先队列中的最小-最大堆和双端堆的数据结构以及左高树、裴波那契堆、二项堆等数据结......一起来看看 《数据结构》 这本书的介绍吧!

URL 编码/解码
URL 编码/解码

URL 编码/解码

SHA 加密
SHA 加密

SHA 加密工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具