Solving a Reinforcement Learning Problem Using Cross-Entropy Method

栏目: IT技术 · 发布时间: 4年前

内容简介:In a previous posts we advanced that anEven for fairly simple environments, we can have a variety of policies. Then we need a method to automatically find optimal policies. From this post onwards we will explore different methods to obtain a policy that al

Agent Creation Using Deep Neural Networks

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

After a parenthesis of three posts introducing basics in Deep Learning and Pytorch, in this post we put the focus back on Reinforcement Learning.

In a previous posts we advanced that an Agent make decisions to solve complex decision-making problems under uncertainty. For this purpouse the Agent employs a policy, as a strategy to determine the next action a based on the current state s .

Even for fairly simple environments, we can have a variety of policies. Then we need a method to automatically find optimal policies. From this post onwards we will explore different methods to obtain a policy that allows an Agent to make decisions.

In this post we will start with Cross-Entropy method. Despite the simplicity of this method, it works well in basic environments and it’s easy to implement, which makes it an ideal baseline method to try.

The Cross-Entropy Method Overview

Overview

Remember that a policy , denoted by ( | ), says which action the Agent should take for every state observed. In this post we will consider that the core of our Agent will be a neural network that produces the policy . We can refer to the methods that solves this type of problems as policy gradient methods , that train the neural network with the goal to maximize the expected Return(G).

In practice, the policy is usually represented as a probability distribution over actions (that the Agent can take at a given state), which makes it very similar to a classification problem presented before ( in the Deep Learning post ), with the amount of classes being equal to the amount of actions we can carry out. In our case the output of our neural network is an action vector that represents a probability distribution:

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

We refer to it as a stochastic policy gradient , because it returns a probability distribution over actions rather than returning a deterministic single action.

How to improve our policy?

We want a policy, a probability distribution, and we initialize it at random. Then we improve our policy by playing a few games and then adjusting our policy (parameter of the neural network) in a way that is more efficient. Then repeat this process in order to our policy gradually gets better. One algorithm which can be used for that is the C ross-Entropy method .

Training Dataset

Since we will consider a neural network as the heart of this first Agent, we need to find some way to obtain data that we can assimilate as a training dataset, which includes input data and their respective labels.

During the agent’s lifetime, its experience is presented as episodes . Every episode is a sequence of observations of states that the Agent has got from the Environment, actions it has issued, and Rewards for these actions.

Imagine that our Agent has played several such episodes. For every episode, we can calculate the Return (total reward) that the agent has claimed. Remember that an Agent tries to accumulate as much total Reward as possible by interacting with the Environment.

Again, for simplicity we will use the Frozen-Lake example. To understand what’s going on, we need to look deeper at the Reward structure of the Frozen-Lake Environment. We get the reward of 1.0 only when we reach the goal, and this Reward says nothing about how good each episode was. Was it quick and efficient? or did we make many rounds on the lake before we randomly stepped into the final cell? We don’t know; it’s just 1.0 reward and that’s it.

Let’s imagine that we already have the Agent programmed and we use it to create 4 episodes, that we will then visualize with the .render() method already presented:

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

Note that due to randomness in the Environment and the way that the Agent selects actions to take, the episodes have different lenght and also shown different Rewards. Obviously an episode that has a Reward of 1.0 is better than one that has a reward of 0.0 . What about episodes that end with the same reward?

It is clear that we can consider some episodes “better” than others, e.g. the third is shortest that the second. For this, we can use a gamma = 0,9 (discount factor) presented previously. In this case, the Return (G) for shorter episodes will be higher than the Reward for longer ones.

Let’s illustrate these four episodes with a diagram where each cell represents the Agent’s step in the episode and its Return:

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

Cross-Entropy Algoritm

The core of the Cross-Entropy method is simple: generate episodes, throw away bad episodes and train on better ones. So, a summary of the steps of the method can be described as follows:

  1. Play a number of episodes in the Environment using our current Agent model.
  2. Calculate the Return for every episode and decide on a return boundary . Usually, we use some percentile of all rewards.
  3. Throw away all episodes with a return below the return boundary.
  4. Train the neural network of the Agent using episode steps (tuples <s,a,r>) from the remaining “elite” episodes, using the state s as the input and issued actions a as the label (desired output).
  5. Repeat from step 1 until we become satisfied with the result.

A variant of the method, which we will discuss in the next post, is that we can keep “elite” episodes for a longer time. I mean that the default version of the algorithm samples episodes from the Environment, train on the best ones, and threw them away. However, when the number of succesful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them.

The Environment

The Environment is the source of data from which we are going to create the dataset that will be used to train the neural network of our Agent.

Episode steps

The Agent will start from a random policy, where the probability of all actions is uniform, and while training, the Agent will hopefully learn from data obtained from the Environment to optimize its policy toward reaching the optimal policy.

The data that comes from the Environment are episode steps that should be expressed with tuples of the form <s,a,r> (state, action and Reward) which are obtained in each timestep as indicated in the following scheme:

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

Coding the Environment

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .

Let’s code it. We must first import several packages:

import numpy as npimport torch
import torch.nn as nnimport gym
import gym.spaces

We will start by creating the not slippery Environment (in the next post we will discuss more about the slippery version):

env = gym.make(‘FrozenLake-v0’, is_slippery=False)

Our state space is discrete, which means that it’s just a number from zero to fifteen inclusive (our current position in the grid). The action space is also discrete, from zero to three.

Our neural network expects a vector of numbers. To get this, we can apply the traditional onehot encoding of discrete inputs ( presented in this previous post ), which means that the input to our network will have 16 numbers with zero everywhere except the index that we will encode. To ease the code, we can use the ObservationWrapper class from Gym and implement our OneHotWrapper class:

class OneHotWrapper(gym.ObservationWrapper):def __init__(self, env):
   super(OneHotWrapper, self).__init__(env)
   self.observation_space = gym.spaces.Box(0.0, 1.0, 
              (env.observation_space.n, ), dtype=np.float32)def observation(self, observation):
    r = np.copy(self.observation_space.low)
    r[observation] = 1.0
    return r
env = OneHotWrapper(env)

As a summary, we have in env an Environment (not slippery Frozen-Lake) that we will use for obtain data to train our Agent.

The Agent

We have already advanced that our Agent is based on a neural network. Let’s see how to code this neural network and how it is used to perform the selection of actions that an Agent does.

The model

Our model’s core is a one-hidden-layer neural network with 32 neurons using a Sigmoid activation function. There is nothing special about our neural network. We start with an arbitrary number of layers and number of neurons.

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
HIDDEN_SIZE = 32net= nn.Sequential(
     nn.Linear(obs_size, HIDDEN_SIZE),
     nn.Sigmoid(),
     nn.Linear(HIDDEN_SIZE, n_actions)
)

The neural network takes a single observation from the environment as an input vector and outputs a number for every action we can perform, a probability distribution over actions. A straightforward way to proceed would be to include softmax nonlinearity after the last layer. However, remember from aprevious post that we try to avoid apply softmax to increase the numerical stability of the training process. Rather than calculating softmax and then calculating Cross-Entropy loss, in this example we use the PyTorch class nn.CrossEntropyLoss , which combines both softmax and Cross-Entropy in a single, more numerically stable expression. CrossEntropyLoss requires raw, unnormalized values from the neural network (also called logits).

Optimizer and Loss function

Other “hyperparameters” as Loss function and the Optimizer are also set almost randomly for this example:

objective = nn.CrossEntropyLoss()
optimizer = optim.SGD(params=net.parameters(), lr=0.001)

As we will see, the method is robust and converges very quickly, giving us plenty of room to choose the hyperparameters.

Get an Action

This abstraction makes our agent very simple: it needs to pass an observation ( state ) that receives from the Environment to the neural network model and perform random sampling using the probability distribution to get an action to carry out:

   sm = nn.Softmax(dim=1)   def select_action(state):
1: state_t = torch.FloatTensor([state])
2: act_probs_t = sm(net(state_t))
3: act_probs = act_probs_t.data.numpy()[0]
4: action = np.random.choice(len(act_probs), p=act_probs)
return action

Line 1:This functions requires that a first step transform the state to a tensor to ingest it to our neural network. At every iteration, we convert our current observation (Numpy array of 16 positions) to a PyTorch tensor and pass it to the model to obtain action probabilities. Remember that our neural network model needs tensors as a input data.

Line 2:A consequence of using nn.CrossEntropyLoss we need to remember to apply softmax every time we need to get probabilities from our neural network output.

Line 3:We need to convert the output tensor (remember that the model and softmax function return tensors) into a NumPy array. This array will have the same 2D structure as the input, with the batch dimension on axis 0, so we need to get the first batch element to obtain a 1D vector of action probabilities.

Line 4:With the probability distribution of actions, we can use it to obtain the actual action for the current step by sampling this distribution using NumPy’s function random.choice() .

Training the Agent

In the next figure we show a screenshot of the training loop indicating the general steps of the Cross-Entropy algorithm :

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

In order not to make this post too long, we leave for the next post the detailed explanation of this algorithm. Remember that entire code of this post can be found on GitHub . For now I simply propose to run the code of this loop and see the results. Just to mention that we considered a good result to have a Reward of 80%.

Test the Agent

In any case, what remains now is to see if the Agent really makes good decisions. To check this, we can create a new Environment ( test_env ), and check if our Agent is able to reach the Goal cell (we will use the .render() method in the code to make it more visual):

test_env = OneHotWrapper(gym.make(‘FrozenLake-v0’, 
           is_slippery=False))
state= test_env.reset()
test_env.render()is_done = Falsewhile not is_done:
   action = select_action(state)
   new_state, reward, is_done, _ = test_env.step(action)
   test_env.render()
   state = new_stateprint(“reward = “, reward)

If we try it several times we will see that it does it well enough:

Solving a Reinforcement Learning Problem Using Cross-Entropy Method

What next?

In the next post we will describe in detail the training loop (which we have skipped in this post) as well as see how we can improve the learning of the Agent taking into account a better neural network (with more neurons or different activation functions). Also we will consider the variant of the method that keeps “elite” episodes for several iterations of the training process. See you in the following post.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

七步掌握业务分析

七步掌握业务分析

芭芭拉·A·卡克诺德 / 2010-9 / 49.00元

《七步掌握业务分析》内容简介:业务分析师是新兴的专业职务。在组织或项目中,业务分析师通过与项目干系人合作,采取一系列技术和知识,分析、理解组织或项目需求,并实现组织或项目目标,提出解决方案。《七步掌握业务分析》作者是国际业务分析协会(IIBA)的《业务分析知识体系指南》BABOK创作委员会的核心成员,全书结合BABOK的标准,以通俗易懂的语言阐述了业务分析的基本概念、任务与目标,介绍了从初级业务分......一起来看看 《七步掌握业务分析》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换