内容简介:Beyond Temporal Difference learning (TD), we’ve discussed theThe rapid intra-episodic responsiveness that characterize online learning approaches such as TD make them suitable for highly dynamic environments,where the values of states and actions is contin
Fundamentals of Reinforcement Learning
Jun 2 ·13min read
Introduction
O nline learning methods are a dynamic family of reinforcement learning algorithms behind many of the achievements in general AI over the past decade. Belonging to the sample-based learning class of reinforcement learning approaches, online learning methods allow for the determination of state values simply through repeated observations, eliminating the need for transition dynamics. Unlike their offline counterparts , online learning approaches allow for the incremental updates of the values of states and actions during an environmental episode, allowing for constant, incremental performance improvements to be observed.
Beyond Temporal Difference learning (TD), we’ve discussed the theory and practical implementations of Q-learning, an evolution of TD designed to allow for incremental estimations and improvement of state-action values. Q-learning has been made famous as becoming the backbone of reinforcement learning approaches to simulated game environments, such as those observed in OpenAI’s gyms. As we’ve already covered theoretical aspects of Q-learning inpast articles, they will not be repeated here.
The rapid intra-episodic responsiveness that characterize online learning approaches such as TD make them suitable for highly dynamic environments,where the values of states and actions is continuously updated throughout time through sets of estimates. Perhaps most notably, TD is the foundation of Q-learning, a more advanced algorithm used to train agents tackling game environments such as those observed in the OpenAI Atari gyms, as covered insome of ourprevious implementations.
In this article, we’ll explore how Q-learning can be applied to training an agent to play the classic FPS game Doom , through the use of the open-source OpenAI gym wrapper library Vizdoomgym . We’ll guide you through setting up your first agent, and lay the foundations of future work.
Beyond TD: SARSA & Q-learning
Recall that in Temporal Difference learning, we observed that an agent behaves cyclically in an environment, through sequence of States (S), Actions (A), and (Rewards).
During TD, we can update the value of the previous state as soon as we reach the next state. We can further expand the scope of our model to include state-action values, in an approach termed SARSA, an on-policy TD control algorithm for estimating action values . During SARSA, we continuously estimate the action-values across incremental timesteps under a given policy, while using this knowledge to modify the policy towards an action-value greedy alternative.
Let’s compare the state-action and state-value TD update equations:
Q-learning differs from SARSA by forcing a selection of the action with the current highest action value during an update, in a similar approach to what’s observed with Bellman Optimality Equations. We can inspect SARSA and Q-learning next to the Bellman and Bellman Optimality Equations, below:
You may be wondering about how ensure complete exploration of our state-action space, given the need to constantly select actions for a state with the highest existing action-value. In theory, we could be avoiding the optimal action simply by failing to evaluate it in the first place. To encourage exploration, we can use a decaying e-greedy policy, essentially forcing the agent to select an apparent sub-optimal action at a decaying probability in order to learn more about its value. By using a decay value, we can limit exploration once all of the states have been evaluated, after which we’ll permanently select the optimal actions for each state.
With our theory covered, let’s get started on our implementation.
Implementation
Our Google Colaboratory implementation is written in Python utilizing Tensorflow Core, and can be found on the GradientCrescent Github. Readers following our publication will find the code similar to our previous implementations forAtari environments. As the implementation for this approach is quite convoluted, let’s summarize the order of actions required :
1. We define our Deep Q-learning neural network . This is a CNN that takes in-game screen images and outputs the probabilities of each of the actions, or Q-values, in the Ms-Pacman gamespace. To acquire a tensor of probabilitieses, we do not include any activation function in our final layer.
2. As Q-learning require us to have knowledge of both the current and next states, we need to start with data generation . We feed preprocessed input images of the game space, representing initial states s , into the network, and acquire the initial probability distribution of actions, or Q-values. Before training, these values will appear random and sub-optimal.
3. With our tensor of probabilities, we then select the action with the current highest probability using the argmax() function, and use it to build an epsilon greedy policy.
4. Using our policy, we’ll then select the action a , and evaluate our decision in the gym environment to receive information on the new state s’ , the reward r , and whether the episode has been finished.
5. We store this combination of information in a buffer in the list form <s,a,r,s’,d>, and repeat steps 2–4 for a preset number of times to build up a large enough buffer dataset.
6. Once step 5 has finished,we move to generate our target y -values, R’ and A’ , that are required for the loss calculation. While the former is simply discounted from R , we obtain the A’ by feeding S’ into our network.
7. With all of our components in place, we can then calculate the loss to train our network .
8. Once training has finished, we’ll evaluate the performance of our agent under a new game episode, and record the performance.
Let’s get started. With Tensorflow 2 in use for Colaboratory environments, we’ve converted our code to be TF2 compliant, using the new compat package. Note that this code is not TF2 native.
Let’s by importing all of the necessary packages, including the OpenAI and Vizdoomgym environments and Tensorflow Core:
import gym import vizdoomgym !pip install tensorflow==1.15 import numpy as np import tensorflow as tf from tensorflow.contrib.layers import flatten, conv2d, fully_connected from collections import deque, Counter import random from datetime import datetime
Next, we define a preprocessing function to normalize and resize the observations from our gym environment and convert them into one-dimensional tensors.
from skimage.color import rgb2gray from skimage import transform#prepro (240, 320, 3) uint8 frame into 30x40 1D float vector color = np.array([240, 320, 74]).mean() def preprocess_observation(obs): img =obs/255.0 img[img==color] = 0 img_gray = rgb2gray(img) preprocessed_frame = transform.resize(img_gray, [60,80]) return preprocessed_frame
Next, let’s initialize the gym environment. We’ll be using Vizdoomgym’s health gathering scenario, where the objective is to collect as many health tokens as possible in order to stay alive while navigating a square room with a damaging acidic floor.
env = gym.make(‘VizdoomHealthGathering-v0’) n_outputs = env.action_space.n print(n_outputs)observation = env.reset()import tensorflow as tf import matplotlib.pyplot as pltfor i in range(22): if i > 20: print(observation.shape) plt.imshow(observation) plt.show()observation, _, _, _ = env.step(1)
We can inspect a screen of gameplay, and also view the 3 actions available within the gamespace, namely to turn left, right, or to move forward. Naturally, this information is not available to our agent.
We can take this chance to compare our original and preprocessed input images:
Next, we introduce input frame stacking and frame composition into our preprocessing pipeline, two techniques introduced by Deepmind in 2015. These methods serve to provide a temporal and motion reference for our inputs, respectively.
We apply frame composition by by taking two of our input frames, and returning an element-wise maximum summation maxframe of the two. These composed frames are then stored in a deque or “stack”, which automatically removes older entries as new entries are introduced.
stack_size = 4 # We stack 4 composite frames in total stacked_frames = deque([np.zeros((60,80), dtype=np.int) for i in range(stack_size)], maxlen=4)def stack_frames(stacked_frames, state, is_new_episode): # Preprocess frame frame = preprocess_observation(state) if is_new_episode: # Clear our stacked_frames stacked_frames = deque([np.zeros((60,80), dtype=np.int) for i in range(stack_size)], maxlen=4) # Because we’re in a new episode, copy the same frame 4x, apply elementwise maxima maxframe = np.maximum(frame,frame) stacked_frames.append(maxframe) stacked_frames.append(maxframe) stacked_frames.append(maxframe) stacked_frames.append(maxframe) # Stack the frames stacked_state = np.stack(stacked_frames, axis=2) else: #Since deque append adds t right, we can fetch rightmost element maxframe=np.maximum(stacked_frames[-1],frame) # Append frame to deque, automatically removes the oldest frame stacked_frames.append(maxframe) # Build the stacked state (first dimension specifies different frames) stacked_state = np.stack(stacked_frames, axis=2) return stacked_state, stacked_frames
Next, let’s define our model, a deep Q-network. This is essentially a three layer convolutional network that takes preprocessed input obserations, with the generated flattened output fed to a fully-connected layer, generating probabilities of taking each action in the game space as an output. Note there are no activation layers here, as the presence of one would result in a binary output distribution.
tf.compat.v1.reset_default_graph() def q_network(X, name_scope): # Initialize layers initializer = tf.compat.v1.keras.initializers.VarianceScaling(scale=2.0) with tf.compat.v1.variable_scope(name_scope) as scope: # initialize the convolutional layers layer_1 = conv2d(X, num_outputs=32, kernel_size=(8,8), stride=4, padding=’SAME’, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘layer_1’,layer_1) layer_2 = conv2d(layer_1, num_outputs=64, kernel_size=(4,4), stride=2, padding=’SAME’, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘layer_2’,layer_2) layer_3 = conv2d(layer_2, num_outputs=64, kernel_size=(3,3), stride=1, padding=’SAME’, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘layer_3’,layer_3) # Flatten the result of layer_3 before feeding to the fully connected layer flat = flatten(layer_3) # Insert fully connected layer fc = fully_connected(flat, num_outputs=128, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘fc’,fc) #Add final output layer output = fully_connected(fc, num_outputs=n_outputs, activation_fn=None, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘output’,output) # Vars will store the parameters of the network such as weights vars = {v.name[len(scope.name):]: v for v in tf.compat.v1.get_collection(key=tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES, scope=scope.name)} #Return both variables and outputs together return vars, output
Let’s also take this chance to define our hyperparameters for our model and training process. Note that the X_shape is now (None, 60, 80, 4) , on account of our stacked frames.
num_episodes = 1000 batch_size = 48input_shape = (None, 60, 80, 1)learning_rate = 0.002 #Modified for composite stacked frames X_shape = (None, 60, 80, 4) discount_factor = 0.99global_step = 0 copy_steps = 100 steps_train = 4 start_steps = 2000
Recall, that Q-learning requires us to select actions with the highest action-values. To ensure that we still visit every single possible state-action combination, we’ll have our agent follow a decaying epsilon-greedy policy, with an exploration rate of 5%.
epsilon = 0.5 eps_min = 0.05 eps_max = 1.0 eps_decay_steps = 500000def epsilon_greedy(action, step): p = np.random.random(1).squeeze() #1D entries returned using squeeze epsilon = max(eps_min, eps_max — (eps_max-eps_min) * step/eps_decay_steps) #Decaying policy with more steps if p< epsilon: return np.random.randint(n_outputs) else: return action
Recall from the equations above, that the update function for Q-learning requires the following:
- The current state s
- The current action a
- The reward following the current action r
- The next state s’
- The next action a’
To supply these parameters in meaningful quantities, we need to evaluate our current policy following a set of parameters and store all of the variables in a buffer, from which we’ll draw data in minibatches during training. Let’s go ahead and create our buffer and a simple sampling function:
buffer_len = 20000 exp_buffer = deque(maxlen=buffer_len)def sample_memories(batch_size): perm_batch = np.random.permutation(len(exp_buffer))[:batch_size] mem = np.array(exp_buffer)[perm_batch] return mem[:,0], mem[:,1], mem[:,2], mem[:,3], mem[:,4]
Next, let’s copy the weight parameters of our original network into a target network. This dual-network approach allows us to generate data during the training process using an existing policy while still optimizing our parameters for the next policy iteration, reducing loss oscillations.
# we build our Q network, which takes the input X and generates Q values for all the actions in the state mainQ, mainQ_outputs = q_network(X, ‘mainQ’)# similarly we build our target Q network, for policy evaluation targetQ, targetQ_outputs = q_network(X, ‘targetQ’)copy_op = [tf.compat.v1.assign(main_name, targetQ[var_name]) for var_name, main_name in mainQ.items()] copy_target_to_main = tf.group(*copy_op)
Finally, we’ll also define our loss. This is simply the squared difference of our target action (with the highest action value) and our predicted action. We’ll use an ADAM optimizer to minimize our loss during training.
# define a placeholder for our output i.e action y = tf.compat.v1.placeholder(tf.float32, shape=(None,1))# now we calculate the loss which is the difference between actual value and predicted value loss = tf.reduce_mean(input_tensor=tf.square(y — Q_action))# we use adam optimizer for minimizing the loss optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate) training_op = optimizer.minimize(loss)init = tf.compat.v1.global_variables_initializer() loss_summary = tf.compat.v1.summary.scalar(‘LOSS’, loss) merge_summary = tf.compat.v1.summary.merge_all() file_writer = tf.compat.v1.summary.FileWriter(logdir, tf.compat.v1.get_default_graph())
With all of our code defined, let’s run our network and go over the training process. We’ve defined most of this in the initial summary, but let’s recall for posterity.
- For each epoch, we feed an input image stack into our network to generate a probability distribution of the available actions, before using an epsilon-greedy policy to select the next action
- We then input this into the network, and obtain information on the next state and accompanying rewards, and store this into our buffer. We update our stack and repeat this process over a number of pre-defined steps.
- After our buffer is large enough, we feed the next states into our network in order to obtain the next action. We also calculate the next reward by discounting the current one
- We generate our target y-values through the Q-learning update function, and train our network.
- By minimizing the training loss, we update the network weight parameters to output improved state-action values for the next policy.
with tf.compat.v1.Session() as sess: init.run() # for each episode history = [] for i in range(num_episodes): done = False obs = env.reset() epoch = 0 episodic_reward = 0 actions_counter = Counter() episodic_loss = [] # First step, preprocess + initialize stack obs,stacked_frames= stack_frames(stacked_frames,obs,True) # while the state is not the terminal state while not done: #Data generation using the untrained network # feed the game screen and get the Q values for each action actions = mainQ_outputs.eval(feed_dict={X:[obs], in_training_mode:False}) # get the action action = np.argmax(actions, axis=-1) actions_counter[str(action)] += 1 # select the action using epsilon greedy policy action = epsilon_greedy(action, global_step) # now perform the action and move to the next state, next_obs, receive reward next_obs, reward, done, _ = env.step(action) #Updated stacked frames with new episode next_obs, stacked_frames = stack_frames(stacked_frames, next_obs, False) # Store this transition as an experience in the replay buffer! exp_buffer.append([obs, action, next_obs, reward, done]) # After certain steps, we train our Q network with samples from the experience replay buffer if global_step % steps_train == 0 and global_step > start_steps: o_obs, o_act, o_next_obs, o_rew, o_done = sample_memories(batch_size) # states o_obs = [x for x in o_obs] # next states o_next_obs = [x for x in o_next_obs] # next actions next_act = mainQ_outputs.eval(feed_dict={X:o_next_obs, in_training_mode:False}) # discounted reward: these are our Y-values y_batch = o_rew + discount_factor * np.max(next_act, axis=-1) * (1-o_done) # merge all summaries and write to the file mrg_summary = merge_summary.eval(feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:False}) file_writer.add_summary(mrg_summary, global_step) # To calculate the loss, we run the previously defined functions mentioned while feeding inputs train_loss, _ = sess.run([loss, training_op], feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:True}) episodic_loss.append(train_loss) # after some interval we copy our main Q network weights to target Q network if (global_step+1) % copy_steps == 0 and global_step > start_steps: copy_target_to_main.run() obs = next_obs epoch += 1 global_step += 1 episodic_reward += reward next_obs=np.zeros(obs.shape) exp_buffer.append([obs, action, next_obs, reward, done]) obs= env.reset() obs,stacked_frames= stack_frames(stacked_frames,obs,True) history.append(episodic_reward) print(‘Epochs per episode:’, epoch, ‘Episode Reward:’, episodic_reward,”Episode number:”, len(history))
Once training is complete, we can plot the reward distribution against incremental episodes. The first 1000 episodes are shown below:
This result is acceptable, with the beginnings of improvement being visible, given how the original Vizdoom paper suggests that hundreds of thousands of episodes would be needed for a more significant improvement to be observed.
img_array=[] with tf.compat.v1.Session() as sess: init.run() observation, stacked_frames = stack_frames(stacked_frames, observation, True) while True:# feed the game screen and get the Q values for each action actions = mainQ_outputs.eval(feed_dict={X:[observation], in_training_mode:False})# get the action action = np.argmax(actions, axis=-1) actions_counter[str(action)] += 1# select the action using epsilon greedy policy action = epsilon_greedy(action, global_step) environment.render() new_observation, stacked_frames = stack_frames(stacked_frames, new_observation, False) observation = new_observation # now perform the action and move to the next state, next_obs, receive reward new_observation, reward, done, _ = environment.step(action) img_array.append(new_observation) if done: #observation = env.reset() break environment.close()
Finally, we can take our list of frames and feed it to the scikit-video library to generate a video sequence output for inspection:
from random import choice import cv2 from google.colab.patches import cv2_imshowimport numpy as np import skvideo.ioout_video = np.empty([len(img_array), 240, 320, 3], dtype = np.uint8) out_video = out_video.astype(np.uint8)for i in range(len(img_array)): frame = img_array[i] out_video[i] = frame # Writes the the output image sequences in a video file skvideo.io.vwrite(“/content/doom.mp4”, out_video)
Let’s view our agent in action!
Note how the agent pauses when it spots a health pack, before proceeding to move towards it. Just for fun, we also trained an agent for the basic scenario, where the objective is to hit the monster as soon as possible. While we achieved a best time of ca. 1.3 seconds, we’ll show one of the earlier episodes below.
That wraps up this implementation on Q-learning. In our next article, we’ll move on to tackling more complex Doom scenarios with more advanced Q-learning approaches, and switch our code over to Pytorch, one of the most popular languages in AI research.
We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent, covering applied and theoretical aspects of AI. To stay up to date with the latest updates on GradientCrescent , please consider following the publication and following our Github repository.
References
Sutton et. al, Reinforcement Learning
White et. al, Fundamentals of Reinforcement Learning, University of Alberta
Silva et. al, Reinforcement Learning, UCL
Makarov et. al, Deep Reinforcement Learning with VizDoom First-Person Shooter, HSE
Kempka et. al, ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning, PUT
Ravichandiran et. al, Hands-On Reinforcement Learning with Python
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。