内容简介:Try TF-Agents for RL with this simple tutorial, published as a Google colab notebook so you can run it directly from your browser.Some weeks ago, I wrote an article naming different frameworks you can use to implement Reinforcement Learning (RL) in your pr
Reinforcement Learning with TensorFlow Agents — Tutorial
Try TF-Agents for RL with this simple tutorial, published as a Google colab notebook so you can run it directly from your browser.
Jul 1 ·7min read
Some weeks ago, I wrote an article naming different frameworks you can use to implement Reinforcement Learning (RL) in your projects, showing the ups and downs of each of them and wondering if any of them would rule them all at some point. Since then, I’ve come to know TF Agents , a library for RL based on TensorFlow and with the full support of its community (note that TF Agents is not an official Google product but it is published as a repository from the official TensorFlow account on Github).
I am currently using TF Agents on a project and it has been easy to start with it, thanks to its good documentation including tutorials . It is updated regularly and has lots of contributors, which makes me think it is possible we will see TF Agents as the standard framework for implementing RL in the near future. Because of this, I’ve decided to make this article to give you a quick introduction, so you can also benefit from this library. I have published all the code used here as a Google colab notebook , so you can easily run it online.
You can find the Github with all the code and documentation for TF-Agents here . You won’t need to clone their repository, but it’s always useful to have the official Github for reference. I have implemented the following example following partially one of their tutorials (1_dqn_tutorial) but I have simplified it further and used it for playing Atari games in this article. Let’s get hands on.
Installing TF Agents and Dependencies
As already said, TF-Agents runs on TensorFlow, more specifically TensorFlow 2.2.0. In addition you will need to install the following packages if you don’t have them already:
pip install tensorflow==2.2.0 pip install tf-agents
Implementing a DQN Agent for CartPole
We will implement a DQN Agent ( Mnih et al. 2015 ) and use it for CartPole, a classic control problem. If you would like to solve something more exciting like, say, an Atari game, you just need to change the environment name with the one you wish, choosing it from all the available OpenAI environments .
We start by doing all of the necessary imports. As you can see below, we implement quite a few objects from TF-Agents. These are all things we can customize and switch for our implementation.
from __future__ import absolute_import, division, print_functionimport base64 import IPython import matplotlib import matplotlib.pyplot as plt import numpy as np import tensorflow as tffrom tf_agents.agents.dqn import dqn_agent from tf_agents.drivers import dynamic_step_driver from tf_agents.environments import suite_gym from tf_agents.environments import tf_py_environment from tf_agents.eval import metric_utils from tf_agents.metrics import tf_metrics from tf_agents.networks import q_network from tf_agents.replay_buffers import tf_uniform_replay_buffer from tf_agents.trajectories import trajectory from tf_agents.utils import common
Environment
Now, we head on to create our environment. In CartPole, we have a cart with a pole on top of it, the agent’s mission is to learn to keep up the pole, moving the cart left and right. Note that we will use an e environment from suite_gym already included in TF-Agents, which is a slightly customized (and improved for its use with TF-Agents) version of OpenAI Gym environments (if you’re interested, you can check the differences with OpenAI’s implementation here ). We will also use a wrapper for our environment called TFPyEnvironment — which converts the numpy arrays used for state observations, actions and rewards into TensorFlow tensors. When dealing with TensorFlow models, (i.e., neural networks) we use tensors, so by using this wrapper we save some effort we would need to convert these data.
env = suite_gym.load('CartPole-v1') env = tf_py_environment.TFPyEnvironment(env)
Agent
There are different agents in TF-Agents we can use: DQN , REINFORCE , DDPG , TD3 , PPO and SAC . We will use DQN as said above. One of the main parameters of the agent is its Q (neural) network, which will be use to calculate the Q-values for the actions in each step. A q_network has two compulsory parameters: input_tensor_spec and action_spec defining the observation shape and the action shape. We can get this from our environment so we will define our q_network as follows:
q_net = q_network.QNetwork(env.observation_spec(), env.action_spec())
There are many more parameters we can customize for our q_network as you can see here , but for now, we will go with the default ones. The agent also requires an optimiser to find the values for the q_network parameter. Let’s keep it classic and use Adam.
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001)
Finally, we define and initialize our agent with the following parameters:
- The time_step_spec, which we get from our environment and defines how are our time steps defined.
- The action_spec, same as for the q_network.
- The Q network we created before.
- The optimizer we have also created before.
- The TD error loss function, similar to how the loss is used in NN.
- The train step counter, that is just a rank 0 tensor (a.k.a. scalar) which will count the number of steps we do on the environment.
train_step_counter = tf.Variable(0)agent = dqn_agent.DqnAgent(env.time_step_spec(), env.action_spec(), q_network=q_net, optimizer=optimizer, td_errors_loss_fn= common.element_wise_squared_loss, train_step_counter=train_step_counter)agent.initialize()
Helper Methods: Average Cumulative Return and Collecting Data
We will also need some helper methods. The first one will iterate over the environment for a number of episodes, applying the policy to choose what actions to follow and return the average cumulative reward in these episodes. This will come in handy to evaluate the policy learned by our agent. Below, we also try the method in our environment for 10 episodes.
def compute_avg_return(environment, policy, num_episodes=10): total_return = 0.0 for _ in range(num_episodes): time_step = environment.reset() episode_return = 0.0 while not time_step.is_last(): action_step = policy.action(time_step) time_step = environment.step(action_step.action) episode_return += time_step.reward total_return += episode_return avg_return = total_return / num_episodes return avg_return.numpy()[0]# Evaluate the agent's policy once before training. avg_return = compute_avg_return(env, agent.policy, 5) returns = [avg_return]
We will also implement a method to collect data when training our agent. One of the breakthroughs of DQN was experience replay, in which we store the experiences of the agent (state, action, reward) and use it to train the Q network in batches in each step. This improves the learning by making it faster and more stable. In order to do this, TF-Agents includes the object TFUniformReplayBuffer, which stores these experiences to re-use them later, so we firstly create this object that we will need later on.
In this method, we take an environment, a policy and a buffer, take the current time_step formed by its state observation and reward at that time_step, the action the policy chooses and then the next time_step. Then, we store this in the replay buffer. Note the replay buffer stores an object called Trajectory, so we create this object with the elements named before, and then save it to the buffer using the method add_batch.
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer( data_spec=agent.collect_data_spec, batch_size=env.batch_size, max_length=100000)def collect_step(environment, policy, buffer): time_step = environment.current_time_step() action_step = policy.action(time_step) next_time_step = environment.step(action_step.action) traj = trajectory.from_transition(time_step, action_step, next_time_step)# Add trajectory to the replay buffer buffer.add_batch(traj)
Train Agent
We can finally train our agent. We define the number of steps we will make in every iteration, after this number of steps, we will train our agent in every iteration, modifying it’s policy. For now let’s just use 1 step per iteration. We also define the batch size with which our Q network will be trained and an iterator so we iterate over the experienced of the agent.
Then, we will just gather some experience for our buffer and start with the common RL loop. Get experience by acting on the environment, train policy and repeat. We additionally print the loss and evaluate the performance of the agent every 200 and 1000 steps respectively.
collect_steps_per_iteration = 1 batch_size = 64 dataset = replay_buffer.as_dataset(num_parallel_calls=3, sample_batch_size=batch_size, num_steps=2).prefetch(3) iterator = iter(dataset) num_iterations = 20000 env.reset()for _ in range(batch_size): collect_step(env, agent.policy, replay_buffer)for _ in range(num_iterations): # Collect a few steps using collect_policy and save to the replay buffer. for _ in range(collect_steps_per_iteration): collect_step(env, agent.collect_policy, replay_buffer) # Sample a batch of data from the buffer and update the agent's network. experience, unused_info = next(iterator) train_loss = agent.train(experience).loss step = agent.train_step_counter.numpy() # Print loss every 200 steps. if step % 200 == 0: print('step = {0}: loss = {1}'.format(step, train_loss)) # Evaluate agent's performance every 1000 steps. if step % 1000 == 0: avg_return = compute_avg_return(env, agent.policy, 5) print('step = {0}: Average Return = {1}'.format(step, avg_return)) returns.append(avg_return)
Plot
We can now plot how the cumulative average reward varies as we train the agent. For this, we will use matplotlib to make a very simple plot.
iterations = range(0, num_iterations + 1, 1000) plt.plot(iterations, returns) plt.ylabel('Average Return') plt.xlabel('Iterations')
Complete Code
I have shared all the code in this article as a Google Colab notebook . You can directly run all the code as it is, if you would like to change it, you have to save it on your own Google drive account and then you can do whatever you like. You can also download it to run it locally on your computer, if you wish to.
Where to go from here
- You can follow the tutorials included in the repository of TF-Agents on Github
- If you would like to check other nice frameworks for RL, you can see my previous post here:
- You can also check other environments in which to try TF-Agents (or any RL algorithm of your choice) in this other article I wrote some time ago.
As usual, thank you for reading! Let me know in responses what you think about TF-Agents, and also if you have any question or you found any :bug: in the code.
以上所述就是小编给大家介绍的《Reinforcement Learning with TensorFlow Agents — Tutorial》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
树莓派学习指南
[英]Peter Membrey、[澳]David Hows / 张志博、孙峻文 / 人民邮电出版社 / 2014-4 / 49.00元
树莓派(Raspberry Pi)是一款基于Linux系统的、只有一张信用卡大小的卡片式计算机。由于功能强大、性能出色、价格便宜等特点,树莓派得到了计算机硬件爱好者以及教育界的欢迎,风靡一时。 《树莓派学习指南(基于Linux)》是学习在树莓派上基于Linux进行开发的一本实践指南。全书共3个部分11章,第一部分是前两章,讲述如何设置和运行图形用户界面(GUI)。第二部分是第3章到第7章,讲......一起来看看 《树莓派学习指南》 这本书的介绍吧!