Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

栏目: 数据库 · 发布时间: 5年前

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenAI Gym games

admin GoogleCloud No comments

Source: Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenAI Gym games from Google Cloud

Like many other areas of machine learning research, reinforcement learning (RL) is evolving at breakneck speed. Just as they have done in other research areas, researchers are leveraging deep learning to achieve state-of-the-art results. In particular, reinforcement learning has significantly outperformed prior ML techniques in game playing, reaching human-level and even world-best performance on Atari, beating the human Go champion, and is showing promising results in more difficult games like Starcraft II. Furthermore it’s also helped advance fields like self driving cars , recommendation systems , bidding , and even Machine Learning model development .

Training deep RL agents can be expensive in terms of both computing resources and time. It is common for an algorithm to take tens of thousands or millions of simulation/training steps before one observes the accumulated rewards start to rise. And it can take a lot more from there for the algorithm to converge. For example, in a paper from DeepMind they indicated it took approximately 20 epochs (1 epoch = 0.5 hour according to the paper) before the agent showed clear signs of learning. The entire process took 100 epochs, and yet that was only a single trial!

Furthermore, RL problems are notoriously sensitive to hyperparameters, which means it’s necessary to evaluate many different hyperparameters.

Using these factors as motivation, in this post, we show how to train many jobs in parallel using Google Cloud’s hyperparameter tuning service with ML Engine. This service gives you two key benefits:

  • You can train many models in parallel, this allows you to quickly iterate your concept and you only pay for the compute resources you use in each job.
  • You benefit from the managed hyper-parameter tuning service, which typically results in quicker convergence than just using a naïve grid search.

While these properties are beneficial for all machine learning problems, they are particularly beneficial for reinforcement learning problems where the compute is intensive and it’s often necessary to run many trials, evaluating different hyperparameters.

All of the code for this post is located in the GitHub repository here and here . The code provides examples for using RL on GCP; it could be refined for better results.

Reinforcement learning 101

Reinforcement learning (RL) is a form of machine learning whereby an agent takes actions in an environment to maximize a given objective (a reward) over this sequence of steps. Unlike more traditional supervised learning techniques, every data point is not labelled and the agent only has access to “sparse” rewards. While the history of RL can be dated back to the 1950s and there are a lot of RL algorithms out there, 2 easy to implement yet powerful deep RL algorithms have a lot of attractions recently: deep Q-network (DQN) and deep deterministic policy gradient (DDPG). We briefly introduce the algorithms and variants based on them in this section.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

A conceptual process diagram of the Reinforcement Learning problem

What is a Deep Q-network?

The Deep Q-network (DQN) was introduced by Google Deepmind’s group in this Nature paper in 2015. Encouraged by the success of deep learning in the field of image recognition, the authors  incorporated deep neural networks into Q-Learning and tested their algorithm in the Atari Game Engine Simulator , in which the dimension of the observation space is very large. The deep neural network acts as a function approximator that predicts the output Q-values, or the desirability of taking an action, given a certain input state. Accordingly, DQN is a value-based method: in the training algorithm DQN updates Q-values according to Bellman’s equation, and to avoid the difficulty of fitting a moving target, it employs a second deep neural network that serves as an estimation of target values. This network’s weights are updated only at a specified interval and hence helps training the DQN. Since RL collects data by interacting with the environment continuously, its data is time-correlated, which can prevent the DQN from converging during training. To address this problem, the authors introduced a replay buffer to hold data previously collected and then sample mini-batches from it during each training iteration. For more details on the techniques the DeepMind group used to build this successful strategy, take a look at this paper .

DDPG, TD3 and C2A2

Like DQN, DDPG is a value-based RL method. To be concrete, it is an actor-critic method where the actor is responsible for making decisions on actions given a state from the environment and the critic estimates the value of a given state (or a pair of state and action). Similar to DQN, DDPG uses deep neural networks as its function approximators for both the actor and the critic.

One major problem of DDPG is that the critic tends to overestimate in the learning process (this is actually a problem observed in most value-based RL methods, including DQN). To address this, Twin Delayed Deep Deterministic policy gradient (TD3) proposes to have 2 critics and whenever a (state, action) pair is given the minimum of which is regarded as the estimated value (line 06-11). Besides this, TD3 also proposes to delay the update of its target networks to reduce the growth of estimation error as well as target policy smoothing regularization to prevent overfitting to narrow peaks in the value estimation. You can learn more about TD3 in this paper .

TD3 works great and is easy to implement. But it would be better (and fun!) to have options over actions when the agent thinks the estimated value is low. We therefore added one more actor to TD3 and dub the algorithm as C2A2 (for an apparent reason). C2A2 has the same way of value estimation as TD3, but the extra actor provides it with choices over actions where the agent picks the one with larger estimated value. This logic is revealed in line 18-27, it demonstrates the fact that creating a new algorithm to run on Cloud ML engine requires no specific code modification to suit for the platform.

Deep RL in the OpenAI gym environment

OpenAI gym is a popular platform for people to try/evaluate RL algorithms. Using OpenAI Gym, we can initialize an environment for our agent to interact with:

Assuming we have already trained our agent, it will select an action using the observations from the environment as inputs:

Steps to run RL on Cloud ML Engine

Running a hyperparameter tuning job on Google Cloud is straightforward. Here are the 5 steps you’ll need to take:

  1. Build your model in a normal way.

  2. Establish your hyperparameters as command line arguments, thereby decoupling your hyperparameters from your model’s logic

  3. Create a Python package (i.e., add a __init__.py and setup.py)

  4. Write your hyperparameter.yaml file

  5. Submit your first training job from the command line using gcloud ml engine submit

In practice, you will typically spend most of your time on step 1, whereas  you can start with boilerplate code for the rest of the process. In our setup, we have implemented simple policy gradients agent and a DQN agent compatible with OpenAI’s gym library, as well as TD3 for continuous control problems. You can check out the implementations here and here .

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Submitting a hyperparameter tuning job to ML Engine. The user must make hyperparameters command line arguments and provide a YAML file to the --config argument.

Hyper-Parameter Tuning: Preparing your code

Once you’ve written your reinforcement learning agent and defined the task that you want to solve, you then need to set up your hyperparameters as command-line arguments. This will be crucial because it allows us to decouple all the hyperparameters from one another, thus allowing us to evaluate the effect of each hyperparameter independently. For example using the argparse library, hyperparameters can set as command line arguments:

For hyperparameter tuning, there are two important additions we need to add to the model: 1) specify the metric we want to optimize on; and 2) give each model output a unique name.

In Keras, you can specify the metric we want to optimize on using the metrics argument for a Model object:

You may want to add a custom metric: in our case, we’ll call it the reward. This can be accomplished like so:

Alternatively, if you are using TensorFlow, you simply need to log your metric as you would for TensorBoard :

Below, we will tell ML Engine to optimize over this reward parameter in a YAML file.

Secondly, it is important to have a unique output directory for each trial: this step prevents trials from accidentally overwriting each other. This can automatically be achieved by using --job-dir command line parameter as the output directory. Alternatively, you can manually retrieve the trial number via the TF_CONFIG environmental variable:

Additionally we need to package up our code into a Python package. You can read more about this process here , but essentially it means we need to add an empty __init__.py file and a setup.py to our folder structure. So here are the latest modifications  to our code:

Hyper-Parameter Tuning: Submitting a Job

Now that we have created our model, we will define a hyperparameter yaml file that indicates which hyperparameters we want to tune, the value ranges for each hyperparameter, and which metric we want optimize (“reward” in RL’s case).

Additionally, here you can also specify the scale-tier, which indicates how much compute power you want to use. In our case, we are going to use a single GPU for each job. If you wanted a custom setup, with multiple GPUs, for example, you can specify a custom scale tier . A nice benefit of using ML Engine for machine learning is that it allows you to focus on model development and deployment without worrying about infrastructure.

It’s important to note that the hyperparameter tuning service, because it’s using Bayesian optimization, is a sequential algorithm that learns from each prior step. Thus you must specify the maxTrials and the maxParallelTrials .

In the extreme case, if the maxParallelTrials is equal to maxTrials the process will finish quickly, in about the time it takes to train a single model (in this case you can set the algorithm parameter to GRID_SEARCH to perform grid search instead of Bayesian optimization).  Conversely, if your maxParallelTrials parameter is small, it will take longer to train, but will generally result in better performance as its Bayesian optimization process has benefited from more epochs from which it was able to incrementally improve the model. Here is an example:

The YAML file specifies the name of the metric (“reward” in the case of RL) and the goal (to maximize this metric). We are also giving the hyperparameter tuner a budget—we are allowing it to train 40 total models, with 5 of them run in parallel at any time. Finally, we specify the hyperparameters in the model that can be tuned (learning_rate, batch_size, etc.), along with ranges for each hyper-parameter.

Lastly, we are now ready to submit a hyperparameter job on ML engine:

Notice we have the --config argument where we point to our hyperparameter yaml file. Essentially the process will fire off various training jobs using different hyperparameters.

You can view the job status into primary ways:

Can our model play successfully?

We wrote three reference examples for running reinforcement learning on GCP:

  1. A very simple policy gradients implementation to solve Cartpole using OpenAI Gym in TensorFlow
  2. A more advanced DQN implementation to solve Breakout Gym environments using tf.keras
  3. TD3 using TensorFlow in the BipedalWalker environment

We will share our results from all three implementations including the hyperparameter tuning results, along with visualization of agents solving Cartpole, Breakout, and BipedalWalker.

Because we wrote all of our jobs to model directories on Cloud Storage , we can visualize the trial results using TensorBoard:

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Reward over time for various trials as viewed in TensorBoard.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Learning curves (accumulated reward in each episode evolving over the learning process) from hyper-parameter tuning on the TD3 algorithm.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Learning curve from the best of the above trials. BipedalWalker is considered “solved” if the agent can continuously reach an average reward of 300 for 100 episodes in a row. The agent we trained meet the criteria before episode #2000 and this can put us at the third place (at the time of writing 2018-12-10) on the leaderboard . Try modifying the hyper-parameter tuning configurations here and see if you can earn a higher score.

It can be a bit hard to interpret evaluation curves for multiple trials, so often we simply look at the final objective value. We can view our final results in ML Engine > Jobs on the Cloud Console :

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Example outputs as viewed on Cloud Console .

Reinforcement learning posts aren’t nearly as fun when they omit some GIF animations of our agent playing a video game, so we’ll introduce some now. For our policy gradients implementation with CartPole, the best parameters—limited to using only 30 gradient steps (given enough iterations, CartPole will reach the terminal 200 reward even with suboptimal hyperparameters)—were:

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Cartpole using the best parameters after only 30 gradient updates.

In contrast, using suboptimal hyperparameters the environment finished before reaching the terminal state, reaching a reward of only 32:

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Additionally, we also built a DQN implementation to solve more complex Gym environments. We ran our code on Breakout, and below are a couple of our results using the better hyperparameters. Performance could no doubtly improve with longer training and searching a larger hyperparameter space.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Breakout after approximately 300,000 iterations, reaching a maximum reward of 18.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Breakout after approximately 1 million iterations, reaching a maximum reward of 36.

Results from BipedalWalker:

After only 100 episodes of training, the agent could barely stand.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

After training for 1000 episodes, the agent is able to finish the course (in an inefficient way).

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

The agent’s gait became agile after training for 2000 episodes.

Deep reinforcement learning on GCP: using hyperparameters and Cloud ML Engine to best OpenA...

Conclusion

Reinforcement learning is currently undergoing a really exciting period of rapid innovation. ML Engine serves as a valuable tool for practitioners and researchers working on solving problems with reinforcement learning. Specifically, the hyperparameter tuning service in ML Engine allows users to evaluate different types of hyperparameter combinations, while also benefiting from the managed hyper-parameter tuning service using Bayesian optimization that speeds up optimization process compared to a naive grid search. For more information on hyperparameter tuning, check out this documentation page .

Additional Resources

除非特别声明,此文章内容采用 知识共享署名 3.0 许可,代码示例采用 Apache 2.0 许可。更多细节请查看我们的 服务条款

Tags: Cloud


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

软技能

软技能

John Sonmez / 王小刚 / 人民邮电出版社 / 2016-7 / 59.00元

这是一本真正从“人”(而非技术也非管理)的角度关注软件开发人员自身发展的书。书中论述的内容既涉及生活习惯,又包括思维方式,凸显技术中“人”的因素,全面讲解软件行业从业人员所需知道的所有“软技能”。本书聚焦于软件开发人员生活的方方面面,从揭秘面试的流程到精耕细作出一份杀手级简历,从创建大受欢迎的博客到打造你,从提高自己工作效率到与如何与“拖延症”做斗争,甚至包括如何投资不动产,如何关注自己的健康。本......一起来看看 《软技能》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

URL 编码/解码
URL 编码/解码

URL 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换