Deep Reinforcement Learning and Hyperparameter Tuning

栏目: IT技术 · 发布时间: 4年前

内容简介：One of the most difficult and time consuming parts of deep reinforcement learning is the optimization of hyperparameters. These values — such as the discount factor [latex]\gamma[/latex], or the learning rate — can make all the difference in the performanc

Using Ray’s Tune to Optimize your Models

Christian Hubbs

Apr 16 ·6min read

Deep Reinforcement Learning and Hyperparameter Tuning

One of the most difficult and time consuming parts of deep reinforcement learning is the optimization of hyperparameters. These values — such as the discount factor [latex]\gamma[/latex], or the learning rate — can make all the difference in the performance of your agent.

Agents need to be trained to see how the hyperparameters affect performance — there’s no a priori way to know whether a higher or lower value for a given parameter will improve total rewards. This translates into multiple, costly training runs to get a good agent in addition to tracking the experiments, data, and everything associated with training the models.

Ray provides a way to deal with all of this with the Tune library, which automatically handles your various models, saves the data, adjusts your hyperparameters, and summarizes the results for quick and easy reference.

TL;DR

We walk through a brief example of using Tune’s grid search features to optimize our hyperparameters.

Installing Tune

Tune is a part of the Ray project but requires a separate install, so if you haven’t installed it yet, you’ll need to run the following to get Tune to work.

pip install ray[tune]

From here, we can import our packages to train our model.

import ray
from ray import tune

Tuning your First Model

Starting with the basics, let’s use Tune to train an agent to solve CartPole-v0 . Tune takes a few dictionaries with various settings and criteria to train. The two that it must have are config and stop arguments.

The config dictionary will provide the Tune with the environment it needs to run as well as any environment specific configurations that you may want to specify. This is also where most of your hyperparameters are going to reside, but we'll get to that in a moment.

The stop dictionary tells Tune when to finish a training run or when to stop training altogether. It can be customized based on reward criteria, elapsed time, number of steps taken, and so forth. When I first started with Tune, I overlooked setting any stopping criteria and wound up letting an algorithm train for hours before realizing it. So, you can run it without this, but you may rack up a decent AWS bill if you're not careful!

Try the code below to run the PPO algorithm on CartPole-v0 for 10,000 time steps.

ray.init(ignore_reinit_error=True)
config = {
    'env': 'CartPole-v0'
}
stop = {
    'timesteps_total': 10000
}
results = tune.run(
    'PPO', # Specify the algorithm to train
    config=config,
    stop=stop
)

With these settings, you should see a print-out of the status of your workers, memory, as well as the logdir where all of the data is stored for analysis later.

The console will print these values with each iteration unless the verbose argument in tune.run() is set to 0 (silent).

When training is complete, you’ll get an output saying the status has been terminated, the elapsed time, and mean reward for the past 100 episodes among other data.

Using Grid Search to Tune Hyperparameters

The power of Tune really comes in when we leverage it to adjust our hyperparameters. For this, we’ll turn to the grid_search function which allows the user to specify a set of hyperparameters for the model to test.

To do this, we just need to wrap a list of values in the tune.grid_search() function and place that in our configuration dictionary. Let's go back to our CartPole example above. We might want to see if the learning rate makes any difference and if a two-headed network provides any benefit. We can use grid_search() to implement the different combinations of these as shown below:

config = {
    "env": 'CartPole-v0',
    "num_workers": 2,
    "vf_share_layers": tune.grid_search([True, False]),
    "lr": tune.grid_search([1e-4, 1e-5, 1e-6]),
    }
results = tune.run(
    'PPO', 
    stop={
        'timesteps_total': 100000
    },
    config=config)

Now we see an expanded status printout which contains the various trials we want to run:

As Ray kicks off each one of these, it will show the combination of hyperparameters we want to explore as well as the rewards, iterations, and elapsed time for each. When it completes, we should see TERMINATED as the status for each to show that it worked properly (otherwise it would read ERROR).

Analyzing Tune Results

The output of our tune.run() function is an analysis object that we've labeled results . We can use this to access further details about our experiments. The relevant data can be accessed via results.dataframe() , which will return a Pandas data frame containing average rewards, iterations, KL divergence, configuration settings, and on and on. The data frame also contains the specific directory your experiments were saved in ( logdir ) so you can get into the details of your particular run.

If you look into the logdir directory, you'll find a number of files that contain the saved data from your training runs. The primary file for our purposes will be progress.csv - this contains the training data from each of the iterations, allowing you to dive into the details.

For example, if we want to view the training and loss curves for our different settings, we can loop over the logdir column in our data frame, load each of the progress.csv files and plot the results.

# Plot training results
import matplotlib.pyplot as plt
import pandas as pdcolors = plt.rcParams['axes.prop_cycle'].by_key()['color']
df = results.dataframe()# Get column for total loss, policy loss, and value loss
tl_col = [i for i, j in enumerate(df.columns)
          if 'total_loss' in j][0]
pl_col = [i for i, j in enumerate(df.columns)
          if 'policy_loss' in j][0]
vl_col = [i for i, j in enumerate(df.columns)
          if 'vf_loss' in j][0]
labels = []
fig, ax = plt.subplots(2, 2, figsize=(15, 15), sharex=True)
for i, path in df['logdir'].iteritems():
    data = pd.read_csv(path + '/progress.csv')
    # Get labels for legend
    lr = data['experiment_tag'][0].split('=')[1].split(',')[0]
    layers = data['experiment_tag'][0].split('=')[-1]
    labels.append('LR={}; Shared Layers={}'.format(lr, layers))
    
    ax[0, 0].plot(data['timesteps_total'], 
            data['episode_reward_mean'], c=colors[i],
            label=labels[-1])
    
    ax[0, 1].plot(data['timesteps_total'], 
           data.iloc[:, tl_col], c=colors[i],
           label=labels[-1])
    
    ax[1, 0].plot(data['timesteps_total'], 
               data.iloc[:, pl_col], c=colors[i],
               label=labels[-1])
    
    
    ax[1, 1].plot(data['timesteps_total'], 
               data.iloc[:, vl_col], c=colors[i],
               label=labels[-1])ax[0, 0].set_ylabel('Mean Rewards')
ax[0, 0].set_title('Training Rewards by Time Step')
ax[0, 0].legend(labels=labels, loc='upper center',
        ncol=3, bbox_to_anchor=[0.75, 1.2])
ax[0, 1].set_title('Total Loss by Time Step')
ax[0, 1].set_ylabel('Total Loss')
ax[0, 1].set_xlabel('Training Episodes')ax[1, 0].set_title('Policy Loss by Time Step')
ax[1, 0].set_ylabel('Policy Loss')
ax[1, 0].set_xlabel('Time Step')ax[1, 1].set_title('Value Loss by Time Step')
ax[1, 1].set_ylabel('Value Loss')
ax[1, 1].set_xlabel('Time Step')plt.show()

Beyond Grid Search

There are far more tuning options available in Tune. If you want to see what you can tweak, take a look at the documentation for your particular algorithm . Moreover, Tune enables different approaches to hyperparameter optimization. Grid search can be slow, so just by changing a few options, you can use Bayesian optimization, HyperOpt and others. Finally, Tune makes population based training (PBT) easy allowing multiple agents to scale across various machines. All of this will be covered in future posts!

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Deep Reinforcement Learning and Hyperparameter Tuning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

京东平台运营攻略（全彩）

京东商学院 / 电子工业出版社 / 2015-5 / 69.00元

2014 年年末，京东POP 开放平台的入驻商家已超过6 万，京东平台被广泛关注和认可的同时，在电商江湖中仍颇具神秘色彩。面对碎片化的信息，京东的店铺经营者及希望入驻京东的准商家们，对于在京东如何利用丰富的各类平台资源，搭建并运营京东店铺，一直很难找到全面而系统的资料。《京东平台运营攻略（全彩）》由京东官方出品，动员了京东内部涉及第三方店铺业务线的众多部门，由多位业务精英参与撰写，保证了内......一起来看看《京东平台运营攻略（全彩）》这本书的介绍吧!

码农工具