Deep Reinforcement Learning and Hyperparameter Tuning

栏目: IT技术 · 发布时间: 4年前

内容简介:One of the most difficult and time consuming parts of deep reinforcement learning is the optimization of hyperparameters. These values — such as the discount factor [latex]\gamma[/latex], or the learning rate — can make all the difference in the performanc

Using Ray’s Tune to Optimize your Models

Deep Reinforcement Learning and Hyperparameter Tuning

One of the most difficult and time consuming parts of deep reinforcement learning is the optimization of hyperparameters. These values — such as the discount factor [latex]\gamma[/latex], or the learning rate — can make all the difference in the performance of your agent.

Agents need to be trained to see how the hyperparameters affect performance — there’s no a priori way to know whether a higher or lower value for a given parameter will improve total rewards. This translates into multiple, costly training runs to get a good agent in addition to tracking the experiments, data, and everything associated with training the models.

Ray provides a way to deal with all of this with the Tune library, which automatically handles your various models, saves the data, adjusts your hyperparameters, and summarizes the results for quick and easy reference.

TL;DR

We walk through a brief example of using Tune’s grid search features to optimize our hyperparameters.

Installing Tune

Tune is a part of the Ray project but requires a separate install, so if you haven’t installed it yet, you’ll need to run the following to get Tune to work.

pip install ray[tune]

From here, we can import our packages to train our model.

import ray
from ray import tune

Tuning your First Model

Starting with the basics, let’s use Tune to train an agent to solve CartPole-v0 . Tune takes a few dictionaries with various settings and criteria to train. The two that it must have are config and stop arguments.

The config dictionary will provide the Tune with the environment it needs to run as well as any environment specific configurations that you may want to specify. This is also where most of your hyperparameters are going to reside, but we'll get to that in a moment.

The stop dictionary tells Tune when to finish a training run or when to stop training altogether. It can be customized based on reward criteria, elapsed time, number of steps taken, and so forth. When I first started with Tune, I overlooked setting any stopping criteria and wound up letting an algorithm train for hours before realizing it. So, you can run it without this, but you may rack up a decent AWS bill if you're not careful!

Try the code below to run the PPO algorithm on CartPole-v0 for 10,000 time steps.

ray.init(ignore_reinit_error=True)
config = {
    'env': 'CartPole-v0'
}
stop = {
    'timesteps_total': 10000
}
results = tune.run(
    'PPO', # Specify the algorithm to train
    config=config,
    stop=stop
)

With these settings, you should see a print-out of the status of your workers, memory, as well as the logdir where all of the data is stored for analysis later.

Deep Reinforcement Learning and Hyperparameter Tuning

The console will print these values with each iteration unless the verbose argument in tune.run() is set to 0 (silent).

When training is complete, you’ll get an output saying the status has been terminated, the elapsed time, and mean reward for the past 100 episodes among other data.

Using Grid Search to Tune Hyperparameters

The power of Tune really comes in when we leverage it to adjust our hyperparameters. For this, we’ll turn to the grid_search function which allows the user to specify a set of hyperparameters for the model to test.

To do this, we just need to wrap a list of values in the tune.grid_search() function and place that in our configuration dictionary. Let's go back to our CartPole example above. We might want to see if the learning rate makes any difference and if a two-headed network provides any benefit. We can use grid_search() to implement the different combinations of these as shown below:

config = {
    "env": 'CartPole-v0',
    "num_workers": 2,
    "vf_share_layers": tune.grid_search([True, False]),
    "lr": tune.grid_search([1e-4, 1e-5, 1e-6]),
    }
results = tune.run(
    'PPO', 
    stop={
        'timesteps_total': 100000
    },
    config=config)

Now we see an expanded status printout which contains the various trials we want to run:

Deep Reinforcement Learning and Hyperparameter Tuning

As Ray kicks off each one of these, it will show the combination of hyperparameters we want to explore as well as the rewards, iterations, and elapsed time for each. When it completes, we should see TERMINATED as the status for each to show that it worked properly (otherwise it would read ERROR).

Deep Reinforcement Learning and Hyperparameter Tuning

Analyzing Tune Results

The output of our tune.run() function is an analysis object that we've labeled results . We can use this to access further details about our experiments. The relevant data can be accessed via results.dataframe() , which will return a Pandas data frame containing average rewards, iterations, KL divergence, configuration settings, and on and on. The data frame also contains the specific directory your experiments were saved in ( logdir ) so you can get into the details of your particular run.

If you look into the logdir directory, you'll find a number of files that contain the saved data from your training runs. The primary file for our purposes will be progress.csv - this contains the training data from each of the iterations, allowing you to dive into the details.

For example, if we want to view the training and loss curves for our different settings, we can loop over the logdir column in our data frame, load each of the progress.csv files and plot the results.

# Plot training results
import matplotlib.pyplot as plt
import pandas as pdcolors = plt.rcParams['axes.prop_cycle'].by_key()['color']
df = results.dataframe()# Get column for total loss, policy loss, and value loss
tl_col = [i for i, j in enumerate(df.columns)
          if 'total_loss' in j][0]
pl_col = [i for i, j in enumerate(df.columns)
          if 'policy_loss' in j][0]
vl_col = [i for i, j in enumerate(df.columns)
          if 'vf_loss' in j][0]
labels = []
fig, ax = plt.subplots(2, 2, figsize=(15, 15), sharex=True)
for i, path in df['logdir'].iteritems():
    data = pd.read_csv(path + '/progress.csv')
    # Get labels for legend
    lr = data['experiment_tag'][0].split('=')[1].split(',')[0]
    layers = data['experiment_tag'][0].split('=')[-1]
    labels.append('LR={}; Shared Layers={}'.format(lr, layers))
    
    ax[0, 0].plot(data['timesteps_total'], 
            data['episode_reward_mean'], c=colors[i],
            label=labels[-1])
    
    ax[0, 1].plot(data['timesteps_total'], 
           data.iloc[:, tl_col], c=colors[i],
           label=labels[-1])
    
    ax[1, 0].plot(data['timesteps_total'], 
               data.iloc[:, pl_col], c=colors[i],
               label=labels[-1])
    
    
    ax[1, 1].plot(data['timesteps_total'], 
               data.iloc[:, vl_col], c=colors[i],
               label=labels[-1])ax[0, 0].set_ylabel('Mean Rewards')
ax[0, 0].set_title('Training Rewards by Time Step')
ax[0, 0].legend(labels=labels, loc='upper center',
        ncol=3, bbox_to_anchor=[0.75, 1.2])
ax[0, 1].set_title('Total Loss by Time Step')
ax[0, 1].set_ylabel('Total Loss')
ax[0, 1].set_xlabel('Training Episodes')ax[1, 0].set_title('Policy Loss by Time Step')
ax[1, 0].set_ylabel('Policy Loss')
ax[1, 0].set_xlabel('Time Step')ax[1, 1].set_title('Value Loss by Time Step')
ax[1, 1].set_ylabel('Value Loss')
ax[1, 1].set_xlabel('Time Step')plt.show()

Deep Reinforcement Learning and Hyperparameter Tuning

Beyond Grid Search

There are far more tuning options available in Tune. If you want to see what you can tweak, take a look at the documentation for your particular algorithm . Moreover, Tune enables different approaches to hyperparameter optimization. Grid search can be slow, so just by changing a few options, you can use Bayesian optimization, HyperOpt and others. Finally, Tune makes population based training (PBT) easy allowing multiple agents to scale across various machines. All of this will be covered in future posts!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

C++数值算法(第二版)

C++数值算法(第二版)

William T.Vetterling、Brian P.Flannery、Saul A.Teukolsky / 胡健伟、赵志勇、薛运华 / 电子工业出版社 / 2005年01月 / 68.00

本书选材内容丰富,除了通常数值方法课程的内容外,还包含当代科学计算大量用到的专题,如求特殊函数值、随机数、排序、最优化、快速傅里叶变换、谱分析、小波变换、统计描述和数据建模、常微分方程和偏微分方程数值解、若干编码算法和任意精度的计算等。 本书科学性和实用性统一。每个专题中,不仅对每种算法给出了数学分析和比较,而且根据作者的经验对算法做出了评论和建议,并在此基础上给出了用C++语言编写的实用程......一起来看看 《C++数值算法(第二版)》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换