Programmatically interpretable reinforcement learning

栏目: IT技术 · 发布时间: 4年前

内容简介:Being able to trust (interpret, verify) a controller learned through reinforcement learning (RL) is one of theSo how do you make an interpretable model? Today’s paper choice is the third paper we’ve looked at along these lines (followingCORELS andRiskSlim)

Programmatically interpretable reinforcement learning , Verma et al., ICML 2018

Being able to trust (interpret, verify) a controller learned through reinforcement learning (RL) is one of the key challenges for real-world deployments of RL that we looked at earlier this week. It’s also an essential requirement for agents in human-machine collaborations (i.e, all deployments at some level) as we saw last week. Since reading some ofCynthia Rudin’s work last year I’ve been fascinated with the notion of interpretable models. I believe there are a large set of use cases where an interpretable model should be the default choice. There are so many deployment benefits, even putting aside any ethical or safety concerns.

So how do you make an interpretable model? Today’s paper choice is the third paper we’ve looked at along these lines (followingCORELS andRiskSlim), enough for a recognisable pattern to start to emerge. The first step is to define a language — grammar and associated semantics — in which the ultimate model to be deployed will be expressed. For CORRELS this consists of simple rule based expressions, and for RiskSlim it is scoring sheets. For Programmatically Interpretable Reinforcement Learning (PIRL) as we shall soon see, it’s a minimal functional language and an accompanying program sketch . The key thing is that by the definition of the language and the rules for valid expressions within it, any model expressed in that language will be interpretable. Given the model expression language (a DSL), we can now use all of the machine learning techniques at our disposal (including black box methods) to learn an expression in that language . Ultimately it’s that learned and interpretable (by humans as well as machines) expression that we deploy in our systems. Hence black-box and other models become not the ultimate output of our learning process, but an intermediate step along the way.

In PIRL, Verma et al. embody this pattern in the following way:

  1. There’s a tiny functional language based on a small number of side-effect free combinators
  2. For a given task, a program template (which the authors call a sketch ), further constrains the set of programs that can be learned for the problem in hand. This also very handily constrains the search space of course, helping to make learning a suitable policy program tractable.
  3. To help guide the search within the set of programs conforming to the sketch, a standard reinforcement learning algorithm is used to learn a (black box) policy.
  4. The black box policy is used as an oracle (the Neural Policy Oracle ), and a neurally directed program search (NDPS) tries to find the sketch-conforming program that behaves as closely to the oracle as possible.

There’s a subtle point here that it may not be possible to exactly mirror the behaviour of the Oracle in a sketch-conforming program. For one thing, that program is likely to have a much smaller state space than the Oracle. There’s a nice thing that happens in the evaluation section, where the authors compare the performance of a learned program (interpretable model) against the best black box model (the Oracle). While the interpretable model may not get to quite the same level of performance as the Oracle on the exact task used for training, it turns out that the process of generating the interpretable model results in something which generalises much better to new situations. The authors present this without commentary. My hypothesis is that it’s the result of the smoothing effect of dimensionality reduction, reducing overfitting.

We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also show that PIRL policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by DRL.

A DSL for policies

In PIRL, policies are expressed using a high-level DSL.

…to facilitate search through the space of programs expressible in the language, it is desirable for the language to express computations as compactly and canonically as possible. Because of this, we propose to express parameterized policies using a functional language based on a small number of side-effect free combinators. It is known from prior work that such languages offer natural advantages in program synthesis.

The language supports:

  • Numerical constants
  • A set of basic operators (+, -, *, /, …)
  • Variables
  • A peek(x, i) operation that gives the observed value of a variable x , i timesteps ago in a history.
  • A fold operator that operates over histories

Programmatically interpretable reinforcement learning

Sketches

Given the vast number of possible programs that can be generated using the DSL, for a given problem, the user provides a grammar of allowable expressions. The running example in the paper is learning a policy to drive a race car around circuits in TORCS (The Open Racing Car Simulator). The policy has to control steering, acceleration, and so on. We could use our domain specific knowledge to figure out that a good ‘shape’ (or sketch ) of a likely solution is to use a number of PID controllers. But how they should be parameterised, and how we should coordinate amongst them is unknown. The following sketch grammar shows how we could then constrain the policy program search space:

Programmatically interpretable reinforcement learning

By acting over a fixed-size window of history, the fold (I) can be used as a discrete approximation of the integral term in a PID controller.

Here’s an example program conforming to this grammar that PIRL learns for TORCS:

Programmatically interpretable reinforcement learning

( Enlarge )

Neurally directed program search

The non-smoothness of the space of programmatic policies conforming to a sketch means that standard search approaches can’t be used. So instead PIRL uses standard deep reinforcement learning to learn a policy oracle for the given environment. The Neurally Directed Program Search algorithm (see below) then seeks to find a program that closely mimics the behaviour of the oracle. This is a form of imitation learning .

Programmatically interpretable reinforcement learning

There’s a tricky bit here whereby programs encountered during the search may generate histories that are impossible under the oracle (and so we can’t ask the oracle for advice on what to do). For example, the oracle may never drive the race car into a wall, but if we generate a program that does do this, we very much want advice!

Our solution to this problem is input augmentation , or periodic updates to set Programmatically interpretable reinforcement learning (the history). More precisely, after a certain number of search steps for a fixed set Programmatically interpretable reinforcement learning , and after choosing the best available synthesized program for this set, we sample a set of additional histories by simulating the current programmatic policy, and add these samples to Programmatically interpretable reinforcement learning .

Learning to drive a car

The authors use NDPS to train a policy for the practice mode of TORCS – input is available from 29 sensors, and the agent must learn to control acceleration and steering. The sketch used in the experiment was the one we looked at above, and fold calculations were restricted to the five most recent history observations. Each distinct race track is viewed as distinct POMDP (partially observable Markov Decision Process). In addition to TORCS, NDPS is used on three simpler control games as well: Acrobot , CartPole , and MountainCar .

For two racetracks, here’s how NDPS performed when compared to the following approaches:

  • DRL – an agent using deep reinforcing learning directly
  • Naive – a program synthesised without access to a policy oracle
  • NoAug – a program synthesised without input augmentation
  • NoSketch – a program synthesised without sketch guidance
  • NoIF – a program synthesised with a restricted sketch that does not permit conditional branching.

Programmatically interpretable reinforcement learning

The DRL policy performs better, but the NDPS policy is pretty good, and it is interpretable by construction. The NDPS policy is also less aggressive with its control inputs, resulting in smoother steering actions. When noise is added (by simulating defective sensors with dropouts) then the NDPS policy does noticeably better than the DRL one:

Programmatically interpretable reinforcement learning

It also does better on race tracks the agent hasn’t seen before:

Programmatically interpretable reinforcement learning

Verifying policy properties

So far as we know, the current state of the art neural network verifiers cannot verify the DRL network we are using in a reasonable amount of time, due to the size and complexity of the network…

On the other hand, the simple program produced by NDPS can easily be verified. The authors show one simple proof that the generated program guarantees smooth steering behaviour, and another proof of global bounds on action properties.

One nice thing it occurs to me you could do here, is design the sketch in such a way that the programs conforming to it can easily have an accompanying TLA+ (or similar) model. Then you can use the proof system to reason about the generated programs. It might even be possible to have some kind of reward feedback loop whereby generated programs that don’t satisfy certain desired properties are pruned.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Android软件安全与逆向分析

Android软件安全与逆向分析

丰生强 / 人民邮电出版社 / 2013-2 / 69.00元

本书由浅入深、循序渐进地讲解了Android 系统的软件安全、逆向分析与加密解密技术。包括Android软件逆向分析和系统安全方面的必备知识及概念、如何静态分析Android 软件、如何动态调试Android 软件、Android 软件的破解与反破解技术的探讨,以及对典型Android 病毒的全面剖析。 本书适合所有Android 应用开发者、Android 系统开发工程师、Android ......一起来看看 《Android软件安全与逆向分析》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具