Programmatically interpretable reinforcement learning

栏目: IT技术 · 发布时间: 5年前

内容简介:Being able to trust (interpret, verify) a controller learned through reinforcement learning (RL) is one of theSo how do you make an interpretable model? Today’s paper choice is the third paper we’ve looked at along these lines (followingCORELS andRiskSlim)

Programmatically interpretable reinforcement learning , Verma et al., ICML 2018

Being able to trust (interpret, verify) a controller learned through reinforcement learning (RL) is one of the key challenges for real-world deployments of RL that we looked at earlier this week. It’s also an essential requirement for agents in human-machine collaborations (i.e, all deployments at some level) as we saw last week. Since reading some ofCynthia Rudin’s work last year I’ve been fascinated with the notion of interpretable models. I believe there are a large set of use cases where an interpretable model should be the default choice. There are so many deployment benefits, even putting aside any ethical or safety concerns.

So how do you make an interpretable model? Today’s paper choice is the third paper we’ve looked at along these lines (followingCORELS andRiskSlim), enough for a recognisable pattern to start to emerge. The first step is to define a language — grammar and associated semantics — in which the ultimate model to be deployed will be expressed. For CORRELS this consists of simple rule based expressions, and for RiskSlim it is scoring sheets. For Programmatically Interpretable Reinforcement Learning (PIRL) as we shall soon see, it’s a minimal functional language and an accompanying program sketch . The key thing is that by the definition of the language and the rules for valid expressions within it, any model expressed in that language will be interpretable. Given the model expression language (a DSL), we can now use all of the machine learning techniques at our disposal (including black box methods) to learn an expression in that language . Ultimately it’s that learned and interpretable (by humans as well as machines) expression that we deploy in our systems. Hence black-box and other models become not the ultimate output of our learning process, but an intermediate step along the way.

In PIRL, Verma et al. embody this pattern in the following way:

  1. There’s a tiny functional language based on a small number of side-effect free combinators
  2. For a given task, a program template (which the authors call a sketch ), further constrains the set of programs that can be learned for the problem in hand. This also very handily constrains the search space of course, helping to make learning a suitable policy program tractable.
  3. To help guide the search within the set of programs conforming to the sketch, a standard reinforcement learning algorithm is used to learn a (black box) policy.
  4. The black box policy is used as an oracle (the Neural Policy Oracle ), and a neurally directed program search (NDPS) tries to find the sketch-conforming program that behaves as closely to the oracle as possible.

There’s a subtle point here that it may not be possible to exactly mirror the behaviour of the Oracle in a sketch-conforming program. For one thing, that program is likely to have a much smaller state space than the Oracle. There’s a nice thing that happens in the evaluation section, where the authors compare the performance of a learned program (interpretable model) against the best black box model (the Oracle). While the interpretable model may not get to quite the same level of performance as the Oracle on the exact task used for training, it turns out that the process of generating the interpretable model results in something which generalises much better to new situations. The authors present this without commentary. My hypothesis is that it’s the result of the smoothing effect of dimensionality reduction, reducing overfitting.

We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also show that PIRL policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by DRL.

A DSL for policies

In PIRL, policies are expressed using a high-level DSL.

…to facilitate search through the space of programs expressible in the language, it is desirable for the language to express computations as compactly and canonically as possible. Because of this, we propose to express parameterized policies using a functional language based on a small number of side-effect free combinators. It is known from prior work that such languages offer natural advantages in program synthesis.

The language supports:

  • Numerical constants
  • A set of basic operators (+, -, *, /, …)
  • Variables
  • A peek(x, i) operation that gives the observed value of a variable x , i timesteps ago in a history.
  • A fold operator that operates over histories

Programmatically interpretable reinforcement learning

Sketches

Given the vast number of possible programs that can be generated using the DSL, for a given problem, the user provides a grammar of allowable expressions. The running example in the paper is learning a policy to drive a race car around circuits in TORCS (The Open Racing Car Simulator). The policy has to control steering, acceleration, and so on. We could use our domain specific knowledge to figure out that a good ‘shape’ (or sketch ) of a likely solution is to use a number of PID controllers. But how they should be parameterised, and how we should coordinate amongst them is unknown. The following sketch grammar shows how we could then constrain the policy program search space:

Programmatically interpretable reinforcement learning

By acting over a fixed-size window of history, the fold (I) can be used as a discrete approximation of the integral term in a PID controller.

Here’s an example program conforming to this grammar that PIRL learns for TORCS:

Programmatically interpretable reinforcement learning

( Enlarge )

Neurally directed program search

The non-smoothness of the space of programmatic policies conforming to a sketch means that standard search approaches can’t be used. So instead PIRL uses standard deep reinforcement learning to learn a policy oracle for the given environment. The Neurally Directed Program Search algorithm (see below) then seeks to find a program that closely mimics the behaviour of the oracle. This is a form of imitation learning .

Programmatically interpretable reinforcement learning

There’s a tricky bit here whereby programs encountered during the search may generate histories that are impossible under the oracle (and so we can’t ask the oracle for advice on what to do). For example, the oracle may never drive the race car into a wall, but if we generate a program that does do this, we very much want advice!

Our solution to this problem is input augmentation , or periodic updates to set Programmatically interpretable reinforcement learning (the history). More precisely, after a certain number of search steps for a fixed set Programmatically interpretable reinforcement learning , and after choosing the best available synthesized program for this set, we sample a set of additional histories by simulating the current programmatic policy, and add these samples to Programmatically interpretable reinforcement learning .

Learning to drive a car

The authors use NDPS to train a policy for the practice mode of TORCS – input is available from 29 sensors, and the agent must learn to control acceleration and steering. The sketch used in the experiment was the one we looked at above, and fold calculations were restricted to the five most recent history observations. Each distinct race track is viewed as distinct POMDP (partially observable Markov Decision Process). In addition to TORCS, NDPS is used on three simpler control games as well: Acrobot , CartPole , and MountainCar .

For two racetracks, here’s how NDPS performed when compared to the following approaches:

  • DRL – an agent using deep reinforcing learning directly
  • Naive – a program synthesised without access to a policy oracle
  • NoAug – a program synthesised without input augmentation
  • NoSketch – a program synthesised without sketch guidance
  • NoIF – a program synthesised with a restricted sketch that does not permit conditional branching.

Programmatically interpretable reinforcement learning

The DRL policy performs better, but the NDPS policy is pretty good, and it is interpretable by construction. The NDPS policy is also less aggressive with its control inputs, resulting in smoother steering actions. When noise is added (by simulating defective sensors with dropouts) then the NDPS policy does noticeably better than the DRL one:

Programmatically interpretable reinforcement learning

It also does better on race tracks the agent hasn’t seen before:

Programmatically interpretable reinforcement learning

Verifying policy properties

So far as we know, the current state of the art neural network verifiers cannot verify the DRL network we are using in a reasonable amount of time, due to the size and complexity of the network…

On the other hand, the simple program produced by NDPS can easily be verified. The authors show one simple proof that the generated program guarantees smooth steering behaviour, and another proof of global bounds on action properties.

One nice thing it occurs to me you could do here, is design the sketch in such a way that the programs conforming to it can easily have an accompanying TLA+ (or similar) model. Then you can use the proof system to reason about the generated programs. It might even be possible to have some kind of reward feedback loop whereby generated programs that don’t satisfy certain desired properties are pruned.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

编写高质量代码

编写高质量代码

秦小波 / 机械工业出版社华章公司 / 2011-12-28 / 59.00元

在通往“Java技术殿堂”的路上,本书将为你指点迷津!内容全部由Java编码的最佳实践组成,从语法、程序设计和架构、工具和框架、编码风格和编程思想等五大方面对Java程序员遇到的各种棘手的疑难问题给出了经验性的解决方案,为Java程序员如何编写高质量的Java代码提出了151条极为宝贵的建议。对于每一个问题,不仅以建议的方式从正反两面给出了被实践证明为十分优秀的解决方案和非常糟糕的解决方案,而且还......一起来看看 《编写高质量代码》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

MD5 加密
MD5 加密

MD5 加密工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试