The hidden linear algebra of reinforcement learning

栏目: IT技术 · 发布时间: 5年前

内容简介:Richard E. Bellman was a mathematician that laid the groundwork for modern control and optimization theory. Through a recursive one-step equation,In reinforcement learning, we use the Bellman Update process to solve for the optimal values and q-values of a

Bellman Updates

Richard E. Bellman was a mathematician that laid the groundwork for modern control and optimization theory. Through a recursive one-step equation, a Bellman Update Equation, large optimization problems can be solved efficiently. With a recursive Bellman update, one can set up an optimization or control problem with Dynamic Programming, which is a process of creating smaller, more computationally tractable problems. This process proceeds recursively from the end — a receding horizon approach.

Richard E. Bellman (1920–1984) — Wikipedia.
  1. Bellman Equation : Necessary condition for optimality in optimization problems formulated as Dynamic Programming .
  2. Dynamic Programing : Process to simplify an optimization problem by breaking it down into an optimal substructure.

In reinforcement learning, we use the Bellman Update process to solve for the optimal values and q-values of a state-action space. This is ultimately formulating the expected sum of future rewards from a given location.

Here, we can see all of the values from the review interleaving. The notation (*) denotes optimal, so true or converged. We have the value of the state being determined by the best action, and a q-state, then two recursive definitions. The recursive values balance the probability of visiting any state in T(s,a,s’) and the reward of any transition R(s,a,s’) to create a global map for values of the state-action space .

The best value is tied to the best action-conditions q-value. Then the value and q-value update rules are very similar (weighting transitions, rewards, and discount). top) coupling of values to q-values; mid) Q-value recursion, bot) Value iteration. Source cs188 at UC Berkeley .

They key point here is that we are multiplying matrices ( R, T ), by vectors ( V,U ), to iteratively solve for convergence. The values will converge from any initial state because of how the values for one state are determined by their neighbors s’ . For more on MDPs — see this introduction I wrote .

Reinforcement Learning?

“I was told there would be RL,” — you, reader, 4 minutes in. This is all reinforcement learning, and I assert understanding the assumptions and the model that the algorithms are built on will prepare you vastly better then just copying python tutorials from OpenAI. Do that after. I’ve mentored multiple students into working in RL, and the ones who get more done are always the ones that learn what is going on, and then how to apply it.

That being said, this is one small step away from online q-learning , where we estimate the same Bellman updates with samples of T and R rather than explicitly using them in the equations. All the same assertions apply , but it is over probability distributions and expectations. Q-learning is the famous algorithm that solved Atari games and more in ~2015 .

Hidden Math

Eigenvalues? Huh.

Recall an eigenvalue-eigenvector pair ( λ, u ) of a system A is a vector and scalar such that the vector acted on by the system returns a scalar multiple of the original vector.

The eigenvalue, eigenvector equation.

The beautiful thing about eigenvalues and eigenvectors is that when they span the state space (which they are guaranteed to do for most physical systems by something called generalized eigenvectors), every vector can be written as a combination of the other eigenvectors. Then, in discrete systems Eigenvectors control the evolutions from any initial state — any initial vector will combine to a linear combination of the eigenvectors.

Stochastic Matrices and Markov Chains

MDPs are very close to, but not the same in structure to Markov Chains. Markov chains are determined by transition matrix P . The probability matrix acts like the transition matrix T(s,a,s’) summed over the actions . In Markov Chains, the next state is determined by:

Evolution of a stochastic matrix.

This matrix P has some special values — you can see that this is an eigenvalue equation with all the eigenvalues equal to one (picture a λ =1 pre-multiplying the left side of the equation). In order to get a matrix guaranteed to have eigenvalues equal to one, all the columns must sum up to 1.

What we are looking for in RL now, is how does the evolution of our solutions relate to convergence of probability distributions? We do this by formulating the iterative operators for V* and Q* as a linear operator (a Matrix) B . Convergence can be tricky — the value and q-value vectors we use are not the eigenvectors — they converge to the eigenvectors, but that’s not important to seeing how eigenvectors govern the system .

The Bellman operator, B, like a linear transformation with eigenvector, eigenvalue pair of λ=1.

以上所述就是小编给大家介绍的《The hidden linear algebra of reinforcement learning》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

虚拟现实:最后的传播

虚拟现实:最后的传播

聂有兵 / 中国发展出版社 / 2017-4-1 / 39.00

本书对“虚拟现实”这一诞生自70年代却在今天成为热门话题的概念进行了历史发展式的分析和回顾,认为虚拟现实是当今最重大的社会变革的技术因素之一,对虚拟现实在未来百年可能给人类社会的各个层面带来的影响进行说明,结合多个大众媒介的发展趋势,合理地推演未来虚拟现实在政治、经济、文化等领域的态势,并基于传播学理论框架提出了几个新的观点。对于普通读者,本书可以普及一般的虚拟现实知识;对于传媒行业,本书可以引导......一起来看看 《虚拟现实:最后的传播》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具