Why does the optimal policy exist?

栏目: IT技术 · 发布时间: 4年前

内容简介：In a finiteThe famous introductory book of Sutton and Barto on reinforcement learning¹ takes the existence of optimal policies for granted and let this question unanswered. I had difficult times to believe them and be able to continue reading!In this artic

A proof of the existence of the optimal policy for finite MDPs

Alireza Modirshanechi

Jul 18 ·8min read

In a finite Markov Decision Process (MDP), the optimal policy is defined as a policy that maximizes the value of all states at the same time¹. In other words, if an optimal policy exists, then the policy that maximizes the value of state s is the same as the policy that maximizes the value of state s'. ² But why should such a policy exist?

The famous introductory book of Sutton and Barto on reinforcement learning¹ takes the existence of optimal policies for granted and let this question unanswered. I had difficult times to believe them and be able to continue reading!

In this article, I am going to give a proof of the existence of the optimal policy in finite MDPs ³.

Notation and definitions

Markov decision processes and policies

A finite MDP is characterized by a finite set of states (usually shown by curvy S ), a finite set of actions for each state (usually shown by curvy A ), and a probability distribution over the next state s’ and the next reward value r, given the current state s and the currently chosen action a , denoted as p(s’,r|s,a).

Given the current state s , a policy π is a probability distribution over possible actions at state s , denoted as π (a|s). Then, given a policy, an agent can navigate in the environment (i.e. go from one state to another) and get rewards through each transition.

We show random variables by capital letters and their values by small letters. Time is added to each variable with a subscript. Then, given a policy and an MDP, and given the initial state (at time t=1 ) s , for any T > 1 , the joint distribution of states, actions, and reward values is

Values and Bellman equation

Given a policy π and a discount factor 0 ≤ γ < 1, the value of each state is defined as

and the value of each pair of state and action as

It is easy to show that the values of states and action-state pairs can be written in a recursive way

These sets of equations are known as the Bellman equations .

We will use later the fact that

Equation 1. State value as a function state-action value.

Optimal policy

The policy π* is an optimal policy if and only if we have

for any state s and any other policy π.

The Bellman optimality equations

We show the set of all possible states by curvy S and the set of all possible actions at state s by curvy A(s). We denote the Kronecker delta by δ and start this section with the following theorem.

Proof comment: We used Eq. 1 in the 1st line of the proof, and then repeatedly used the fact that the value of the pair of state and action (s*, a*) is greater than or equal to the value of state s*.

Theorem 1 states that, whenever there is a pair of state and action (s*, a*) with a value greater than the value of state s* , with respect to policy π, then there is another policy π’ that is better than or equal to (in terms of state-values) π in all states. As a result, if an optimal policy π* exists, its values should satisfy, for any state s,

Equation 2. The compact form of the Bellman optimality equations.

where curvy A(s) stands for the set of all possible actions at state s — one can easily prove this statement by contradiction . Using the Bellman equations, we can expand Equation 2 as

Equation 3. The expanded form of the Bellman optimality equations.

This set of non-linear equations (as many as the number of states) is called the “Bellman optimality equations”. So, if an optimal policy exists, its values should satisfy this set of equations³.

Therefore, to show that an optimal policy exists, one must prove the following two statements:

The set of the Bellman optimality equations has solutions, and
One of its solutions has values greater than or equal to the values of the other solutions in all states.

The existence and the uniqueness of the solution

In this section, we prove that the set of the Bellman optimality equations has a unique solution. By doing so, we prove the two aforementioned statements at the same time.

The Bellman optimality operator

Given a set of values over states, we define the vector of values as

which is simply a real-value vector whose elements are equal to the values of different states. Then, we define the “Bellman optimality operator” T as a mapping

The operator T takes a vector of values and maps it to another vector of values. Using this new notation, it is easy to see that Equations 2 and 3 are equivalent to

Equation 4. The Bellman optimality equations as the fixed point of the Bellman optimality operator.

This observation means that the solutions of the Bellman optimality equations are the same as the fixed point s of the Bellman optimality operator. Therefore, to prove the existence and uniqueness of the solution to the Bellman optimality equations, one can prove that the Bellman optimality operator has a unique fixed point.

To do so, we need to introduce another concept and another theorem.

Contraction mapping and Banach fixed-point theorem

Consider a metric space (M,d) , i.e. M is a set, and d is a metric defined on this set for computing the distance of every two elements of M ⁵. A mapping T: M → M is a contraction mapping if there exists 0 ≤ k < 1 such that for any x and y in M , we have

Intuitively, contraction mapping makes points closer to each other. Figure 1 shows an illustration of repeatedly applying a contraction mapping on two points.

Figure 1. An illustration of contraction mapping and the statement of Banach fixed-point theorem

The reason that we are interested in contraction mappings is the following famous theorem, known as Banach fixed-point theorem .

Proof comment: The proof of the theorem is not hard, but I do not include it in this article, because the theorem is well known and the proof can easily be found elsewhere, e.g. see here .

The whole idea behind the theorem is what is illustrated in Figure 1: All points are getting closer to each other after mapping, and hence, by repeating the mapping all points converge to one point which is the unique fixed point of T .

As a result, to prove the existence and uniqueness of the solution of the Bellman optimal equations, it is sufficient to show that there is a metric in which the Bellman optimality operator is a contraction mapping.

The Bellman optimality operator is a contraction mapping in infinity norm

For any pair of value vectors V and V' , their infinity norm is defined as

In this section, we want to prove that the Bellman optimality operator is a contraction mapping in this norm. To do so, we first need the following technical Lemma.

Proof comment: Although the Lemma is pretty non-trivial, its proof is not difficult and needs only elementary techniques. I had some fun proving it and thought it might be good to leave its proof as an exercise for interested readers⁶.

Now, having the lemma, we can finally go to our main theorem.

Proof comment: To go from the 2nd to the 3rd line of the proof, we used the Lemma, and to go from 4th to 5th line we used the convexity of the absolute value function. The rest is straightforward.

As a result, the Bellman optimality operator has a unique fixed point⁷, and the Bellman optimality equations have a unique solution. It is easy to show that any greedy policy on a solution of the Bellman optimality equations has values equal to that solution. Therefore, optimal policies exist!

Conclusion

We show that ( 1) the values of an optimal policy should satisfy the Bellman optimal equations. We then show that ( 2) the solutions to the Bellman optimal equations are the fixed point of the Bellman optimal operator. By showing that ( 3) the Bellman optimal operator is a contraction mapping in infinity norm and using ( 4) Banach fixed-point theorem, we proved that ( 5) the Bellman optimal operator has a unique fixed point. As a result, ( 6) there exist policies that maximize the values of all states at the same time.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Why does the optimal policy exist?

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

深入理解Java虚拟机（第2版）

周志明 / 机械工业出版社 / 2013-9-1 / 79.00元

《深入理解Java虚拟机:JVM高级特性与最佳实践(第2版)》内容简介：第1版两年内印刷近10次，4家网上书店的评论近4?000条，98%以上的评论全部为5星级的好评，是整个Java图书领域公认的经典著作和超级畅销书，繁体版在台湾也十分受欢迎。第2版在第1版的基础上做了很大的改进：根据最新的JDK 1.7对全书内容进行了全面的升级和补充；增加了大量处理各种常见JVM问题的技巧和最佳实践；增加了若干......一起来看看《深入理解Java虚拟机（第2版）》这本书的介绍吧!

码农工具