内容简介:In a finiteThe famous introductory book of Sutton and Barto on reinforcement learning¹ takes the existence of optimal policies for granted and let this question unanswered. I had difficult times to believe them and be able to continue reading!In this artic
A proof of the existence of the optimal policy for finite MDPs
Jul 18 ·8min read
In a finite Markov Decision Process (MDP), the optimal policy is defined as a policy that maximizes the value of all states at the same time¹. In other words, if an optimal policy exists, then the policy that maximizes the value of state s is the same as the policy that maximizes the value of state s'. ² But why should such a policy exist?
The famous introductory book of Sutton and Barto on reinforcement learning¹ takes the existence of optimal policies for granted and let this question unanswered. I had difficult times to believe them and be able to continue reading!
In this article, I am going to give a proof of the existence of the optimal policy in finite MDPs ³.
Notation and definitions
Markov decision processes and policies
A finite MDP is characterized by a finite set of states (usually shown by curvy S ), a finite set of actions for each state (usually shown by curvy A ), and a probability distribution over the next state s’ and the next reward value r, given the current state s and the currently chosen action a , denoted as p(s’,r|s,a).
Given the current state s , a policy π is a probability distribution over possible actions at state s , denoted as π (a|s). Then, given a policy, an agent can navigate in the environment (i.e. go from one state to another) and get rewards through each transition.
We show random variables by capital letters and their values by small letters. Time is added to each variable with a subscript. Then, given a policy and an MDP, and given the initial state (at time t=1 ) s , for any T > 1 , the joint distribution of states, actions, and reward values is
Values and Bellman equation
Given a policy π and a discount factor 0 ≤ γ < 1, the value of each state is defined as
and the value of each pair of state and action as
It is easy to show that the values of states and action-state pairs can be written in a recursive way
These sets of equations are known as the Bellman equations .
We will use later the fact that
Optimal policy
The policy π* is an optimal policy if and only if we have
for any state s and any other policy π.
The Bellman optimality equations
We show the set of all possible states by curvy S and the set of all possible actions at state s by curvy A(s). We denote the Kronecker delta by δ and start this section with the following theorem.
Proof comment: We used Eq. 1 in the 1st line of the proof, and then repeatedly used the fact that the value of the pair of state and action (s*, a*) is greater than or equal to the value of state s*.
Theorem 1 states that, whenever there is a pair of state and action (s*, a*) with a value greater than the value of state s* , with respect to policy π, then there is another policy π’ that is better than or equal to (in terms of state-values) π in all states. As a result, if an optimal policy π* exists, its values should satisfy, for any state s,
where curvy A(s) stands for the set of all possible actions at state s — one can easily prove this statement by contradiction . Using the Bellman equations, we can expand Equation 2 as
This set of non-linear equations (as many as the number of states) is called the “Bellman optimality equations”. So, if an optimal policy exists, its values should satisfy this set of equations³.
Therefore, to show that an optimal policy exists, one must prove the following two statements:
- The set of the Bellman optimality equations has solutions, and
- One of its solutions has values greater than or equal to the values of the other solutions in all states.
The existence and the uniqueness of the solution
In this section, we prove that the set of the Bellman optimality equations has a unique solution. By doing so, we prove the two aforementioned statements at the same time.
The Bellman optimality operator
Given a set of values over states, we define the vector of values as
which is simply a real-value vector whose elements are equal to the values of different states. Then, we define the “Bellman optimality operator” T as a mapping
The operator T takes a vector of values and maps it to another vector of values. Using this new notation, it is easy to see that Equations 2 and 3 are equivalent to
This observation means that the solutions of the Bellman optimality equations are the same as the fixed point s of the Bellman optimality operator. Therefore, to prove the existence and uniqueness of the solution to the Bellman optimality equations, one can prove that the Bellman optimality operator has a unique fixed point.
To do so, we need to introduce another concept and another theorem.
Contraction mapping and Banach fixed-point theorem
Consider a metric space (M,d) , i.e. M is a set, and d is a metric defined on this set for computing the distance of every two elements of M ⁵. A mapping T: M → M is a contraction mapping if there exists 0 ≤ k < 1 such that for any x and y in M , we have
Intuitively, contraction mapping makes points closer to each other. Figure 1 shows an illustration of repeatedly applying a contraction mapping on two points.
The reason that we are interested in contraction mappings is the following famous theorem, known as Banach fixed-point theorem .
Proof comment: The proof of the theorem is not hard, but I do not include it in this article, because the theorem is well known and the proof can easily be found elsewhere, e.g. see here .
The whole idea behind the theorem is what is illustrated in Figure 1: All points are getting closer to each other after mapping, and hence, by repeating the mapping all points converge to one point which is the unique fixed point of T .
As a result, to prove the existence and uniqueness of the solution of the Bellman optimal equations, it is sufficient to show that there is a metric in which the Bellman optimality operator is a contraction mapping.
The Bellman optimality operator is a contraction mapping in infinity norm
For any pair of value vectors V and V' , their infinity norm is defined as
In this section, we want to prove that the Bellman optimality operator is a contraction mapping in this norm. To do so, we first need the following technical Lemma.
Proof comment: Although the Lemma is pretty non-trivial, its proof is not difficult and needs only elementary techniques. I had some fun proving it and thought it might be good to leave its proof as an exercise for interested readers⁶.
Now, having the lemma, we can finally go to our main theorem.
Proof comment: To go from the 2nd to the 3rd line of the proof, we used the Lemma, and to go from 4th to 5th line we used the convexity of the absolute value function. The rest is straightforward.
As a result, the Bellman optimality operator has a unique fixed point⁷, and the Bellman optimality equations have a unique solution. It is easy to show that any greedy policy on a solution of the Bellman optimality equations has values equal to that solution. Therefore, optimal policies exist!
Conclusion
We show that ( 1) the values of an optimal policy should satisfy the Bellman optimal equations. We then show that ( 2) the solutions to the Bellman optimal equations are the fixed point of the Bellman optimal operator. By showing that ( 3) the Bellman optimal operator is a contraction mapping in infinity norm and using ( 4) Banach fixed-point theorem, we proved that ( 5) the Bellman optimal operator has a unique fixed point. As a result, ( 6) there exist policies that maximize the values of all states at the same time.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
流畅的Python
[巴西] Luciano Ramalho / 安道、吴珂 / 人民邮电出版社 / 2017-5-15 / 139元
【技术大咖推荐】 “很荣幸担任这本优秀图书的技术审校。这本书能帮助很多中级Python程序员掌握这门语言,我也从中学到了相当多的知识!”——Alex Martelli,Python软件基金会成员 “对于想要扩充知识的中级和高级Python程序员来说,这本书是充满了实用编程技巧的宝藏。”——Daniel Greenfeld和Audrey Roy Greenfeld,Two Scoops ......一起来看看 《流畅的Python》 这本书的介绍吧!