内容简介:In the previous post, we have been able to check with the Frozen-Lake Environment example the limitations of the Cross-Entropy Method. Continuing with the style of this series of gradually adding knowledge, in this post, we will present another type of Age
V-function and Q-function Explained
Jun 11 ·12min read
In the previous post, we have been able to check with the Frozen-Lake Environment example the limitations of the Cross-Entropy Method. Continuing with the style of this series of gradually adding knowledge, in this post, we will present another type of Agent that allows us to solve the previous problem. These are the so-called Value-based Agents , that store the value function and base their decisions on it. For this purpose, we will present the Bellman equation , one of the central elements of many Reinforcement Learning algorithms, and required for calculating the value functions in this post.
Value-based Agents
Remember that the Agent’s goal is to find a sequence of actions that will maximize the return: the sum of rewards (discounted or undiscounted — depending on the value of gamma) during an episode or the entire life of the Agent, depending on the task.
In a continuous task, this is infinity. We can solve this with the help of the discount factor already introduced inpost 2, which lies in the [0:1] range. The formula for the discounted Return G at time step t is as follows:
Although the sum is still infinite, if γ<0, then Gt will have a finite value. If γ=0, the agent is only interested in the immediate reward and discards the long-term return. Conversely, if γ=1, the agent will consider all future rewards equal to the immediate reward.
We can rewrite this equation with a recursive relationship:
In short, the Agent must be able to exploit this information that we have been able to express with this Return G to make their decisions.
Almost all reinforcement learning algorithms executed by the Agents involve estimating value functions — functions of states (or of state-action pairs). These are the so-called Value-based Agents.
A value function estimates how good it is for the Agent to be in a given state (or how good it is to perform a given action in a given state) in terms of return G . Note that the return G an Agent may depend on the actions it will take. Accordingly, value functions introduced inpost 2 are defined with respect to particular ways of acting, called policies.
The V-function: the value of the state
The first value function we will introduce is V-function. Generally speaking, we can say that V-function answers the basic question of “ What to expect from here?”.
The state-value function
More formally, the V-function , also referred to as the state-value function , or even the value function , or simply V , measures the goodness of each state. In other words, how good or bad it is to be in a particular state according to the return G when following a policy .
That is, we can define the V-function as an expected total reward (discounted or undiscounted — depending on the value of gamma) that is obtainable from the state. In a formal way, the value of V ( ) is:
that describes the expected value of the total return G , at time step t starting from the state s at time t and then following policy π . It is used expectation [.] in this definition because the environment transition function might act in a stochastic way.
To clarify the concept a little more, let’s consider a very simple environment with three states:
- The Agent’s initial state 0
- The final state 1 that the agent is in after executing action “left” from the initial state. The reward obtained from this is r=+1
- The final state 2 that the agent is in after taking action “right.” The reward obtained from this is r=+2
We can represent this with the following environment’s states transition with rewards:
The environment is always deterministic — every action succeeds and we always start from state 0. Once we reach either state 1 or state 2, the episode ends.
Now, the question is, what is the value of state 0? An important detail is that the value of one state is always calculated (dependent) in terms of some policy that our agent follows. Even in a simple environment, our Agent can have different behaviors, each of which will have its own value for state 0. Consider some example of policy:
- The Agent always goes left
- The Agent always goes right
- The Agent goes left with a probability of 0.5 and right with a probability of 0.5.
- The Agent goes left with a probability of 0.2 and right with a probability of 0.8.
The value of state 0, V(0) , in each of the above policies is:
- The value of state 0 in the case of the “always left” Agent is V(0) = 1.0 (every time it goes left, it obtains +1 and the episode ends).
- For the “always right” agent, the value of state 0 is V(0) = 2.0 (every time it goes right, it obtains +2 and the episode ends).
- For the “0.5 left+0.5 right” Agent, the value is V(0) = 1.0 * 0.5 + 2.0 * 0.5 = 1.5
- For the “0.2 left+0.8 right” Agent, the value is V(0) = 1.0 * 0.2 + 2.0 * 0.8 = 1.8
Due to the goal of the Agent is to get as much total reward as possible, the optimal policy for this Agent in this simple one-step Environment is the policy 2, the policy “always right”.
But the preceding example may give a false impression that we should “being greedy” and always take the action with the highest reward. Unfortunately, it’s not that simple. For example, let’s extend our preceding environment with yet another state that is reachable from state 2. State 2 is no longer a terminal state but a transition to state 3, with a bad reward of –10:
The new environment’s states transition with rewards can be represented as:
With that addition, for each policy the V(0) will be:
- For the “always left”, is the same: V(0) =+ 1.0.
- For the “always right”: V(0) = 2.0 + (–10.0) = –8.0
- For the “0.5 left+0.5 right”: V(0) = 1.0 * 0.5 + (2.0+(-10.0) )* 0.5 = -3.5
- For the “0.2 left+0.8 right”: V(0) = 1.0 * 0.2 + (2.0+(-10.0) ) * 0.8 = -6.2
So, the best policy for this new Environment is now policy 1: “always go right”. Notice that once the Agent has chosen the “right” action in state 0, the bad reward is unavoidable, as from state 2 we have only one exit.
This example based in a naïve Environment pursues that the reader realises the complexity of this optimality problem and prepares him or her to see the importance of the Bellman equation.
The Bellman Equation
The Bellman equation shows up everywhere in the Reinforcement Learning literature, being one of the central elements of many Reinforcement Learning algorithms. In summary, we can say that the Bellman equation decomposes the value function into two parts, the immediate reward plus the discounted future values.
This equation simplifies the computation of the value function, such that rather than summing over multiple time steps, we can find the optimal solution of a complex problem by breaking it down into simpler, recursive subproblems and finding their optimal solutions.
To facilitate understanding of the formulation in the following sections, the next diagram shows a convention of names given to the variables and their relationship:
In this diagram P means the probability of action a , issued in state s , ending up in state s’ (with reward r ).
Bellman equation for the State-value function
We already saw that we can define the discounted return, G, in recursive terms. Let’s now see how to recursively we can define the Bellman equation for the state-value function:
This equation tells us how to find the value of a state s . We can intuitively see that it recursively breaks down the value computation into an immediate expected reward from the next state (a sum over the policy probabilities) and discounted return for all states, following the current one.
The Bellman equation is important because it gives us the ability to describe the value of a state s , with the value of the s’ state, and with an iterative approach that we will present in the next post, we can calculate the values of all states.
Unfortunately, in most scenarios, we do not know the probability P and the reward r , so we cannot solve MDPs by directly applying the Bellman equation. But do not worry, in the next post we present some alternatives to find them from experience.
The Q-function: The value of the action
Inpost 2 we extended the definition of state-value function to state-action pairs, defining a value for each state-action pair, which is called the action-value function, also known as Q-function or simply Q. It defines the value of taking action a in state s under a policy π , denoted by Q ( , ), as the expected Return G starting from s, taking the action a, and thereafter following policy π :
Next, let’s denote with π( | ) the probability that a policy, π, selects an action, , given a current state, . Again it is used expectation [.] in this definition because the environment transition function might act in a stochastic way.
Now, that we have defined Q ( , ), we can assert that the state-value function is equivalent to the sum of the action-value functions of all outgoing (from s ) actions a , multiplied by the policy probability of selecting each action:
Note that the sum of probabilities of all outbound actions from s is equal to 1:
Bellman equation for the Action-value function
We also have the Bellman equation for the action-value function:
Again, to facilitate understanding of the formulation we can construct this diagram with the relationship between the variable for the action-value function:
The Bellman equation of optimality
As we will see in the next post, the Bellman equation is used to find the optimal values of the value functions in the algorithms to calculate them.
Bellman optimality equation for the state-value function
Bellman proved that the optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus the discounted long-term reward for the next state:
Optimal Policy
The goal of the Agent is to maximize the total cumulative reward in the long run. The policy, which maximizes the total cumulative reward is called the optimal policy . Notice that there could be different optimal policies, but they all share the same value functions, the “optimal” value functions.
The Bellman optimality equation not only gives us the best reward that we can obtain, but it also gives us the optimal policy to obtain that reward.
If our Agent knows the value for every state, then it knows how to gather all this reward and the Agent only needs to select in each timestep the action that leads the Agent to the state with the maximum expected reward in each moment.
Bellman optimality equation for the action-value function
We also can define V(s) via Q(s,a):
This just means that the value of some state equals the value of the maximum action we can execute from this state.
Bellman also proved that the optimal value of the action-state can be defined recursively as:
That means that in order to obtain the optimal policy, we can obtain the optima state-value via the action-value.
To show it, let’s look at a simple example of how to obtain V(s) via Q(s,a). Let’s consider an environment that is similar to Frozen-Lake Environment, but has a much simpler structure. We have one initial state (s0) surrounded by four target states s1, s2, s3 and s4, with rewards +1,+2,+3, and +4 respectively.
Every action is probabilistic in the same way as in Frozen-Lake Environment: with a 33% chance that our action will be executed without modifications, but with a 33% chance that we will slip to the left, relatively, of our target cell and a 33% chance that we will slip to the right. For simplicity, we use a discount factor = 1. The environment’s states transition can be represented as:
Terminal states 1, 2, 3, and 4 have no outbound connections, so Q-values for those states are equal 0. Due to this, the values of the terminal states are equal to their immediate reward according to the previous formula. Then, V(1) = 1, V(2) = 2, V(3) = 3, and V(4) = 4.
Now is the time to calculate the value of state 0. According to the formula:
(0, 0) = 0.33 * 1 + 0.33* 2 + 0.33 * 3
(0, 1) = 0.33 * 2 + 0.33* 3 + 0.33 * 4
(0, 2) = 0.33 * 1 + 0.33* 3 + 0.33 * 4
(0, 3) = 0.33 * 1 + 0.33* 2 + 0.33 * 4
so,
(0, 0) = 0.33 * 1 + 0.33* 2 + 0.33 * 3 = 1.98
(0, 1) = 0.33 * 1 + 0.33* 3 + 0.33 * 4 = 2.64
(0, 2) = 0.33 * 1 + 0.33* 3 + 0.33 * 4 = 2.64
(0, 3) = 0.33 * 1 + 0.33 * 2 + 0.33 * 4 = 2.31
The value for state 0, which is the maximum of all Q-values, will be V(0)=2.97.
Q-learning methods
Q-values are much more convenient in practice, as for the Agent, it’s much simpler to make decisions about actions based on Q-values than on V-values. In the case of Q-values, to choose the action based on the state, the agent just needs to calculate Q-values for all available actions using the current state and choose the action with the largest value of Q-values.
To do the same using values of the states, the Agent needs to know not only values but also probabilities for transitions. In practice, we rarely know them in advance, so the agent needs to estimate transition probabilities for every action and state pair.
This approach gave a name to the whole family of methods called Q-learning that we will review in future posts.
What is next?
In future posts, you will see these formulas in practice by solving the Frozen-Lake environment. However, to be able to do this, we have one important thing still missing: a general way to calculate those V-values and Q-values. In the next post, we will present the Value Iteration method for it. See you in the next post.
For more detail of this part the reader can review the excellent book Reinforcement Learning, Second Edition , by Richard S. Sutton and Andrew G. Barto.
以上所述就是小编给大家介绍的《The Bellman Equation》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。