How do we specify the measure of the quality of a policy?
Finite-horizon
For a given MDP policy and horizon , we define the “horizon value” of a state, . This is the expected sum of rewards given that we start in state , and execute policy for steps.
We do this by induction on the horizon , which is the number of steps left to go. The base case of is when there are no steps remaining, in which case, no matter what state we’re in, the value is , so:
Note that the indexing here is reversed! means that there are steps until the end. Above, means that we are at the end of the process.
Then, we have:
So, starting with horizons and , we have:
Like we mentioned, the summation term is an expected value; we don’t know exactly what state we’re going to end up in but we do know a probability distribution for the possibilities. Recall that:
- is the probability of transitioning to state from state after taking .
- is the reward for taking action in state .
- In the above, we could replace with action .
In general, this becomes:
Once again, the sum over is an expected value. We are considering all possible next states of and computes their average of their -horizon values, weighted by the probability that the transition function from state with the action chosen by the policy, , assigns to arriving in state .
We can say that a policy is better than for horizon , if for all , and there exists at least one such that .
Infinite-horizon
For infinite-horizon, we evaluate in terms of the expected discounted infinite-horizon value that the agent will get in the MDP if it executes that policy.
We define the value of a state under policy as
The expectation of a linear combination of random variables is the linear combination of the expectations, so
You could write down one of these equations for each of the states. There are unknowns . These are linear equations, and so it’s easy to solve them using Gaussian elimination to find the value of each state under this policy.