Given an MDP, our goal is typically to find a policy that is optimal such that it gets as much total reward as possible, in expectation over the stochastic transitions that the domain makes. Finite-horizon MDPs have a horizon which indicates the number of steps of interaction that the agent will have with the MDP.
We want to find an optimal policy, , such that:
where is the value function for the MDP at the current horizon and state. At least one exists.
Optimality Considerations
An important observation to make is that, in a finite-horizon problem, the best action doesn’t just depends on the current state – it also depends on the horizon. If I have one day left to live, should I behave the same way as if I have 100 years?
Because of the horizon consideration above, the optimal policy changes depending on the horizon. Thus, is more like a set of optimal policies for each horizon:
So how do we find an optimal policy? The naive method is to enumerate all possible policies and calculating their value functions and picking the best one, but this is a lot of work.
Action-Value Functions
One way to find an optimal policy is to compute an optimal action-value function, .
We define to be the expected value of future rewards given:
- starting state ,
- executing action , and
- executing for more steps, with an optimal policy for the appropriate horizon on each step.
Similar to our definition of for evaluating a policy, we define the function recursively across the horizon. The only difference is that, on each step with horizon , rather than selecting an action specified by a given policy, we select the value of that will maximize the expected value of the next state.
- To clarify, is the best action that can be taken given that we are in state .
- Also, similar to our definition of value functions (see Policy Evaluation), we use a summation here to express expected value to capture the uncertainty of dealing with a probability distribution.
We can solve for values of with a simple recursive value iteration algorithm, which just computes starting from horizon and working backward to the desired horizon . Given , an optimal policy is easy to find:
There may be multiple possible optimal policies.
Dynamic Programming for MDPs
Most methods for solving MDPs or computing value functions rely on dynamic programming to be efficient. The principle of dynamic programming is to compute and store the solutions to simple sub-problems that can be re-used later in the computation.
Let’s consider what would happen if we tried to compute for all ( by directly using the definition:
- To compute for any one , we would need to compute for all pairs.
- To compute for any one , we’d need to compute for all pairs.
- To compute for any one , we’d need to compute for all pairs.
- Luckily, those are just our values.
So, if we have states and actions, this is which is way too much, especially as the horizon increases. But observe that we really only have values that need to be computed, for all . If we start with , compute and store those values, then using and reusing the values to compute the values, we can do all this computation in time , which is much better.