Infinite-horizon MDP Solutions

It’s more common for the actual horizon of a problem to be unknown. If we tried to take our definition of $Q^{h}$ from Finite-horizon MDP Solutions and set $h = \infty$ , $Q^{\infty}$ values for all actions could be infinite, and there would be no way to select one over the other.

There are 2 standard ways to deal with this problem:

Take the average over all time-steps
Take a discounted infinite horizon.

In the discounted infinite horizon, we select a discount factor $0 < γ < 1$ . On each step, our life might with a probability of $1 - γ$ , giving us an expected lifetime of $1/ (1 - γ)$ . Unlike Finite-horizon MDP Solutions, we don’t need a different policy for a different horizon; if we survive today, our expected future lifetime is just as long as yesterday.

Instead of trying to find a policy that maximizes expected finite-horizon undiscounted value,

E [t = 0 \sum h R_{t} ∣ π, s_{0}]

we will try to find one that maximizes infinite-horizon discounted value:

E [t = 0 \sum \infty γ^{t} R_{t} ∣ π, s_{0}] = E [R_{0} + γ R_{1} + γ^{2} R_{2} + \dots ∣ π, s_{0}]

Unlike finite-horizon and it’s use of $h$ (horizon), the $t$ indices here are not the number of steps to go, but the number of steps forward from the starting state (there is no notion of “steps to go” in the infinite horizon case).

Why do we do discounting? Two reasons:

In economic terms, you’d generally rather have some money today than that same amount of money next week (because you could use it now or invest it).
Think of the whole process terminating, with probability 1 − γ on each step of the interaction. This value is the expected amount of reward the agent would gain under this terminating model.

Finding an optimal policy

The best way of behaving in an infinite-horizon discounted MDP is not time-dependent. At every step, your expected future lifetime, given that you have survived until now, is $1/ (1 - γ)$ .

An important theorem about MDPs is: there exists a stationary optimal policy $π^{*}$ (there may be more than one) such that for all $s \in S$ and all other policies $π$ , we have

V_{π^{*}} (s) \geq V_{π} (s)

Value iteration

Define $Q^{*} (s, a)$ to be expected infinite-horizon discounted value of being in state s, executing action a, and executing an optimal policy $π^{*}$ thereafter. Using similar reasoning the recursive definition of $V_{π}$ (see Policy Evaluation), we can express this value recursively as

Q^{*} (s, a) = R (s, a) + γ s^{'} \sum T (s, a, s^{'}) a^{'} max Q^{*} (s^{'}, a^{'})

This is also a set of equations, one for each $(s, a)$ pair. If we knew the optimal action-value function, then we could derive an optimal policy $π^{*}$ as

π^{*} (s) = argmax_{a^{'}} Q^{*} (s, a)

We can iteratively solve for the $Q^{*}$ values with the value iteration algorithm:

This is essentially just executing the above equation over and over again until we converge to optimal $Q$ -values! Super simple.

Value Iteration Theory

There are a lot of nice theoretical results about value iteration. For some given (not necessarily optimal) $Q$ function, define $π_{Q} (s) = argmax_{a} Q (s, a)$ .

After executing value iteration with parameter $ϵ$ , $∣∣ V_{π_{Q new}} - V_{π^{*}} ∣ ∣_{max} < ϵ$ .
There is a value of $ϵ$ such that

∣∣ Q_{old} - Q_{new} ∣ ∣_{max} < ϵ ⟹ π_{Q_{new}} = π^{*}

As the algorithm executes, $∣∣ V_{π_{Q_{new}}} - V_{π^{*}} ∣ ∣_{max}$ decreases monotonically on each iteration
The algorithm can be executed asynchronously, in parallel: as long as all $(s, a)$ pairs are updated infinitely often in an infinite run, it still converges to optimal value.

/notes/

Recent

Sources of Test Error

UDL Chapter 8 Problems

Parameter Initialization

Infinite-horizon MDP Solutions

Finding an optimal policy

Value iteration

Value Iteration Theory

Graph View

Table of Contents

Backlinks