Policy Evaluation

How do we specify the measure of the quality of a policy?

Finite-horizon

For a given MDP policy $π$ and horizon $h$ , we define the “horizon $h$ value” of a state, $V_{π}^{h} (s)$ . This is the expected sum of rewards given that we start in state $s$ , and execute policy $π$ for $h$ steps.

We do this by induction on the horizon $h$ , which is the number of steps left to go. The base case of is when there are no steps remaining, in which case, no matter what state we’re in, the value is $0$ , so:

V_{π}^{0} (s) = 0

Note that the indexing here is reversed! $V^{h}$ means that there are $h$ steps until the end. Above, $h = 0$ means that we are at the end of the process.

Then, we have:

(Value of policy at h + 1) = (Reward in state s) + (Expected horizon h value for next state)

So, starting with horizons $1$ and $2$ , we have:

V_{π}^{1} (s) V_{π}^{2} (s) = 0 + R (s, π (s)) = 0 + R (s, π (s)) + s^{'} \sum T (s, π (s), s^{'}) \cdot R (s^{'}, π (s^{'}))

Like we mentioned, the summation term is an expected value; we don’t know exactly what state we’re going to end up in but we do know a probability distribution for the possibilities. Recall that:

$T (s, π (s), s^{'})$ is the probability of transitioning to state $s^{'}$ from state $s$ after taking $π (s)$ .
$R (s, π (s))$ is the reward for taking action $π (s)$ in state $s$ .
In the above, we could replace $π (s)$ with action $a$ .

In general, this becomes:

V_{π}^{h} (s) = R (s, π (s)) + s^{'} \sum T (s, π (s), s^{'}) \cdot V_{π}^{h - 1} (s^{'})

Once again, the sum over $s^{'}$ is an expected value. We are considering all possible next states of $s^{'}$ and computes their average of their $(h - 1)$ -horizon values, weighted by the probability that the transition function from state $s$ with the action chosen by the policy, $π (s)$ , assigns to arriving in state $s^{'}$ .

We can say that a policy $π_{1}$ is better than $π_{2}$ for horizon $h$ , if for all $s \in S$ , $V_{π_{1}}^{h} (s) \geq V_{π_{2}}^{h} (s)$ and there exists at least one $s \in S$ such that $V_{π_{1}}^{h} (s) > V_{π_{2}}^{h} (s)$ .

Infinite-horizon

For infinite-horizon, we evaluate in terms of the expected discounted infinite-horizon value that the agent will get in the MDP if it executes that policy.

We define the value of a state $s$ under policy $π$ as

V_{π} (s) = E [R_{0} + γ R_{1} + γ^{2} R_{2} + \dots ∣ π, S_{0} = s] = E [R_{0} + γ (R_{1} + γ (R_{2} + γ \dots)) ∣ π, S_{0} = s]

The expectation of a linear combination of random variables is the linear combination of the expectations, so

V_{π} (s) = E [R_{0} ∣ π, S_{0} = s] + γ E [R_{1} + γ (R_{2} + γ \dots) ∣ π, S_{0} = s] = R (s, π (s)) + γ s^{'} \sum T (s, π (s), s^{'}) V_{π} (s^{'})

You could write down one of these equations for each of the $n = ∣ S ∣$ states. There are $n$ unknowns $V_{π} (s)$ . These are linear equations, and so it’s easy to solve them using Gaussian elimination to find the value of each state under this policy.

/notes/

Recent

Performance of Sampled Data Systems

Nyquist Stability

Jury Test

Policy Evaluation

Finite-horizon

Infinite-horizon

Graph View

Table of Contents

Backlinks