A Markov decision process is a variation of state machine in which:
- The transition functions are stochastic, such that it defines a probability distribution over the next state given the previous state and input; each time it’s evaluated, it draws a new state from that distribution.
- The output is equal to the state (or is the identity function)
- Some states or state-action pairs are more desirable than others
An MDP can be used to model interaction with an outside “world”, such a single-player game. The idea is that an agent (a robot or a game-player) can model its environment as an MDP and try to choose actions that will drive the process into states that have high scores.
Formalization
Formally, an MDP is where:
- Transition model: , where
specifying a conditional probability distribution. Uppercase letters above like $S_{t}$ are random variables.
- Reward function: , where specifies how desirable it is to be state and take action ; and
- Discount factor:
A policy is a function that specifies what action to take in each state.
Markov Assumption
The Markov assumption is:
Not all systems can be modeled in this way!