Markov decision process (MDP) is the mathematical
framework for reinforcement learning. To understand MDP,
we first consider the basice concept of Markov process
or Markov chain, a stochastic model of a system that
can be characterized by a set of states
of size
(cardinality of
). The dynamics of the
system in terms of its state transition is modeled by the
transition probability
from the current
state
to the next state
with
. The system is assumed to be memoryless,
i.e., its future is independent of the past history
, given the present
:
(1) |
For example, if the state of a helicopter is described by its linear and angular position and velocity based on which its future position and velocity can be completely determined, then it can be modeled by a Markov process; but if the state is only described by its position, then its position in the past is needed for determining its future (e.g., to find its velocity), and it is not a Markov process.
All transition probabilities can be organized as an
state transition matrix:
(2) |
In the following, our discussion is mostly concentrated on
the the general transition from the current state to
the next state
with transition probability
.
Example Weather in Los Angeles:
(4) |
We next consider a Markov reward process (MRP),
represented by a tuple
of
four elements for the states, state transition probabilities,
rewards, and the discount factor. Here the reward
can be considered as a feedback signal to the agent at
each time step of the sequence of state transitions, indicating
how well it is doing with respect to the overall task. The
reward can be negative, as a penalty, for situations to be
avoided. The performance of the agent is measured by the
sum of rewards accumulated over all time steps of the
Markov process, weighted by the discount factor
that discounts future rewards. If
is close to 0, then the immediate reward is
emphasized (short-sighted or greedy), but if
is
close to 1, then rewards in the future steps will be almost
as valuable as immediate ones (far-sighted).
The sequence of state transitions of an MRP from an
initial state to a terminal state
is called an
episode, and the number of time steps
is called
horizon, which can be either finite or infinite.
Both the reward and the next state
resulting
from arriving at the current state
are considered
as random variables with joint probability
, based on which
both the state transition probability and state reward
probability can be found by marginalization:
(5) |
We define the return at each step
as
the accumlate reward, the sum of the immediate reward
after arriving at the current state
and all
delayed rewards in the future states upto a terminal
state
with reward
at the end of the episode,
discounted by
:
This equation is called the Bellman equation, by
which the value at current state
is expressed
recursively in terms of the immediate reward
and
the values
of all possible next states, without
explicitely invoking all future rewards. Based on this
bootstrap idea of the Bellman equation, a multi-step
MRP problem can be expressed recursively as a subproblem
concerning only a single state transition from
to
,
as a backward induction to find current state value from
all future ones. For this reason, the Bellman equation
plays an essential role in future discussion of some
important algorithms in reinforcement learning.
The Bellman equation in Eq. (8) holds
for all states
, and the resulting
equations can be expressed in vector form as:
Alternatively, the Bellman equation in Eq. (8) can also be solved iteratively by the general method of dynamic programming (DP), which solves a multi-stage planning problem by a backward induction and find the value function recursively. We first rewrite the Bellman equation as
(11) |
(13) |
(14) |
(15) |
In summary, the Bellman equation in Eq. (8)
can be solved by either of the two methods in
Eqs. (10) and (12),
so long as . When the size of state space
is large, the complexity
is too high then the
iterative method may be more efficient.
Based on the definition of MRP, we further define
Markov decision process (MDP) as an MRP with
certain decision-making rule called policy,
denoted by , following the policy the agent
takes one of the actions
available at
each state
to control the state transition of the
process. An MDP can therefore be described by a tuple
, of which
is the set of all
actions. The policy
can be either stochastic
denoted by
as the conditional
probability of taking action
in state
, or
deterministic denoted by
. A policy is
soft if any of the actions available at a state
is possible to be taken, i.e.,
for all
and all
. In particular, a policy
is
-soft if
for some small value of
.
As there are possible actions to take in
state
, the total number of policies is
. If the
number of available actions
is the same for all
states, then the total number of policies is
.
Following a given policy from an initial state
will result a sequence of state transitions called a
trajectory of the episode:
(16) |
Along the trajectory of an episode, an accumulation
of discounted rewards from all states visited will
be received. Our goal is to find the optimal policy
as a sequence of actions
for
a given MDP model of the environment in terms of its
state transition dynamics and rewards, so that the
value
at the start state
, the expectation
of the sum of all discounted future rewards, is maximized.
This optimization problem can be solved by the method
of dynamic programming, and such a process is called
planning.
In an MDP, both the next state and reward
are
assumed to be random variables with joint probability
conditioned on the action
and the previous state
. The state transition probability becomes
(19) |
(20) |
We further define two important functions with respect
to a given policy of an MDP:
The state-action value
plays a more important
role than
as it allows the freedom of taking
any action independent of a given policy
, thereby
allowing the opportunity to improve an existing policy,
e.g., by taking a greedy action to maximize the
value
. The state-action value function
is often abbreviated as the action value
ore Q-value for convenience in future discussions.
As a function of the state-action pair , the state
action value
can be represented as a table
of
rows each for one of the states
, and
columns each for one of the actions
. The Q-value for
each state-action pair stored in the table can be updated
iteratively by various algorithms for learning the Q-values.
In particular, if the policy is deterministic with
and
, then we have
(23) |
The figure above illustrates the Bellman equations in
Eqs. (21) and (22),
showing the bootstrapping of the state value function
in the dashed box on the left, and that of the action
value function in dashed box on the right. Specifically
here the word bootstrapping means the iterative method
that updates the estimated value at a state based
on the estimated value of the next state
. After
taking oen of the actions
available in state
based on policy
, an immediate reward
is received. However, which state the MDP will
transit into as the next state
is random depending
on
of the environment but independent of the
policy. This uncertanty is represented graphically by a
dashed circle called a chance node associated with
the action value
as the sum of the immediate
reward
and the expected value of the next state,
the weighted average of values
of all possible
state
.