Model-Based Planning

Given a complete MDP model in terms of $p(s',r\vert s,a)$ of the environment, the goal of model-based planning is to find the optimal policy $\pi^*$ that achieves the maximum value:

$\displaystyle v^*(s)=\max_\pi v_\pi(s),\;\;\;\;\;\;\;\;\;\;\;
\pi^*(s)=\argmax_\pi v_\pi(s)\ge \pi\;\;\;\;\;\;\forall\pi$ (25)

based on the partial order that compares different policies according to the corresponding values:
If $\displaystyle \;\;\;\;\;$ $\displaystyle v_{\pi_1}(s)\ge v_{\pi_2}(s)\ge
\cdots v_{\pi_k}(s), \;\;\;\;\forall s\in S$  
Then $\displaystyle \;\;\;\;$ $\displaystyle \pi_1\ge\pi_2\ge\cdots\ge\pi_k$ (26)

The problem of policy optimization can be addressed based on two subproblems: policy evaluation to measure how good a policy is, and policy improvement to get a better policy for higher values.

One way to solve this optimization problem is to use brute-force search to enumerate all $\vert A(s)\vert$ available policies at each of the $N=\vert S\vert$ states $s\in S$. If we assume the numbers of avilable actions in all states are the same for simplicity, i.e., $A(s_1)=\cdots=A(s_N)=A$, then the computational complexity of this method is $O(\vert A\vert^{\vert S\vert})$, likely to be too high for such a brute-force search to be practical if $N$ is large.

A more efficient and therefore more practical way to optimize the policy is to do it iteratively. Consider a simpler task of improving upon an existing deterministic policy $\pi$ at a single transition step from the current state $s$ to a next state $s'$. Instead of taking the action $a=\pi(s)$ dictated by the policy, we can improve it by taking the greedy action that maximizes the action value function in Eq. (22), while subsequent steps still follow the old policy $\pi$. Now we get a new policy:

$\displaystyle \pi'(s)=\argmax_a q_\pi(s,a)
=\argmax_a \left[ r(s,a)+\gamma \sum_{s'} P(s'\vert s,a)
v_\pi(s')\right]$ (27)

so that the action value is higher than that of the old one:

$\displaystyle q_\pi(s,\pi'(s))=\max_a q_\pi(s,a)
=\max_a \left[ r(s,a)+\gamma \sum_{s'} P(s'\vert s,a)
v_\pi(s')\right]
\ge
q_\pi(s,\pi(s))=v_\pi(s)$ (28)

where the last equality is due to Eq. (24) for a deterministic policy.

While it is obvious that this greedy method will result in a higher return for the single state transition from $s$ to $s'$, we can further prove the following policy improvement theorem stating that following this greedy method in all subsequent states, we will get a new policy $\pi'$ that achieves higher values than the those of the old one $\pi$ at all states:

$\displaystyle v_{\pi'}(s)\ge v_\pi(s),\;\;\;\;\;\;\;\forall s\in S$ (29)

The proof is by recursively applying the greedy method to replace the old policy $\pi(s)$ by the new one $\pi'(s)$ for a higher value for each of the subsequent steps one at a time:

$\displaystyle v_\pi(s)$ $\displaystyle \le$ $\displaystyle q_\pi(s,\pi'(s)) =\max_a q_\pi(s,a)
=\max_a\left[ r(s,a)+\gamma \sum_{s'} P(s'\vert s,a) v_\pi(s')\right]$  
  $\displaystyle =$ $\displaystyle r(s,\pi'(s))+\gamma \sum_{s'} P(s'\vert s,\pi'(s)) v_\pi(s')$  
  $\displaystyle \le$ $\displaystyle r(s,\pi'(s))+\gamma \sum_{s'} P(s'\vert s,\pi'(s)) \max_{a'} q_\pi(s',a')$  
  $\displaystyle =$ $\displaystyle r(s,\pi'(s))+\gamma \sum_{s'} P(s'\vert s,\pi'(s))
\left[ r(s',\pi'(s))+\gamma \sum_{s''} P(s''\vert s',\pi'(s')) v_\pi(s'')\right]$  
  $\displaystyle =$ $\displaystyle r(s,\pi'(s))+\gamma r(s',\pi'(s))+
\gamma \sum_{s'} P(s'\vert s,\pi'(s))
\left[ \gamma \sum_{s''} P(s''\vert s',\pi'(s')) v_\pi(s'')\right]$  
  $\displaystyle \le$ $\displaystyle \cdots$  
  $\displaystyle =$ $\displaystyle r(s,\pi'(s))+\gamma r(s',\pi'(s'))+\gamma^2 r(s'',\pi'(s'')
+ \cdots$  
  $\displaystyle =$ $\displaystyle v_{\pi'}(s)$ (30)

i.e., $\pi'(s)\ge\pi(s)$.

The optimal state value $v^*(s)$ and action value $q^*(s,a)$ can therefore be achieved by replacing the weighted average over all possible actions at each state ($s$ and $s'$) in the Bellman equations in Eqs. (21) and (22) by their maximum:

$\displaystyle v^*(s)=\max_a q^*(s,a) =\max_a \left(
r(s,a)+\gamma \sum_{s'} P(s'\vert s,a) v^*(s') \right)$ (31)

and

$\displaystyle q^*(s,a)=r(s,a)+\gamma \sum_{s'} P(s'\vert s,a) v^*(s')
=r(s,a)+\gamma \sum_{s'} P(s'\vert s,a) \max_{a'} q^*(s',a')$ (32)

Eqs. (31) and (32) are the Bellman optimality equations, which should hold for all $N$ states in $S=\{s_1,\cdots,s_N\}$ and can be written in vector form:

$\displaystyle {\bf v}^*=\max_a\left({\bf r}_a
+\gamma{\bf P}_a{\bf v}^*\right)$ (33)

where

$\displaystyle {\bf v}^*=\left[\begin{array}
{c}v^*(s_1)\\ \vdots\\ v^*(s_N)\end...
...s & \vdots \\
P(s_1\vert s_N,a) & \cdots & P(s_N\vert s_N,a)\end{array}\right]$ (34)

This nonlinear equation system can be solved iteratively to for ${\bf v}^*$:

$\displaystyle {\bf v}^*_{n+1} =\max_a\left({\bf r}_a
+\gamma{\bf P}_a{\bf v}^*_n \right)=B({\bf v}_n)$ (35)

This iteration is to be used for policy evaluation in the algorithm value iteration below for finding the optimal policy. This iteration will converge as function $B({\bf v})$ is a contraction mapping with $\gamma<1$:
$\displaystyle \vert\vert B({\bf v}_i)-B({\bf v}_j)\vert\vert$ $\displaystyle =$ $\displaystyle \vert\vert\max_{a_i}\left({\bf r}_{a_i}+\gamma{\bf P}_{a_i}{\bf v...
...)
-\max_{a_j}\left({\bf r}_{a_j}+\gamma{\bf P}_{a_j}{\bf v}_j \right)\vert\vert$  
  $\displaystyle \le$ $\displaystyle \vert\vert\max_{a_i}
\left({\bf r}_{a_i}+\gamma {\bf P}_{a_i}{\bf v}_i
-{\bf r}_{a_i}-\gamma{\bf P}_{a_i}{\bf v}_j \right)\vert\vert$  
  $\displaystyle =$ $\displaystyle \max_{a_i}\gamma\vert\vert{\bf P}_{a_i} ({\bf v}_i-{\bf v}_j)\ver...
...amma\vert\vert{\bf P}_{a_i}\vert\vert\; \vert\vert{\bf v}_i-{\bf v}_j\vert\vert$  
  $\displaystyle =$ $\displaystyle \gamma\vert\vert{\bf v}_i-{\bf v}_j\vert\vert<\vert\vert{\bf v}_i-{\bf v}_j\vert\vert$ (36)

where $\vert\vert{\bf P}\vert\vert=\vert\vert{\bf P}\vert\vert _\infty=1$ is the p-norm ($p=\infty$) of a stochastic matrix, which is known to be 1.

BellmanEquations2.png

Given the complete information of the MDP model in terms of $p(s',r\vert s,a)$, we can find the optimal policy $\pi^*$ using either policy iteration (PI) or value iteration (VI) based on the DP method.

In the PI method considered above, the value functions at all states are evaluated before the policy is updated for all states, as illustrated below:

GPI.png

PViterate.png

This synchronous policy iteration can be modified so that the values at some or even one of the states are evaluated in-place before other state values are updated, and the policy at some or even one of the states are updated in-place before that at other states, resulting in an asynchronous method called general policy iteration (GPI), to be discussed in detail in the next sections on model-free control.