Control based on Function Approximation

The control algorithms based on approximated value functions also follow the general method of the general policy iteration, as illustrated bellow:

PQiterate2.png

We note that this is similar to the algorithms for model-free control illustrated in Fig. 1.4, but with the action-value function $q_\pi(s,a)$ is replaced by the parameter ${\bf w}$ of the approximation action function $\hat{q}(s,a,{\bf w})$. In particular, for a linear function, we have:

$\displaystyle \hat{q}(s,a,{\bf w})=\sum_nw_nx(s,a)={\bf w}^T{\bf x}_n(s,a)$ (94)

where ${\bf x}(s,a)$ is the feature vector for the state-action pair $(s,a)$. We need to find the optimal parameter ${\bf w}$ that minimizes the objective function, the mean square error of the approximation:

$\displaystyle J({\bf w})
=\frac{1}{2}E_\pi[(q_\pi(s,a)-\hat{q}(s,a,{\bf w}))^2]
=\frac{1}{2}E_\pi[(q_\pi(s,a)-{\bf w}^T{\bf x}(s,a))^2]$ (95)

with gradient vector:
$\displaystyle \triangledown J({\bf w})$ $\displaystyle =$ $\displaystyle \frac{d}{d{\bf w}}J({\bf w})
=-E_\pi\left[q_\pi(s,a)-\hat{q}(s,a,{\bf w})\right]
\triangledown\hat{q}(s,a,{\bf w})$  
  $\displaystyle =$ $\displaystyle -E_\pi\left[(q_\pi(s,a)-{\bf w}^T{\bf x}(s,a)){\bf x}(s,a)\right]$ (96)

If the stochastic gradient descent method is used based on a single sample of the action value $q_\pi(s,a)$, instead of its expectation, then $E_\pi$ can be dropped, and the optimal weight vector ${\bf w}^*$ that minimizes $J({\bf w})$ in Eq. (95) can be learned iteratively:
$\displaystyle {\bf w}_{t+1}$ $\displaystyle =$ $\displaystyle {\bf w}_t+\Delta{\bf w}
={\bf w}_t-\alpha\triangledown{\bf J}({\bf w})$  
  $\displaystyle =$ $\displaystyle {\bf w}_t+\alpha\left[(q_\pi(s_t,a_t)-\hat{q}_\pi(s_t,a_t,{\bf w}))
\triangledown\hat{q}_\pi(s_t,a_t,{\bf w}) \right]$  
  $\displaystyle =$ $\displaystyle {\bf w}_t+\alpha\left[(q_\pi(s_t,a_t)-{\bf w}_t^T{\bf x}(s_t,a_t))
{\bf x}(s_t,a_t) \right]$ (97)

where $\Delta{\bf w}$ is the increment of the update:
$\displaystyle \Delta{\bf w}=-\alpha\triangledown{\bf J}({\bf w})$ $\displaystyle =$ $\displaystyle \alpha[ q_\pi(s_t,a_t)-\hat{q}_\pi(s_t,a_t,{\bf w})]
\triangledown\hat{q}_\pi(s_t,a_t,{\bf w})$  
  $\displaystyle =$ $\displaystyle \alpha[ q_\pi(s_t,a_t)-{\bf w}_t^T{\bf x}(s_t,a_t) ]
{\bf x}(s_t,a_t)$ (98)

As the true Q-value $q_\pi(s,a)$ in the expression is unknown, it needs to be estimated by some target depending on the specific methods used:

In summary, here are the conceptual (not necessarily algorithmic) steps for the general model-free control based on approximated action-value function:

These steps are also illustrated below:

$\displaystyle \mbox{Training of ${\bf w}$}$$\displaystyle \Longrightarrow Q-value
\Longrightarrow$   $\displaystyle \mbox{Policy $\pi$}$ (109)