The control algorithms based on approximated value functions also follow the general method of the general policy iteration, as illustrated bellow:
We note that this is similar to the algorithms for model-free
control illustrated in Fig. 1.4, but with
the action-value function
is replaced by the
parameter
of the approximation action function
. In particular, for a linear function,
we have:
(96) |
(98) |
The true
is replaced by the sample return
as the target, obtained at the end of each episode:
(99) |
The action value function
is replaced
by the TD target, the sum of the immediate reward,
available at each step of each spisode, and the
approximated action value of the next state
:
Following Eq. (71), the TD error is defined as:
(101) |
(102) |
In the forward-view version of the TD() method
the action function
is approximated by
-return
as the target, available
only at the end of each spisode:
(104) |
The backward-view version of the TD() method
based on eligibility traces is more advantageous in
both space and temporal complexity as well as learning
efficiency.
We first define an eligiibility trace vector which is set to zero at the beginning of the episode, but then decays
(105) |
(106) |
and Eq. (70)
/////
Recall the TD error for the backward view of the
TD() method first given in
Eq. (71):
(108) |
In summary, here are the conceptual (not necessarily algorithmic) steps for the general model-free control based on approximated action-value function:
(109) |