The MC and TD methods considered previously can be
unified by the n-step TD() method that spans
a spectrum of which the MC and TD methods are two
special cases at the opposite extremes.
Recall that both MC and TD algorithms updates iteratively
the value functions being estimated and the policy being
improved based on Eq. (49), but they
estimate the target in the equation differently.
In an MC algorithm, the target is the actual return
, the
sum of discounted rewards of all future steps upto the
terminal state
, available only at the end of each
episode. On the other hand, in a TD algorithm, the target
is
the sum of the
immeidate reward
available at each step of the
episode, and the discounted value of the next state,
based on bootstrapping. We therefore see that an MC
algorithm updates the value functions and policy once
every episode, while a TD algorithm updates once every
step in the spisode.
As a trade-off between the MC and TD methods, the n-step
TD algorithm can be considered a generalization of the TD
method, where the target in Eq. (49)
is an n-step return, the sum of the discounted
rewards in the subsequent states and the discounted
value function at the following state
with
:
(57) |
(58) |
(59) |
Based on these n-step returns of different values,
the n-step TD algorithm can be further generalized to
a more computationally advantageous and therefore more
useful algorithm called TD(
), by which the MC
and TD algorithms are again unified as two special cases
at the opposite ends of a spectrum.
We first define the -return
as
the weighted average of all n-step returns for
:
(60) |
(61) |
(63) |
Again consider two special cases:
(64) |
(65) |
In Eq. (62),
is calulated as the
weighted sum of all n-step returns
upto
the last one
available at the end of the episode.
This summation can be truncated to include fewer terms
before reaching the terminal state at the end of the
episode:
Based on -return, the iterative update of the
value function
in Eq. (52) can
be modified to:
(67) |
(68) |
An alternative method is the backward view of
TD(), which can be shown to be equivalent to
the forward-view of TD(
), but it is similar
to the TD(0) method, as they are both based on the
immediate reward
available at each time step
of the episode, and therefore more computationally
convenient.
Specifically, the backward view of TD() is
based the eligibility trace
for each
of the states, which decays exponentially upon each
state transition
(69) |
Now the iterative update of the estimated value function
in Eq. (52) is modified so that all
states, instead of only the one currently visited, are
updated, but to different extents based on :
The eligibility trace as defined above is
motivated by the frequency and recency of the visites
to each state. If a state
has been more frequently
and recently visited compared to others, its
is greater than others and its value function
will be updated by a greater increment than others.
Here is the pseudo code for the backward view of the
TD() methd for policy evaluation:
Again consider two special cases:
This backward view of the TD() method can be
applied to model-free control when the state value
function is replaced by the action value function.
Corresponding to the SARSA and Q-learning algorithms
based on TD(0) in the previous section, here are the
two algorithms based on eligibility traces: