Temporal Difference (TD) Algorithms

The temporal difference (TD) method is a combinaion of the MC method considered above and the bootstrapping DP method based on the Bellman equation. The main difference between the TD and MC methods is the target, the return $G$, in the incremental average in Eq. (49) for estimating either the state or action value functions. While in the MC method the target is the actual return $G_t$ calculated at the end of the episode when all subsequent rewards $r_{t+1},\cdots,r_T$ are available, here in the TD method the target, called TD target, is the sum of the immediate reward $r_{t+1}$ and the previouosly estimated value $v_\pi(s_{t+1})$ at the next state:

$\displaystyle G_t=r_{t+1}+\gamma v_\pi\;(s_{t+1})$ (51)

We see that the TD method is an in-place bootstrapping method, and it makes more efficient use of the sample data and updates more frequently the value functions being estimated and the policy being improved at every step of an episode, instead of at the end of the episode as in the MC method. Also, it can be used even if the dynamic process of the environment is non-episodic with infinite horizon.

Again, we first consider the simpler problem of policy evaluation. As in the MC method, we estimate the value function $v_\pi(s)$ as the average of the actual returns $G$ found by running multiple episodes while sampling the environment. Substituting this TD target into the running average in Eq. (49) for the value function above, we get

    $\displaystyle v_\pi(s)\Leftarrow v_\pi(s)+\alpha(G-v_\pi(s))$  
  $\displaystyle =$ $\displaystyle v_\pi(s)+\alpha( r+\gamma v_\pi(s')-v_\pi(s))
=v_\pi(s)+\alpha\delta$ (52)

where $\delta$, called the TD error, is the difference between the TD target and the previous estimate:

$\displaystyle \delta=G-v_\pi(s)=r+\gamma v_\pi(s')-v_\pi(s)$ (53)

and $v_\pi(s')$ is the current estimatee of the value of the next state (bootstrapping). It can be shown that the iteration of this TD method converges to the true value $v_\pi(s)$, if the step size is small enough.

Here is the pseudo code for policy evaluation using the TD method, based on parameters $\gamma$ and $\alpha$:

Input: a given policy $\pi$ to be evaluated
Initialize: $v(s)$ for all $s\in S$ arbitrarily, $v(s)=0$ for terminal state $s$, $\alpha\in(0,\;1]$
loop (for each episode)
$s=s_0$
while $s$ is not terminal (for each step)
take action $a=\pi(s)$, get reward $r$ and next state $s'$
$v(s)=v(s)+\alpha[ r+\gamma v(s')-v(s)]$
$s=s'$

As the TD algorithm updates the value function at every step of the episode, it uses the sample data more frequently and efficiently, and therefore has lower variance error than the CM method that updates the estimated value function at the end of each episode. On the other hand, the TD method may be biased when compared to the first-visit MC method, due to the arbitrary initialization of the value functions.

Based on the TD method for model-free policy evaluation, we now further consider the TD method for model-free control to gradually learn the optimal policy by updading the Q-values of all state-action pairs $(s,a)$ estimated iteratively as the running average of the sample Q-values at each time step of an episode while sampling the environment.

The Q-values for all state-action pairs can be stored in a state-action table which is iteratively updated by running many episodes of the unknown MDP by the TD method to gradually approach the maximum Q-values achievable by taking each action at each state, i.e., the optimal action is the one with the highest Q-value.

This method has two different flavors, the on-policy algorithm, which updates the Q-value by following the current policy such as an $\epsilon$-greedy policy, and the off-policy algorithm, which updates the Q-value by taking actions different from the current policy such as the greedy action. In particular, if the policy currently being followed is greedy (instead of $\epsilon$-greedy), the two algorithms are the same.

Here is the comparison of the MC and TD methods in terms of their pros and cons:

Here is a summary of the dynamic programming (DP) method for model-based planning, and the Monte-Carlo (MC) and time difference (TD) methods for model-free control:

DPMCTDmethods.png