Monte Carlo (MC) Algorithms

We first consider a simple problem of evaluating an existing policy $\pi$ in terms of its value function $v_\pi(s)=E[\,G_t\;]$ at state $s=s_t$ , the expectation of return $G_t$ , which can be obtained in the model-based case by the Bellman equation in Eq. (24):

$\displaystyle v_\pi(s)=r(s,\pi(s))+\gamma\sum_{s'}p(s'\vert s,\pi(s))v_\pi(s')$

(50)

based on $v_\pi(s')$ of all possible next states each weighted by the corresponding transition probability $p(s'\vert s, a)$ , which is no longer available now in the model-free case. We can still find the value function based on the MC method by running multiple sample episodes in the environment, and estimate $v_\pi(s)$ as the average of the actual returns $G_t$

, which in turn can be calculated as the sum of discounted rewards $r_{t+1},\cdots,r_T$ from the current state $s=s_t$

onward to the terminal state $s_T$

, found at the end of the episode (assumed to have finite horizon).

This method has two versions, first-visit and every-visit, depending on whether only the first or every visit to a state in the trajectory of an episode is counted in calculation of the return. The pseudo code below is for first-visit version due to the if statement, which can be removed for the every-visit version.

Input: a given policy $\pi$ to be evaluated

Initialize: $G_t=[\;\;]$ (empty lists) for all $t=0,\cdots,T-1$

loop (for each episode)

run episode to the end while following $\pi$ to get sample data: $(s_0,a_0,r_1),\;(s_1,a_1,r_2),\cdots,(s_{T-1},a_{T-1},r_T)$

for $t=T-1,T-2,\cdots,0$ (for each step)

$G=\gamma G+r_{t+1}$

if $S_t\notin \{S_0,\cdots,S_{t-1}\}$ (first visit)

(append

to list

)

$V(S_t)=average(G_t),\;\forall t=0,\;,\cdots,T-1$

In the every-visit case, all sample returns $G_t$ from multiple visits to state $s_t$ are used, although they may be correlated and not independent of each other, i.e., they are not necessarily i.i.d. samples and their average may be biased. This problem can be avoided in the first-visit case, where only the returns of the first visit to each state are counted as independent samples drawn from the same distribution, and their average is not biased. However, as only a fraction of the sample points collected is used in the calculation of the returns, the cost of this unbiased estimation is its high variance, which can be reduced only if a large number of episodes are used to get enough samples for a more statistically reliable average. This is the typical trade-off between variance and bias errors.

In the code above, the function average finds the average of all elements of a list, only when all sample returns are available at the end of each episode. A more computationally efficient method (in terms of both space and complexity) is to find the average incrementally as in Eq. (49). Based on this incremental average, policy evaluation in terms of both the state and action values can also be carried out by the code below:

Initialize $v(s)=0,\;\forall s$

loop (for each episode)

run episode following $\pi$ to get $G_t,\;\forall t$

for $t=0,\cdots,T-1$ (for each step)

is visited the first time

: $v(s_t)=v(s_t)+\alpha (G_t-v(s_t))$

and

Initialize: $q(s,a)=0,\;\forall (s,a)$

loop (for each episode)

run episode following $\pi$ to get $G_t,\;\forall t$

for all $(s,a)$

pairs visited

is visited the first time

: $q(s,a)=q(s,a)+\alpha (G_t-q(s,a))$

The if condition can be removed for every-visit version of the algorithm.

We next consider the MC algorithm for model-free control, based on the generalized policy iteration of alternating and interacting policy evaluation and policy improvement. Specifically, while sampling the environment by following an existing policy $\pi$ the action value $q_\pi(s)$ , instead of the state value $v_\pi(s)$ , is gradually learned and the policy is gradually improved at the same time. This iteration will converge to the optimal policy at the limit when the number of iterations goes to infinity.

The pseudo code for this lgorithm is listed below.

Input: $\epsilon>0,\;\alpha>0$ , and a policy $\pi$

Initialize: $q(s,a)=0,\;\forall(s,a),\; \pi=\epsilon-$ soft (arbitrarily)

loop (for each episode)

run episode following $\pi$ to get $G_0,\cdots,G_{T-1}$

for $t=0,\cdots,T-1$

is visited the first time

$q(s_t,a_t)=q(s_t,a_t)+\alpha(G_t-q(s_t,a_t))$

$a^*=\argmax_a q(s_t,a_t)$

for $\forall a\in A(s_t)$

$\pi(a\vert s_t)=\left\{\begin{array}{ll} 1-\epsilon+\epsilon/\vert A(s_t)\vert... ...ox{if } a=a^*\\ \epsilon/\vert A(s_t)\vert & \mbox{else} \end{array}\right.$

Here the $\epsilon$ -greedy policy $\pi(a\vert s)$ is based on action value function $q(s,a)$

(maximized by the greedy action). Through out the iteration, the policy is modified following the action value function, while at the same time the action value function is modified based on the policy, until the process converges to the desired optimality.

We note that the optimal policy $\pi(s)$ obtained by the algorithm above is not completely deterministic due to the $\epsilon$ -greedy approach, but it is said to be near-deterministic as the greedy action is favored over other non-greedy actions.