We first consider a simple problem of evaluating an
existing policy in terms of its value function
at state
, the expectation
of return
, which can be obtained in the model-based
case by the Bellman equation in
Eq. (24):
This method has two versions, first-visit and every-visit, depending on whether only the first or every visit to a state in the trajectory of an episode is counted in calculation of the return. The pseudo code below is for first-visit version due to the if statement, which can be removed for the every-visit version.
In the every-visit case, all sample returns from
multiple visits to state
are used, although they
may be correlated and not independent of each other,
i.e., they are not necessarily i.i.d. samples and their
average may be biased. This problem can be avoided in
the first-visit case, where only the returns of the first
visit to each state are counted as independent samples
drawn from the same distribution, and their average is not
biased. However, as only a fraction of the sample points
collected is used in the calculation of the returns, the
cost of this unbiased estimation is its high variance,
which can be reduced only if a large number of episodes
are used to get enough samples for a more statistically
reliable average. This is the typical trade-off between
variance and bias errors.
In the code above, the function average finds the average of all elements of a list, only when all sample returns are available at the end of each episode. A more computationally efficient method (in terms of both space and complexity) is to find the average incrementally as in Eq. (49). Based on this incremental average, policy evaluation in terms of both the state and action values can also be carried out by the code below:
We next consider the MC algorithm for model-free control,
based on the generalized policy iteration of alternating
and interacting policy evaluation and policy improvement.
Specifically, while sampling the environment by following
an existing policy the action value
,
instead of the state value
, is gradually learned
and the policy is gradually improved at the same time. This
iteration will converge to the optimal policy at the limit
when the number of iterations goes to infinity.
The pseudo code for this lgorithm is listed below.
We note that the optimal policy obtained by the
algorithm above is not completely deterministic due to
the
-greedy approach, but it is said to be
near-deterministic as the greedy action is favored
over other non-greedy actions.