The previously discussed dynamic programming methods find
the optimal policy based on the assumption that the MDP
model of the environment is completely available, i.e.,
the dynamics of MDP in terms of its state transition and
reward mechanism are known. A given policy can then
be evaluated based the transition probabilities
,
and improved based on greedy action selection at each
state. Such a model-based optimization problem, called
planning, involves no learning of the environment.
Now we consider the optimization problem with an unknown MDP model of the environment. In this model-free problem, called control, the state transition and reward probabilities of the MDP model are unknown, the value functions can no longer be calculated directly based on Eqs. (21) and (22) as in model-based planning. Instead, now we need to learn the MDP model by sampling the environment, i.e., by repeatedly running a large number of episodes of the stochastic process of the environment while following some given policy, and estimate the value functions as the average of the actual returns received during the sampling process. This method is generally referred to as Monte Carlo (MC) method. At the same time, the given policy can also be improved to gradually approach optimality.
This approach is called general policy iteration (GPI), as illustrated in Eq. (40) below, similar to the policy iteration (PI) algorithm for model-based planning illustrated in Eq. (37). In general, all model-free control algorithms are based on GPI by which the two alternating processes of policy evaluation and policy imprvement are carried out iteratively. This GPI can also be similarly illustrated in Fig. 1.3 for the GI.
While GPI and GI illustrated in respectively in Eqs. (40) and (37) may look similar to each other, there are some essential differences between the two, as listed below.
This issue of exploration versus exploitation
can be addressed by the -greedy method
to make a proper trade-off between the exploitation of
the greedy action and the exploration of other non-greedy
actions, so that the agent can be exposed to all possible
state-action pairs and gradually learn the action value
function while at the same time also improve the policy
being followed. In this method, we define
as the probability to explore any randomly chosen action
out of all actions available in state
,
each with an equal probability
, and
is the probability of taking the greedy
action as one of the
actions, with probability
.
For example, if
We further note that a larger value of (close to
1) can be used to emphasize exploration at the early stage
of the iteration when many state-action pairs have not been
visited yet, while progressively smaller
values
(approaching to 0) can be used to amphasize exploitation for
higher values in the later stage of the iteration when the
action value function has been more reliably learned. For
example, we can let
be inversely proportional
to the current number of iterations
, so that the larger
values at the early stage of the iteration allow
more sate-action pairs to be explored, while smaller
values in the later stage of the iteration
encourage exploitation of the greedy actions. To the limit
and
, the
-greedy method approaches absolute greedy method.
Similar to how we proved the policy improvement theorem
in Eq. (30) stating that the greedy
policy achieves state value
no lower
than
achieved by the original policy
, i.e.,
, here we can also prove the same policy
improvement theorem stating that the
-greedy
policy
achieves action value
no lower
than
by the original policy
:
(42) |
(43) |
The iterative process of the GPI method for model-free
control is illustrated in the figure below, similar to
the PI method for model-based planning based planning
illustrated in Fig. 1.3, but with the state
value replaced by the action value
,
and the greedy method replaced by
-greedy method.
Also, here the estimated action value is
obtained with only one iteration for policy evaluation,
instead of a much more accurate estimate that could
otherwise be obtained if the iterative evaluation was
fully carried out to its convergence (the top line).
At the same the time, based on the estimated
policy improvement is carried out by the
-greedy
method (the bottom line).
The main task in model-free control is to evaluate
the state and action-value functions given in
Eqs. (21) and (22)
while following a given policy without a specific
MDP model. This is done by sampling of the dynamic process
of the environment, and then estimating the value functions
as the average of the actual returns received by the agent.
Specifically, we run a large number of episodes (all assumed
to terminate) of the MDP model of the environment by taking
actions
in each state
:
(45) |
Such a sequence of states visited is called a trajectory, and the trajectories of different episodes are in general different from each another due to the random nature of the environment. In the following sections we will consider specific algoithms for the implementation of the general model-free control discussed above.
To prepare for specific discussion of the GPI methods, we
first consider a general problem of the estimation of the
value of a random variable as the running average of
its samples beging collected in real time. Speicically
the average
based on previous
samples
is updated incrementally upon receiving
a new sample
:
(46) |
newEstimate | oldEstimate |
||
oldEstimate |
|||
oldEstimate |
(48) |
Specifically in model-free control Eq. (47)
can be used to iteratively update the estimated value
functions while sampling the environment. Both the
state and action values as the expected return are
estimated as the average of all previous sample returns,
and they are updated incrementally when a new sample
return , the target, becomes available: