All RL algorithms previous considered are based on either
state or action-value function and the policy is indectly
derived from them by greedy or -greedy method.
However, as the ultimate goal of an RL problem is to find
the optimal policy that maximizes the return, it make
sense to consider this as an optimization problem to
directly find the optimal policy based on an objective
function representing the total return to be maximized,
by the gradient ascent method, as discussed in the
following.
Previously we approximate the state value function
and action value function
by
some parameterized functions
and
respectively, of which
the parameter
can be obtained by sampling
the environment, as a supervised learning process.
Now we construct a model of a stochastic policy as
a parameterized function:
(110) |
As a specific example, the policy model can be based on the soft-max function:
satisfying
Different from how we find the parameter of
by minimizing
,
the mean squared error between the value function
and its model
,
based on gradient descent, here we find the parameter
of
by maximizing
representing the value function, the
expected return, under the policy, based on gradient
ascent. Such methods are therefore called
policy gradient methods.
The value-function based methods considered previously and the policy-based methods considered here are summarized below, together with the actor-critic method, as the combination of the two:
Advantages of policy-based RL includes better convergence properties, but may stuck at local optimum, effective in high-dimensional or continuous action space can learn stochastic policies, but have high variance.
How good a policy is may be measured by different objective functions all related to the values or rewards associated with the policy being evaluated, depending on the environment of the specific problem:
(112) |
(113) |
(114) |
(115) |
While it is conceptually straight forward to see how
gradient ascent method can be used to find the optimal
parameter
that maximizes
,
it is not easy to actually find its gradient
, which depends on not only
what actions to take at the states directly determined
by the policy
, but also how the
states are distributed under the policy in an unknown
environment. This challenge is addressed by the following
policy gradient theorem. The proof of the theorem
below leads to an expression of the gradient
, which can be used
to find the optimal parameter
iteratively
by the stochastic gradient ascend method.
Proof of policy gradient theorem:
We further define
as the probability
of transitioning from state
to a state
after
steps
following
:
(119) |
(126) |
Q.E.D.
We see that the gradient
is
now expressed as the expectation of the action value function
, weighted by the gradient of the logrithm of the
corresponding policy
.
Now Eq. (116) can be further written as
We further note that Eq. (125) still hold if
an arbitrary bias term independent of action
is included:
(131) |
The expectation
in Eqs. (129)
and (132) can be dropped if the method of
stochastic gradient ascent is used, as in all algorithms
below, where each iterative step is based on only one
sample data point instead of its expectation.
Specifically, for a soft-max policy model as given in
Eq. (111), we have
(133) |
We list set of popular policy-based algorithms below, based on either the MC or TD methods, generally used in previous algorithms.
In this algorithm, the action value function
in Eq. (129) is
replaced by the return
btained at the end of each
episode while sampling the environment. Now the equation
becomes:
(134) |
Here is the pseudo code for the algorithm:
As its name suggests, this algorithm is based on two
approximation function models, the first for the policy
parameterized by
,
the actor, same as in the REINFORCE algorithm above,
and the second for the value function
parameterized by
, the critic, same as in
Eq. (97).
Specifically, in Eq. (132), the action
value function
is replaced by its
bootstrapping expression
,
and the bias term
is replaced by the approximated
value function
. Now both parameters
and
can be found iteratively at
every step of an episode while sampling the environment
(the TD method) by stochastic gradient method with the
expectation
dropped:
(135) |
Here is the pseudo code for the algorithm:
Here is the pseudo code for the algorithm:
Following the similar steps, the policy gradient theorem for environment with continuous state and action spaces can be also proven:
(136) |