All previoiusly considered algorithms are based on
either the state values function for each
state
or the action value
for each
state-action pair. As these function values can be
considered as either a 1-D or 2-D table, the algorithms
based on such function values are called tabular methods.
However, when the number of states and the number of
actions in each state are large (e.g., the game Go has
states), the tabular methods are no longer
suitable due to the unrealistic table sizes. In such
a case, we can consider approximating the state value
function
or action-value function
by a
function
or
,
parameterized by a set of parameters as the components
of vector
. By doing so the state or action
value functions can be represented more conveniently
in terms of a much smaller number of parameters in
.
This approach can be considered as a regression problem
to fit a continuous function
or
to a set of discrete and finite data
points
or
obtained by sampling the
MDP of the environment, so that all points in the state
space, including those for stataes or state-action pairs
never actually observed during the sampling process can
be represented.
Examples of such value function approximators include
Same as in regression problems, we desire to find the
optimal parameter for the modeling function
so that the following objective
function, the mean squared error between the approximated
values and the sample values, is minimized:
(72) |
(73) |
(74) |
(75) |
In the following, we will consider the special case
where the state value is approximated by a
linear function
(77) |
As a simple example, if we represent each of the
states
by a
dimensional binary feature vector
of all zero
components
except the n-th one
,
then the weight vector can be
,
and we have
.
In this case, each state
is represented by its value
function
, same as in all previous discussions.
The objective function for the mean squared error of such a linear approximating function is
and its gradient is(79) |
(80) |
(81) |
However, as reinforcement learning is typically an
online problem, the parameter for the
approximator needs to be modified in real-time whenever
a new piece of data becomes available during the
sampling of the environment. In this case, we can find
the gradient descent method, based on the gradient vector
of
:
(82) |
(83) |
(84) |
We note that the expectation of the estimated
found by this SGD method is the same as that obtained by
full GD method. Also this iteration will convergence to
the global minimum of the objective function
in Eq. (78) as it is a
quadratic function with only one minimum which is global.
Specifically, the state value function is unknown and can be estimated in several different ways, similar to the corresponding algorithms discussed before.
The value function is estimated by the actual
return
for each state
visited in an episode,
obtained only at the end of each episode while sampling
the environment:
Here is the pseudo code for the algorithm based on the MC method, similar to that for value evaluation algorithms discussed previously:
The value function is estimated by the TD
target, the sum of the immediate reward
and
the approximated value of the next state
, at
each time step of each episode while sampling the
environment:
(86) |
Here is the pseudo code for the algorithm based on the TD method, similar to that for value evaluation algorithms discussed previously:
The value function is estimated based on
-return
given in Eq. (62).
As discussed before, TD(
) has two flavors:
(87) |
(88) |
(89) |
Given a policy for an MDP, we define a
probability districution
over all staes
visited according to
, satisfying
(90) |
(91) |
Given the probability distribution , the mean
squared error of the value function approximation for
a policy
can be expressed as
(92) |
(93) |