Consider a linear equation system of #tex2html_wrap_inline2354# equations and #tex2html_wrap_inline2356#
variables:
#math145# #tex2html_wrap_indisplay6072#
(2)
This can be expressed in vector form:
#math146# #tex2html_wrap_indisplay6074#
(3)
where
#math147# #tex2html_wrap_indisplay6076#
(4)
In general, the existence and uniqueness of the solution can be
determined based on the <#80#>fundamental theorem of linear algebra<#80#>
(Table #TableFundamentalAlgebra#81> in Section
#fundamentalTheoremAlgebra#82>), depending on the rank #tex2html_wrap_inline2358# of
the coefficient matrix #tex2html_wrap_inline2360#, and whether the system is
underdetermined (underconstrained) if #tex2html_wrap_inline2362#, or overdetermined
(overcostrained) if #tex2html_wrap_inline2364#.
If #tex2html_wrap_inline2366# is a full rank square matrix with #tex2html_wrap_inline2368#, then its
inverse #tex2html_wrap_inline2370# exists and the system has a unique solution
#math148##tex2html_wrap_inline2372#. But if #tex2html_wrap_inline2374# and #tex2html_wrap_inline2376# is not
in the column space of #tex2html_wrap_inline2378#, the system is overconstrained
with no solutions. In this case we can still find an optimal
approximate solution that minimizes the
<#93#>sum-of-squares error (SSE)<#93#>, called <#94#>objective function<#94#>
in general optimization problems, defined as
#math149#
#tex2html_wrap_indisplay6089# |
#tex2html_wrap_indisplay6090# |
#tex2html_wrap_indisplay6091# |
|
|
#tex2html_wrap_indisplay6092# |
#tex2html_wrap_indisplay6093# |
(5) |
where #math150##tex2html_wrap_inline2380# is the <#131#>residual<#131#>
of the system. The optimal solution #tex2html_wrap_inline2382# that minimizes
#math151##tex2html_wrap_inline2384# can be found by setting its derivative
with respect to #tex2html_wrap_inline2386# to zero (see Section
#vectorMatrixDifferentiation#135> for differentiation with
respect to a vector variable):
#math152#
#tex2html_wrap_indisplay6099# |
#tex2html_wrap_indisplay6100# |
#tex2html_wrap_indisplay6101# |
|
|
#tex2html_wrap_indisplay6102# |
#tex2html_wrap_indisplay6103# |
(6) |
Solving this matrix equation we get the optimal approximate
solution
#math153# #tex2html_wrap_indisplay6105#
(7)
where #math154##tex2html_wrap_inline2388#
is the <#184#>pseudo-inverse<#184#> (see Section #pseudoInverse#185>)
of the non-square matrix #tex2html_wrap_inline2390#. Here we assume the #tex2html_wrap_inline2392#
equations are independent, i.e., the #tex2html_wrap_inline2394# square matrix
#math155##tex2html_wrap_inline2396# has a full rank #tex2html_wrap_inline2398#, i.e., it is
nonsingular and invertible with all eigenvalues non-zero.
<#6112#>Figure<#6112#> 1.1:
<#6113#>The Pseudo Inverse<#6113#>
[width=90mm]<#190#>figures/PseudoInverse.eps<#190#> |
If the #tex2html_wrap_inline2400# equations are barely independent, i.e., some
eigenvalues of #math156##tex2html_wrap_inline2402# are very close to zero,
then it is near-singular but still invertible, and some
eigenvalues of its inverse #math157##tex2html_wrap_inline2404# may
take huge values. Consequently the result in
Eq. (#linearEquationPseudoInverse#199>) may vary greatly
due to some small random fluctuation in either #tex2html_wrap_inline2406#
or #tex2html_wrap_inline2408#. In this case, the system is said to be
<#202#>ill-conditioned<#202#> or <#203#>ill-posed<#203#>, as it is prone
to noise, i.e., it may the solution is highly sensitive to
noise and therefore unreliable.
Such an ill-posed problem can be addressed by <#204#>regularization<#204#>,
so that the solution #tex2html_wrap_inline2410# is forced to avoid taking huge
values and less sensitive to noise. Specifically we insert in
the objective function an additional penalty term for large
#tex2html_wrap_inline2412# scaled by a <#207#>hyperparameter<#207#> #tex2html_wrap_inline2414#:
#math158# #tex2html_wrap_indisplay6124#
(8)
By minimizing #tex2html_wrap_inline2416#, we can obtain a solution #tex2html_wrap_inline2418#
of small norm #tex2html_wrap_inline2420# as well as a small error
#math159##tex2html_wrap_inline2422#. Just as before, the solution #tex2html_wrap_inline2424#
can be obtained by setting the derivative of #tex2html_wrap_inline2426# to
zero
#math160# #tex2html_wrap_indisplay6132#
(9)
and solving the resulting equation to get:
#math161# #tex2html_wrap_indisplay6134#
(10)
We see that even if #math162##tex2html_wrap_inline2428# is near singular,
matrix #math163##tex2html_wrap_inline2430# is not due to the
additional term #math164##tex2html_wrap_inline2432#.
By adjusting the hyperparameter #tex2html_wrap_inline2434#, we can make a proper
trade-off between accuracy and stability.
- Small #tex2html_wrap_inline2436#: the solution is more accurate but
less stable as it is more prone to noise, i.e., the variance
error may be large. This is called <#250#>overfitting<#250#>;
- Large #tex2html_wrap_inline2438#: the solution is less accurate but more
stable as it is less affected by noise. This is called
<#251#>underfitting<#251#>.
The issue of overfitting versus underfitting is of essential
importance in machine learning in general, and will be more
formally addressed while discussing various regression and
classification algorithms in some later Chapters.