Linear Equation Systems

Consider a linear equation system of #tex2html_wrap_inline2354# equations and #tex2html_wrap_inline2356# variables:

#math145#   #tex2html_wrap_indisplay6072# (2)
This can be expressed in vector form:
#math146#   #tex2html_wrap_indisplay6074# (3)
where
#math147#   #tex2html_wrap_indisplay6076# (4)
In general, the existence and uniqueness of the solution can be determined based on the <#80#>fundamental theorem of linear algebra<#80#> (Table #TableFundamentalAlgebra#81> in Section #fundamentalTheoremAlgebra#82>), depending on the rank #tex2html_wrap_inline2358# of the coefficient matrix #tex2html_wrap_inline2360#, and whether the system is underdetermined (underconstrained) if #tex2html_wrap_inline2362#, or overdetermined (overcostrained) if #tex2html_wrap_inline2364#.

If #tex2html_wrap_inline2366# is a full rank square matrix with #tex2html_wrap_inline2368#, then its inverse #tex2html_wrap_inline2370# exists and the system has a unique solution #math148##tex2html_wrap_inline2372#. But if #tex2html_wrap_inline2374# and #tex2html_wrap_inline2376# is not in the column space of #tex2html_wrap_inline2378#, the system is overconstrained with no solutions. In this case we can still find an optimal approximate solution that minimizes the <#93#>sum-of-squares error (SSE)<#93#>, called <#94#>objective function<#94#> in general optimization problems, defined as

#math149#
#tex2html_wrap_indisplay6089# #tex2html_wrap_indisplay6090# #tex2html_wrap_indisplay6091#  
  #tex2html_wrap_indisplay6092# #tex2html_wrap_indisplay6093# (5)
where #math150##tex2html_wrap_inline2380# is the <#131#>residual<#131#> of the system. The optimal solution #tex2html_wrap_inline2382# that minimizes #math151##tex2html_wrap_inline2384# can be found by setting its derivative with respect to #tex2html_wrap_inline2386# to zero (see Section #vectorMatrixDifferentiation#135> for differentiation with respect to a vector variable):
#math152#
#tex2html_wrap_indisplay6099# #tex2html_wrap_indisplay6100# #tex2html_wrap_indisplay6101#  
  #tex2html_wrap_indisplay6102# #tex2html_wrap_indisplay6103# (6)
Solving this matrix equation we get the optimal approximate solution
#math153#   #tex2html_wrap_indisplay6105# (7)
where #math154##tex2html_wrap_inline2388# is the <#184#>pseudo-inverse<#184#> (see Section #pseudoInverse#185>) of the non-square matrix #tex2html_wrap_inline2390#. Here we assume the #tex2html_wrap_inline2392# equations are independent, i.e., the #tex2html_wrap_inline2394# square matrix #math155##tex2html_wrap_inline2396# has a full rank #tex2html_wrap_inline2398#, i.e., it is nonsingular and invertible with all eigenvalues non-zero.

<#6112#>Figure<#6112#> 1.1: <#6113#>The Pseudo Inverse<#6113#>
[width=90mm]<#190#>figures/PseudoInverse.eps<#190#>

If the #tex2html_wrap_inline2400# equations are barely independent, i.e., some eigenvalues of #math156##tex2html_wrap_inline2402# are very close to zero, then it is near-singular but still invertible, and some eigenvalues of its inverse #math157##tex2html_wrap_inline2404# may take huge values. Consequently the result in Eq. (#linearEquationPseudoInverse#199>) may vary greatly due to some small random fluctuation in either #tex2html_wrap_inline2406# or #tex2html_wrap_inline2408#. In this case, the system is said to be <#202#>ill-conditioned<#202#> or <#203#>ill-posed<#203#>, as it is prone to noise, i.e., it may the solution is highly sensitive to noise and therefore unreliable.

Such an ill-posed problem can be addressed by <#204#>regularization<#204#>, so that the solution #tex2html_wrap_inline2410# is forced to avoid taking huge values and less sensitive to noise. Specifically we insert in the objective function an additional penalty term for large #tex2html_wrap_inline2412# scaled by a <#207#>hyperparameter<#207#> #tex2html_wrap_inline2414#:

#math158#   #tex2html_wrap_indisplay6124# (8)
By minimizing #tex2html_wrap_inline2416#, we can obtain a solution #tex2html_wrap_inline2418# of small norm #tex2html_wrap_inline2420# as well as a small error #math159##tex2html_wrap_inline2422#. Just as before, the solution #tex2html_wrap_inline2424# can be obtained by setting the derivative of #tex2html_wrap_inline2426# to zero
#math160#   #tex2html_wrap_indisplay6132# (9)
and solving the resulting equation to get:
#math161#   #tex2html_wrap_indisplay6134# (10)
We see that even if #math162##tex2html_wrap_inline2428# is near singular, matrix #math163##tex2html_wrap_inline2430# is not due to the additional term #math164##tex2html_wrap_inline2432#.

By adjusting the hyperparameter #tex2html_wrap_inline2434#, we can make a proper trade-off between accuracy and stability.

  • Small #tex2html_wrap_inline2436#: the solution is more accurate but less stable as it is more prone to noise, i.e., the variance error may be large. This is called <#250#>overfitting<#250#>;
  • Large #tex2html_wrap_inline2438#: the solution is less accurate but more stable as it is less affected by noise. This is called <#251#>underfitting<#251#>.
The issue of overfitting versus underfitting is of essential importance in machine learning in general, and will be more formally addressed while discussing various regression and classification algorithms in some later Chapters.