2.5: The Projection Theorem and the Least Squares Estimate

Last updated
Save as PDF

Page ID: 24238

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

The solution to our least squares problem is now given by the Projection Theorem, also referred to as the Orthogonality Principle, which states that

\[\widehat{e}=(y-A \widehat{x}) \quad \perp \quad \mathcal{R}\tag{A}\]

from which - as we shall see - \(\widehat{x}\) can be determined. In words, the theorem/"principle" states that the point \(\widehat{y}= A \widehat{x}\) in the subspace \(\mathcal{R}(A)\) that comes closest to \(y\) is characterized by the fact that the associated error \(\widehat{e}= y- \widehat{y}\) is orthogonal to \(\mathcal{R}(A)\), i.e., orthogonal to the space spanned by the vectors in \(A\). This principle was presented and proved in the previous chapter. We repeat the proof here in the context of the above problem.

Proof:

We first show that \(y\) has a unique decomposition of the form \(y = y_{1}+y_{2}\), where \(y_{1} \in \mathcal{R}(A)\) and \(y_{2} \in \mathcal{R}^{\perp}(A)\). We can write any \(y_{1} \in \mathcal{R}(A)\) in the form \(y_{1} = A \alpha\) for some vector \(\alpha\).

If we want \((y- y_{1}) \in \mathcal{R}^{\perp}(A)\), we must see if there is an \(\alpha\) that satisfies

\[<a_{i},(y-A \alpha)>=0, \quad i=1, \ldots, n\nonumber\]

or, using our Gram product notation,

\[\prec A,(y-A \alpha) \succ=0\nonumber\]

Rearranging this equation and using the linearity of the Gram product, we get

\[\prec A, A \succ \alpha=\prec A, y \succ\nonumber\]

which is in the form of the normal equations that we encountered in Lecture 1. Under our assumption that the vectors making up the columns of \(A\) are independent, the Gram matrix lemma shows that \(\prec A, A \succ\) is invertible, so the unique solution of the preceding equation is

\[\alpha=\prec A, A \succ^{-1} \prec A, y \succ\nonumber\]

We now have the decomposition that we sought.

To show that the preceding decomposition is unique, let \(y = y_{1a} + y_{2a}\) be another such decomposition, with \(y_{1a} \in \mathcal{R}(A)\) and \(y_{2a} \in \mathcal{R}^{\perp}(A)\). Then

\[y - y_{1a} = y - y_{2a}\nonumber\]

and the left side is in \( \mathcal{R}(A)\) while the right side is in its orthogonal complement. It is easy to show that the only vector common to a subspace and its orthogonal complement is the zero vector, so \(y - y_{1a}=0\) and \(y - y_{1a}= 0\), i.e., the decomposition of \(y\) is unique.

To proceed, decompose the error \(e = y - Ax\) similarly (and uniquely) into the sum of \(e_{1} \in \mathcal{R}(A)\) and \(e_{2} \in \mathcal{R}^{\perp}(A)\). Note that

\[\|e\|^{2}=\left\|e_{1}\right\|^{2}+\left\|e_{2}\right\|^{2}\nonumber\]

Now we can rewrite \(e = y - Ax\) as

\[e_{1}+e_{2}=y_{1}+y_{2}-A x\nonumber\]

\[e_{2} - y_{2}=y_{1} - e_{1}-A x\nonumber\]

Since the right side of the above equation lies in \(\mathcal{R}(A)\) and the left side lies in \(\mathcal{R}^{\perp}(A)\), each side separately must equal 0 - again because this is the only vector common to a subspace and its orthogonal complement. We thus have \(e_{2}=y_{2}\), and the choice of \(x\) can do nothing to affect \(e_{2}\). On the other hand, \(e_{1} = y_{1} - Ax = A( \alpha - x)\), and the best we can do as far as minimizing \(\|e\|^{2}\) is to make \(e_{1} = 0\) by choosing \(x = \alpha\), so \( \widehat{x}= \alpha\), i.e.,

\[\widehat{x}=\prec A, A \succ^{-1} \prec A, y \succ\nonumber\]

This solves the least squares estimation problem that we have posed.

The above result, though rather abstractly developed, is immediately applicable to many concrete cases of interest.

Specializing to the case of \(R^{m}\) or \(C^{m}\), and choosing \(x\) to minimize the usual Euclidean norm,

\[\|e\|^{2}=e^{\prime} e=\sum_{i=1}^{m}\left|e_{i}\right|^{2}\nonumber\]

we have

\[\widehat{x}=\left(A^{\prime} A\right)^{-1} A^{\prime} y\nonumber\]

Note that if the columns of \(A\) form a mutually orthogonal set (i.e. an orthogonal basis for \(\mathcal{R}(A)\)), then \(A^{\prime}A\) is diagonal, and its inversion is trivial.

If instead we choose to minimize \(e^{\prime}Se\) for some positive definite Hermitian \(S\) (\( \neq I\)), we have a weighted least squares problem, with solution given by

\[\widehat{x}=\left(A^{\prime} SA\right)^{-1} A^{\prime} Sy\nonumber\]

For instance, with a diagonal \(S\), the criterion that we are trying to minimize becomes

\[\sum_{i=1}^{m} s_{i i}\left|e_{i}\right|^{2}\nonumber\]

where the \(s_{ii}\) are all positive. We can thereby preferentially weight those equations in our linear system for which we want a smaller error in the final solution; a larger value of \(s_{ii}\) will encourage a smaller \(e_{i}\).

Such weighting is important in any practical situation, where different measurements \(y_{i}\) may have been subjected to different levels of noise or uncertainty. One might expect that \(s_{ii}\) should be inversely proportional to the noise intensity on the ith equation. In fact, a probabilistic derivation, assuming zero-mean noise on each equation in the system but noise that is uncorrelated across equations, shows that \(s_{ii}\) should vary inversely with the variance of \(e_{i}\).

A full matrix \(S\) rather than a diagonal one would make sense if the errors were correlated across measurements. A probabilistic treatment shows that the proper weighting matrix is \(S=\left(E\left[e e^{\prime}\right]\right)^{-1}\), the inverse of the covariance matrix of \(e\). In the deterministic setting, one has far less guidance on picking a good \(S\).

The boxed result also allows us to immediately write down the choice of coeffcients \(x_{i}\) that minimizes the integral

\[\int\left[y(t)-a_{1}(t) x_{1}-a_{2}(t) x_{2}-\cdots-a_{n}(t) x_{n}\right]^{2} d t\nonumber\]

for specified functions \(y(t)\) and \(a_{i}(t)\). If, for instance, \(y(t)\) is of finite extent (or finite "support") \(T\), and the \(a_{i}(t)\) are sinusoids whose frequencies are integral multiples of \(2 \pi /T\), then the formulas that we obtain for the \(x_{i}\) are just the familiar Fourier series expressions. A simplification in this example is that the vectors in \(A\) are orthogonal, so \(\prec A, A \succ\) is diagonal.