Moore-Penrose Pseudoinverse

6 minute read

Definition

Let $V$ and $W$ be finite dimensional inner product spaces over the same field $F$ and let $T:V\to W$ be a linear transformation. Let $L:\ker T^\perp\to \text{im}T$ be linear transformation defined by $L\mathbf{x} = T\mathbf{x}$ for all $\mathbf{x}\in \ker T^\perp$. The pseudo-inverse ( or Moore-Penrose generalized inverse) of $T$, denoted as $T^\dagger$ is defined as the unique linear transformation from $W$ to $V$ such that

\[\begin{align*} T^\dagger(y) = \begin{cases} L^{-1}(\mathbf{y}) &\text{ for } \mathbf{y} \in \text{im}T \\ \mathbf{0} &\text{ for } \mathbf{y} \in \text{im} T^\perp \end{cases} \end{align*}\]

Remark

Note that $L: \ker T^\top \to \text{im}T$ is invertible since $\ker L = \{ \mathbf{0}\}$. The pseudo-inverse of $T$ exists if $T$ is not invertible. Furthermore if $T$ is invertible, then $T=T^\dagger$ since $\ker T^\perp=V$ and thus $L=T$.

We can use the singular value theorem to describe the pseudo-inverse of linear transformation. Suppose that $V$ and $W$ are finite dimensional inner product spaces and $T:V\to W$ is linear transformation with rank $r$. Let $\{\mathbf{v}_1, \ldots, \mathbf{v}_n\}$ and $\{\mathbf{u}_1, \ldots, \mathbf{u}_m \}$ be bases for $V$ and $W$, respectively and let $\sigma_1\geq\sigma_2 \geq \cdots \sigma_r$ be non-zero singular values of $T$ such that

\[\begin{align*} T(\mathbf{v}_i)=\begin{cases} \sigma_i \mathbf{u}_i & \text{if } 1\leq i \leq r\\ 0 & \text{if } i >r. \end{cases} \end{align*}\]

Since $\dim\ker T^\perp = \dim V - \text{rank} T = n-r$ and $T (\mathbf{v}_i)=\mathbf{0}$ for $i>r$,

\[\begin{align*} \{\mathbf{v}_{r+1}, \ldots, \mathbf{v}_n\} \end{align*}\]

is a basis for $\ker T$. On the other hand, since $\mathbf{v}_i \perp \mathbf{v}_j$ for all $1\leq i \leq r$ and $r<j \leq n$, $\{\mathbf{v}_1, \ldots, \mathbf{v}_r\}$ is linearly independent subset of $\ker T^\perp$.

Since $V= \ker T \bigoplus \ker T^\perp$, $\dim \ker T^\perp = \dim V - \dim \ker T^\perp=n - (n-r) =r$. Thus $\{\mathbf{v}_1\ldots, \mathbf{v}_r\}$ is a basis for $\ker T^\perp$.

Similarly, we can show that

\[\begin{align*} \{\mathbf{u}_{1}, \ldots, \mathbf{u}_r \} \end{align*}\]

is a basis for $\text{im} T$ and $\{\mathbf{u}_{r+1}, \ldots, \mathbf{u}_m \}$ is a basis for $\text{im}T^\perp$.

Then \(\begin{align*} \{\mathbf{v}_1, \ldots, \mathbf{v}_r \} \end{align*}\)

is a basis for $\ker T^\perp$ and

\[\begin{align*} \{\mathbf{v}_{r+1},\ldots, \mathbf{v}_n \} \end{align*}\]

is a basis for $\ker T$. Since

Let $L$ be the restriction of $T$ to $\ker T^\perp$ as in the definition of the pseudo-inverse. Then $L^{-1}(\mathbf{u}_i)=\frac{1}{\sigma_i}\mathbf{v}_i$ for $1 \leq i \leq r$. Therefore,

\[\begin{align*} T^\dagger (\mathbf{u}_i) = \begin{cases} \frac{1}{\sigma_i} \mathbf{v}_i &\text{ if } 1 \leq i \leq r \\ \mathbf{0} &\text{ if } r< i < m. \end{cases} \end{align*}\]

Definition (Pseudo-inverse of a Matrix)

Let $A$ be an $m\times n$ matrix of rank $r$. Then there exists a unique $n\times m$ matrix $B$ such that $L^\dagger_A: F^m\to F^n$ is equal to $L_B$. We call $B$ the pseudo-inverse of $A$ and denote it by $B=A^\dagger$.

Theorem

Let $A$ be an $m\times n$ matrix of rank $r$ with a singular value decomposition $A =U\Sigma V^\ast$ and non-zero singular values $\sigma_1\geq \sigma_2\geq \cdots \geq\sigma_r$. Let $\Sigma^\dagger$ be the $m\times n$ matrix defined by

\[\begin{align*} \Sigma^\dagger_{ij} = \begin{cases} \frac{1}{\sigma_i} &\text{ if } i=j \leq r \\ 0 & \text{otherwise}. \end{cases} \end{align*}\]

Then $A^\dagger = V \Sigma^\dagger U^\ast$.

<Proof>

Let $\beta =\{\mathbf{v}_1, \ldots, \mathbf{v}_n \}$ be an orthonormal basis for $F^n$ with $V = [\mathbf{v_1}\cdots \mathbf{v}_n]\in \mathbb{R}^{n\times n}$ and let $\gamma =\{ \mathbf{u}_1, \ldots, \mathbf{u}_m\}$ be an orthonormal basis for $F^m$ with $U=[\mathbf{u}\cdots \mathbf{u}]\in\mathbb{R}^{m\times m}$. Let $\mathcal{E}$ and $\mathcal{F}$ be standard bases for $F^n$ and $F^m$, respectively.

Then by the fundamental theorem of linear algebra, the following equations hold.

\[\begin{align*} A^\dagger &= [L^\dagger_A]^\mathcal{F}_\mathcal{E} \\ &= [Id_V\circ L^\dagger_A\circ Id_W]^\mathcal{F}_\mathcal{E} \\ &=[Id_V]_\mathcal{E}^\beta [L_A^\dagger]^\gamma_\beta[Id_W]_\gamma^\mathcal{F} \\ &=[Id_V]_\mathcal{E}^\beta [L_A^\dagger]^\gamma_\beta([Id_W]^\gamma_\mathcal{F})^{-1} \\ &=V\Sigma^\dagger U^{-1} \\ &=V\Sigma^\dagger U^{\ast} \end{align*}\] \[\tag*{$\square$}\]

Lemma

Let $V$ and $W$ be finite-dimensional inner product spaces and let $T:V\to W$ be linear. Then

(a) $T^\dagger T$ is the orthogonal projection of $V$ onto $\ker T^\perp$.

(b) $T T^\dagger$ is the orthogonal projection of $W$ onto $\text{im} T$.

<Proof>

We define $L: \ker T^\perp \to \text{im} T$ by $L(\mathbf{x}) = T(\mathbf{x})$ for all $\mathbf{x}\in\ker T^\perp$.

(a) If $\mathbf{x}\in\ker T^\perp$, then $T^\dagger T (\mathbf{x})=L^{-1} (L(\mathbf{x}))= \mathbf{x}$. If $\mathbf{x}\in \ker T$, then $T^\dagger T(\mathbf{x})=T^\dagger(\mathbf{0})=\mathbf{0}$.

$\therefore T^\dagger T$ is orthogonal projection of $V$ onto $\ker T^\perp$.

(b) If $\mathbf{y} \in \text{im}T$, then $TT^\dagger(\mathbf{y})=T(L^{-1}(y))= L(L^{-1}(\mathbf{y}))=\mathbf{y}$. On the other hand, if $\mathbf{y} \in \text{im}T^\perp$, then $TT^\dagger(\mathbf{y})=T(\mathbf{0})=\mathbf{0}$.

$\therefore TT^\dagger$ is orthogonal projection of $W$ onto $\text{im} T$.

\[\tag*{$\square$}\]

Theorem

Consider the system of linear equations $A\mathbf{x}=\mathbf{b}$, where $A$ is an $m \times n$ matrix and $\mathbf{b}\in F^m$. If $\mathbf{z} = A^\dagger \mathbf{b}$, then $\mathbf{z}$ has the following properties.

(a) If $A\mathbf{x}=\mathbf{b}$ is consistent, then $\mathbf{z}$ is the unique solution to the system having minimum norm. That is $\mathbf{z}$ is a solution to the system and if $\mathbf{y}$ is any solution to the system, then $\lVert \mathbf{z} \rVert \leq \lVert \mathbf{y} \rVert$ with equality if and only if $\mathbf{y}=\mathbf{z}$.

(b) If $A\mathbf{x}=\mathbf{b}$ is inconsistent, then $\mathbf{z}$ is the unique best approximation to a solution having minimum norm. That is $\lVert A\mathbf{z}-\mathbf{b} \rVert \leq \lVert A\mathbf{y}-\mathbf{b}\rVert$ for any $\mathbf{y}\in F^m$ with equality if and only if $A\mathbf{z}=A\mathbf{y}$. Furthermore, if $A\mathbf{z}=A\mathbf{y}$, then $\lVert \mathbf{z} \rVert \leq \lVert \mathbf{y} \rVert$ with equality if and only if $\mathbf{y}=\mathbf{z}$.

<Proof>

(a) Let $T = L_A$. Suppose that $A\mathbf{x}= \mathbf{b}$ is consistent and let $\mathbf{z}=A^\dagger\mathbf{b}$. Since $\mathbf{b} \in \text{im} T$,

\[\begin{align*} A\mathbf{z}=AA^\dagger\mathbf{b}=TT^\dagger(\mathbf{b})=\mathbf{b} \end{align*}\]

by the previous lemma. Thus $\mathbf{z}$ is a solution to the system. Now suppose that $\mathbf{y}$ is a solution to the system. Then

\[\begin{align*} T^\dagger T(\mathbf{y}) = A^\dagger A\mathbf{y}=A^\dagger \mathbf{b}=\mathbf{z}, \end{align*}\]

that is $\mathbf{z}$ is orthogonal projection of $\mathbf{y}$ onto $\ker T^\perp$. Since $F^n=\ker T \bigoplus\ker T^\perp, \mathbf{y}=\mathbf{x}+\mathbf{z}$ for some $\mathbf{x}\in\ker T$.

Since $\langle \mathbf{x}, \mathbf{z}\rangle = 0$,

\[\begin{align*} \lVert \mathbf{y} \rVert^2 &= \lVert \mathbf{x+z} \rVert^2 \\ &= \langle \mathbf{x}, \mathbf{x} \rangle+ \langle \mathbf{z}, \mathbf{z}\rangle + \langle \mathbf{x}, \mathbf{z} \rangle + \langle \mathbf{z}, \mathbf{x} \rangle \\ &= \langle \mathbf{x}, \mathbf{x} \rangle+ \langle \mathbf{z}, \mathbf{z}\rangle \\ & = \lVert \mathbf{x} \rVert^2 + \lVert \mathbf{z} \rVert^2 \\ &\geq \lVert \mathbf{z} \rVert^2 \end{align*}\]

$\therefore \lVert \mathbf{z} \rVert\leq \lVert \mathbf{y}\rVert$ for all solution $\mathbf{y}\in F^n$.

(b) Suppose that $A\mathbf{x}=\mathbf{b}$ is inconsistent. By the previous lemma,

\[\begin{align*} A\mathbf{z}= AA^\dagger \mathbf{b} = TT^\dagger(\mathbf{b}). \end{align*}\]

Thus $A\mathbf{z}$ is the orthogonal projection of $\mathbf{b}$ onto $\text{im} T$, which is $A\mathbf{z}$ is the best approximation to $\mathbf{b}$. Thus, $\lVert A\mathbf{z}-\mathbf{b}\rVert \leq \lVert A\mathbf{y} -\mathbf{b}\rVert$ for all $\mathbf{y}\in F^m$.

Finally, suppose that $\mathbf{y}\in F^n$ is any vector such that $A\mathbf{z}=A\mathbf{y} = \mathbf{c}$. With singular value decomposition of $A$, we can write

\[\begin{align*} A=U\Sigma V^\ast , A^\dagger = V\Sigma U^\ast . \end{align*}\]

Thus \(\begin{align} \begin{split} A^\dagger A A^\dagger &= (V\Sigma^\dagger U^\ast )( U\Sigma V^\ast)(V\Sigma^\dagger U^\ast) \\ &=V\Sigma^\dagger \Sigma \Sigma^\dagger U^\ast \\ &=V\begin{pmatrix} I_r & 0 \\ 0 & 0 \end{pmatrix} \Sigma^\dagger U^\ast \\ &=V\Sigma^\dagger U^\ast \\ &=A^\dagger \end{split} \label{eq:1} \end{align}\)

Then by equation$~\ref{eq:1}$,

\[\begin{align*} A^\dagger \mathbf{c}=A^\dagger A\mathbf{z}=A^\dagger AA^\dagger \mathbf{b} = A^\dagger\mathbf{b}=\mathbf{z}. \end{align*}\]

Since $\mathbf{z}=A^\dagger\mathbf{c}$, we can apply the part (a) to the linear system $A\mathbf{x}=\mathbf{c}$, which is consistent. Thus $\lVert \mathbf{z} \rVert \leq \lVert \mathbf{y}\rVert$ with equality if and only if $\mathbf{y}=\mathbf{z}$.

\[\tag*{$\square$}\]

Remark

Let $A$ be an $m\times n$ matrix of rank $n$ with $m>n$. For linear regression problem, we want to find $\mathbf{x}\in F^n$ such that

\[\begin{align*} \underset{\mathbf{x} \in F^n}{\text{minimize}} \lVert A\mathbf{x} -\mathbf{y} \rVert \end{align*}\]

for given labels $\mathbf{y}\in F^m$. With the normal equation, we know that $(A^\ast A)^{-1}A^\ast\mathbf{y}$ is the best approximation to the linear system.

Then the previous theorem says that we have the unique approximation $A^\dagger \mathbf{y}$, that is $A^\dagger = (A^\ast A)^{-1}A^\ast$.

<Proof> With a singular value decomposition of $A$, we write $A=U\Sigma V^\ast$ where $U,V$ are orthogonal matrices and

\(\Sigma_{ii}=\begin{cases} \sigma_i &\text{if } 1\leq i \leq n \\ 0 & \text{if } n < i \leq m \end{cases}\) with non-zero singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n>0$.

Since

\[\begin{align*} A^\ast A &= (V\Sigma^\top U^\ast )( U\Sigma V^\ast) \\ &=V\Sigma^\top \Sigma V^\ast \\ &= V \begin{pmatrix}\sigma^2_1 & & &\\ & \sigma^2_2 & & \\ & & \ddots &\\ & & & \sigma^2_n \end{pmatrix} V^\ast, \end{align*}\]

we get $(A^\ast A)^{-1}= V\text{diag}(1/\sigma^2_1, \ldots, 1/\sigma^2_n )V^\ast$. Thus,

\[\begin{align*} (A^\ast A)^{-1}A^{\ast}&=V\begin{pmatrix}\frac{1}{\sigma^2_1} & & &\\ & \frac{1}{\sigma^2_2} & & \\ & & \ddots &\\ & & & \frac{1}{\sigma^2_n} \end{pmatrix}V^\ast V\Sigma^\top U^\ast \\ &=V\Sigma^\dagger U^\ast \\ &=A^\dagger. \end{align*}\]

\(\tag*{$\square$}\)

Reference

  • Stephen Friedberg, Arnold Insel, and Lawrence Spence Linear Algebra