Two-layer linear neural network is arguably the simplest, nontrivial neural network. This post fleshes out the details of this seminal paper ^{1}, in which the authors set out to understand the solution space of this optimization problem \[\begin{align} \label{eq:nn_main_global}
\min \; f\left(\mathbf A, \mathbf B\right) = \frac{1}{2}\left\| \mathbf Y - \mathbf A \mathbf B \mathbf X \right\|_{F}^2, \mathbf A \in \mathbb{R}^{n \times p}, \mathbf B \in \mathbb{R}^{p \times n}
\end{align}\] where \(\mathbf Y \in \mathbb{R}^{n \times m}\) and \(\mathbf X \in \mathbb{R}^{n \times m}\) are known with \(m \geq n\). They are stacked (in columns) response-input vector pairs used to train the network. When \(p \ll n\) and \(\mathbf Y = \mathbf X\), it is easy to see any optimum \(\mathbf A \mathbf B = \mathbf U \mathbf U^\top\), where \(\mathbf U\) is an orthonormal basis for the subspace spanned by the first \(p\) eigenvectors of \(\mathbf X \mathbf X^\top\). Hence this formulation is closely related to the well-known principal component analysis. It turns out one interesting aspect about this optimization problem is that under some conditions the only local minimum point (up to invertible linear transformations, of course) is also global minimum. We will use some short-hand notation in the proof: \(\mathbf \Sigma_{\mathbf X\mathbf X} \doteq \mathbf X \mathbf X^\top\), similarly for \(\mathbf \Sigma_{\mathbf Y \mathbf Y^\top}\), \(\mathbf \Sigma_{\mathbf X \mathbf Y}\) and \(\mathbf \Sigma_{\mathbf Y \mathbf X}\).

Theorem 1Suppose \(\mathbf X\) is full rank and hence \(\mathbf \Sigma_{\mathbf X \mathbf X}\) is invertible. Further assume \(\mathbf \Sigma \doteq \mathbf \Sigma_{\mathbf Y \mathbf X} \mathbf \Sigma_{\mathbf X \mathbf X}^{-1} \mathbf \Sigma _{\mathbf X \mathbf Y}\) is full rank with \(n\) distinct eigenvalues \(\lambda_1 > \cdots > \lambda_n > 0\). Then \(\eqref{eq:nn_main_global}\) has no spurious local minima, except for equivalent versions of global minimum due to invertible transformations.