Ju Sun (孙举)

Dedicated to my research and life

Closed Form Solutions in Nuclear Norm Related Optimization

| Comments

The nuclear norm has been on the spotlight of optimization for quite a while, largely due to the breakthrough work of Emmanuel Candès on low-rank matrix completion and recovery years back (see also this). Summing up the singular values, this norm can be considered as the \(\ell_1\) norm in the spectral domain, reminiscent of the intriguing relationship between the \(\ell_0\) (quasi-)norm and the \(\ell_1\) norm revealed in compressive sensing. In fact, this norm best approximates the rank for a given matrix under some technical conditions (aka. serving as the convex surrogate). To be precise, for a particular matrix \(\mathbf{A}\), its nuclear norm is defined as \(\|\mathbf{A}\|_* = \sum_i \sigma_i\), where \(\sigma_i\) denotes the \(i^{th}\) singular value (henceforth we assume singular values sorted in descending order). Then \(\|\mathbf{A}\|_*\) is the best under-estimator (technically the convex envelope) of \(\textrm{rank} \left(\mathbf{A}\right)\) over the domain \(\|\mathbf{A}\| \leq 1\), where \(\| \cdot\|\) is the operator norm or equivalently the largest singular value \(\sigma_1\). The last condition on the spectral norm is in fact trivial, as one can always scale the matrix to make \(\sigma_1\) valid (in practical optimization, this possibly entails parameter tuning).

Most nuclear norm related (convex) optimization can be cast as SemiDefinite Programming problems, due to the simple variational characterization of the nuclear norm (see Proposition 2.1 in this paper) \[\begin{align} \min_{\mathbf{W}_1, \mathbf{W}_2} \quad \frac{1}{2}\left(\mathrm{trace} \left(\mathbf{W}_1\right)+ \mathrm{trace} \left(\mathbf{W}_2\right) \right), \qquad \mathrm{s.t.} \quad \begin{bmatrix} \mathbf{W}_1 & \mathbf{X} \newline \mathbf{X} & \mathbf{W}_2 \end{bmatrix} \succeq \mathbf{0}. \end{align}\] Despite the elegance, this approach is numerically unwise as most interior-point solvers can only handle problems up to hundreds of variables both for hardware and algorithmic reasons. On the other hand, most current applications such as large-scale matrix completion and the Robust PCA (by Candès et al) have been numerically feasible due to simple closed-form solutions to some canonical forms of nuclear norm optimization (most popular and foremost the singular value thresholding (SVT) operator). This line of results has served as the key to success of recent numerical schemes such as the accelerated proximal gradient (APG) and the augmented Lagragian multiplier (ALM) methods , for optimization problems involving nuclear norm. Hence discovery and (hopefully) systematic investigation into such closed-form solutions are both necessary and interesting in their own right.

Realizing the importance of unitarily invariant property (UIP)

My recent investigation into the low rank representation for subspace segmentation (Ref. my earlier post here and this paper) has boiled down to understanding this problem \[\begin{align} \min_{\mathbf{Z}} \quad \| \mathbf{Z}\|_*, \qquad \mathrm{s.t. } \quad \mathbf{XZ} = \mathbf{X} \label{eqn:lrr} \end{align}\] and it turns out

Theorem 1 The minimizer \(\mathbf{Z}^*\) to problem \(\eqref{eqn:lrr}\) obeys \(\mathbf{Z}^* = \mathbf{VV}^\top\), where \(\mathbf{X} = \mathbf{U\Sigma V}^\top\) is the singular value decomposition of \(\mathbf{X}\). Moreover, \(\mathbf{Z}^*\) is also the minimizer to this problem: \[\begin{align} \min_{\mathbf{Z}} \quad \| \mathbf{Z}\|_*, \qquad \mathrm{s.t. } \quad \mathbf{XZ} = \mathbf{X} , \mathbf{Z}\succeq \mathbf{0}. \label{eqn:lrr-sdp} \end{align}\]

It may come surprising that the above problem could have simple closed-form solution. It came to me when I was trying to model practical data points contaminated with outliers and considering numerical routine for this optimization problem
\[\begin{align} \min_{\mathbf{Z, E}} \quad \| \mathbf{Z}\|_* + \lambda \| \mathbf{E}\|_{2, 1}, \qquad \mathrm{s.t. } \quad \mathbf{XZ} +\mathbf{E} = \mathbf{X} , \mathbf{Z}\succeq \mathbf{0}. \label{eqn:lrr-sdp-rubost} \end{align}\] It turns out proximal gradient kind of algorithm entails solving this subproblem: \[\begin{align} \min_{\mathbf{Z}} \quad \| \mathbf{Z}\|_* + \frac{\mu}{2}\| \mathbf{X - Z}\|_F^2, \qquad \mathrm{s.t. } \quad \mathbf{Z} \succeq \mathbf{0}. \end{align}\] To establish the closed-form solution to this problem, it turns out the unitarily invariance property of the nuclear norm is critical.

Definition 2 (Unitarily Invariant Norms) A norm \(\|\cdot\|_{\ell}\) is called unitarily invariant if for any matrix \(\mathbf X\) and any unitary matrices \(\mathbf U, \mathbf V\) of compatible dimension, we have \(\|\mathbf{X}\|_\ell = \|\mathbf{UXV}\|_\ell\).

Two families of unitarily invariant norms are particularly relevant here. One is the Schattern \(p\) norms defined as \(\|\mathbf{X}\|_{\mathrm{sp}} = \left(\sum_{i=1}^r \sigma_i^p\right)^{1/p}\) (assuming \(\operatorname{rank}\left(\mathbf X\right) = r\)); the other is the Ky-Fan \(k\) norms, defined as $||{} = {i=1}^k _i $ (assuming descending order of singular values). Clearly the nuclear norm is the Schatten \(1\) norm (or Ky-Fan \(r\) norm), and hence unitarily invariant. Similarly the Frobenius norm is the Schatten \(2\) norm and hence also enjoys the nice property.

Deriving closed-form solutions of nuclear norm problems by unitarily invariance property

I have progressively realized the UIP can be used for figuring out closed form solutions to several canonical forms of nuclear norm optimization problems. Though results presented below are known from various authors, I don’t like their proofs in general because their proofs are by construction in nature – it is beautiful but not generalizable nor stimulating, esp. to complicated forms and further problems.

Before the exposition, I’d like to briefly review some facts from convex analysis and optimization (ref. Boyd and Vandenberghe). A real-valued function \(f\left(x\right)\) is convex if and only if its domain \(\mathrm{dom}\left(f\right)\) is a convex set and \(f\left(\alpha x_1 +\left(1-\alpha\right)x_2\right) \leq \alpha f\left(x_1\right) + \left(1-\alpha\right)f\left(x_2\right)\) for all \(x_1, x_2 \in \mathrm{dom}\left(f\right)\) and all \(0\leq\alpha\leq 1\) (the description can be considerably simplified with the extended-value extensions); it’s strictly convex if inequality always holds strictly whenever \(x_1 \neq x_2\) and \(0 < \alpha < 1\) (geometrically this means the line segmentation connecting any two distinct points of the function curve lies strictly above the part of function curve in between). Now consider minimizing a convex function \(f\left(x\right)\) in practice, \[\begin{align} \min. \quad f\left(x\right), \qquad \textrm{s.t.} \quad x\in \mathcal{C}, \label{eqn:generic_opt} \end{align}\]
where \(\mathcal{C}\) is the convex domain that’s dictated by constraints. Suppose \(\mathcal{C}\) is nonempty, so that the problem is feasible, then

Lemma 3 (Convexity and Uniqueness of Minimizers) For the generic convex optimization \(\eqref{eqn:generic_opt}\) with a nonempty domain \(\mathcal{C}\), we have

  • For convex functions \(\left\{f_i\left(x\right), x \in \mathcal C_i\right\}_{i \in [k]}\), the additive composite \(g\left(x\right) = \sum_{i=1}^k w_i f_i\left(x\right)\) with \(w_i \geq 0\) and \(x \in \cap_{i=1}^k \mathcal C_i\) is strictly convex so long as any component function is strictly convex and is with strictly positive weight.
  • If \(f\left(x\right)\) is strictly convex, the minimizer to \(\eqref{eqn:generic_opt}\) is always unique.
  • When \(f\left(x\right)\) is not strictly convex, the minimizer(s) to \(\eqref{eqn:generic_opt}\) may or may not be unique.

The first and second can be easily verified from the definition of strong convexity, while the third can be loosely cemented with examples, esp on why strict convexity is not a necessary condition for the uniqueness. Take \(f\left(x\right)=\| x\|_1, x \in \mathcal{R}^n\) for example. The objective is obviously not strictly convex (in fact, local linearity is a typical kind of antidote for thus), but the minimizer is only the origin (it looks as if it was strictly convex in the \(\varepsilon\) vicinity of the origin). There are numerous situations such unicity can arise and I won’t push towards enumerating them; instead, I’d like to note that a useful technique to prove such unicty is by perturbation analysis: for a candidate minimizer of \(\eqref{eqn:generic_opt}\), say \(x_\diamond\), it is the unique minimizer if and only if \(f\left(x_\diamond +\Delta x\right) > f\left(x_\diamond\right)\) where \(x_\diamond +\Delta x \in \mathcal{C}\) and \(\Delta x \neq 0\). This is justified since for convex functions, any local minimizer is also a global minimizer while the convexity of domain suggests such minimizers must be spatially contiguous.

Now I’ll close the loop by presenting the UIP version of sketchy proofs to several known results on nuclear norm.

Theorem 4 (Singular Value Thresholding and EigenValue Thresholding) We have the following analytic results

  • The minimizer to \[\begin{align} \min_{\mathbf{X}} \quad \mu \|\mathbf{X}\|_* + \frac{1}{2}\|\mathbf{ Y - X}\|_F^2 \label{eqn:nc_1} \end{align}\] obeys \(\mathbf{X} = \mathbf{U} \max\left(\mathbf \Lambda - \mu \mathbf I, 0\right)\mathbf{V}^\top\), for SVD of \(\mathbf{Y}\) as \(\mathbf{Y} = \mathbf{U\Lambda V}^\top\).
  • For symmetric matrix \(\mathbf{Y}\), the minimizer to \[\begin{align} \min_{\mathbf{X}\succeq \mathbf{0}} \quad \mu \|\mathbf{X}\|_* + \frac{1}{2}\|\mathbf{ Y - X}\|_F^2 \label{eqn:nc_2} \end{align}\] obeys \(\mathbf{X} = \mathbf{U} \max\left(\mathbf{\Lambda} - \mu \mathbf{I}, \mathbf{0}\right)\mathbf{U}^\top\), for eigenvalue decomposition of \(\mathbf{Y}\) as \(\mathbf{Y} = \mathbf{U\Lambda U}^\top\).
  • For general square matrix \(\mathbf{Y}\), the minimizer to \[\begin{align} \min_{\mathbf{X}\succeq \mathbf{0}} \quad \mu \|\mathbf{X}\|_* + \frac{1}{2}\|\mathbf{Y - X}\|_F^2 \label{eqn:nc_3} \end{align}\] obeys \(\mathbf{X} = \mathbf{U} \max\left(\mathbf{\Lambda} - \mu \mathbf{I}, \mathbf{0}\right)\mathbf{U}^\top\), for eigenvalue decomposition of \(\widetilde{\mathbf{Y}} = \frac{1}{2}\left(\mathbf{Y} +\mathbf{Y}^\top\right)\) as \(\widetilde{\mathbf{Y}} = \mathbf{U\Lambda U}^\top\).

Proof.  (1) We observe \(\eqref{eqn:nc_1}\) is strictly convex (as \(\frac{1}{2}\|\mathbf{ Y - X}\|_F^2\) is) and hence has a unique minimizer. It is not hard to see by UIP of Schatten norms that \[\begin{align} \min_{\mathbf{X}} \quad \mu \|\mathbf{X}\|_* + \frac{1}{2}\|\mathbf{ Y - X}\|_F^2 \Longleftrightarrow \min_{\mathbf{X}} \quad \mu \|\mathbf{U}^\top\mathbf{XV}\|_* + \frac{1}{2}\|\mathbf{ \Lambda} - \mathbf{U}^\top \mathbf{XV}\|_F^2, \end{align}\] where the right hand side is minimized only when \(\mathbf{\Sigma} = \mathbf{U}^\top \mathbf{XV}\) is diagonal. The last assertion comes from the facts that for any matrix \(\mathbf{A}\) decomposed into its diagonal and off-diagonal elements, \(\mathbf{A = \Sigma + B}\), we have \(\| \mathbf{\Sigma + B}\|_* \geq \|\mathbf{\Sigma}\|_*\) (generalized from Lemma 13 in this paper; also known in quantum information theory as (a particular case of) the pinching equality – see this slides) and \(\| \mathbf{\Sigma + B}\|_F^2 \geq \|\mathbf{\Sigma}\|_F^2\), so the problem to solve reduces to seeking \(\mathbf{\Sigma}_\diamond = \mathop{\arg\min}_{\mathbf{\Sigma}} \, \mu \|\mathbf{\Sigma}\|_* + \frac{1}{2}\|\mathbf{ \Lambda} - \mathbf{\Sigma}\|_F^2\) (solved by simple component-wise arguments similar to that of soft-thresholding for the Lasso problem), of which the optimal variable is \(\mathbf{X}_\diamond = \mathbf{U\Sigma}_\diamond \mathbf{V}^\top\).

(2) On top of the above, the key is to verify \(\mathbf{X}\succeq \mathbf{0} \Longleftrightarrow \mathbf{U}^\top \mathbf{X}\mathbf{U} \succeq \mathbf{0}\) for orthogonal \(\mathbf{U}\) (convention here: “orthogonal’’ means \(\mathbf{U}^\top \mathbf{U} = \mathbf{U}\mathbf{U}^\top =\mathbf{I}\)), for which both directions are easy by the definition of positive semi-definiteness.

(3) Observing that \(\| \mathbf{Y -X} \|_F^2 = \|\mathbf{Y}^\top - \mathbf{X}\|_F^2\) for symmetric \(\mathbf{X}\) (implied from \(\mathbf{X}\succeq \mathbf{0}\)), one can easily verify that \(\|\mathbf{Y-X}\|_F^2 + \|\mathbf{Y}^\top - \mathbf{X} \|_F^2 = 2 \|\frac{\mathbf{Y} +\mathbf{Y}^\top}{2} - \mathbf{X}\|_F^2\). So the objective is equivalent to \(\min_{\mathbf{X}\succeq \mathbf{0}} \; \mu \|\mathbf{X}\|_* + \frac{1}{2}\|\frac{\mathbf{Y} +\mathbf{Y}^\top}{2} - \mathbf{ X}\|_F^2\). \(\Box\)

As mentioned, the above results are critical to devising proximal point kind of numerical algorithms (and also alternating direction method) for a large family of optimization problems involving the nuclear norm. Recently this paper working on subspace segmentation has a nice take into this slightly more interesting version of closed form solution (though the authors have failed to verified the uniqueness of the solution before they used the term the optimal solution, also their proof is by construction – less favorable to us as discussed).

Theorem 5 The optimal solution to the unconstrained optimization \[\begin{align} \min_{\mathbf{Z}} \quad \|\mathbf{Z}\|_* + \frac{\tau}{2}\|\mathbf{ X- XZ}\|_F^2 \label{eqn:nuclear_new} \end{align}\] is \[\begin{align} \mathbf{Z}_\diamond = \mathbf{V}\max\left(\mathbf{0}, \mathbf{I} - \frac{1}{\tau} \mathbf{\Lambda}^{-2}\right)\mathbf{V}^\top, \end{align}\] where \(\mathbf{X} = \mathbf{U\Lambda V}^\top\) is the SVD of \(\mathbf{X}\) and we have assumed \(\frac{1}{0} = \infty\) and \(\infty^2 = \infty\) for convenience. Furthermore \(\mathbf{Z}_\diamond \succeq \mathbf{0}\).

Before moving into the actual proof, we note that the form of \(\mathbf{Z}_\diamond\) is reminiscent of the solution to \(\eqref{eqn:lrr}\) and \(\eqref{eqn:lrr-sdp}\). Indeed the optimization problems are closely related – the difference lies at if the equality \(\mathbf{X = XZ}\) is enforced through a constraint or through a soft loss term. It’s easy to see the \(\left\| \mathbf X - \mathbf X \mathbf Z \right\|_{F}^2\) is convex in \(\mathbf Z\), as convex composition of affine function of \(\mathbf Z\). To see if it is strictly convex, for any \(0 < \alpha < 1\) and \(\mathbf Z_1 \neq \mathbf Z_2\), \[\begin{align} & \| \mathbf{X}-\alpha \mathbf{X} \mathbf{Z}_1 - \left(1-\alpha\right)\mathbf{X} \mathbf{Z}_2 \|_F^2 \newline = \; & \| \alpha \left( \mathbf{X}- \mathbf{X} \mathbf{Z}_1\right) + \left(1-\alpha\right) \left( \mathbf{X}- \mathbf{X} \mathbf{Z}_2\right) \|_F^2 \newline \leq \; & \alpha \| \mathbf{X}- \mathbf{X} \mathbf{Z}_1 \|_F^2 + \left(1-\alpha\right) \| \mathbf{X}- \mathbf{X} \mathbf{Z}_2 \|_F^2, \end{align}\] Since \(\left\| \cdot \right\|_{F}^2\) is strictly convex, the inequality becomes equality only when \(\mathbf{X-XZ}_1 = \mathbf{X-XZ}_2 \Longrightarrow \mathbf{XZ}_1 =\mathbf{XZ}_2\). This provides an important hint on when \(g\left(\mathbf{Z}\right)\) is strictly convex:

Proposition 6 For a given matrix \(\mathbf{X}\), if \(\mathbf{XZ}_1 \neq\mathbf{XZ}_2 ~\forall \mathbf{Z}_1 \neq \mathbf{Z}_2\), \(g\left(\mathbf{Z}\right)\) is strictly convex. In other words, \(g\left(\mathbf{Z}\right)\) is strictly convex when \(\mathbf{X}\) is injective, or equivalently \(\mathbf{X}\) has full column rank.

So far it’s clear that the objective in \(\eqref{eqn:nuclear_new}\) is in general not strictly convex, and hence the optimization may not yield unique minimizer. The following deferred proof shows otherwise positive.

Proof.   (of Theorem 5)We assume the full SVD of \(\mathbf{X}\) as \(\mathbf{X} =\mathbf{U\Lambda V}^\top\). By the UIP of Schattern norms, it’s easy to see \(\|\mathbf{X-XZ}\|_F^2 =\|\mathbf{U}^\top \left(\mathbf{X-XZ}\right)\mathbf{V}\|_F^2 = \|\mathbf{\Lambda} - \mathbf{\Lambda V}^\top \mathbf{ZV}\|_F^2\), and hence the objective can be written as \(\|\mathbf{V}^\top \mathbf{Z} \mathbf{V}\|_*+ \frac{\tau}{2}\|\mathbf{\Lambda} - \mathbf{\Lambda V}^\top \mathbf{ZV}\|_F^2 = \|\mathbf{D}\|_*+ \frac{\tau}{2}\|\mathbf{\Lambda} - \mathbf{\Lambda D}\|_F^2\) with the substitution \(\mathbf{D} = \mathbf{V}^\top \mathbf{ZV}\). For the diagonal (but non-vanishing) \(\mathbf{\Lambda}\) and a positive \(\tau\), it’s not hard to see that \(\|\mathbf{D}\|_*+ \frac{\tau}{2}\|\mathbf{\Lambda} - \mathbf{\Lambda D}\|_F^2\) is minimized only when \(\mathbf{D}\) is also diagonal (by the pinching equality for nuclear norm as mentioned above and also property for squared Frobenius norm). The diagonal elements can be uniquely determined from the corresponding scalar optimization problems. For invertible \(\mathbf{V}\), we must have \(\mathbf{Z} =\mathbf{VDV}^{\top}\). The positive semi-definiteness can be easily established as the diagonal matrix \(\mathbf{D}_\diamond = \max \left(\mathbf{0}, \mathbf{I} - \frac{1}{\tau}\mathbf{\Lambda}^{-2}\right) \geq \mathbf{0}\), i.e., is component-wise nonnegative.

Uniqueness. It is easy to see the minimizer(s) is determined only by singular vector pairs with singular values greater than \(1/\sqrt{\tau}\), i.e., \(\mathbf{Z}_\diamond = \sum_{i: \; \sigma_i > 1/\sqrt{\tau}} d_{\sigma_i} \mathbf{v}_i\mathbf{v}_i^{\top}\). For any singular value \(\sigma_i\) with multiplicity \(1\), the corresponding singular vector \(\mathbf v\) is uniquely defined up to sign, so \(\mathbf v \mathbf v^{\top}\) is uniquely defined. For any singular value with multiplicity greater than \(1\), the singular vectors can be taken as any orthonormal basis from the corresponding invariant subspace. Fix a particular such basis \(\mathbf V_{\sigma_i}\), any other ortho-basis can be written as \(\mathbf V_{\sigma_i} \mathbf R\) for some orthogonal matrix \(\mathbf R\). But we always have \(\mathbf V_{\sigma_i} \mathbf R \left(\mathbf V_{\sigma_i} \mathbf R\right)^{\top} = \mathbf V_{\sigma_i} \mathbf V_{\sigma_i}^{\top}\). \(\Box\)

The above proof has verified Lemma 1 in this paper. Lemma 2 and 3 therein are built on the results of Lemma 1, and the uniqueness of the respective solution can also be established by similar perturbation argument as shown above.

Connection to other work and generalization

Recently unitarily invariant norms used as regularizers in machine learning have been touched on in Ref [2]. This paper has in particular focused on characterize optimal solutions to such spectral regularization problems. In addition, Ref [1] shares similar flavors and talks about unitarily invariant functions as well . These techniques draws heavily from matrix analysis, and hence the seminal paper Ref [3] proves to be extremely useful here. Besides, the very fresh monograph Ref [4] on matrix-valued functions are also pertinent in this line (e.g., refer to this paper to see how some recent matrix cone programming are tackled).

[Update Aug 12 2011] Refer to this paper also for similar discussions. (Citation: Yu, YL and Schuurmans. Rank/Norm Regularization with Closed-Form Solutions: Application to Subspace Clustering. In UAI 2011).

[Update Jan 25 2012] I repeatedly encountered situations where variational characterization of nuclear norm could be useful. One piece of result, for example, is \(\|\mathbf{X}\|_* = \min_{\mathbf{X=UV}^\top} \|\mathbf{U}\|_F \|\mathbf{V}\|_F = \min_{\mathbf{X=UV}^\top} \frac{1}{2} \left(\|\mathbf{U}\|_F^2 + \|\mathbf{V}\|_F^2\right)\) (this can be found in this paper without proof). To see this is true, one can set \(\mathbf{U} = \mathbf{L\Sigma}^{1/2}\) and \(\mathbf{V} = \mathbf{\Sigma}^{1/2} \mathbf{R}^\top\) where \(\mathbf{X} = \mathbf{L\Sigma R}^\top\) is the SVD of \(\mathbf{X}\), suggesting \[\begin{align} \|\mathbf{X}\|_* \geq \min_{\mathbf{X=UV}^\top} \frac{1}{2} \left(\|\mathbf{U}\|_F^2 + \|\mathbf{V}\|_F^2\right) \geq \min_{\mathbf{X=UV}^\top} \|\mathbf{U}\|_F \|\mathbf{V}\|_F. \end{align}\] On the other hand, by Theorem 3.3.14(a) of Topics in matrix analysis (by Horn and Johnson, 1e), \(\left\| \mathbf X \right\|_{\ast} = \left\| \mathbf U \mathbf V^\top \right\|_{\ast} \leq \left\langle \mathbf \sigma\left(\mathbf U\right), \mathbf \sigma\left(\mathbf V\right) \right\rangle\) (this can also be extracted from Theorem IV 2.5 of Matrix analysis (by Bhatia), though one should know how to generalize that to general non-square matrices.) And applying Cauchy–Schwarz we get \(\left\| \mathbf X \right\|_{\ast} \leq \left\| \mathbf U \right\|_{\ast} \left\| \mathbf V \right\|_{\ast}\).

[Update Nov 27 2013] This note has been revised with some error correction when moved to this new blog.