Local Convergence of Proximal Splitting Methods for Rank Constrained Problems

10/11/2017
by   Christian Grussler, et al.
LUNDS TEKNISKA HÖGSKOLA
0

We analyze the local convergence of proximal splitting algorithms to solve optimization problems that are convex besides a rank constraint. For this, we show conditions under which the proximal operator of a function involving the rank constraint is locally identical to the proximal operator of its convex envelope, hence implying local convergence. The conditions imply that the non-convex algorithms locally converge to a solution whenever a convex relaxation involving the convex envelope can be expected to solve the non-convex problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/01/2019

Proximal algorithms for constrained composite optimization, with applications to solving low-rank SDPs

We study a family of (potentially non-convex) constrained optimization p...
06/16/2021

DeepSplit: Scalable Verification of Deep Neural Networks via Operator Splitting

Analyzing the worst-case performance of deep neural networks against inp...
02/12/2021

From perspective maps to epigraphical projections

The projection onto the epigraph or a level set of a closed proper conve...
02/04/2014

UNLocBoX: A MATLAB convex optimization toolbox for proximal-splitting methods

Convex optimization is an essential tool for machine learning, as many o...
10/27/2021

Constrained Optimization Involving Nonconvex ℓ_p Norms: Optimality Conditions, Algorithm and Convergence

This paper investigates the optimality conditions for characterizing the...
05/01/2021

NuSPAN: A Proximal Average Network for Nonuniform Sparse Model – Application to Seismic Reflectivity Inversion

We solve the problem of sparse signal deconvolution in the context of se...
10/17/2018

Efficient Proximal Mapping Computation for Unitarily Invariant Low-Rank Inducing Norms

Low-rank inducing unitarily invariant norms have been introduced to conv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Proximal splitting methods such as Douglas-Rachford splitting, the alternating direction method of multipliers, forward-backward splitting and many others (see [6, 2, 8, 3, 9, 10, 7]) are often used for solving large-scale convex optimization problems of the form

(1)

where and (or) have cheaply computable proximal mappings. Since also many non-convex functions possess cheap proximal computations, there is a great interest in analyzing whether these iterates still converge to a solution. This paper focuses on analyzing the performance of splitting methods applied to problems, where is convex and

(2)

is non-convex with

  • being an increasing, convex function,

  • being a unitarily invariant norm,

  • being the indicator function for matrices that have at most rank .

Analogously, one can consider vector-valued problems where the rank constraint is replaced by the cardinality constraint. Both problem types are very common within statistics, machine learning, automatic control and many more (see 

[12, 14, 26, 30, 25, 5, 4, 15]).

Till this day, only special instances of solving this problem with proximal splitting methods have been analyzed [16, 21, 24, 17, 29], mainly under the assumption that is the indicator function of an affine set and . In this paper, we deal with general convex functions and a large class of functions , which allow us to provide an alternative analysis for showing local convergence.

Letting denote the bi-conjugate (convex envelope) of , we show conditions under which the proximal operator to the non-convex function in Eq. 2 and its convex envelope (which was introduced in [12]) coincide. We translate these conditions to the setting of applying the Douglas-Rachford and forward-backward splitting algorithms to the non-convex problem Eq. 1 with in Eq. 2, and its optimal convex relaxation

We show that the conditions imply local convergence of the non-convex splitting methods whenever all solutions to the convex relaxation are solutions to Eq. 1. Thus in many practical examples, there is no loss in directly using the non-convex algorithms. In fact, there are many examples where the non-convex methods can find a low-rank solution where the optimal convex relaxation fails. In other words, the non-convex algorithm can have low-rank limit points, whilst the convex has none, but not vice versa. This fact is explicitly analysed for the case where is the Frobenius norm and .

Interestingly, we will see that unlike in the convex case, proximal splitting methods applied to Eq. 1 and

(3)

where , do not necessarily converge to the same limit points. Furthermore, the existence of a limit point as well as the the region of attraction in our local convergence result highly depend on the size of . On the one hand, if the optimal convex relaxation does not posses a low-rank solution, it is shown that has to be chosen sufficiently small for a limit point to exists. On the other hand, in case of our guaranteed local convergence, the region of attraction grows with , i.e. for every initial point of the proximal algorithms there exists a sufficiently large such that the algorithm converges.

Finally note that besides the ability of finding low-rank solutions when the convex relaxation fails, the non-convex algorithms are computationally more favourable, because the proximal computations of are significantly cheaper than those of the convex envelope (see [11]).

2 Background

The following notation for real matrices and vectors

is used in this paper. The non-increasingly ordered singular values of

, counted with multiplicity, are denoted by

where . Further, for and we define the unique optimal rank-r approximation with respect to unitary invariant norms (see [19, Theorem 7.4.9.1]) as

where

is a singular value decomposition (SVD) of

. If , then

Further, the inner-product for is defined by

2.1 Norms

A function is called a symmetric gauge function if

  1. is a norm.

  2. , where denotes the element-wise absolute value.

  3. for all permutation matrices and all .

A norm on is unitarily invariant if for all and all unitary matrices and it holds that Since all unitarily invariant norms on define a symmetric gauge function and vice versa (see [19]), we define

By [19] also the dual norm of is unitarily invariant and therefore it is associated with a symmetric gauge function , i.e.

For , the truncated symmetric gauge functions are given by

Then, the so-called low-rank inducing norms are defined in [12] as the dual norms of

The following properties have been shown in [12].

Lemma 1.

For all symmetric gauge functions and it holds that

(4)
(5)

Finally, the Frobenius norm is given by

2.2 Functions

The effective domain of a function is defined as

Then is said to be:

  • proper if .

  • closed if for each is a closed set.

A function is called increasing if

  • .

The conjugate and bi-conjugate function and of are defined as

and . If , then the monotone conjugate is given by

The subdifferential of in is defined as

The proximal mapping of at is defined by

Finally, for the indicator function is defined as

2.3 Optimal Convex Relaxation

It is shown in [12] that the every low-rank inducing norm is the biconjugate (convex envelope) of Eq. 2 for different .

Proposition 1.

Assume is an increasing closed convex function, and let be defined on with . Then,

(6)
(7)

These characterizations can be used to formulate Fenchel dual problems and optimal convex relaxations to our rank constrained problems. This is shown in the following proposition, which is from [12].

Proposition 2.

Let and be proper, closed, convex functions with . Further let be increasing. Then,

(8)
(9)
(10)

If solves Eq. 10 such that , then equality holds, and is also a solution to Eq. 8.

3 Theoretical Results

In this section we derive the theoretical results that are needed for our convergence analysis in Section 4. The proofs to these results are given in the appendix.

Theorem 1.

Let , and , where is a proper, closed and increasing convex function and Then for all it holds that

Moreover, let

then the following are equivalent:

  1. for all .

Computing the prox of the non-convex function at , reduces to evaluating the convex prox of either or the convex envelope at . Therefore, only the first singular values and vectors are needed to compute the non-convex prox. This can be compared to the prox of the convex envelope at , where all singular values and vectors might be needed. To compute the prox of is cheaper than computing the prox of , except for rank- matrices, see [12]. Therefore it is often much cheaper to evaluate the prox of the non-convex function than of its convex envelope .

In order to relate Theorem 1 to the solutions of Eq. 8 and Eq. 10, the following results, which are proven in Sections A.3 and A.2 respectively, will be needed.

Lemma 2.

Let and . Assume that

where either or for some . Then all fulfill that

Moreover, if , then .

Proposition 3.

Let be solutions to Eq. 9 and Eq. 10, respectively. Assume

where either or for some . Further, if let . Then,

In particular, if there exists a solution to Eq. 9 such that or , then all solutions to Eq. 10 are solutions to Eq. 8.

4 Convergence Analysis

Next it is discussed how Propositions 3 and 1 can be used to show local convergence of proximal splitting algorithms applied to problems of the form

(11)

where is a convex function with cheaply computable proximal mapping and an convex, increasing function. To illustrate and support our analysis, let us first recap the following two well-known proximal splitting algorithms applied to Eq. 1.

Douglas-Rachford Splitting

The Douglas-Rachford splitting method is one of the most well-known splitting algorithms for solving large-scale convex problems [7, 22, 8, 6]. In fact, the well-known alternating direction methods of multipliers (ADMM) is a special case of this algorithm (see [10, 9, 3]). The Douglas-Rachford iterations are given by equationparentequation

(12a)
(12b)
(12c)

where and . For convex and , and converge towards an identical solution of Eq. 1 and is non-increasing, where . (see [7, 22, 8]).

Forward-Backward Splitting

Another popular splitting methods is the so-called forward-backward splitting algorithm (see [6, 2, 20, 27]). In this case, is assumed to be differentiable with Lipschitz continuous gradient, i.e. for all

Then the forward-backward iterations are given by

where . Also here if and are convex, then it can be shown that converges towards a solution of Eq. 1 and is non-increasing with .

Local Convergence

One of the steps in the above two methods (and many other operator splitting methods) when applied to solve Eq. 1 is

If and are convex, then converges to a solution of Eq. 1 in both methods and is a non-increasing sequence, where . Next, we will show that the latter and Theorem 1 imply local convergence of proximal splitting algorithms applied to the non-convex problem in Eq. 11.

In the following we will refer to a proximal splitting algorithm applied to the optimal convex relaxation in Eq. 10, which is restated here,

(13)

as the convex splitting algorithm with iterates

Correspondingly, if the algorithm is applied to Eq. 11, i.e. , we speak of the non-convex splitting algorithm with iterates

Let us assume that is a solution to Eq. 13 with

By (firm) nonexpansiveness of and the continuity of the singular values (see [28, Corollary 4.9]), Theorem 1 implies that

for all , where . Thus, since is non-increasing, it follows that

This proves the local convergence of the non-convex algorithm if .

We will conclude this section by linking this condition to the solution set of Eq. 13, which is the same as Eq. 10. A necessary optimality condition for solving Eq. 9 and Eq. 10 is that (see [23, Theorem 7.12.1] and [27, Theorem 23.5.])

Now, relating this to the optimality condition of the convex prox computation:

implies that

is a solution to the dual problem Eq. 9, i.e.,

(14)

By Theorem 1 we can conclude that if . Hence, Proposition 3 implies the local convergence of non-convex proximal splitting algorithms, if there exists a solution to Eq. 14 such that

(15)

This condition insures, by Proposition 3, that Eq. 13 has only solutions of at most rank . Note that if Eq. 13 has solutions of rank larger than , then a convex algorithm cannot be expected to find solutions of rank , despite their possible existence. This is because the solution set of a convex problem is a convex set.

In other words, non-convex proximal splitting methods locally converge to a solution of Eq. 11, whenever one can expect to find such a solution by solving Eq. 13. Moreover, the region of attraction to contains the ball with

This means that for each initial point there exists a that guarantees the convergence to . Finally, numerical experiments indicate that the non-convex algorithms can also find rank-r solutions to Eq. 13 despite the fact that Eq. 13 may have higher rank solutions.

5 Douglas-Rachford Limit Points

In the following, let us compare the Douglas-Rachford limit points to the optimal convex relaxation (convex Douglas-Rachford) with the limit points of the non-convex Douglas-Rachford for problems Eq. 1 where

Using completion of squares and the well-known Schmidt-Mirsky Theorem (see [19, Theorem 7.4.9.1]), we get that

(16)

This allows us to derive the following comparative result on the limit points of the convex and non-convex Douglas-Rachford, which is proven in Section A.4.

Theorem 2.

Let with and . Then is a limit point of the convex (non-convex) Douglas-Rachford splitting iterate Eq. 12a if and only if there exists such that

and in the

  • convex case:

  • non-convex case:

Theorem 2 verifies what has been discussed in the end of previous section that all limit points of the convex Douglas-Rachford are limit points to the non-convex Douglas-Rachford, but not vice versa. More importantly, it shows the importance of choosing a feasible . In the presence of a duality gap in Eq. 9, Theorem 2 implies that if is chosen too large, then the non-convex Douglas-Rachford may not posses a limit point, but choosing sufficiently small can help to gain convergence. Analytical examples where this applies have been studied in [11] and a numerical example is given in the next section. This is very much in contrast to the convex case, where convergence is independent of . Finally note that by choosing just small enough for a limit point to exist, the problem of multiple limit points may be avoided and thus making the algorithm independent of the initialization. Similar derivations can be carried out for all in the form of Eq. 2.

6 Example

Within many areas such as automatic control, the rank of a Hankel operator/matrix is crucial, because it determines the order of a linear dynamical system. Whereas, the celebrated Adamyan-Arov-Krein theorem (see [1]) answers the question of optimal low-rank approximation of infinite dimensional Hankel operators, the following finite dimensional case is still unsolved:

subject to

where . In the following, we show how non-convex Douglas-Rachford splitting performs on this problem class in comparison with the optimal convex relaxation. To this end, we rewrite the problem in the view of Eq. 13 and Eq. 10 as

where . For our numerical experiments we use

The non-convex Douglas-Rachford uses and is initialized with for all . The ranks of the solutions to the optimal convex relaxation are shown in Figure 1. We observe that only for the convex relaxation manages to find guaranteed solutions to the non-convex problem. In contrast, the non-convex Douglas Rachford converges for all . Figure 2 shows the relative errors of these solutions and the (sub-optimal) solutions to the convex relaxation as well as the lower bound that is provided by the convex relaxation (see Proposition 2). Note that the convex relaxation is not able to obtain a sub-optimal solution of rank . From Figure 2 it can be seen that the non-convex solutions for coincide with the convex solutions, just as our local convergence guarantee suggests. However, for all other , the non-convex approximations outperform the sub-optimal solutions of the convex relaxation. Finally, is has been observed that, if one chooses sufficiently large, the non-convex Douglas-Rachford does not converge for . This can be explained through Theorem 2.

rank
Figure 1: Hankel matrix approximation – Rank of the solutions to the optimal convex relaxation.

rank

Figure 2: Hankel matrix approximation – Relative errors of the approximations obtained by 2 the optimal convex relaxation and 2 non-convex Douglas-Rachford. 2 indicates the lower bound obtained by the optimal convex relaxation.

7 Conclusion

We have shown conditions under which the proximal mapping of the non-convex function Eq. 2 coincides with the proximal mapping of its convex envelope. This allowed us to state conditions under which the non-convex and convex Douglas-Rachford methods and forward-backward methods coincide. This, in turn, guarantees local convergence of the non-convex methods in these situations. Furthermore, we have provided a comparison between the convex and non-convex Douglas-Rachford limit points for common instance of the squared Frobenius norm. Unlike in the convex case, this has demonstrated that scaling the problem may have significant impact. Finally, we discussed a numerical example in which a non-convex method converges also when the stated assumptions do not hold. In those situations, the quality of the solution from the non-convex algorithm was better than the solution obtained by the optimal convex relaxation.

References

  • [1] A. Antoulas, Approximation of Large-Scale Dynamical Systems.   SIAM, 2005.
  • [2] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, ser. CMS Books in Mathematics.   Springer New York, 2011.
  • [3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [4] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006.
  • [5] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of Computational Mathematics, vol. 12, no. 6, pp. 805–849, 2012.
  • [6] P. L. Combettes and J.-C. Pesquet, Proximal Splitting Methods in Signal Processing.   Springer New York, 2011, pp. 185–212.
  • [7] J. Douglas and H. H. Rachford, “On the numerical solution of heat conduction problems in two and three space variables,” Transactions of the American Mathematical Society, vol. 82, no. 2, pp. 421–439, 1956.
  • [8] J. Eckstein and D. P. Bertsekas, “On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators,” Mathematical Programming, vol. 55, no. 1, pp. 293–318, 1992.
  • [9] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear variational problems via finite element approximation,” Computers and Mathematics with Applications, vol. 2, no. 1, pp. 17–40, 1976.
  • [10] R. Glowinski and A. Marroco, “Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problémes de dirichlet non linéaires,” ESAIM: Mathematical Modelling and Numerical Analysis - Modélisation Mathématique et Analyse Numérique, vol. 9, pp. 41–76, 1975.
  • [11] C. Grussler, “Rank reduction with convex constraints,” Ph.D. dissertation, Lund University, 02 2017.
  • [12] C. Grussler and P. Giselsson, “Low-rank inducing norms with optimality interpretations,” 2016, preprint.
  • [13] C. Grussler, A. Rantzer, and P. Giselsson, “Low-rank optimization with convex constraints,” 2016.
  • [14] C. Grussler, A. Zare, M. R. Jovanovic, and A. Rantzer, “The use of the heuristic in covariance completion problems,” in 55th IEEE Conference on Decision and Control (CDC), 2016.
  • [15] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations.   CRC Press, 2015.
  • [16] R. Hesse, D. R. Luke, and P. Neumann, “Alternating projections and Douglas-Rachford for sparse affine feasibility,” IEEE Transactions on Signal Processing, vol. 62, no. 18, pp. 4868–4881, 2014.
  • [17] R. Hesse and D. R. Luke, “Nonconvex Notions of Regularity and Convergence of Fundamental Algorithms for Feasibility Problems,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2397–2419, 2013.
  • [18] J.-B. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms I: Fundamentals, ser. Grundlehren der mathematischen Wissenschaften.   Springer Berlin Heidelberg, 2013, vol. 305.
  • [19] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed., 2012.
  • [20] E. Levitin and B. Polyak, “Constrained minimization methods,” USSR Computational Mathematics and Mathematical Physics, vol. 6, no. 5, pp. 1 – 50, 1966.
  • [21] A. S. Lewis, “The convex analysis of unitarily invariant matrix functions,” Journal of Convex Analysis, vol. 2, no. 1, pp. 173–183, 1995.
  • [22] P.-L. Lions and B. Mercier, “Splitting algorithms for the sum of two nonlinear operators,” SIAM Journal on Numerical Analysis, vol. 16, no. 6, pp. 964–979, 1979.
  • [23] D. G. Luenberger, Optimization by Vector Space Methods.   John Wiley & Sons, 1968.
  • [24] D. R. Luke, “Prox-Regularity of Rank Constraint Sets and Implications for Algorithms,” Journal of Mathematical Imaging and Vision, vol. 47, no. 3, pp. 231–238, 2013.
  • [25] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review, vol. 52, no. 3, pp. 471–501, 2010.
  • [26] G. C. Reinsel and R. Velu, Multivariate Reduced-Rank Regression: Theory and Applications, ser. Lecture Notes in Statistics.   Springer New York, 1998, vol. 136.
  • [27] R. T. Rockafellar, Convex Analysis.   Princeton University Press, 1970, no. 28.
  • [28] G. W. Stewart and J.-g. Sun, Matrix Perturbation Theory.   Academic press, 1990.
  • [29] A. Themelis, L. Stella, and P. Patrinos, “Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone line-search algorithms,” 2016.
  • [30] R. Vidal, Y. Ma, and S. S. Sastry,

    Generalized Principal Component Analysis

    , ser. Interdisciplinary Applied Mathematics.   Springer-Verlag New York, 2016, vol. 40.
  • [31] G. Watson, “Characterization of the subdifferential of some matrix norms,” Linear Algebra and its Applications, vol. 170, pp. 33 – 45, 1992.

Appendix A Appendix

a.1 Proof to Theorem 1

Proof.

For and , let us define

By [21, Corollary 2.5.] and the unitary invariance of , it can be seen that and have simultaneous SVDs, i.e. if , then . Hence,

Further, [19, Theorem 7.4.8.4.] implies that

for all , which yields that

where the last equality and the inclusion follow by [19, Corollary 7.4.1.3.], [21, Corollary 2.5.] and the unitary invariance of . This proves that

Moreover, by Eq. 4 it follows that implies

By the extend Moreau decomposition (see e.g. [3]) and Proposition 1 it holds that

where . As before, , and can be shown to have simultaneous SVDs which is why

(17)

Thus if and only if for . Since, only depends on , this is equivalent to

This shows the equivalence between Items iv, iii and ii. Finally note that this is also equivalent to

Since is unique, this can only be true if and thus , which concludes the proof. ∎

a.2 Proof to Lemma 2

Proof.

Let be an SVD of and the corresponding vector of singular values. Further, let for all , be defined as

By [31, Theorem 2] it holds that

(18)

Next we show that . Letting denote the cardinality, it follows from [19, Theorem 7.4.8.4.] that

and therefore by [18, Corollary VI.4.3.2]

where denotes the convex hull. However, [19, Theorem 7.4.8.4.] implies that if

In this case, only depends on variables , which is why for all and hence all it holds that