Statistical Machine Learning project
Many problems in machine learning and statistics can be formulated as (generalized) eigenproblems. In terms of the associated optimization problem, computing linear eigenvectors amounts to finding critical points of a quadratic function subject to quadratic constraints. In this paper we show that a certain class of constrained optimization problems with nonquadratic objective and constraints can be understood as nonlinear eigenproblems. We derive a generalization of the inverse power method which is guaranteed to converge to a nonlinear eigenvector. We apply the inverse power method to 1-spectral clustering and sparse PCA which can naturally be formulated as nonlinear eigenproblems. In both applications we achieve state-of-the-art results in terms of solution quality and runtime. Moving beyond the standard eigenproblem should be useful also in many other applications and our inverse power method can be easily adapted to new problems.READ FULL TEXT VIEW PDF
An important form of prior information in clustering comes in form of
Spectral clustering is one of the fundamental unsupervised learning meth...
Constrained clustering has been well-studied for algorithms such as K-me...
We propose a spectral clustering method based on local principal compone...
Given a large data matrix, sparsifying, quantizing, and/or performing ot...
Least squares form one of the most prominent classes of optimization
In recent years, spectral clustering has become a standard method for da...
Statistical Machine Learning project
Graph-based clustering using the inverse power method for nonlinear eigenproblems
sparse principal component analysis as a nonlinear eigenproblem
Eigenvalue problems associated to a symmetric and positive semi-definite matrix are quite abundant in machine learning and statistics. However, considering the eigenproblem from a variational point of view using Courant-Fischer-theory, the objective is a ratio of quadratic functions, which is quite restrictive from a modeling perspective. We show in this paper that using a ratio of -homogeneous functions leads quite naturally to a nonlinear
eigenvalue problem, associated to a certain nonlinear operator. Clearly, such a generalization is only interesting if certain properties of the standard problem are preserved and efficient algorithms for the computation of nonlinear eigenvectors are available. In this paper we present an efficient generalization of the inverse power method (IPM) to nonlinear eigenvalue problems and study the relation to the standard problem. While our IPM is a general purpose method, we show for two unsupervised learning problems that it can be easily adapted to a particular application.
The first application is spectral clustering . In prior work  we proposed -spectral clustering based on the graph -Laplacian, a nonlinear operator on graphs which reduces to the standard graph Laplacian for . For close to one, we obtained much better cuts than standard spectral clustering, at the cost of higher runtime. Using the new IPM, we efficiently compute eigenvectors of the -Laplacian for -spectral clustering. Similar to the recent work of , we improve considerably compared to  both in terms of runtime and the achieved Cheeger cuts. However, opposed to the suggested method in  our IPM is guaranteed to converge to an eigenvector of the -Laplacian.
The second application is sparse Principal Component Analysis (PCA). The motivation for sparse PCA is that the largest PCA component is difficult to interpret as usually all components are nonzero. In order to allow a direct interpretation it is therefore desirable to have only a few features with nonzero components but which still explain most of the variance. This kind of trade-off has been widely studied in recent years, see and references therein. We show that also sparse PCA has a natural formulation as a nonlinear eigenvalue problem and can be efficiently solved with the IPM.
The standard eigenproblem for a symmetric matric is of the form
where and . It is a well-known result from linear algebra that for symmetric matrices , the eigenvectors of can be characterized as critical points of the functional
The eigenvectors of can be computed using the Courant-Fischer Min-Max principle. While the ratio of quadratic functions is useful in several applications, it is a severe modeling restriction. This restriction however can be overcome using nonlinear eigenproblems. In this paper we consider functionals of the form
where with we assume , to be convex, Lipschitz continuous, even and positively -homogeneous111A function is positively homogeneous of degree if for all . with . Moreover, we assume that if and only if . The condition that and are -homogeneous and even will imply for any eigenvector that also for is an eigenvector. It is easy to see that the functional of the standard eigenvalue problem in Equation (2) is a special case of the general functional in (3).
To gain some intuition, let us first consider the case where and are differentiable. Then it holds for every critical point of ,
Let be the operators defined as , and , we see that every critical point of satisfies the nonlinear eigenproblem
which is in general a system of nonlinear equations, as and are nonlinear operators. If and are both quadratic, and are linear operators and one gets back the standard eigenproblem (1).
Before we proceed to the general nondifferentiable case, we have to introduce some important concepts from nonsmooth analysis. Note that is in general nonconvex and nondifferentiable. In the following we denote by the generalized gradient of at according to Clarke ,
where . In the case where is convex, is the subdifferential of and the directional derivative for each . A characterization of critical points of nonsmooth functionals is as follows.
A point is a critical point of , if .
This generalizes the well-known fact that the gradient of a differentiable function vanishes at a critical point. We now show that the nonlinear eigenproblem (4) is a necessary condition for a critical point and in some cases even sufficient. A useful tool is the generalized Euler identity.
Let be a positively -homogeneous and convex continuous function. Then, for each and it holds that .
The next theorem characterizes the relation between nonlinear eigenvectors and critical points of .
Suppose that fulfill the stated conditions. Then a necessary condition for being a critical point of is
If is continuously differentiable at , then this is also sufficient.
and thus . As are Lipschitz continuous, we have, see Prop. 2.3.14 in ,
Thus if is a critical point, that is , then
given that . Moreover, by Prop. 2.3.14 in  we have equality in (6), if is continuously differentiable at and
thus (5) implies that is a critical point of .
Finally, the definition of the associated nonlinear operators in the nonsmooth case is a bit tricky as and can be set-valued. However, as we assume and to be Lipschitz, the set where and are nondifferentiable has measure zero and thus and are single-valued almost everywhere.
A standard technique to obtain the smallest eigenvalue of a positive semi-definite symmetric matrix is the inverse power method . Its main building block is the fact that the iterative scheme
converges to the smallest eigenvector of . Transforming (7) into the optimization problem
is the motivation for the general IPM. The direct generalization tries to solve
where and . For this leads directly to Algorithm 2, however for the direct generalization fails. In particular, the ball constraint has to be introduced in Algorithm 1 as the objective in the optimization problem (9) is otherwise unbounded from below. (Note that the 2-norm is only chosen for algorithmic convenience). Moreover, the introduction of in Algorithm 1 is necessary to guarantee descent whereas in Algorithm 2 it would just yield a rescaled solution of the problem in the inner loop (called inner problem in the following).
For both methods we show convergence to a solution of (4), which by Theorem 2.2 is a necessary condition for a critical point of and often also sufficient. Interestingly, both applications are naturally formulated as -homogeneous problems so that we use in both cases Algorithm 1. Nevertheless, we state the second algorithm for completeness. Note that we cannot guarantee convergence to the smallest eigenvector even though our experiments suggest that we often do so. However, as the method is fast one can afford to run it multiple times with different initializations and use the eigenvector with smallest eigenvalue.
The inner optimization problem is convex for both algorithms. In turns out that both for -spectral clustering and sparse PCA the inner problem can be solved very efficiently, for sparse PCA it has even a closed form solution. While we do not yet have results about convergence speed, empirical observation shows that one usually converges quite quickly to an eigenvector.
To our best knowledge both suggested methods have not been considered before. In  they propose an inverse power method specially tailored towards the continuous -Laplacian for , which can be seen as a special case of Algorithm 2. In  a generalized power method has been proposed which will be discussed in Section 5. Finally, both methods can be easily adapted to compute the largest nonlinear eigenvalue, which however we have to omit due to space constraints.
Proof of Lemma 3.1 for Algorithm 1: First note that the optimal value of the inner problem is non-positive as . Moreover, as is -homogeneous, the minimum of is always attained at the boundary of the constraint set. Thus any fulfills and thus is feasible, and
where we used from Theorem 2.1. If the optimal value is zero, then is a possible minimizer and the sequence terminates and is an eigenvector see proof of Theorem 3.1 for Algorithm 1. Otherwise the optimal value is negative and at the optimal point we get . The definition of the subdifferential together with the -homogeneity of yields
and finally .
Proof of Theorem 3.1 for Algorithm 1: By Lemma 3.1 the sequence is monotonically decreasing. By assumption and are nonnegative and hence is bounded below by zero. Thus we have convergence towards a limit
Note that for every , thus the sequence is contained in a compact set, which implies that there exists a subsequence converging to some element . As the sequence is a subsequence of a convergent sequence, it has to converge towards the same limit, hence also
As shown before, the objective of the inner optimization problem is nonpositive at the optimal point. Assume now that
. Then the vectorsatisfies
where we used the definition of the subdifferential and the -homogeneity of . Hence
which is a contradiction to the fact that the sequence has converged to . Thus we must have , i.e. the function is nonnegative in the unit ball. Using the fact that for any ,
we can even conclude that the function is nonnegative everywhere, and thus . Note that , which implies that is a global minimizer of , and hence
which implies that is an eigenvector with eigenvalue . Note that this argument was independent of the choice of the subsequence, thus every convergent subsequence converges to an eigenvector with the same eigenvalue . Clearly we have .
The following lemma is useful in the convergence proof of Algorithm 2.
Let be a convex, positively -homogeneous function with . Then for any , and any we have .
Proof: Using the definition of the subgradient, we have for any and any ,
Using the -homogeneity of , we can rewrite this as
which implies .
The following Proposition generalizes a result by Zarantonello .
Let be a convex, continuous and positively -homogeneous and even functional and its subdifferential at . Then it holds for any and ,
Proof: First observe that for any points , the subdifferential inequality yields
and hence, by summing up,
Let now , and . We construct a set of points in , where , as follows:
By Lemma 3.2 for all there exists an s.t.
Eq. (10) now yields
which simplifies to
By letting we obtain for the two sums
Hence in total in the limit Eq. (11) becomes
As the above inequality holds for all , clearly we can now perform the substitution , where , which gives
A local optimum with respect to of the left side satisfies the necessary condition
which implies that
Plugging this into (12) yields
By the homogeneity of we then have
Finally, note that we can replace the left side by its absolute value since replacing with yields
where we used the fact that is even.
and thus the minimum is attained at and
Assume there exists that satisfies where . Hence, also , which implies
where we used the fact that and . Rearranging, we obtain
Using the Hölder-type inequality of Proposition 3.1 and , we obtain
which gives . Let now be the minimizer of . Then satisfies . If equality holds then is a minimizer of the inner problem and the sequence terminates. In this case is an eigenvector, see proof of Theorem 3.1 for Algorithm 2. Otherwise and thus fulfills the above assumption and we get , as claimed.
Proof of Theorem 3.1 for Algorithm 2: Note that as , the sequence is bounded from below, and by Lemma 3.1 it is monotonically decreasing and thus converges to some . Moreover, for all . As is continuous it attains its minimum on the unit sphere in . By assumption . We obtain
Thus the sequence is bounded and there exists a convergent subsequence . Clearly, . Let now , and suppose that there exists with where . Then, analogously to the proof of Lemma 3.1, one can conclude that which contradicts the fact that has as its limit . Thus is a minimizer of , which implies
so that is an eigenvector with eigenvalue . As this argument was independent of the subsequence, any convergent subsequence of converges towards an eigenvector with eigenvalue .
By the proof of Lemma 3.1, descent in is not only guaranteed for the optimal solution of the inner problem, but for any vector which has inner objective value for Alg. 1 and in the case of Alg. 2. This has two important practical implications. First, for the convergence of the IPM, it is sufficient to use a vector satisfying the above conditions instead of the optimal solution of the inner problem. In particular, in an early stage where one is far away from the limit, it makes no sense to invest much effort to solve the inner problem accurately. Second, if the inner problem is solved by a descent method, a good initialization for the inner problem at step is given by in the case of Alg. 1 and in the case of Alg. 2 as descent in is guaranteed after one step.
Spectral clustering is a graph-based clustering method (see  for an overview) based on a relaxation of the NP-hard problem of finding the optimal balanced cut of an undirected graph. The spectral relaxation has as its solution the second eigenvector of the graph Laplacian and the final partition is found by optimal thresholding. While usually spectral clustering is understood as relaxation of the so called ratio/normalized cut, it can be equally seen as relaxation of the ratio/normalized Cheeger cut, see . Given a weighted undirected graph with vertex set and weight matrix , the ratio Cheeger cut () of a partition , where and , is defined as
where we assume in the following that the graph is connected. Due to limited space the normalized version is omitted, but the proposed IPM can be adapted to this case. In  we proposed -spectral clustering, a generalization of spectral clustering based on the second eigenvector of the nonlinear graph -Laplacian (see ; the graph Laplacian is recovered for ). The main motivation was the relation between the optimal Cheeger cut and the Cheeger cut obtained by optimal thresholding the second eigenvector of the -Laplacian, see [6, 9],
where denotes the degree of vertex . While the inequality is quite loose for spectral clustering (), it becomes tight for . Indeed in  much better cuts than standard spectral clustering were obtained, at the expense of higher runtime. In  the idea was taken up and they considered directly the variational characterization of the ratio Cheeger cut, see also [2, 9],
In  they proposed a minimization scheme based on the Split Bregman method . Their method produces comparable cuts to the ones in , while being computationally much more efficient. However, they could not provide any convergence guarantee about their method.
In this paper we consider the functional associated to the -Laplacian ,
and study its associated nonlinear eigenproblem .
Any non-constant eigenvector of the -Laplacian has median zero. Moreover, let be the second eigenvalue of the -Laplacian, then if is connected it holds .
Proof: The subdifferential of the enumerator of can be computed as
where we use the set-valued mapping
Moreover, the subdifferential of the denominator of is
Note that, assuming that the graph is connected, any non-constant eigenvector must have . Thus if is an eigenvector of the -Laplacian, there must exist with and and with such that
Summing over yields due to the anti-symmetry of , , where are the cardinalities of the positive and negative part of and is the number of components with value zero. Thus we get
which implies with that and . Thus the median of is zero if
is odd. Ifis even, the median is non-unique and is contained in which contains zero.
If the graph is connected, the only eigenvector corresponding to the first eigenvalue of the -Laplacian is the constant one.
As all non-constant eigenvectors have median zero, it follows with Equation 13 that . For the other
direction, we have to use the algorithm we present in the following and some subsequent results. By Lemma 4.2
there exists a vector with such that . Obviously,
is non-constant and has median zero and thus can be used as initial point for Algorithm 3.
By Lemma 4.1 starting with the sequence either terminates and the current iterate
is an eigenvector or one finds a with , where has median zero. Suppose that there exists
such a , then
which is a contradiction. Therefore the sequence has to terminate and thus by the argument in the proof of Theorem 4.1
the corresponding iterate is an eigenvector. Thus we get and thus with we arrive
at the desired result.
For the computation of the second eigenvector we have to modify the IPM which is discussed in the next section.
The direct minimization of (14) would be compatible with the IPM, but the global minimizer is the first eigenvector which is constant. For computing the second eigenvector note that, unlike in the case , we cannot simply project on the space orthogonal to the constant eigenvector, since mutual orthogonality of the eigenvectors does not hold in the nonlinear case.
Algorithm 3 is a modification of Algorithm 1 which computes a nonconstant eigenvector of the 1-Laplacian. The notation and refers to the cardinality of positive, negative and zero elements, respectively. Note that Algorithm 1 requires in each step the computation of some subgradient , whereas in Algorithm 3 the subgradient has to satisfy . This condition ensures that the inner objective is invariant under addition of a constant and thus not affected by the subtraction of the median. Opposite to  we can prove convergence to a nonconstant eigenvector of the -Laplacian. However, we cannot guarantee convergence to the second eigenvector. Thus we recommend to use multiple random initializations and use the result which achieves the best ratio Cheeger cut.
The sequence produced by Algorithm 3 satisfies for all or the sequence terminates.
Proof: Note that, analogously to the proof of Lemma 3.1, we can conclude that the inner objective is nonpositive at the optimum, where the sequence terminates if the optimal value is zero as the previous is among the minimizers of the inner problem. Now observe that the objective of the inner optimization problem is invariant under addition of a constant. This follows from the fact that we always have , which can be easily verified. Hence, with , we get
Dividing both sides by yields