# Scale Invariant Power Iteration

Power iteration has been generalized to solve many interesting problems in machine learning and statistics. Despite its striking success, theoretical understanding of when and how such an algorithm enjoys good convergence property is limited. In this work, we introduce a new class of optimization problems called scale invariant problems and prove that they can be efficiently solved by scale invariant power iteration (SCI-PI) with a generalized convergence guarantee of power iteration. By deriving that a stationary point is an eigenvector of the Hessian evaluated at the point, we show that scale invariant problems indeed resemble the leading eigenvector problem near a local optimum. Also, based on a novel reformulation, we geometrically derive SCI-PI which has a general form of power iteration. The convergence analysis shows that SCI-PI attains local linear convergence with a rate being proportional to the top two eigenvalues of the Hessian at the optimum. Moreover, we discuss some extended settings of scale invariant problems and provide similar convergence results for them. In numerical experiments, we introduce applications to independent component analysis, Gaussian mixtures, and non-negative matrix factorization. Experimental results demonstrate that SCI-PI is competitive to state-of-the-art benchmark algorithms and often yield better solutions.

## Authors

• 2 publications
• 4 publications
• 66 publications
05/25/2018

### EM algorithms for ICA

Independent component analysis (ICA) is a widely spread data exploration...
04/20/2022

### Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

We consider minimizing a smooth and strongly convex objective function u...
01/18/2016

### Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Large scale optimization problems are ubiquitous in machine learning and...
07/15/2021

### Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

In second-order optimization, a potential bottleneck can be computing th...
07/12/2018

### Convergence Rate of Block-Coordinate Maximization Burer-Monteiro Method for Solving Large SDPs

Semidefinite programming (SDP) with equality constraints arise in many o...
06/20/2017

### Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint

We propose a Frank-Wolfe (FW) solver to optimize the symmetric nonnegati...
11/19/2021

### Impact of spatial coarsening on Parareal convergence

We study the impact of spatial coarsening on the convergence of the Para...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider a generalization of power iteration for finding the leading eigenvector of a matrix . In power iteration, the update rule

is repeatedly applied until some stopping criterion is satisfied. Since no hyperparameter is required, this update rule is very practical yet attains global linear convergence with the rate of

where is the largest absolute eigenvalue of . This convergence result is analogous to that of gradient descent for convex optimization. Therefore, many variants including coordinate-wise [13], accelerated [23], stochastic [17]

, stochastic variance-reduced (VR)

[19, 20], and stochastic VR heavy ball [10] power iterations have been developed, drawing a parallel literature to gradient descent for convex optimization. Also, a general form of power iteration has been used to solve

 maximizef(x)subject tox∈∂Bd≜{x∈Rd:∥x∥=1} (1)

in many applications such as sparse principal component analysis (PCA)

[8, 16], -norm kernel PCA [9], phase synchronization [15], and the Burer-Monteiro factorization of semi-definite programs [3]. (All norms are 2-norms unless indicated otherwise.) Nevertheless, theoretical understanding of when such an algorithm enjoys the attractive convergence property of power iteration is limited. While convex is considered in [8], only global sublinear convergence is shown, not generalizing the appealing linear convergence property of power iteration.

In this work, we introduce a new class of optimization problems called scale invariant problems and show that they can be efficiently solved by scale invariant power iteration (SCI-PI) with a generalized convergence guarantee of power iteration. Scale invariant problems consider scale invariant functions in (1). We say that is scale invariant, which is rigorously defined later, if its geometric surface is invariant under constant multiplication of . Many important optimization problems in statistics and machine learning can be formulated as scale invariant problems, for instance,

-norm kernel PCA and maximum likelihood estimation of mixture proportions, to name a few. Moreover, as studied herein, independent component analysis (ICA), non-negative matrix factorization (NMF), and Gaussian mixture models (GMM) can be formulated as extended settings of scale invariant problems.

Derivatives of scale invariant functions have the interesting relation that holds for some . Using the KKT condition, we derive an eigenvector property stating that any stationary point satisfying for some is an eigenvector of . Due to the eigenvector property, scale invariant problems can be locally seen as the leading eigenvector problem. Therefore, we can expect that a simple update rule like power iteration would efficiently solve scale invariant problems near a local optimum . Another interesting property of scale invariant problems is that by swapping the objective function and the constraint, a geometrically interpretable dual problem with the goal of finding the closest point to the origin from the constraint is obtained. By mapping an iterate to the dual space, taking a descent step in the dual space and mapping it back to the original space, we geometrically derive SCI-PI, which replaces with in power iteration. We show that SCI-PI converges to a local maximum at a linear rate when initialized close to it. The convergence rate is proportional to / where is the spectral norm of and is the Lagrange multiplier corresponding to , generalizing the convergence rate of power iteration. Moreover, under some mild conditions, we provide an explicit expression regarding the initial condition on to ensure linear convergence.

In the extended settings, we discuss three variants of scale invariant problems. In the first setting,

is replaced with a sum of scale invariant functions. This setting covers a Kurtosis-based ICA and can be solved by SCI-PI with similar convergence guarantees. We also consider a block version of scale invariant problems which covers NMF and the Burer-Monteiro factorization of semi-definite programs. To solve block scale invariant problems, we present a block version of SCI-PI and show that it attains linear convergence in a two-block case. Lastly, we consider partially scale invariant problems which include general mixture problems such as GMM. To solve partially scale invariant problems, we present an alternative algorithm based on SCI-PI and the gradient method and prove its local linear convergence. In numerical experiments, we benchmark the proposed algorithms against state-of-the-art methods for KL-NMF, GMM and ICA. The experimental results show that our algorithms are computationally competitive and result in better solutions in select cases.

Our work has the following contributions.

1. We introduce scale invariant problems which cover interesting examples in statistics and machine learning yet can be efficiently solved by SCI-PI due to the eigenvector property.

2. We present a geometric derivation of SCI-PI using a dual reformulation and provide a convergence analysis for it. We show that SCI-PI converges to a local maximum at a linear rate when initialized close to , generalizing the attractive convergence property of power iteration. Moreover, we introduce 3 extended settings of scale invariant problems together with their convergence analyses.

3. We report numerical experiments including a novel reformulation of KL-NMF to extended settings of scale invariant problems. The experimental results demonstrate that SCI-PI are not only computationally competitive to state-of-the-art methods but also often yield better solutions.

The paper is organized as follows. In Section 2, we define scale invariance and present interesting properties of scale invariant problems including an eigenvector property and a dual formulation. We then provide a geometric derivation of SCI-PI and a convergence analysis in Section 3. The extended settings are discussed in Section 4 and we report the numerical experiments in Section 5. All proofs are deferred to Supplementary Material.

## 2 Scale Invariant Problems

Before presenting properties of scale invariant problems, we first define scale invariant functions.

###### Definition 2.1.

We say that a function is multiplicatively scale invariant if it satisfies

 f(cx)=u(c)f(x) (2)

and is additively scale invariant if it satisfies

 f(cx)=f(x)+v(c) (3)

for some even functions with and with .

The following proposition characterizes the exact form of and for continuous .

###### Proposition 2.2.

If a continuous function satisfies (2) with a multiplicative factor , then we have for some . Also, if a continuous function satisfies (3) with an additive factor , then we have for some where .

Next, we establish derivative-based properties of scale invariant functions.

###### Proposition 2.3.

Suppose is twice differentiable. If satisfies (2) with , we have

 c∇f(cx)=|c|p∇f(x),∇f(x)Tx=pf(x),∇2f(x)x=(p−1)∇f(x). (4)

Also, if satisfies (3) with , we have

 c∇f(cx)=∇f(x),∇f(x)Tx=log−1(a),∇2f(x)x=−∇f(x). (5)

The interesting relation that holds for some is presented in Proposition 2.3. Using the first-order optimality conditions, we derive an eigenvector property as follows.

###### Proposition 2.4.

Suppose that is twice differentiable and let be a stationary point of (1) such that . If satisfies (2) with , then we have . Also, if satisfies (3) with , then we have . In both cases, is an eigenvector of . Moreover, if is greater than the remaining eigenvalues of , then is a local maximum to (1).

Proposition 2.4 states that a stationary point is an eigenvector of and becomes a local maximum if the Lagrange multiplier is greater than the remaining eigenvalues of . Due to this property, scale invariant problems can be considered as a generalization of the leading eigenvector problem. Next, we introduce a dual formulation of scale invariant problems.

###### Proposition 2.5.

Suppose that a continuous function is either multiplicatively scale invariant such that or additively scale invariant with an additive factor with . Then, solving (1) is equivalent to solving the following optimization problem

 \rm minimize∥w∥\rm subject tof(w)=1. (6)

Note that a dual reformulation for a multiplicatively scale invariant with or an additively scale invariant with can be obtained by replacing with in (6). The dual formulation (6) has a nice geometric interpretation that an optimal solution is the closest point to the origin from . This understanding is used to derive SCI-PI in Section 3.

-norm kernel PCA, estimation of mixture proportions, and KL-divergence NMF are all cases of scale invariant problems. The details of these cases are provided in Appendix A.1.

## 3 Scale Invariant Power Iteration

In this section, we provide a geometric derivation of SCI-PI to find a local optimal solution of (1). The algorithm is developed using the geometric interpretation of the dual formulation (6) as illustrated in Figure 1. Starting with an iterate , we obtain a dual iterate by projecting to the constraint . Given

, we identify the hyperplane

which the current iterate lies on and is tangent to . After identifying the equation of , we find the closest point to the origin from and obtain a new dual iterate by projecting to the constraint . Finally, we obtain a new primal iterate by mapping back to the set .

Now, we develop an algorithm based on the above idea. For derivation of the algorithm, we assume that an objective function is continuous and satisfies either (2) with where and for all or (3) with where . Under these conditions, a scalar mapping from to can be well defined as or , respectively. Let . Since is on the constraint

, the tangent vector of the hyperplane

is . Therefore, we can write down the equation of the hyperplane as . Note that is a scalar multiple of where the scalar can be determined from the requirement that is on . Since is the projection of , it must be a scalar multiple of the tangent vector . Therefore, we can write as . Finally, by projecting to , we obtain

 xk+1=wk+1∥wk+1∥=dkyk∥dkyk∥=yk∥yk∥=∇f(wk)∥∇f(wk)∥=∇f(ckxk)∥∇f(ckxk)∥=∇f(xk)∥∇f(xk)∥

where the last equality follows from Proposition 2.3. Summarizing all the above, we obtain SCI-PI presented in Algorithm 1.

Next, we provide a convergence analysis of SCI-PI. Global sublinear convergence of SCI-PI for convex has been addressed in [8]. We additionally show that SCI-PI yields an ascent step even for quasi-convex .

###### Proposition 3.1.

If is quasi-convex and differentiable, a sequence of iterates generated by SCI-PI satisfies for .

If is quasi-convex, is convex, therefore, from Figure 1, we can expect that SCI-PI would yield an ascent step. If is not quasi-convex, is not necessarily increasing, making it hard to analyze global convergence. Assuming that an initial point is close to a local maximum , we study local convergence of SCI-PI as follows.

###### Theorem 3.2.

Let be a scale invariant, twice continuously differentiable function on an open set containing and let be a local maximum satisfying and where is an eigen-pair of with . Then, there exists some such that under the initial condition , the sequence of iterates generated by SCI-PI satisfies

 1−(xTkx∗)2≤k−1∏t=0(¯¯¯λ2λ∗+γt)2(1−(xT0x∗)2)andlimk→∞γk=0.

Moreover, if has a continuous Hessian on an open set containing , we can explicitly write as

where

 ¯¯¯λ1=|λ1|,M=maxx∈∂Bd,y∈Bd,∞√∑di=1(xTGi(y)x)2,Gi(y)=∑dj=1vi,jHj(y).

Theorem 3.2 presents a local convergence result of SCI-PI with generalizing the convergence rate of power iteration. Note that Theorem 3.2 requires while it is sufficient to have for to ensure local optimality. However, by adding for some to the objective function , we can always enforce . On the other hand, by adding for some , we may improve the convergence rate by increasing the relative gap between and .

## 4 Extended Settings

### 4.1 Sum of Scale Invariant Functions

Consider a sum of scale invariant functions having the form of where is a multiplicatively scale invariant function with and is an additively scale invariant function with . Note that this does not imply that is scale invariant in general. However, by Proposition 2.3, the gradient of has the form of

 ∇f(x)=m∑i=1∇gi(x)+n∑j=1∇hj(x)=[m∑i=1(1pi−1)∇2gi(x)−n∑j=1∇2hj(x)]x=F(x)x,

therefore a stationary point satisfying is an eigenvector of . We present a local convergence analysis of SCI-PI for a sum of scale invariant functions as follows.

###### Theorem 4.1.

Let be a sum of scale invariant functions and twice continuously differentiable on an open set containing and let be a local maximum satisfying and Then, there exists some such that under the initial condition , the sequence of iterates generated by SCI-PI satisfies

 1−(xTkx∗)2≤k−1∏t=0(¯¯¯λ2λ∗+γt)2(1−(xT0x∗)2)andlimk→∞γk=0.

Moreover, if has a continuous Hessian on an open set containing ,

where

 ¯¯¯λ1=√2⋅∥∇2f(x∗)x∗∥,M=maxx∈∂Bd,y∈Bd,∞√∑di=1(xTGi(y)x)2,Gi(y)=∑dj=1vi,jHj(y).

Note that has the additional factor which comes from the fact that is not necessarily an eigenvector of . Nonetheless, the asymptotic convergence rate in Theorem 4.1 provides a generalization of the convergence rate in Theorem 3.2.

### 4.2 Block Scale Invariant Problems

Next, consider a class of optimization problems having the form of: where is scale invariant in for fixed and vice versa. We derive the following alternating maximization algorithm called block SCI-PI as

 xk+1←∇xf(x,yk)/∥∇xf(x,yk)∥,yk+1←∇yf(xk,y)/∥∇yf(xk,y)∥. (7)
###### Theorem 4.2.

Suppose that is twice continuously differentiable on an open set containing and let be a local maximum satisfying , , , where and are eigen-pairs of and , respectively such that and . If holds, then for the sequence of iterates generated by (7), there exists some such that if then we have for some sequence such that where

 Δk=⎡⎢ ⎢⎣√1−(xTkx∗)2√1−(yTky∗)2⎤⎥ ⎥⎦,ρ=12⎛⎜⎝¯¯¯λ2λ∗+¯¯¯s2s∗+ ⎷(¯¯¯λ2λ∗−¯¯¯s2s∗)2+4ν2λ∗s∗⎞⎟⎠<1.

If and are independent (), we have . Otherwise, increases as increases. Note the result of Theorem 3.2 can be restored by dropping or in Theorem 4.2 and the algorithm and convergence analysis can be easily generalized to more than two blocks.

### 4.3 Partially Scale Invariant Problems

Lastly, we consider a class of optimization problems of the form: where is a scale invariant function in for each . This problem has the form of (1) with respect to once is fixed. Also, by fixing , we obtain an unconstrained optimization problem with respect to . Using SCI-PI and the gradient method, an alternative maximization algorithm is derived as

 xk+1←∇xf(xk,yk)/∥∇xf(xk,yk)∥,yk+1←yk+α∇yf(xk,yk). (8)

While the gradient method is used in (8), any method for unconstrained optimization can replace it.

###### Theorem 4.3.

Suppose that is scale invariant in for each , -strongly concave in with an -Lipschitz continuous for each , and three-times continuously differentiable on an open set containing . Let be a local maximum satisfying and where is an eigen-pair of with . If holds, then for the sequence of iterates generated by (8) with , there exists some such that if , then we have for some sequence such that where

 Δk=⎡⎣√1−(xTkx∗)2∥yk−y∗∥⎤⎦,ρ=12⎛⎜⎝¯¯¯λ2λ∗+L−μL+μ+ ⎷(¯¯¯λ2λ∗−L−μL+μ)2+8ν2λ∗(L+μ)⎞⎟⎠<1.

As in the result of Theorem 4.2, the rate increases as increases and is equal to when . Also, by dropping , we can restore the convergence result of Theorem 3.2.

## 5 Numerical Experiments

We tested the proposed algorithms on real-world data sets. All experiments were implemented on a standard laptop (2.6 GHz Intel Core i7 processor and 16GM memory) using the Julia programming language. Let us emphasize that scale invariant problems frequently appear in many important applications in statistics and machine learning. We select 3 important applications, KL-NMF, GMM and ICA. A description of the data sets is provided in Supplementary Material. All of them are standard sets used in prior works on these 3 problems.

#### Kl-Nmf:

The KL-divergence NMF (KL-NMF) subproblem can be solved via SCI-PI (see Supplementary Material A.1). Our focus is to compare this algorithm with other famous alternating minimization algorithms listed below, updating and alternatively. To lighten the notation, let , and denote element-wise product, division and square, respectively. We let and denote a vector of ones.

• Projected gradient descent (PGD): It iterates followed by projection onto the simplex, where is an appropriate learning rate [14].

• Multiplicative update (MU): A famous multiplicative update algorithm is originally suggested by [12], which iterates and is learning rate free.

• Our method (SCI-PI): It iterates and rescales , where is a shift parameter. We simply use for preconditioning.

• Sequential quadratic programming (MIXSQP): Solving each subproblem via a convex solver mixsqp [11]. This algorithm performs sequential non-negative least squares.

To study the convergence rate for KL-NMF subproblems, we use four simulated data sets exhibited in [11]. We study MU, PGD and SCI-PI since they have the same order of computational complexity per iteration, but omit MIXSQP since it is a second-order method which cannot be directly compared. For PGD, the learning rate is optimized by grid search. The stopping criterion is where is the solution obtained by MIXSQP after extensive computation time. The average runtime for aforementioned 3 methods are 33, 33 and 30 seconds for 10,000 iterations, respectively. The result is shown in Figure 2111For each evaluation, we randomly draw 10 initial points and report the averaged relative errors with respect to . The initial input for the KL-NMF problem is a one-step MU update of a Unifrandom matrix.. It shows that SCI-PI outperforms the other 2 for all simulated data sets. Also, all methods seems to exhibit linear convergence.

Next, we test the 4 algorithms on 4 real-world data sets for 3 different purposes: 287 waving tree (WT) images for image reconstruction, two bag-of-words data sets from the KOS blog and NIPS full papers for topic modeling, and a Wikipedia (WIKI) vote network for graph clustering. We estimate factors. At each iteration, all 4 algorithms solve subproblems simultaneously.

The result is summarized in Figure 3222In all plots we do not show the first few iterations. The initial random solutions have the gap of approximately 50% which drops to a few percent after 10 iterations where the plots start.. The convergence plots are based on the average relative errors over 10 repeated runs with random initializations. The result shows that SCI-PI is an overall winner, showing faster convergence rates. The stopping criterion is the same as above. To assess overall performance when initialized differently, we select KOS and WIKI and run MU, PGD, SCI-PI, and MIXSQP 10 times. The 3 algorithms except MIXSQP have (approximately) the same computational cost per iteration, take runtime of 391, 396, 408 seconds for KOS data and 372, 390, 418 seconds for WIKI data, respectively for 200 iterations. MIXSQP has a larger per iteration cost. After 400 seconds, SCI-PI achieves lowest objective values in all cases but one for each data set (38 out of 40 in total). Thus it clearly outperforms other methods and also achieves the lowest variance. Unlike the other 3 algorithms, SCI-PI is not an ascent algorithm but an eigenvalue-based fixed-point algorithm. We observe that sometimes SCI-PI converges to a better solution due to this fact. Admittedly, this can be potentially dangerous but for the KL-NMF problem its performance turns out to be stable.

#### Gmm:

GMM fits a mixture of Gaussian distributions to the underlying data. Let

where is the sample index and the cluster index and let be the actual mixture proportion vector. GMM fits into our restricted scale invariant setting (Sec. 4.3) with reparametrization, but the gradient update for is replaced by the exact coordinate ascent step. The EM and SCI-PI updates for can be written respectively as

 r=1⊘(Lπ),πnewk∝π⊙(LTr)(EM),πnewk∝π⊙(α+LTr)⊙2,(SCI-PI). (9)

We compare SCI-PI and EM for different real-world data sets. All the algorithms initialize from the same standard Gaussian random variable, repeatedly for 10 times. The result is summarized in the left panel in Figure

4. The stopping criterion is . In some cases, SCI-PI achieves much larger objective values even if initialized the same. In many cases the 2 algorithms exhibit the same performance. This is because estimation of ’s and ’s are usually harder than estimation of , and EM and SCI-PI have the same updates for and . For a few cases EM outperforms SCI-PI. Let us mention that SCI-PI and EM have the same order of computational complexity and require 591 and 590 seconds of total computation time, respectively.

#### Ica:

We implement SCI-PI on the Kurtosis-based ICA problem [6] and compare it with the benchmark algorithm FastICA [5], which is the most popular algorithm. Given a pre-processed333A centered matrix is pre-processed by so that . data matrix , we seek to maximize an approximated negative entropy subject to , for maximizing Kurtosis-based non-Gaussianity [7]. This problem fits into the sum of scale invariant setting (Sec. 4.1). SCI-PI iterates and FastICA iterates , both followed by normalization.

We compare SCI-PI and FastICA for different real-world data sets. The majority of data points (81 out of 100 in total) show that SCI-PI tends to find a better solution with a larger objective value, but in a few cases SCI-PI converges to a sub-optimal point. Both algorithms are fixed-point based and thus have no guarantee of global convergence but overall SCI-PI outperforms FastICA. SCI-PI and FastICA have the same order of computational complexity and require 11 and 12 seconds of total computation time, respectively.

## 6 Final Remarks

In this paper, we propose a new class of optimization problems called the scale invariant problems, together with a generic solver SCI-PI, which is indeed an eigenvalue-based fixed-point iteration. We showed that SCI-PI directly generalizes power iteration and enjoys similar properties, for instance, that SCI-PI has local linear convergence under mild conditions and its convergence rate is determined by eigenvalues of the Hessian matrix at a solution. Also, we extend scale invariant problems to problems with more general settings. Although scale invariance is a rather restrictive assumption, we show by experiments that SCI-PI can be a competitive option for numerous important problems such as KL-NMF, GMM and ICA. Finding more examples and extending SCI-PI further to a more general setting is a promising direction for future studies.

## References

• [1] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
• [2] Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
• [3] Murat A Erdogdu, Asuman Ozdaglar, Pablo A Parrilo, and Nuri Denizcan Vanli. Convergence Rate of Block-Coordinate Maximization Burer-Monteiro Method for Solving Large SDPs. arXiv preprint arXiv:1807.04428, 2018.
• [4] Cédric Févotte and Jérôme Idier. Algorithms for Nonnegative Matrix Factorization with the -divergence. Neural Computation, 23(9):2421–2456, 2011.
• [5] Aapo Hyvarinen.

Fast ICA for Noisy Data using Gaussian Moments.

In Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI, volume 5, pages 57–61. IEEE, 1999.
• [6] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis, volume 46. John Wiley & Sons, 2004.
• [7] Aapo Hyvärinen and Erkki Oja. Independent Component Analysis: Algorithms and Applications. Neural Networks, 13(4-5):411–430, 2000.
• [8] Michel Journée, Yurii Nesterov, Peter Richtárik, and Rodolphe Sepulchre. Generalized Power Method for Sparse Principal Component Analysis. Journal of Machine Learning Research, 11(Feb):517–553, 2010.
• [9] Cheolmin Kim and Diego Klabjan. A Simple and Fast Algorithm for L1-norm Kernel PCA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
• [10] Cheolmin Kim and Diego Klabjan. Stochastic Variance-reduced Heavy Ball Power Iteration. arXiv preprint arXiv:1901.08179, 2019.
• [11] Youngseok Kim, Peter Carbonetto, Matthew Stephens, and Mihai Anitescu. A Fast Algorithm for Maximum Likelihood Estimation of Mixture Proportions Using Sequential Quadratic Programming. arXiv preprint arXiv:1806.01412, 2018.
• [12] Daniel D Lee and H Sebastian Seung. Algorithms for Non-negative Matrix Factorization. In Advances in Neural Information Processing Systems, pages 556–562, 2001.
• [13] Qi Lei, Kai Zhong, and Inderjit S Dhillon. Coordinate-wise Power method. In Advances in Neural Information Processing Systems, pages 2064–2072, 2016.
• [14] Chih-Jen Lin. Projected Gradient Methods for Non-negative Matrix Factorization. Neural Computation, 19(10):2756–2779, 2007.
• [15] Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. On the Estimation Performance and Convergence Rate of the Generalized Power Method for Phase Synchronization. SIAM Journal on Optimization, 27(4):2426–2446, 2017.
• [16] Ronny Luss and Marc Teboulle. Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint. SIAM REVIEW, 55(1):65–98, 2013.
• [17] Erkki Oja.

Simplified Neuron Model as a Principal Component Analyzer.

Journal of Mathematical Biology, 15(3):267–273, 1982.
• [18] Prasanna K Sahoo and Palaniappan Kannappan. Introduction to Functional Equations. Chapman and Hall/CRC, 2011.
• [19] Ohad Shamir. A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate. In International Conference on Machine Learning, pages 144–152, 2015.
• [20] Ohad Shamir. Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity. In International Conference on Machine Learning, pages 248–256, 2016.
• [21] Rong Wang, Feiping Nie, Xiaojun Yang, Feifei Gao, and Minli Yao. Robust 2DPCA With Non-greedy -Norm Maximization for Image Analysis. IEEE Transactions on Cybernetics, 45(5):1108–1112, 2015.
• [22] Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative Matrix Factorization: A Comprehensive Review. IEEE Transactions on Knowledge and Data Engineering, 25(6):1336–1353, 2013.
• [23] Peng Xu, Bryan He, Christopher De Sa, Ioannis Mitliagkas, and Chris Re. Accelerated Stochastic Power Iteration. In

International Conference on Artificial Intelligence and Statistics

, pages 58–67, 2018.

## Appendix A Supplementary Material

### a.1 Examples

We introduce two immediate applications, -norm kernel PCA and the mixture model, which have been intensively studied over the past few decades in the field of statistics and machine learning.

###### Example A.1 (Lp-norm kernel PCA).

For , -norm kernel PCA [21] is defined as

 \rm maximizefp(x)=n−1∑ni=1∥Φ(ai)Tx∥pp\rm subject tox∈Bd, (10)

which satisfies property (2) with . The example includes the standard -norm PCA.

###### Example A.2 (Estimation of Mixture Proportions).

Given a design matrix satisfying , the problem of estimating mixture proportions seeks to find a vector

of mixture proportions on the probability simplex

that maximizes the log-likelihood . We reformulate the problem by reparametrizing by and obtain

 \rm maximizef0(x)=n−1∑nj=1log(∑dk=1Ljkx2k)\rm subject% tox∈Bd, (11)

which now satisfies property (3) with .

The reformulation idea in Example A.2 implies that any simplex-constrained problem with scale invariant can be reformulated to a scale invariant problem. Example A.2 has a direct application to general mixture models, including the GMM [1]. The same optimization problem also appears in the Kullback-Leibler (KL) divergence NMF problem. In what follows, we show that the KL divergence NMF subproblem is indeed a scale invariant problem.

###### Example A.3 (Kl-Nmf).

The KL-NMF problem [4, 12, 22] is defined as

 minimize DKL(V∥WH)≜∑i,j[VijlogVij∑kWikHkj−Vij+∑kWikHkj] (12) subject to Wik≥0, Hkj≥0, i=1,⋯,n, j=1,⋯,m, k=1,⋯,K.

Many popular algorithms for KL-NMF are based on alternate minimization of and . We consider a subproblem given and :

 \rm minimizefKL(h)=∑i[vilogvi∑kWikhk−vi+∑kWikhk]\rm subject tohk≥0, (13)

where we let and , as the objective is decomposed into separate subproblems. Problem (13) can be reformulated to a scale invariant problem as follows.

###### Lemma A.4.

The KL-NMF subproblem (13) is equivalent to the following scale invariant problem:

 \rm maximize−∑ivilog∑kWik¯hk\rm subject to∑k¯hk=1,  ¯hk≥0, (14)

with the relationship .

### a.2 Description of Data Sets

For KL divergence nonnegative matrix factorization (Section 5), we used 4 public real data sets available online444These 4 data sets are retrieved from https://www.microsoft.com/en-us/research/project, https://archive.ics.uci.edu/ml/datasets/bag+of+words, and https://snap.stanford.edu/data/wiki-Vote.html. Waving Trees (WT) has 287 images, each having pixels. KOS and NIPS are sparse, large matrices implemented for topic modeling. WIKI is a large binary matrix having values or representing the adjacency matrix of a directed graph.

For GMM (Section 5), we used 10 public real data sets. We used all small/moderate data sets provided by the mlbench package in R. We select data sets for multi-class classification problems and run EM and SCI-PI for given number of classes without class labels. The sample size varies from to , the dimension varies from to , and the number of classes varies from to . If missing data exists, we simply replace it by since our main focus is to solve the optimization problem better.

For ICA, we used 9 public data sets from the UCI Machine Learning repository. The sample size varies from to 58,000 and the dimension varies from to .

### a.3 More on Nonnegative Matrix Factorization

We first draw averaged convergence plots for the 4 real world data sets in Figure LABEL:fig:nmftotal. For the WT data set, MIXSQP exhibits a best convergence. Also, the convergence of SCI-PI is much faster than those of MU and PGD. For the other 3 data sets, MIXSQP sometimes converges to suboptimal points. Also, SCI-PI exhibits fastest convergence.

Next, we design a simple simulation study to evaluate the performance of block SCI-PI on KL-NMF problems. To this end, we sample a data matrix

independently from a single “zero-inflated” Poisson distribution (ZIP):

 Vij∼π0δ0+(1−π0)Poisson(l) (15)

where is the proportion of zero inflation and is the mean parameter of the Poisson distribution. Although this data generating distribution does not always reflect empirical distributions of real-world data sets, our focus here is to understand the behavior of SCI-PI compared to the other two methods, MU and PGD.

Let and be the row and column lengths of , be the number of non-negative factors and be the number of zero entries in . We set , , , and as a default and change some parameters to understand how the algorithms work for different settings.

Figure 5 summarizes the result. We conclude that SCI-PI tends to perform better in comparison to MU and PGD when is denser ((1,1) vs. (1,3) and (2,1) vs. (2,2) in Figure 5), when is larger ((1,1) vs. (2,3) in Figure 5) and when

is more uniformly distributed ((1,1) vs (1,2) and (1,1) vs (2,1) in Figure

5).

### a.4 Proofs of Results in Section 2

###### Proof of Proposition 2.2.

We first consider the multiplicative scale invariant case. Let be a point such that . Then, we have

 f(rsx)=u(rs)f(x)=u(r)u(s)f(x),

resulting in

 u(rs)=u(r)u(s)

for all . Letting , we have

 g(r+s)=ln(u(er+s))=ln(u(eres))=ln(u(er)u(es))=ln(u(er))+ln(u(es))=g(r)+g(s),

implying that satisfies the Cauchy functional equation. Since is continuous, so is and thus . Therefore, by [18], we have

 g(r)=rg(1) (16)

for all . From the definition of and (16), we have

 u(er)=eg(r)=(er)g(1). (17)

Representing as and using (17), we obtain

 u(r)=u(eln(r))=rg(1)=rln(u(e))=rp.

If , then since , we have

 limr→0+f(rx)=limr→0+u(r)f(x)=f(x)⋅limr→0+rp=f(x)⋅∞≠f(0)<∞,

contradicting the fact that is continuous at . If , we get , which contradicts . Therefore, we must have . From being an even function, we finally have

 u(r)=|r|p

for .

Consider now the additive scale invariant case. For any , we have

 f(rsx)=f(x)+v(rs)=f(x)+v(r)+v(s),

resulting in

 v(rs)=v(r)+v(s)

for all . Letting , we have

 g(r+s)=v(er+s)=v(eres)=v(er)+v(es)=g(r)+g(s).

Since is continuous and satisfies the Cauchy functional equation, we have

 g(r)=rg(1)

for all . For , letting , we have

 v(r)=v(eln(r))=g(ln(r))=g(1)ln(r)=v(e)ln(r)=loga(r)

where