Matrix factorization plays an important role in machine learning, due to its wide applications including collaborative filtering
Matrix factorization plays an important role in machine learning, due to its wide applications including collaborative filtering, parts-based learning  and clustering . Given a data matrix (where denotes the number of data samples and denotes the dimension of each data sample), one aims to find matrices and such that , where usually . Many existing algorithms in the literature find and by minimizing (possibly with regularizations on or to induce structural properties). However, when the number of data samples becomes large, solving this problem using batch algorithms can be highly inefficient, in terms of both computation and storage. To improve the efficiency, online (stochastic) matrix factorization algorithms [4, 5, 6, 7, 8, 9, 10, 11, 12] have been proposed to learn the dictionary (matrix) from a sequence of randomly drawn data samples (possibly in mini-batches).111We use “online” and “stochastic” interchangeably. In the literature, online algorithms also cover the setting where the number of data samples is infinite. However, we only consider to be finite, but can be very large. Extensive numerical evidence has shown that the stochastic matrix factorization algorithms can learn significantly faster than their batch counterparts, in terms of (empirical) convergence rate. In addition, only storage space are consumed by these algorithms, in contrast to by batch methods. Due to these advantages, stochastic matrix factorization algorithms have gained much popularity in recent years.
From an optimization point of view, previous stochastic matrix factorization algorithms [4, 5, 6, 7, 8, 9, 10, 11, 12] aim to solve a nonconvex stochastic program involving the dictionary (see (4) in Remark 1). They mainly leverage two optimization frameworks, namely stochastic majorization-minimization (SMM)  and stochastic (sub-)gradient descent (SGD) . Based on either framework, almost sure asymptotic convergence to stationary points of the stochastic program has been shown. However, the asymptotic nature of this convergence analysis cannot provide insights into the dictionary learning process at any finite time instant. Thus, we wish to understand the non-asymptotic convergence of dictionary learning, as well as the sample complexity for learning a “reasonable” dictionary . Besides, we desire improved stochastic methods that yield faster convergence, at least for large but finite datasets.
Main Contributions. Inspired by the recent advances in variance reduction techniques (particularly for nonconvex problems) [15, 16, 17], we propose a unified framework to speed up the existing stochastic matrix factorization algorithms via variance reduction (VR). As shown in Section 2, our framework is general and it subsumes eight well-known SMF formulations as special instances, including those robust variants that explicitly model outliers in their objective functions. Within this framework, we derive a convergence analysis of our framework, where in each step the coefficient (and outlier) vectors are only learned approximately. In the context of stochastic (variance-reduced) nonconvex proximal gradient algorithms, this is the
and it subsumes eight well-known SMF formulations as special instances, including those robust variants that explicitly model outliers in their objective functions. Within this framework, we derive anon-asymptotic convergence rate, measured in terms of how “fast” the algorithm converges to a stationary point in expectation (see Section 4.1 for precise statements). Accordingly, we derive sample and computational complexities for our algorithm to converge to an -stationary point in expectation. To the best of our knowledge, this is the first time that a non-asymptotic convergence analysis (together with sample complexity) is established for SMF algorithms. To further improve the efficiency of our framework, we also provide an inexact
convergence analysis of our framework, where in each step the coefficient (and outlier) vectors are only learned approximately. In the context of stochastic (variance-reduced) nonconvex proximal gradient algorithms, this is thefirst inexact analysis thus far. We show, via extensive experiments on various SMF formulations and datasets, that our variance-reduced framework consistently outperforms state-of-the-art frameworks (including SMM and SGD), in terms of both convergence rate and accuracy of learned dictionaries.
Related works. Another line of works [18, 19, 20] considers using SMF algorithms in a matrix completion setting, where the matrices and are jointly learned from a random sequence of entries (or columns) of a data matrix . By leveraging low-rankness and the restricted isometry property (RIP) on , the product of and can exactly recover (with high probability). However, their problem setting differs from ours in two aspects. First, we do not store or explicitly learn
(with high probability). However, their problem setting differs from ours in two aspects. First, we do not store or explicitly learn, due to space constraints. Second, our convergence results do not rely on low-rankness or RIP. Therefore, similar to [4, 5, 6, 7, 8, 9, 10, 11, 12], our results are stated in terms of stationary points, which are oftentimes appealing in real applications [4, 11]. In the context of stochastic PCA, VR techniques have been applied to yield a linear convergence . Again, the problem setting therein vastly differs from ours.
Notations. We use boldface capital letters, boldface lowercase letters and lowercase letters to denote matrices, column vectors and scalars respectively. Denote as and as . For any , define and . For a matrix , denote its -th column as . We use to denote the -norm of a vector and the Frobenius norm of a matrix. For a convex and closed set , denote its indicator function as , such that if and otherwise. All the sections, lemmas and algorithms beginning with ‘S’ will appear in the supplemental material.
2 Problem Formulation
Let denote the set of data samples.
We first define a loss function
denote the set of data samples. We first define a loss functionw.r.t. a data sample and a dictionary as
where and denote the constraint set and regularizer of the coefficient vector respectively. Since large-sale datasets may contain outliers, we can define a “robust” version of as
where and denote the constraint set and regularizer of the outlier vector respectively. For convenience, let denote either or . Based on , we formulate our problem as
where and denote the constraint set and regularizer of the dictionary respectively. Our targeted problem (3) is general and flexible, in the sense that by choosing the constraint sets , and and the regularizers , and in different manners, (3) encompasses many important examples in the literature of SMF. For loss function , we have:
Online Structured Sparse Learning (OSSL) : , , and , where is a subvector of , and .
Smooth Sparse ODL (SSODL) : , , and , where and is positive semidefinite.
For loss function , we have:
In almost all the works cited in this section, the optimization problem is formulated as a stochastic program that minimizes the expected risk of dictionary training. It has the form
where is a probability distribution. The reasons that we consider (
is a probability distribution. The reasons that we consider (3) rather than (4) are three-fold. First, (3) can be regarded as the sample average approximation  of (4). Solving this approximation has been a popular and efficient approach for solving a general stochastic program like (4) . Second, (3) can also be interpreted as minimizing the empirical risk incurred by any large-scale finite training dataset. Solving (3) efficiently can greatly speed up the dictionary learning process. Last but not least, since (3) is a finite-sum minimization problem, we can employ the recently developed variance reduction techniques [15, 16] to solve it in an efficient manner. See Section 3 for details.
For algorithm and convergence analysis, we will focus on the loss function , which contains as a special case (by choosing ).
We present the pseudo-code for our stochastic matrix factorization algorithm with variance reduction techniques in Algorithm 1. To develop our algorithm, we first make two mild assumptions. We then describe our algorithm in details, with focus on learning the coefficient and outlier vectors.
, and are convex, proper and closed functions on , and respectively, with proximal operators that can be evaluated efficiently.222For a function , the “efficient” evaluation of means that for any , can be computed in arithemetic operations. The same definitions apply to “efficient projections” in ii.
, and are convex and admit efficient (Euclidean) projections onto them.
Algorithm Description. Variance-reduced stochastic optimization algorithms typically contain outer and inner loops [15, 24]. In Algorithm 1, we use and to denote the indices for outer and inner iterations respectively. At the beginning of each outer iteration , we solve (regularized) least-square regression problems to learn the set of coefficient and outlier vectors based on . Based on the learned vectors, we compute , which will be shown to be in Section 4. In each inner iteration , we solve a (random) mini-batch of regression problems, based on and respectively. The regression results, together with , are used to calculate a search direction , which acts as a variance-reduced gradient estimate of the loss function
, which acts as a variance-reduced gradient estimate of the loss functionat . The proximal operator of either has a closed form  or can be obtained using Douglas-Rachford splitting or alternating direction method of multipliers , both in (arithmetic) operations. We choose the final dictionary via option I or II (in lines 15 and 16 respectively).333In this work, the line numbers always refer to those in Algorithm 1. In practice, we choose the final dictionary as the last iterate , i.e., option II. However, to streamline the analysis, we output the dictionary using option I. This is consistent with the many works on variance-reduced SGD algorithms, e.g., .
Learning coefficient and outlier vectors. It can be seen that the computational complexity of Algorithm 1 crucially depends on solving the problem (2), which appears in lines 4, 8 and 9 of Algorithm 1. To solve this problem, we leverage the block successive upperbound minimization framework , by alternating between the following two steps, i.e.,
where and denote the updated values of and respectively (in one iteration). The proximal operators of and can be obtained in the same way as that of . Overall, the computational complexity of solving (2) is .
In our implementation, we do not store before computing . Instead, after obtaining and , we add them to the running sum in line 5, and then discard them. The same procedure applies to computing . Thus the storage complexity of Algorithm 1 is . This complexity is linear in the dimension of the optimization variable , which has entries.
4 Convergence Analysis
Additional assumptions. For analysis purposes, we make the following three additional assumptions:
The data samples lie in a compact set .
For any and , is jointly strongly convex on .
is compact; is compact or is coercive on ; is compact or is coercive on .
Assumption i is natural since all real data are bounded. Assumption ii is common in the literature [4, 11] and can be satisfied by adding Tikhonov regularizers (see [11, Remark 4] for a detailed discussion). Assumption iii is satisfied by all the Problems 1 to 4, except in 2, 1, 2 and 3, is not bounded. However, for each problem, it was proven that the sequence of basis matrices generated by the corresponding algorithm in ,  or  belonged to a compact set almost surely. Thus the compactness of is a reasonable assumption.
4.1 Key Lemmas and Main Results
We first present some definitions that are used in the statements of our results. We then state our key lemmas (Lemmas 1 and 2) and main convergence theorems (Theorems 1 and 2), together with the derived sample and computational complexities (Corollary 1). The proof sketches of all the results are shown in Section 4.2, with complete proofs deferred to Section S-1 to Section S-4.
Definitions. Define a filtration such that is the -algebra generated by the random sets . It is easy to see that is adapted to to . Consider a composite function defined on a Euclidean space , where is differentiable and is proper, closed and convex. For any and , define the (proximal) gradient mapping  of at with step size , . A point is -stationary w.r.t. the problem and if . Finally, define .
Lemma 1 (Smoothness of ).
Define . The loss function in (2) is differentiable on and
In addition, is continuous on . Moreover, for any , is Lipschitz on with a Lipschitz constant independent of .
Lemma 2 (Variance bound of ).
In Algorithm 1, and
Measuring convergence rate. Since the loss function in (3) is nonconvex (due to bilinearity), obtaining global minima of is in general out-of-reach. Thus in almost all the previous works [4, 5, 6, 7, 8, 9, 10, 11, 12], convergence to stationary points of (3) was studied instead. Define . Following [29, 16], for the sequence of (random) dictionaries generated in Algorithm 1, we propose to use to measure its convergence rate (to stationary points).To see the validity of this measure, define the directional derivative of at along , i.e., . In previous works [4, 5, 6, 7, 8, 9, 10, 11, 12], is characterized as a stationary point of (3) if and only if for any , . It can be shown that this condition is equivalent to , for any . In particular, if , we can verify that both conditions are equivalent to the condition that lies in the normal cone of at .
Equipped with the proper convergence measure, we now state our main convergence theorem.
Note that denotes the total number of inner iterations. Thus (9) states a sublinear convergence rate of towards the set of stationary points of (3). In addition, since , we can choose , and such that to satisfy the conditions in Theorem 1.
In practice, for efficiency considerations, it is important that we learn the coefficient and outlier vectors (in lines 4, 8 and 9) approximately, by only finding inexact solutions to the subproblem (2). The errors in these inexact solutions will be propagated to the variance-reduced gradient as a whole. In the next theorem, we show that if the (additive) error in is properly controlled (i.e., decreases at a certain rate), then we can achieve the same convergence rate as in Theorem 1. Note that although inexact analyses have been done for batch  and incremental  proximal gradient (PG) algorithms, our result is the first one for a stochastic (variance-reduced) PG algorithm.
Remark 4 (Bound ).
Denote an approximate solution of as . From the definition of and the compactness of and , we see that can be bounded by the learning errors in coefficient and outlier vectors, i.e., . For each , these learning errors can be further bounded by the (infimum of) norm of (sub-)gradient of at , due to the first-order optimality conditions. Therefore, we are able to bound in terms of the approximate solutions .
4.2 Proof Sketches
Proof Sketch of Lemma 1. By Assumption iii, if is not compact, then is coercive. This implies the boundedness of . A similar argument also applies to . Thus it is equivalent to minimizing over a compact set . In addition, Assumptions i and iii ensure the compactness of and respectively. Since Assumptions ii and ii ensure the uniqueness of , for any , we can invoke Danskin’s theorem (see Lemma S-1) to guarantee the differentiability of , and compute as in (7). Additionally, we can invoke the Maximum theorem (see Lemma S-2) to ensure the continuity of on . This implies the continuity of . Based on this, we can assert the Lipschitz continuity of both and on . The proceeding arguments, together with the compactness of , , and , allow us to conclude the Lipschitz continuity of . ∎
Conditioning on , we then take expectations w.r.t. on both sides of (10). Using Lemma 2, we can bound in terms of and . Then, instead of directly telescoping (10), we construct a surrogate function , where the sequence is given by the recursion , where . Using this recursive relation, together with the conditions and in Theorem 1, we manage to arrive at
Define . We now telescope (11) over to obtain
Finally, we telescope (12) over and use option I to choose the final dictionary .∎
Proof Sketch of Theorem 2. The proof of Theorem 2 is modified from that of Theorem 1. Due to the error , in (10) is replaced by . We decouple from through . We then take expectation on both sides and carefully follow the telescoping procedure (over and ) used in proving Theorem 1. After some algebraic manipulations, we arrive at
Since , for any , . Therefore by using option I to choose , we complete the proof. ∎
Proof of Corollary 1. By Theorem 1, we know that inner iterations are needed for . By evenly distributing the data samples drawn at the start of each outer iteration into inner iterations, we see that each inner iteration takes in samples. Thus the sample complexity is . Based on it, we can also derive the computational complexity to be . This is because both subproblem (2) (in lines 4, 8 and 9) and the proximal operator of (in line 7) can be solved or evaluated in arithmetic operations (see Section 3). From the condition in Theorem 1, we see that . Thus the sample and computational complexities become and respectively. To optimize both costs, it is necessary that , which amounts to . (Note that this also implies .) ∎
5 Numerical Experiments
5.1 Tested SMF Formulations and Datasets
We tested the performance of our proposed framework (Algorithm 1) on four representative SMF formulations in Section 2, including ODL, ONMF, ORPCA and ORNMF. Among them, ODL and ONMF are non-robust SMF formulations whereas ORPCA and ORNMF are robust ones, i.e., they explicitly model outliers. Therefore, for ODL and ONMF, we tested their algorithms on the CBCL  and MNIST  datasets; while for ORPCA and ORNMF, we tested their algorithms on the Synth and YaleB  datasets. The CBCL and MNIST datasets are commonly used in testing (non-robust) matrix factorization algorithms, e.g., .
The YaleB dataset consists of face images taken under various lighting conditions. The shadow and gloss in these images caused by imperfect illumination can be regarded as outliers.
The Synth dataset was generated synthetically in a similar fashion to those in [7, 8, 11]. Specifically, we first generated a matrix and a matrix , where , and . The entries of and were drawn i.i.d. from a normal distribution with mean 0.5 and variance denotes the outlier density. The remaining entries were then generated i.i.d. from a uniform distribution with support
were drawn i.i.d. from a normal distribution with mean 0.5 and variance. We then generated a outlier matrix by first uniformly randomly selecting of its entries to be zero, where
denotes the outlier density. The remaining entries were then generated i.i.d. from a uniform distribution with support. We set and . Finally .
5.2 Benchmarking Frameworks and Choice of Parameters
For each tested SMF formulation, we compared our variance-reduced SMF framework (denoted as VR) against another two optimization frameworks commonly used in previous works on SMF, namely SMM and SGD. For completeness, the pseudo-codes for these two methods are shown in Algorithms S-2 and S-3 respectively. We next describe the parameter setting in our method. From our analysis in Corollary 1, we set the mini-batch size and the number of inner iterations , where and . We set the number of outer iterations and the parameter in Theorem 2 to . For each tested SMF formulation on each dataset, we chose the step size such that our method yielded the best (empirical) convergence rate. The plots of objective values resulting from other step sizes are shown in Figure S-31. For the latent dimension , following the convention in [2, 4, 11], we set it to 49. This parameter can also be chosen from domain knowledge or a Bayesian approach, e.g., . Since in practice we found the performance of all the three frameworks (VR, SMM and SGD) was not sensitive to , we fixed it for simplicity. For both SMM and SGD, we used the same mini-batch size as in VR. The step sizes in SGD are chosen in the classical way , i.e., . Similar to VR, we chose such that SGD achieved the best empirical convergence rate. For simplicity, we initialized the dictionary such that all its entries are equal to one. Finally, following [4, 6, 7, 11], we set the regularization weight in ODL, ONMF and ORNMF to . In addition, we set