 # Dictionary Learning for Massive Matrix Factorization

Sparse matrix factorization is a popular tool to obtain interpretable data decompositions, which are also effective to perform data completion or denoising. Its applicability to large datasets has been addressed with online and randomized methods, that reduce the complexity in one of the matrix dimension, but not in both of them. In this paper, we tackle very large matrices in both dimensions. We propose a new factoriza-tion method that scales gracefully to terabyte-scale datasets, that could not be processed by previous algorithms in a reasonable amount of time. We demonstrate the efficiency of our approach on massive functional Magnetic Resonance Imaging (fMRI) data, and on matrix completion problems for recommender systems, where we obtain significant speed-ups compared to state-of-the art coordinate descent methods.

## Code Repositories

### modl

Randomized online matrix factorization

### modl

Randomized online matrix factorization

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Background on Dictionary Learning

In this section, we introduce dictionary learning as a matrix factorization problem, and present stochastic algorithms that observe one column (or a minibatch) at every iteration.

### 1.1 Problem Statement

The goal of matrix factorization is to decompose a matrix – typically signals of dimension – as a product of two smaller matrices:

 X≈DAwithD∈Rp×k,A∈Rk×n, (1)

with potential sparsity or structure requirements on and . In statistical signal applications, this is often a dictionary learning problem, enforcing sparse coefficients . In such a case, we call  the “dictionary” and  the sparse codes. We use this terminology throughout the paper.

Learning the dictionary is typically performed by minimizing a quadratic data-fitting term, with constraints and/or penalties over the code and the dictionary:

 minD∈CA=[α1,…,αn]∈Rk×nn∑i=112∥∥xi−Dαi∥∥22+λΩ(αi), (2)

where is a convex set of , and a is a penalty over the code, to enforce structure or sparsity. In large and large settings, typical in recommender systems, this problem is solved via block coordinate descent, which boils down to alternating least squares if regularizations on  and  are quadratic (Hastie et al., 2014).

#### Constraints and penalties.

The constraint set is traditionally a technical constraint ensuring that the coefficients  do not vanish, making the effect of the penalty  disappear. However, other constraints can also be used to enforce sparsity or structure on the dictionary (see Varoquaux et al., 2013). In our paper, is the Cartesian product of a or norm ball:

 C={D∈Rp×k   s.t.   ψ(dj)≤1∀j=1,…,k}, (3)

where and or . The choice of and  typically offers some flexibility in the regularization effect that is desired for a specific problem; for instance, classical dictionary learning uses  and , leading to sparse coefficients , whereas our experiments on fMRI uses and , leading to sparse dictionary elements  that can be interpreted as brain activation maps.

### 1.2 Streaming Signals with Online Algorithms

In stochastic optimization, the number of signals  is assumed to be large (or potentially infinite), and the dictionary can be written as a solution of

 minD∈Cf(D)% wheref(D)=Ex[l(x,D)] (4) l(x,D)=minα∈Rk12∥x−Dα∥22+λΩ(α),

where the signals

are assumed to be i.i.d. samples from an unknown probability distribution. Based on this formulation,

Mairal et al. (2010) have introduced an online dictionary learning approach that draws a single signal  at iteration  (or a minibatch), and computes its sparse code  using the current dictionary  according to

 αt←\operatornamewithlimitsargminα∈Rk12∥xt−Dt−1α∥22+λΩ(α). (5)

Then, the dictionary is updated by approximately minimizing the following surrogate function

 gt(D)=1tt∑i=112∥∥xi−Dαi∥∥22+λΩ(αi), (6)

which involves the sequence of past signals  and the sparse codes  that were computed in the past iterations of the algorithm. The function  is called a “surrogate” in the sense that it only approximates the objective . In fact, it is possible to show that it converges to a locally tight upper-bound of the objective, and that minimizing  at each iteration asymptotically provides a stationary point of the original optimization problem. The underlying principle is that of majorization-minimization, used in a stochastic fashion (Mairal, 2013).

One key to obtain efficient dictionary updates is the observation that the surrogate  can be summarized by a few sufficient statistics that are updated at every iteration. In other words, it is possible to describe  without explicitly storing the past signals  and codes  for . Indeed, we may define two matrices and

 Ct=1tt∑i=1αiα⊤i Bt=1tt∑i=1xiα⊤i, (7)

and the surrogate function is then written:

 gt(D)=12Tr(D⊤DCt−D⊤Bt)+λtt∑i=1Ω(αi). (8)

The gradient of  can be computed as

 ∇Dgt(D) = DCt−Bt. (9)

Minimization of is performed using block coordinate descent on the columns of . In practice, the following updates are successively performed by cycling over the dictionary elements  for

 dj←Projψ(.)≤1[dj−1Ct[j,j]∇djgt(D)], (10)

where Proj denotes the Euclidean projection over the constraint norm constraint . It can be shown that this update corresponds to minimizing  with respect to  when fixing the other dictionary elements (see Mairal et al., 2010).

### 1.3 Handling Missing Values

Factorization of matrices with missing value have raised a significant interest in signal processing and machine learning, especially as a solution for recommender systems. In the context of dictionary learning, a similar effort has been made by

Szabó et al. (2011) to adapt the framework to missing values. Formally, a mask , represented as a binary diagonal matrix in , is associated with every signal , such that the algorithm can only observe the product at iteration  instead of a full signal . In this setting, we naturally derive the following objective

 minD∈Cf(D)% wheref(D)=Ex,M[l(x,M,D)] (11)

where the pairs  are drawn from the (unknown) data distribution. Adapting the online algorithm of Mairal et al. (2010) would consist of drawing a sequence of pairs , and building the surrogate

 (12)

where  is the size of the mask and

 αi∈\operatornamewithlimitsargminα∈Rkp2si∥Mi(xi−Di−1α)∥22+λΩ(α). (13)

Unfortunately, this surrogate cannot be summarized by a few sufficient statistics due to the masks : some approximations are required. This is the approach chosen by Szabó et al. (2011). Nevertheless, the complexity of their update rules is linear in the full signal size , which makes them unadapted to the large- regime that we consider.

## 2 Dictionary Learning for Massive Data

Using the formalism exposed above, we now consider the problem of factorizing a large matrix  in  into two factors  in  and  in  with the following setting: both and  are large (greater than  up to several millions), whereas is reasonable (smaller than  and often near ), which is not the standard dictionary-learning setting; some entries of  may be missing. Our objective is to recover a good dictionary  taking into account appropriate regularization.

To achieve our goal, we propose to use an objective akin to (11

), where the masks are now random variables

independant from the samples. In other words, we want to combine ideas of online dictionary learning with random subsampling, in a principled manner. This leads us to consider an infinite stream of samples , where the signals  are i.i.d. samples from the data distribution – that is, a column of  selected at random – and “selects” a random subset of observed entries in . This setting can accommodate missing entries, never selected by the mask, and only requires loading a subset of at each iteration.

The main justification for choosing this objective function is that in the large sample regime that we consider, computing the code  using only a random subset of the data  according to (13

) is a good approximation of the code that may be computed with the full vector

in (5). This of course requires choosing a mask that is large enough; in the fMRI dataset, a subsampling factor of about  – that is only  of the entries of  are observed – resulted in a similar speed-up (see experimental section) to achieve the same accuracy as the original approach without subsampling. This point of view also justifies the natural scaling factor introduced in (11).

An efficient algorithm must address two challenges: (i) performing dictionary updates that do not depend on  but only on the mask size; (ii) finding an approximate surrogate function that can be summarized by a few sufficient statistics. We provide a solution to these two issues in the next subsections and present the method in Algorithm 1.

### 2.1 Approximate Surrogate Function

To approximate the surrogate (8) from computed in (13), we consider defined by

 ht(D)=12Tr(D⊤DCt−D⊤Bt)+λtt∑i=1sipΩ(αi) (14)

with the same matrix  as in (8), which is updated as

 Ct←(1−1t)Ct−1+1tαtα⊤t, (15)

and to replace  in (8) by the matrix

 Bt=(t∑i=1Mi)−1t∑i=1Mixiα⊤i, (16)

which is the same as (7) when . Since  is a diagonal matrix, is also diagonal and simply “counts” how many times a row has been seen by the algorithm. thus behaves like for large , as in the fully-observed algorithm. By design, only rows of  selected by the mask differ from . The update can therefore be achieved in operations:

 (17)

This only requires keeping in memory the diagonal matrix , and updating the rows of  selected by the mask. All operations only depend on the mask size instead of the signal size .

### 2.2 Efficient Dictionary Update Rules

With a surrogate function in hand, we now describe how to update the codes  and the dictionary  when only partial access to data is possible. The complexity for computing the sparse codes  is obviously independent from  since (13) consists in solving a reduced

penalized linear regression of

in  on in . Thus, we focus here on dictionary update rules.

The naive dictionary update (18) has complexity due to the matrix-vector multiplication for computing . Reducing the single iteration complexity of a factor requires reducing the dimensionality of the dictionary update phase. We propose two strategies to achieve that, both using block coordinate descent, by considering

 dj←Projψ(.)≤1[dj−1Ct[j,j]Mt∇djht(D)], (18)

where is the partial derivative of with respect to the -th column and rows selected by the mask.

The update (18) represents a classical block coordinate descent step involving particular blocks. Following Mairal et al. (2010), we perform one cycle over the columns warm-started on . Formally, the gradient step without projection for the -th component consists of updating the vector

 dj←dj−1Ct[j,j]Mt∇djht(D)=dj−1Ct[j,j](MtDctj−Mtbtj), (19)

where  are the -th columns of , respectively. The update has complexity since it only involves rows of  and only entries of  have changed.

#### Projection step.

Block coordinate descent algorithms require orthogonal projections onto the constraint set . In our case, this amounts to the projection step on the unit ball corresponding to the norm  in (18). The complexity of such a projection is usually both for  and -norms (see Duchi et al., 2008). We consider here two strategies.

##### Exact lazy projection for ℓ2.

When , it is possible to perform the projection implicitly with complexity . The computational trick is to notice that the projection amounts to a simple rescaling operation

 dj←djmax(1,∥dj∥2), (20)

which may have low complexity if the dictionary elements  are stored in memory as a product where is in  and  is a rescaling coefficient such that . We code the gradient step (19) followed by -ball projection by the updates

 nj←∥Mjfj∥22fj←fj−max(1,lj)Ct[j,j](MtDctj−Mtbtj)lj←√l2j−nj+∥Mjfj∥22 (21)

Note that the update of  corresponds to the gradient step without projection (19) which costs , whereas the norm of  is updated in operations. The computational complexity is thus independent of  and the only price to pay is to rescale the dictionary elements on the fly, each time we need access to them.

##### Exact lazy projection for ℓ1.

The case of  is slightly different but can be handled in a similar manner, by storing an additional scalar  for each dictionary element . More precisely, we store a vector  in  such that , and a classical result (see Duchi et al., 2008) states that there exists a scalar  such that

 dj=Slj[fj],Sλ(u)=sign(u).max(|u|−λ,0) (22)

where  is the soft-thresholding operator, applied elementwise to the entries of . Similar to the case , the “lazy” projection consists of tracking the coefficient  for each dictionary element and updating it after each gradient step, which only involves coefficients. For such sparse updates followed by a projection onto the -ball, Duchi et al. (2008) proposed an algorithm to find the threshold  in

operations. The lazy algorithm involves using particular data structures such as red-black trees and is not easy to implement; this motivated us to investigate another simple heuristic that also performs well in practice.

##### Approximate low-dimension projection.

The heuristic consists in performing the projection by forcing the coefficients outside the mask not to change. This results in the orthogonal projection of each on , which is a subset of the original constraint set .

All the computations require only 4 matrices kept in memory , , , with additional , matrices and vectors for the exact projection case, as summarized in Alg. 1.

### 2.3 Discussion Figure 1: Acceleration of sparse matrix factorization with random subsampling on the HCP dataset (2TB). Reducing streamed data with stochastic masks permits 10× speed-ups without deteriorating goodness of fit on test data nor alterating sparsity of final dictionary.

#### Relation to classical matrix completion formulation.

Our model is related to the classical -penalized matrix completion model (e.g. Bell & Koren, 2007) we rewrite

 n∑i=1∥Mi(xi−D⊤αi)∥22+λsi∥αi∥22+λ∥(n∑i=1Mi)12D∥22 (23)

With quadratic regularization on  and  – that is, using and  – (11) only differs in that it uses a penalization on instead of a constraint. Srebro et al. (2004) introduced the trace-norm regularization to solve a convex problem equivalent to (23). The major difference is that we adopt a non-convex optimization strategy, thus losing the benefits of convexity, but gaining on the other hand the possibility of using stochastic optimization.

#### Practical considerations.

Our algorithm can be slightly modified to use weights that differ from for and , as advocated by Mairal (2013). It also proves beneficial to perform code computation on mini-batches of masked samples. Update of the dictionary is performed on the rows that are seen at least once in the masks .

## 3 Experiments

The proposed algorithm was designed to handle massive datasets: masking data enables streaming a sequence instead of , reducing single-iteration computational complexity and IO stress of a factor , while accessing an accurate description of the data. Hence, we analyze in detail how our algorithm improves performance for sparse decomposition of fMRI datasets. Moreover, as it relies on data masks, our algorithm is well suited for matrix completion, to reconstruct a data stream from the masked stream . We demonstrate the accuracy of our algorithm on explicit recommender systems and show considerable computational speed-ups compared to an efficient coordinate-descent based algorithm.

We use scikit-learn (Pedregosa et al., 2011) in experiments, and have released a python package for reproducibility.

### 3.1 Sparse Matrix Factorization for fMRI

#### Context.

Matrix factorization has long been used on functional Magnetic Resonance Imaging (McKeown et al., 1998). Data are temporal series of 3D images of brain activity, to decompose in spatial modes capturing regions that activate together. The matrices to decompose are dense and heavily redundant, both spatially and temporally: close voxels and successive records are correlated. Data can be huge: we use the whole HCP dataset (Van Essen et al., 2013), with (2000 records, 1 200 time points) and , totaling 2 TB of dense data.

Interesting dictionaries for neuroimaging capture spatially-localized components, with a few brain regions. This can be obtained by enforcing sparsity on the dictionary: in our formalism, this is achieved with -ball projection for . We set , and . Historically, such decomposition have been obtained with the classical dictionary learning objective on transposed data (Varoquaux et al., 2013): the code holds sparse spatial maps and voxel time-series are streamed. However, given the size of for our dataset, this method is not usable in practice. Figure 2: Brain atlases: outlines of each map at half the maximum value (λ=10−4). Left: the reference algorithm on the full dataset. Middle: the reference algorithm on a twentieth of the dataset. Right: the proposed algorithm with a similar run time: half the dataset and r=9. Compared to a full run of the baseline algorithm, the figure explore two possible strategies to decrease computation time: processing less data (middle), or our approach (right). Our approach achieves a result closer to the gold standard in a given time budget.

Handling such volume of data sets new constraints. First, efficient disk access becomes critical for speed. In our case, learning the dictionary is done by accessing the data in row batches, which is coherent with fMRI data storage: no time is lost seeking data on disk. Second, reducing IO load on the storage is also crucial, as it lifts bottlenecks that appear when many processes access the same storage at the same time, e.g. during cross-validation on within a supervised pipeline. Our approach reduces disk usage by a factor . Finally, parallel methods based on message passing, such as asynchronous coordinate descent, are unlikely to be efficient given the network / disk bandwidth that each process requires to load data. This makes it crucial to design efficient sequential algorithms.

#### Experiment

We quantify the effect of random subsampling for sparse matrix factorization, in term of speed and accuracy. A natural performance evaluation is to measure an empirical estimate of the loss

defined in Eq. 4 from unseen data, to rule out any overfitting effect. For this, we evaluate on a test set . Pratically, we sample in a pseudo-random manner: we randomly select a record, from where we select a random batch of rows – we use a batch size of , empirically found to be efficient. We load in memory and perform an iteration of the algorithm. The mask sequence is sampled by breaking random permutation vectors into chunks of size . Figure 3: Evolution of objective function with epochs for three reduction factors. Learning speed per epoch is little reduced by stochastic subsampling, despite the speed-up factor it provides.

#### Results

Fig. 1(a) compares our algorithm with subsampling ratios in to vanilla online dictionary learning algorithm , plotting trajectories of the test objective against real CPU time. There is no obvious choice of  due to the unsupervised nature of the problem: we use and , that bounds the range of providing interpretable dictionaries.

First, we observe the convergence of the objective function for all tested , providing evidence that the approximations made in the derivation of update rules does not break convergence for such . Fig. 1(b) shows the validity of the obtained dictionary relative to the reference output: both objective function and ratio – the relevant value to measure sparsity in our setting – are comparable to the baseline values, up to . For high regularization and , our algorithm tends to yield somewhat sparser solutions ( lower ) than the original algorithm, due to the approximate -projection we perform. Obtained maps still proves as interpretable as with baseline algorithm.

Our algorithm proves much faster than the original one in finding a good dictionary. Single iteration time is indeed reduced by a factor , which enables our algorithm to go over a single epoch  times faster than the vanilla algorithm and capture the variability of the dataset earlier. To quantify speed-ups, we plot the empirical objective value of against the number of observed records in Fig. 3. For , increasing little reduces convergence speed per epoch: random subsampling does not shrink much the quantity of information learned at each iteration.

This brings a near speed-up factor: for high and low regularization respectively, our algorithm converges in and hours with subsampling factor , whereas the vanilla online algorithm requires about and hours. Qualitatively, Fig. 2 shows that with the same time budget, the proposed reduction approach with on half of the data gives better results than processing a small fraction of the data without reduction: segmented regions are less noisy and closer to processing the full data.

These results advocates the use of a subsampling rate of in this setting. When sparse matrix decomposition is part of a supervised pipeline with scoring capabilities, it is possible to find efficiently: start by setting it dereasonably high and decrease it geometrically until supervised performance (e.g. in classification) ceases to improve.

### 3.2 Collaborative Filtering with Missing Data Figure 4: Learning speed for collaborative filtering for datasets of different size: the larger the dataset, the greater our speed-up.

We validate the performance of the proposed algorithm on recommender systems for explicit feedback, a well-studied matrix completion problem. We evaluate the scalability of our method on datasets of different dimension: MovieLens 1M, MovieLens 10M, and 140M ratings Netflix dataset.

We compare our algorithm to a coordinate-descent based method (Yu et al., 2012), that provides state-of-the art convergence time performance on our largest dataset. Although stochastic gradient descent methods for matrix factorization can provide slightly better single-run performance (Takács et al., 2009), these are notoriously hard to tune and require a precise grid search to uncover a working schedule of learning rates. In contrast, coordinate descent methods do not require any hyper-parameter setting and are therefore more efficient in practice. We benchmarked various recommender-system codes (MyMediaLite, LibFM, SoftImpute, ), and chose coordinate descent algorithm from spira as it was by far the fastest.

#### Completion from dictionary Dt.

We stream user ratings to our algorithm: is the number of movies and is the number of users. As on Netflix dataset, this increases the benefit of using an online method. We have observed comparable prediction performance streaming item ratings. Past the first epoch, at iteration , every column of can be predicted by the last code that was computed from this column at iteration . At iteration , for all , . Prediction thus only requires an additional matrix computation after the factorization.

#### Preprocessing.

Successful prediction should take into account user and item biases. We compute these biases on train data following Hastie et al. (2014) (alternated debiasing). We use them to center the samples that are streamed to the algorithm, and to perform final prediction.

#### Tools and experiments.

Both baseline and proposed algorithm are implemented in a computationally optimal way, enabling fair comparison based on CPU time. Benchmarks were run using a single 2.7 GHz Xeon CPU, with a 30 components dictionary. For Movielens datasets, we use a random of data for test and the rest for training. We average results on five train/test split for MovieLens in Table 1. On Netflix, the probe dataset is used for testing. Regularization parameter is set by cross-validation on the training set: the training data is split 3 times, keeping of Movielens datasets for evaluation and for Netflix, and grid search is performed on 15 values of between and . We assess the quality of obtained decomposition by measuring the root mean square error (RMSE) between prediction on the test set and ground truth. We use mini-batches of size .

#### Results.

We report the evolution of test RMSE along time in Fig. 4, along with its value at convergence and numerical convergence time in Table 1. Benchmarks are performed on the final run, after selection of parameter .

The two variants of the proposed method converge toward a solution that is at least as good as that of coordinate descent, and slightly better on Movielens 10M and Netflix. Our algorithm brings a substantial performance improvement on medium and large scale datasets. On Netflix, convergence is almost reached in minutes (score under deviation from final RMSE), which makes our method times faster than coordinate descent. Moreover, the relative performance of our algorithm increases with dataset size. Indeed, as datasets grow, less epochs are needed for our algorithm to reach convergence (Fig. 4). This is a significant advantage over coordinate descent, that requires a stable number of cycle on coordinates to reach convergence, regardless of dataset size. The algorithm with partial projection performs slightly better. This can be explained by the extra regularization on brought by this heuristic.

#### Learning weights.

Unlike SGD, and similar to the vanilla online dictionary learning algorithm, our method does not critically suffer from hyper-parameter tuning. We tried weights as described in Sec. 2.3, and observed that a range of yields fast convergence. Theoretically, Mairal (2013) shows that stochastic majorization-minimization converges when . We verify this empirically, and obtain optimal convergence speed for . (Fig. 5). We report results for . Figure 5: Learning weights: on two different datasets, optimal convergence is obtained for β∈[.85,.95], predicted by theory.

## 4 Conclusion

Whether it is sensor data, as fMRI, or e-commerce databases, sample sizes and number of features are rapidly growing, rendering current matrix factorization approaches intractable. We have introduced a online algorithm that leverages random feature subsampling, giving up to 8-fold speed and memory gains on large data. Datasets are getting bigger, and they often come with more redundancies. Such approaches blending online and randomized methods will yield even larger speed-ups on next-generation data.

## Acknowledgements

The research leading to these results was supported by the ANR (MACARON project, ANR-14-CE23-0003-01 – NiConnect project, ANR-11-BINF-0004NiConnect) and has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (HBP).

## References

• Aharon et al. (2006) Aharon, Michal, Elad, Michael, and Bruckstein, Alfred. k-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
• Bell & Koren (2007) Bell, Robert M. and Koren, Yehuda. Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79, 2007.
• Bingham & Mannila (2001) Bingham, Ella and Mannila, Heikki. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM, 2001.
• Blondel et al. (2015) Blondel, Mathieu, Fujino, Akinori, and Ueda, Naonori. Convex factorization machines. In Machine Learning and Knowledge Discovery in Databases, pp. 19–35. Springer, 2015.
• Bottou (2010) Bottou, Léon. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT, pp. 177–186. Springer, 2010.
• Candès & Recht (2009) Candès, Emmanuel J. and Recht, Benjamin. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
• Candès & Tao (2006) Candès, Emmanuel J. and Tao, Terence. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006.
• Duchi et al. (2008) Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Chandra, Tushar. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the International Conference on Machine Learning, pp. 272–279. ACM, 2008.
• Halko et al. (2009) Halko, Nathan, Martinsson, Per-Gunnar, and Tropp, Joel A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. arXiv:0909.4061 [math], 2009.
• Hastie et al. (2014) Hastie, Trevor, Mazumder, Rahul, Lee, Jason, and Zadeh, Reza. Matrix completion and low-rank SVD via fast alternating least squares. arXiv:1410.2596 [stat], 2014.
• Johnson & Lindenstrauss (1984) Johnson, William B. and Lindenstrauss, Joram. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
• Mairal (2013) Mairal, Julien. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, pp. 2283–2291, 2013.
• Mairal (2014) Mairal, Julien. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision, 8(2-3):85–283, 2014.
• Mairal et al. (2010) Mairal, Julien, Bach, Francis, Ponce, Jean, and Sapiro, Guillermo. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60, 2010.
• McKeown et al. (1998) McKeown, M. J., Makeig, S., Brown, G. G., Jung, T. P., Kindermann, S. S., Bell, A. J., and Sejnowski, T. J. Analysis of fMRI Data by Blind Separation into Independent Spatial Components. Human Brain Mapping, 6(3):160–188, 1998. ISSN 1065-9471.
• Pedregosa et al. (2011) Pedregosa, Fabian, Varoquaux, Gaël, Gramfort, Alexandre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier, Blondel, Mathieu, Prettenhofer, Peter, Weiss, Ron, Dubourg, Vincent, Vanderplas, Jake, Passos, Alexandre, Cournapeau, David, Brucher, Matthieu, Perrot, Matthieu, and Duchesnay, Édouard. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Pourkamali-Anaraki et al. (2015) Pourkamali-Anaraki, Farhad, Becker, Stephen, and Hughes, Shannon M. Efficient dictionary learning via very sparse random projections. In Proceedings of the IEEE International Conference on Sampling Theory and Applications, pp. 478–482. IEEE, 2015.
• Rendle (2010) Rendle, Steffen. Factorization machines. In Proceedings of the IEEE International Conference on Data Mining, pp. 995–1000. IEEE, 2010.
• Rendle & Schmidt-Thieme (2008) Rendle, Steffen and Schmidt-Thieme, Lars. Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In Proceedings of the ACM Conference on Recommender systems, pp. 251–258. ACM, 2008.
• Srebro et al. (2004) Srebro, Nathan, Rennie, Jason, and Jaakkola, Tommi S. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems, pp. 1329–1336, 2004.
• Szabó et al. (2011) Szabó, Zoltán, Póczos, Barnabás, and Lorincz, András. Online group-structured dictionary learning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp. 2865–2872. IEEE, 2011.
• Takács et al. (2009) Takács, Gábor, Pilászy, István, Németh, Bottyán, and Tikk, Domonkos. Scalable collaborative filtering approaches for large recommender systems. The Journal of Machine Learning Research, 10:623–656, 2009.
• Van Essen et al. (2013) Van Essen, David C., Smith, Stephen M., Barch, Deanna M., Behrens, Timothy E. J., Yacoub, Essa, and Ugurbil, Kamil. The WU-Minn Human Connectome Project: An overview. NeuroImage, 80:62–79, 2013.
• Varoquaux et al. (2011) Varoquaux, Gaël, Gramfort, Alexandre, Pedregosa, Fabian, Michel, Vincent, and Thirion, Bertrand. Multi-subject dictionary learning to segment an atlas of brain spontaneous activity. In Proceedings of the Information Processing in Medical Imaging Conference, volume 22, pp. 562–573. Springer, 2011.
• Varoquaux et al. (2013) Varoquaux, Gaël, Schwartz, Yannick, Pinel, Philippe, and Thirion, Bertrand. Cohort-level brain mapping: learning cognitive atoms to single out specialized regions. In Proceedings of the Information Processing in Medical Imaging Conference, pp. 438–449. Springer, 2013.
• Yu et al. (2012) Yu, Hsiang-Fu, Hsieh, Cho-Jui, and Dhillon, Inderjit. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Proceedings of the International Conference on Data Mining, pp. 765–774. IEEE, 2012.