# Randomized Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a powerful tool for data mining. However, the emergence of `big data' has severely challenged our ability to compute this fundamental decomposition using deterministic algorithms. This paper presents a randomized hierarchical alternating least squares (HALS) algorithm to compute the NMF. By deriving a smaller matrix from the nonnegative input data, a more efficient nonnegative decomposition can be computed. Our algorithm scales to big data applications while attaining a near-optimal factorization, i.e., the algorithm scales with the target rank of the data rather than the ambient dimension of measurement space. The proposed algorithm is evaluated using synthetic and real world data and shows substantial speedups compared to deterministic HALS.

## Authors

• 16 publications
• 1 publication
• 1 publication
• 48 publications
05/17/2018

### Accelerating Nonnegative Matrix Factorization Algorithms using Extrapolation

In this paper, we propose a general framework to accelerate significantl...
09/23/2021

### Memory-Efficient Convex Optimization for Self-Dictionary Separable Nonnegative Matrix Factorization: A Frank-Wolfe Approach

Nonnegative matrix factorization (NMF) often relies on the separability ...
07/12/2020

### An Alternating Rank-K Nonnegative Least Squares Framework (ARkNLS) for Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a prominent technique for data...
08/07/2020

### Nyström Approximation with Nonnegative Matrix Factorization

Motivated by the needs of estimating the proximity clustering with parti...
07/28/2017

### Sparse Deep Nonnegative Matrix Factorization

Nonnegative matrix factorization is a powerful technique to realize dime...
01/02/2020

### On Large-Scale Dynamic Topic Modeling with Nonnegative CP Tensor Decomposition

There is currently an unprecedented demand for large-scale temporal data...
02/10/2021

### Forecasting Nonnegative Time Series via Sliding Mask Method (SMM) and Latent Clustered Forecast (LCF)

We consider nonnegative time series forecasting framework. Based on rece...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Techniques for dimensionality reduction, such as principal component analysis (PCA), are essential to the analysis of high-dimensional data. These methods take advantage of redundancies in the data in order to find a low-rank and parsimonious model describing the underlying structure of the input data. Indeed, at the core of machine learning is the assumption that low-rank structures are embedded in high-dimensional data

(Udell and Townsend, 2017)

. Dimension reduction techniques find basis vectors which represent the data as a linear combination in lower-dimensional space. This enables the identification of key features and efficient analysis of high-dimensional data.

A significant drawback of PCA and other commonly-used dimensionality reduction techniques is that they permit both positive and negative terms in their components. In many data analysis applications, negative terms fail to hold physically meaningful interpretation. For example, images are represented as a grid of nonnegative pixel intensity values. In this context, the negative terms in principal components have no interpretation.

To address this problem, researchers have proposed restricting the set of basis vectors to nonnegative terms (Paatero and Tapper, 1994; Lee and Seung, 1999). The paradigm is called nonnegative matrix factorization (NMF) and it has emerged as a powerful dimension reduction technique. This versatile tool allows computation of sparse (parts-based) and physically meaningful factors that describe coherent structures within the data. Prominent applications of NMF are in the areas of image processing, information retrieval and gene expression analysis, see for instance the surveys by Berry et al. (2007) and Gillis (2014). However, NMF is computationally intensive and becomes infeasible for massive data. Hence, innovations that reduce computational demands are increasingly important in the era of ‘big data’.

Randomized methods for linear algebra have been recently introduced to ease the computational demands posed by classical matrix factorizations (Mahoney, 2011; Drineas and Mahoney, 2016). Wang and Li (2010) proposed to use random projections to efficiently compute the NMF. Later, Tepper and Sapiro (2016) proposed compressed NMF algorithms based on the idea of bilateral random projections (Zhou and Tao, 2012). While these compressed algorithms reduce the computational load considerably, they often fail to converge in our experiments.

We follow the probabilistic approach for matrix approximations formulated by Halko et al. (2011). Specifically, we propose a randomized hierarchical alternating least squares (HALS) algorithm to compute the NMF. We demonstrate that the randomized algorithm eases the computational challenges posed by massive data, assuming that the input data feature low-rank structure. Experiments show that our algorithm is reliable and attains a near-optimal factorization. Further, this manuscript is accompanied by the open-software package ristretto, written in Python, which allows the reproduction of all results (GIT repository: https://github.com/erichson/ristretto).

The manuscript is organized as follows: First, Section 2

briefly reviews the NMF as well as the basic concept of randomized matrix algorithms. Then, Section

3 describes a randomized variant of the HALS algorithm. This is followed by an empirical evaluation in Section 4, where synthetic and real world data are used to demonstrate the performance of the algorithm. Finally, Section 5 concludes the manuscript.

## 2 Background

### 2.1 Low-Rank Matrix Factorization

Low-rank approximations are fundamental and widely used tools for data analysis, dimensionality reduction, and data compression. The goal of these methods is to find two matrices of much lower rank that approximate a high-dimensional matrix :

 X≈WH.m×nm×kk×n (1)

The target rank of the approximation is denoted by , an integer between and

. A ubiquitous example of these tools, the singular value decomposition (SVD), finds the exact solution to this problem in a least-square sense

(Eckart and Young, 1936). While the optimality property of the SVD and similar methods is desirable in many scientific applications, the resulting factors are not guaranteed to be physically meaningful in many others. This is because the SVD imposes orthogonality constraints on the factors, leading to a holistic, rather than parts-based, representation of the input data. Further, the basis vectors in the SVD and other popular decompositions are mixed in sign.

Thus, it is natural to formulate alternative factorizations which may not be optimal in a least-square sense, but which may preserve useful properties such as sparsity and nonnegativity. Such properties are found in the NMF.

### 2.2 Nonnegative Matrix Factorization

The roots of NMF can be traced back to the work by Paatero and Tapper (1994). Lee and Seung (1999) independently introduced and popularized the concept of NMF in the context of psychology several years later.

Formally, the NMF attempts to solve Equation (1) with the additional nonnegativity constraints: and . These constraints enforce that the input data are expressed as an additive linear combination. This leads to sparse, parts-based features appearing in the decomposition, which have an intuitive interpretation. For example, NMF components of a face image dataset reveal individual nose and mouth features, whereas PCA components yield holistic features, known as ‘eigenfaces’.

Though NMF bears the desirable property of interpretability, the optimization problem is inherently nonconvex and ill-posed. In general, no convexification exists to simplify the optimization, meaning that no exact or unique solution is guaranteed (Gillis, 2017). Different NMF algorithms, therefore, can produce distinct decompositions that minimize the objective function.

We refer the reader to Lee and Seung (2001)Berry et al. (2007), and Gillis (2014) for a comprehensive discussion of the NMF and its applications. There are two main classes of NMF algorithms (Gillis, 2017), discussed in the following.

#### 2.2.1 Standard Nonlinear Optimization Schemes

Traditionally, the challenge of finding the nonnegative factors is formulated as the following optimization problem:

 minimize f(W,H)=∥X−WH∥2F (2) subject to W≥0 and H≥0.

Here denotes the Frobenius norm of a matrix. However, the optimization problem in Equation (2) is nonconvex with respect to both the factors and . To resolve this, most NMF algorithms divide the problem into two simpler subproblems which have closed-form solutions. The convex subproblem is solved by keeping one factor fixed while updating the other, alternating and iterating until convergence. One of the most popular techniques to minimize the subproblems is the method of multiplicative updates (MU) proposed by Lee and Seung (1999). This procedure is essentially a rescaled version of gradient descent. Its simple implementation comes at the expense of a much slower convergence.

More appealing are alternating least squares methods and their variants. Among these, HALS proves to be highly efficient (Cichocki and Anh-Huy, 2009). Without being exhaustive, we would also like to point out the interesting work by Gillis and Glineur (2012) as well as by Kim et al. (2014) who proposed improved and accelerated HALS algorithms for computing NMF.

#### 2.2.2 Separable Schemes

Another approach to compute NMF is based on the idea of column subset selection. In this method, columns of the input matrix are chosen to form the factor matrix , where denotes the index set. The factor matrix is found by solving the following optimization problem:

 minimizef(H)=∥X−W(:,J)H∥2Fs.t.H≥0. (3)

In context of NMF, this approach is appealing if the input matrix is separable (Arora et al., 2012). This means it must be possible to select basis vectors for from the columns of the input matrix . In this case, selecting actual columns from preserves the underlying structure of the data and allows a meaningful interpretation. This assumption is intrinsic in many applications, e.g., document classification and blind hyperspectral unmixing (Gillis, 2017). However, this approach has limited potential in applications where the data is dense or noisy.

These separable schemes are not unique and can be obtained through various algorithms. Finding a meaningful column subset is explored in the CX decomposition (Boutsidis et al., 2014), which extracts columns that best describe the data. In addition, the CUR decomposition (Mahoney and Drineas, 2009) leverages statistical significance of both columns and rows to improve interpretability, leading to near-optimal decompositions. Another interesting algorithm to compute the near-separable NMF was proposed by Zhou et al. (2013). This algorithm finds conical hulls in which smaller subproblems can be computed in parallel in 1D or 2D. For details on ongoing research in column selection algorithms, we refer the reader to Wang and Zhang (2013)Boutsidis and Woodruff (2017), and Wang et al. (2016).

### 2.3 Probabilistic Framework

In the era of ‘big data’, probabilistic methods have become indispensable for computing low-rank matrix approximations. The central concept is to utilize randomness in order to form a surrogate matrix which captures the essential information of a high-dimensional input matrix. This assumes that the input matrix features low-rank structure, i.e., the effective rank is smaller than its ambient dimensions. We refer the reader to the surveys by Halko et al. (2011)Mahoney (2011)Drineas and Mahoney (2016) and Martinsson (2016) for more detailed discussions of randomized algorithms. For implementations details, for instance, see Szlam et al. (2014)Voronin and Martinsson (2015), and Erichson et al. (2016).

Following Halko et al. (2011), the probabilistic framework for low-rank approximations proceeds as follows. Let be an matrix, without loss of generality we assume that . First, we aim to approximate the range of . While the SVD provides the best possible basis in a least-square sense, a near-optimal basis can be obtained using random projections

 Y:=AΩ, (4)

where is a random test matrix. Recall, that the target rank of the approximation is denoted by the integer , and is assumed to be . Typically, the entries of

are independently and identically drawn from the standard normal distribution. Next, the QR-decomposition of

is used to form a matrix with orthogonal columns. Thus, this matrix forms a near-optimal normal basis for the input matrix such that

 A≈QQ⊤A (5)

is satisfied. Finally, a smaller matrix is computed by projecting the input matrix to low-dimensional space

 B:=Q⊤A. (6)

Hence, the input matrix can be approximately decomposed (also called QB decomposition) as

 A≈QB. (7)

This process preserves the geometric structure in a Euclidean sense. The smaller matrix is, in many applications, sufficient to construct a desired low-rank approximation. The approximation quality can be controlled by oversampling and the use of power iterations. Oversampling is required to find a good basis matrix. This is because real-world data often do not have exact rank. Thus, instead of just computing random projections we compute random projections in order to form the basis matrix

. Specifically, this procedure increases the probability that

approximately captures the column space of . Our experiments show that small oversampling values of about achieve good approximation results to compute the NMF. Next, the idea of power iterations is to pre-process the input matrix in order to sample from a matrix which has a faster decaying singular value spectrum (Rokhlin et al., 2009). Therefore, Equation (4) is replaced by

 Y=((XX∗)qX)Ω (8)

where specifies the number of power iterations. The drawback to this method, is that additional passes over the input matrix are required. Note that the direct implementation of power iteration is numerically unstable, thus, subspace iterations are used instead (Gu, 2015).

#### 2.3.1 Scalability

We can construct the basis matrix using a deterministic algorithm when the data fit into fast memory. However, deterministic algorithms can become infeasible for data which are too big to fit into fast memory. Randomized methods for linear algebra provide a scalable architecture, which ease some of the challenges posed by big data. One of the key advantages of randomized methods is pass efficiency, i.e., the number of complete passes required over the entire data matrix.

The randomized algorithm, which is sketched above, requires only two passes over the entire data matrix, compared to passes required by deterministic methods. Hence, the smaller matrix can be efficiently constructed if a subset of rows or columns of the data can be accessed efficiently. Specifically, we can construct by sequentially reading in columns or blocks of consecutive columns. See, Appendix A for more details and a prototype algorithm. Note, that there exist also single pass algorithms to construct  (Tropp et al., 2016), however, the performance depends substantially on the singular value spectrum of the data. Thus, we favor the slightly more expensive multi-pass framework. Also, see the work by Bjarkason (2018) for an interesting discussion on the pass efficiency of randomized methods.

The randomized framework can be also extended to distributed and parallel computing. Voronin and Martinsson (2015) proposed a blocked scheme to compute the QB decomposition in parallel. Using this algorithm, can be constructed by distributing the data across processors which have no access to a shared memory to exchange information.

#### 2.3.2 Theoretical Performance of the QB Decomposition

Martinsson (2016) provides the following simplified description of the expected error:

 E∥A−QB∥2≤[1+√kp−1+e√k+pp⋅√n−k]12q+1σk+1(A).

It follows that as increases, the error tends towards the best possible approximation error, i.e., the singular value . A rigorous error analysis is provided by Halko et al. (2011).

## 3 Randomized Nonnegative Matrix Factorization

High-dimensional data pose a computational challenge for deterministic nonnegative matrix factorization, despite modern optimization techniques. Indeed, the costs of solving the optimization problem formulated in Equation (2) can be prohibitive. Our motivation is to use randomness as a strategy to ease the computational demands of extracting low-rank features from high-dimensional data. Specifically, a randomized hierarchical alternating least squares (HALS) algorithm is formulated to efficiently compute the nonnegative matrix factorization.

### 3.1 Hierarchal Alternating Least Squares

Block coordinate descent (BCD) methods are a universal approach to algorithmic optimization (Wright, 2015). These iterative methods fix a block of components and optimize with respect to the remaining components. Figure 1 illustrates the process for a simple 2-dimensional function , where we iteratively update while is fixed

 x=x+ν∂f∂x(x,y),

and while is fixed

 y=y+ν∂f∂y(x,y),

until convergence is reached. The parameter controls the step size and can be chosen in various ways (Wright, 2015). Following this philosophy, the HALS algorithm unbundles the original problem into a sequence of simpler optimization problems. This allows the efficient computation of the NMF (Cichocki and Anh-Huy, 2009).

Suppose that we update and by fixing most terms except for the block comprised of the th column and th row . Thus, each subproblem is essentially reduced to a smaller minimization. HALS approximately minimizes the cost function in Equation (2) with respect to the remaining components

 minimize Jj(W(:,j),H(j,:))=∥R(j)−W(:,j)H(j,:)∥2F, (9)

where is the th residual

 R(j):=X−k∑i≠jW(:,i)H(i,:). (10)

This can be viewed as a decomposition of the residual (Kimura et al., 2015). Then, it is simple to derive the gradients to find the stationary points for both components. First, we expend the cost function in Eq. (9) as

 Jj =Tr[R(j)⊤R(j)−2R(j)⊤W(:,j)H(j,:)+H⊤(j,:)W⊤(:,j)W(:,j)H(j,:)] =∥R(j)∥2F−2Tr[R(j)⊤W(:,j)H(j,:)]+Tr[H⊤(j,:)W⊤(:,j)W(:,j)H(j,:)]

Then, we take the gradient of with respect to

 0=∂Jj∂W(:,j)=W(:,j)H(j,:)H⊤(j,:)−R(j)H⊤(j,:),

and the gradient of with respect to

 0=∂Jj∂H(j,:)=H⊤(j,:)W⊤(:,j)W(:,j)−R(j)⊤W(:,j).

The update rules for the th component of and are

 W+(:,j)←⎡⎣[R(j)H⊤(j,:)H(j,:)H⊤(j,:)⎤⎦+, (11)
 H+(j,:)←⎡⎣R(j)⊤W(:,j)W⊤(:,j)W(:,j)⎤⎦+, (12)

where the maximum operator, defined as , ensures that the components remain nonzero. Note, that we can express Eq. (10) also as

 R(j) :=X−k∑i≠jW(:,i)H(i,:)=X−WH+W(:,j)H(j,:). (13)

Then, Eq. 13 can be substituted into the above update rules in order to avoid the explicit computation of the residual . Hence, we obtain the following simplified update rules:

 W+(:,j)←⎡⎢ ⎢ ⎢⎣W(:,j)+[XH⊤](:,j)−W[HH⊤](:,j)[HH⊤](j,j)⎤⎥ ⎥ ⎥⎦+, (14)
 H+(j,:)←⎡⎢ ⎢ ⎢⎣H(j,:)+[X⊤W](:,j)−H⊤[W⊤W](:,j)[W⊤W](j,j)⎤⎥ ⎥ ⎥⎦+. (15)

### 3.2 Randomized Hierarchal Alternating Least Squares

Employing randomness, we reformulate the optimization problem in Equation (2) as a low-dimensional optimization problem. Specifically, the high-dimensional input matrix is replaced by the surrogate matrix , which is formed as described in Section 2.3. Thus we yield the following optimization problem:

 minimize ~f(~W,H)=∥B−~WH∥2F (16) subject to Q~W≥0 and H≥0.

The nonnegativity constraints need to apply to the high-dimensional factor matrix , but not necessarily to . The matrix can be rotated back to high-dimensional space using the following approximate relationship . Equation (16) can only be solved approximately, since . Further, there is no reason that has nonnegative entries, yet the low-dimensional projection will decrease the objective function in Equation (16).111A proof can be demonstrated similar to the proof by Cohen et al. (2015)

for the compressed nonnegative CP tensor decomposition.

Now, we formulate the randomized HALS algorithm as

 minimize ~Jj(~W(:,j),H(j,:))=∥~R(j)−~W(:,j)H(j,:)∥2F, (17)

where is the th compressed residual

 ~R(j):=B−k∑i≠j~W(:,i)H(i,:). (18)

Then, the update rule for the th component of is as follows

 H+(j,:)←⎡⎢ ⎢ ⎢⎣H(j,:)+[B⊤~W](:,j)−H⊤[~W⊤~W](:,j)[~W⊤~W](j,j)⎤⎥ ⎥ ⎥⎦+, (19)

Note, that we use for scaling in practice in order to ensure the correct scaling in high-dimensional space. Next, the update rule for the th component of is

 ~W(:,j)←~W(:,j)+[BH⊤](:,j)−~W[HH⊤](:,j)[HH⊤](j,j). (20)

Then, we employ the following scheme to update the th component in high-dimensional space, followed by projecting the updated factor matrix back to low-dimensional space

 W+(:,j)←[Q~W(:,j)]+. (21)
 ~W+(:,j)←Q⊤W+(:,j). (22)

This HALS framework allows the choice of different update orders (Kim et al., 2014). For instance, we can proceed by using one of the following two cyclic update orders:

 W+(:,1)→H+(1,:)→⋯→W+(:,k)→H+(k,:), (23)

or

 W+(:,1)→W+(:,2)→⋯→W+(:,k)→H+(1,:)→H+(2,:)⋯→H+(k,:), (24)

which are illustrated in Figure 2. We favor the latter scheme in the following. Alternatively, a shuffled update order can be used, wherein the components are updated in a random order in each iteration. Wright (2015) noted that this scheme performs better in some applications. Yet another exciting strategy is randomized block coordinate decent (Nesterov, 2012).

The computational steps are summarized in Algorithm 1.

### 3.3 Stopping Criterion

Predefining a maximum number of iterations is not satisfactory in practice. Instead, we aim to stop the algorithm when a suitable measure has reached some convergence tolerance. For a discussion on stopping criteria, for instance, see Gillis and Glineur (2012). One option is to terminate the algorithm if the value of objective function is smaller than some stopping condition

 ||X−WH||2F<ϵ. (25)

This criteria is expensive to compute and often not a good measure for convergence. An interesting alternative stopping condition is the projected gradient (Lin, 2007; Hsieh and Dhillon, 2011). The projected gradient measures how close the current solution is to a stationary point. We compute the projected gradient with respect to as

 ∇PWij:=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩∂f(W,H)∂Wij,if Wij>0\parmin(0,∂f(W,H)∂Wij),% if Wij=0, (26)

and in a similar way with respect to . Accordingly, the stopping condition can be formulated:

 ||∇Pf(W,H)||2F<ϵ||∇Pf(W0,H0)||2F, (27)

where and indicate the initial points. Note, that it follows from the Karush–Kuhn–Tucker (KKT) condition that a stationary point is reached if and only if the projected gradient is for some optimal points and .

###### Remark 1 (Random Test Matrix).

The entries of the random test matrix

are drawn independently from the uniform distribution in the interval of

. Nonnegative random entries perform better than Gaussian distributed entries, and seem to be the natural choice in the context of nonnegative data.

###### Remark 2 (Initialization).

The performance of NMF algorithms depends on the procedure used for initializing the factor matrices. We refer to Langville et al. (2014) for an excellent discussion on this topic. A standard approach is to initialize the factor matrices with Gaussian entries, where negative elements are set to 0. However, in many applications the performance can be improved by using an initialization scheme which is based on the (randomized) singular value decomposition (Boutsidis and Gallopoulos, 2008).

### 3.4 Regularized Hierarchal Alternating Least Squares

Nonnegative matrix factorization can be extend by incorporating extra constraints such as regularization terms. Specifically, we can formulate the problem more generally as

 minimize (28) subject to W≥0 and H≥0,

where is a regularization term. Popular choices for regularization are the norm, the norm, and a combination of both. The different norms are illustrated in Figure 3.

The norm (see Fig. 2(a)), or Frobenious norm, as a regularizer adds a small bias against large terms into the updating rules, which is also known as ridge. This regularizer can be defined as

 r(x)=α∥x∥2F,

where is a tuning parameter. An additional benefit is less overfitting.

In many application it is favorable to obtain a more parsimonious model, i.e., a model that is simpler and more interpretable. This can be achieved by using sparsity promoting regularizers which help to identify the most meaningful ‘active’ (non-zero) entries, while forcing many terms to zero. The most popular choice is to use the norm as sparsity-promoting regularizer

 r(x)=β∥x∥1,

which is also known as LASSO (least absolute shrinkage and selection operator), see Fig. 2(b). The tuning parameter can be used to control the level of sparsity.

A third regularization method, elastic net (Zou and Hastie, 2005), maintains the properties of both ridge and LASSO defined as

 r(x)=α∥x∥1+β∥x∥22.

Essentially, the elastic net is just a combination of the and the Frobenious norm, see Fig. 2(c).

The discussed regularizes are simple to integrate into the HALS update rules. For details see Cichocki and Anh-Huy (2009) and Kim et al. (2014).

##### Hierarchal Alternating Least Squares with ℓ2 Regularization:

We start by augmenting the cost function with additional regularization terms for the components

 minimize Jj(W(:,j),H(j,:))=∥R(j)−W(:,j)H(j,:)∥2F+αW∥W(:,j)∥22+αH∥H(j,:)∥22, (29)

where and are tuning parameters. We take the gradient of with respect to

 0=∂Jj∂W(:,j)=W(:,j)H(j,:)H⊤(j,:)−R(j)H⊤(j,:)+αWW(:,j),

Then, rearranging terms and substituting Eq. (10) yields the following regularized update rule for the th component of

 W+(:,j)←⎡⎢ ⎢ ⎢⎣[HH⊤](j,j)[HH⊤](j,j)+αWW(:,j)+[XH⊤](:,j)−W[HH⊤](:,j)[HH⊤](j,j)+αW⎤⎥ ⎥ ⎥⎦+. (30)

Similar, we yield the following regularized update rule for the components of

 H+(j,:)←⎡⎢ ⎢ ⎢⎣[W⊤W](j,j)[W⊤~W](j,j)+αHH(j,:)+[X⊤W](:,j)−H⊤[W⊤W](:,j)[W⊤W](j,j)+αH⎤⎥ ⎥ ⎥⎦+, (31)
##### Hierarchal Alternating Least Squares with ℓ1 Regularization:

We start by augmenting the cost function with additional regularization terms for the components

 minimize ~Jj(~W(:,j),H(j,:))=∥~R(j)−~W(:,j)H(j,:)∥2F+βW∥W(:,j)∥1+βH∥H(j,:)∥1, (32)

where and are tuning parameters to control the level of sparsity. Noting, that the entries of are nonnegative we have that . Hence, taking the gradient of with respect to gives

 0=∂Jj∂W(:,j)=W(:,j)H(j,:)H⊤(j,:)−R(j)H⊤(j,:)+βW1.

Then, we can formulate the following update rule for the th component of

 W+(:,j)←⎡⎢ ⎢ ⎢⎣W(:,j)+[XH⊤−βW1](:,j)−W[HH⊤](:,j)[HH⊤](j,j)⎤⎥ ⎥ ⎥⎦+, (33)

and similar we yield the update rules for the components of

 H+(j,:)←⎡⎢ ⎢ ⎢⎣H(j,:)+[X⊤W−βH1](:,j)−H⊤[W⊤W](:,j)[W⊤W](j,j)⎤⎥ ⎥ ⎥⎦+. (34)

Both, and regularization can be combined, which leads to the elastic net (Zou and Hastie, 2005). The elastic net, which combines both the effects of ridge and lasso, shows often a favorable performance in high-dimensional data settings.

## 4 Experimental Evaluation

In the following, we evaluate the proposed randomized algorithm and compare it to the deterministic HALS algorithm as implemented in the scikit-learn package (Pedregosa et al., 2011). Througout all experiments, we set the oversampling parameter to and the number of subspace iterations to for the randomized algorithm. For comparison, we also provide the results of the compressed MU algorithm Tepper and Sapiro (2016).222Note, that Tepper and Sapiro (2016) have also used the active set and the alternating direction method of multipliers for computing the NMF. While both of these methods perform better than the MU algorithm in several empirical experiments, we faced some numerical issues in applications of interest to us. Hence, we present only the results of the compressed MU algorithm in the following. All computations are performed on a laptop with Intel Core i7-7700K CPU Quad-Core 4.20GHz, 64GB fast memory.

### 4.1 Facial Feature Extraction

Feature extraction from facial images is an essential preprocessing step for facial recognition. An ordinary approach is to use the SVD or PCA for this task, which was first studied by Kirby and Sirovich (1990) and later by Turk and Pentland (1991). The resulting basis images represent ‘shadows’ of the faces, the so-called eigenfaces. However, instead of learning holistic representations, it seems more natural to learn a parts-based representation of the facial data. This was first demonstrated in the seminal work by Lee and Seung (1999) using the NMF. The corresponding basis images are more robust to occlusions and are easier to interpret.

In the following we use the downsampled cropped Yale face database B (Georghiades et al., 2001). The dataset comprises grayscale images, cropped and aligned. Each image is of dimension . Thus, we yield a data matrix of dimension after vectorizing and stacking the images. Figure 4 shows the dominant features extracted via the deterministic and randomized HALS, which characterize some facial features. For comparison, we show also the holistic basis functions computed via the SVD. Clearly, the features extracting using the NMF are more parsimonious and better to interpret.

The randomized algorithm achieves a substantial speed-up of a factor of about , while the reconstruction error remains near-optimal, see Table 1. In comparison, the compressed MU algorithm performs slightly poorer overall. The compressed MU algorithm is extremely fast and the computational costs per iteration are lower than the costs of the randomized HALS algorithm. However, the MU algorithm requires a large number iterations to converge. Hence, the randomized HALS algorithms has a more favorable trade-off between speed and accuracy.

To better contextualize the behavior of the randomized HALS algorithm we plot the relative error and the projected gradient against the computational time in Figure 5. The cost per iteration compared to the deterministic algorithms is substantially lower, i.e., the objective function decreases in less time. In addition, Figure 6 shows the results plotted against the number of iterations. Note that using the SVD for initialization increases the accuracy. While the SVD adds additional costs to the overall computational time, it is often the favorable initialization scheme (Boutsidis and Gallopoulos, 2008).

### 4.2 Hyperspectral Unmixing

Hyperspectral imaging is often used in order to determine what materials and underlying processes are present in a scene. It uses data collected from across the electromagnetic spectrum. The difficulty, however, is that pixels of hyperspectral images commonly consist of a mixture of several materials. Thus, the aim of hyperspectral unmixing (HU) is to separate the pixel spectra into a collection of the so-called endmembers and abundances. Specifically, the endmembers represent the pure materials, while the abundances reflect the fraction of each endmember that is present in the pixel (Bioucas-Dias et al., 2012)

. The NMF represents a simple linear mixing model for blind HU in form of

 pi=WH(:,i) (35)

where the th pixel is denoted as . The basis matrix represents the spectral signatures of the endmembers, while the th column of the weight matrix represents the abundances of the corresponding pixel.

In the following we use the popular ‘urban’ hyperspectral image for demonstration (the data are obtained from http://www.agc.army.mil/). The hyperspectral image is of dimension pixels, each corresponding to an area of meters. Further, the image consists of spectral bands; however, we omit several channels due to dense water vapor and atmospheric effects. We use only bands for the following analysis, which aims to automatically extract the four endmembers: asphalt, grass, tree and roof. Figure 7 shows the dominant basis images reflecting the four endmembers as well as the corresponding abundance map. Both the deterministic and randomized algorithms faithfully extracted the spectral signatures and the abundance maps. Note that we are using the SVD for initialization here, which shows a favorable performance compared to randomly initialized factor matrices.

Table 2 quantifies the results. The randomized HALS and compressed MU algorithms require more iterations to converge compared to the deterministic HALS. The speedup of randomized HALS is considerable by a factor of about , whereas the MU algorithm took longer. This is, because the MU algorithm requires a larger number of iterations to converge.

Overall, NMF is able to extract the four different endmembers, yet the modes seem to be somewhat mixed. Now, we are interested in improving the interpretability of this standard model. Therefore, we introduce some additional sparsity constraints in order to obtain a more parsimonious model. More concretely, we employ regularization to yield a sparser factor matrix . We control the level of sparsity via the tuning parameter which we set to . Figure 6(c) shows the resulting modes. Clearly, the different endmembers are less mixed and appear to be better interpretable. Note that the corresponding spectra remains the same.

Next, we contextualize the performance of the randomized HALS algorithms by plotting the relative error and projected gradient vs computational time and the number of iterations, Figure 8 and 6. Again, we see that the randomized algorithm faithfully converges in a fraction of the computational time required by the deterministic algorithm. Using the SVD, rather than random initialization, leads to a lower relative error on average.

### 4.3 Handwritten Digits

In practice, it is often of great interest to extract features from data in order to characterize the underlying structure. The extracted features can then be used, for instance, for visualization or classification. Here, we are interested in investigating whether the features extracted via the randomized NMF algorithms yield a good classification performance.

In the following we use the MNIST (Modified National Institute of Standards and Technology) database of handwritten digits, which comprises training and testing images, for demonstration (the data are obtained from http://yann.lecun.com/exdb/mnist/). Each image patch is of dimension and depicts a digit between and . Figure 10 shows the first dominant basis images computed via the deterministic and randomized HALS as well as the SVD. Compared to the holistic features found by the SVD, NMF computes a parts-based representation of the digits. Hence, each digit can be formed as an additive combination of the individual parts. Furthermore, the basis images are simple to interpret.

Table 3 summarizes the computational results and shows that the randomized algorithms achieve a near-optimal reconstruction error while reducing computation. Here we limit the number of iterations to to keep the computational time low. A higher number of iterations, however, does not significantly improve the classification performance.

Now, we investigate the question of whether the quality of the features computed by the randomized algorithm is sufficient for classification. We use the basis images to first project the data into low-dimensional space, then we use the -nearest-neighbors method, with , for classification. The results for both the training and test samples are shown in Table 4.

Interestingly, both the randomized and deterministic features yield a similar classification performance in terms of precision, recall, and the F1-score.

### 4.4 Computational performance

We use synthetic nonnegative data to contextualize the computational performance of the algorithms. Specifically, we construct low-rank matrices consisting of nonnegative elements drawn from the Gaussian distribution.

First, we compute the NMF for low-rank matrices of dimension and , both of rank . Figure 11 shows the relative error, the timings, and the speedup for varying target ranks , averaged over 20 runs. First, we notice that the randomized HALS algorithm shows a significant speedup over the deterministic HALS algorithm by a factor of to , while achieving a high accuracy. Here, we have limited the maximum number of iterations to for both the randomized and deterministic HALS algorithm. If required, an even higher accuracy could be achieved by allowing for a larger number of iterations.

The MU algorithm is known to require a larger number of iterations. Thus, we have allowed for a maximum number of iterations of . Despite the large number of iterations, the compressed MU algorithms shows a patchy performance in comparison. While the results for small target ranks are satisfactory, the algorithm has difficulties in approximating factors of larger ranks accurately. In both cases, the compressed MU algorithm does not converge.

Figure 12 and 13 shows the convergence behavior for a random generated low-rank matrix with dimensions . Like the deterministic algorithm, the randomized HALS algorithms approximates the data with nearly machine-precision. While using the SVD for initialization requires some additional computational costs, the accuracy is slightly better.

## 5 Conclusion

Massive nonnegative data poses a computational challenge for deterministic NMF algorithms. However, randomized algorithms are a promising alternative for computing the approximate nonnegative factorization. The computational complexity of the proposed randomized algorithm scales with the target rank rather than ambient dimension of the measurement space. Hence, the computational advantage becomes pronounced with increasing dimensions of the input matrix.

The randomized HALS algorithm substantially reduces the computational demands while achieving near-optimal reconstruction results. Thereby, the trade-off between precision and speed can be controlled via oversampling and utilizing power iterations. We prose and the computation of subspace iterations as default values. These settings show a good performance throughout all our experiments. Overall, the randomized HALS algorithm shows considerable advantages over the deterministic HALS and the previously proposed compressed MU algorithm. In addition, regularized HALS can help to compute more interpretable modes.

Future research will investigate a GPU-accelerated implementation of the proposed randomized HALS algorithm. Furthermore, the presented ideas can be applied to nonnegative tensor factorization using the randomized framework proposed by Erichson et al. (2017).

## Acknowledgments

We would like to express our gratitude to the two anonymous reviewers for their helpful feedback which allowed us improve the manuscript. NBE and JNK acknowledge support from an SBIR grant through SURVICE Inc. (FA9550-17-C-0007)

## Appendix A QB Decomposition

If the data matrix fits into fast memory, the QB decomposition can be computed efficiently using BLAS-3 operations. However, in a big data environment we face the challenge that the data matrix is to big to fit into fast memory. The randomized scheme allows to build the smaller matrix by simply iterating over the columns or blocks of columns of , once at a time. For instance, the HDF5 file system allows a handy framework to access subsets of columns of a data matrix.

Algorithm 2 shows a prototype implementation of the QB decomposition. Without additional power iterations, the algorithm requires two passes over the entire data matrix. Each additional power iteration requires one additional pass. Note, that in practice it is more efficient to read in blocks, rather than just a single column. Further, the for-loops in line (4), (8) and (12) can be executed in parallel.

## References

• Arora et al. (2012) Arora S, Ge R, Kannan R, Moitra A (2012). “Computing a nonnegative matrix factorization–provably.” In

Proceedings of the forty-fourth annual ACM symposium on Theory of computing

, pp. 145–162. ACM.
• Berry et al. (2007) Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ (2007). “Algorithms and applications for approximate nonnegative matrix factorization.” Computational statistics & data analysis, 52(1), 155–173.
• Bioucas-Dias et al. (2012) Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012). “Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2), 354–379.
• Bjarkason (2018) Bjarkason EK (2018). “Pass-Efficient Randomized Algorithms for Low-Rank Matrix Approximation Using Any Number of Views.” arXiv preprint arXiv:1804.07531.
• Boutsidis et al. (2014) Boutsidis C, Drineas P, Magdon-Ismail M (2014). “Near-optimal column-based matrix reconstruction.” SIAM Journal on Computing, 43(2), 687–717.
• Boutsidis and Gallopoulos (2008) Boutsidis C, Gallopoulos E (2008). “SVD based initialization: A head start for nonnegative matrix factorization.” Pattern Recognition, 41(4), 1350–1362.
• Boutsidis and Woodruff (2017) Boutsidis C, Woodruff DP (2017). “Optimal CUR matrix decompositions.” SIAM Journal on Computing, 46(2), 543–589.
• Cichocki and Anh-Huy (2009) Cichocki A, Anh-Huy P (2009). “Fast local algorithms for large scale nonnegative matrix and tensor factorizations.” IEICE transactions on fundamentals of electronics, communications and computer sciences, 92(3), 708–721.
• Cohen et al. (2015) Cohen J, Farias RC, Comon P (2015). “Fast decomposition of large nonnegative tensors.” IEEE Signal Processing Letters, 22(7), 862–866.
• Drineas and Mahoney (2016) Drineas P, Mahoney MW (2016). “RandNLA: Randomized Numerical Linear Algebra.” Communications of the ACM, 59(6), 80–90. doi:10.1145/2842602.
• Eckart and Young (1936) Eckart C, Young G (1936). “The Approximation of one Matrix by Another of Lower Rank.” Psychometrika, 1(3), 211–218.
• Erichson et al. (2017) Erichson NB, Manohar K, Brunton SL, Kutz JN (2017). “Randomized CP Tensor Decomposition.” Preprint arXiv:1703.09074, pp. 1–29.
• Erichson et al. (2016) Erichson NB, Voronin S, Brunton SL, Kutz JN (2016). “Randomized matrix decompositions using R.” arXiv preprint arXiv:1608.02148.
• Erichson et al. (2018) Erichson NB, Zheng P, Manohar K, Brunton SL, Kutz JN, Aravkin AY (2018). “Sparse Principal Component Analysis via Variable Projection.” arXiv preprint arXiv:1804.00341.
• Georghiades et al. (2001) Georghiades AS, Belhumeur PN, Kriegman DJ (2001). “From Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose.” IEEE transactions on pattern analysis and machine intelligence, 23(6), 643–660.
• Gillis (2014) Gillis N (2014). “The why and how of nonnegative matrix factorization.” In

Regularization, Optimization, Kernels, and Support Vector Machines

, pp. 257–291. Chapman and Hall/CRC.
• Gillis (2017) Gillis N (2017). “Introduction to Nonnegative Matrix Factorization.” arXiv preprint arXiv:1703.00663.
• Gillis and Glineur (2012) Gillis N, Glineur F (2012). “Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization.” Neural computation, 24(4), 1085–1105.
• Gu (2015) Gu M (2015). “Subspace Iteration Randomization and Singular Value Problems.” SIAM Journal on Scientific Computing, 37(3), 1139–1173.
• Halko et al. (2011) Halko N, Martinsson PG, Tropp JA (2011). “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” SIAM Review, 53(2), 217–288. doi:10.1137/090771806.
• Hsieh and Dhillon (2011) Hsieh CJ, Dhillon IS (2011). “Fast coordinate descent methods with variable selection for non-negative matrix factorization.” In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1064–1072. ACM.
• Kim et al. (2014) Kim J, He Y, Park H (2014). “Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework.” Journal of Global Optimization, 58(2), 285–319.
• Kimura et al. (2015) Kimura K, Tanaka Y, Kudo M (2015). “A Fast Hierarchical Alternating Least Squares Algorithm for Orthogonal Nonnegative Matrix Factorization.” In Proceedings of the Sixth Asian Conference on Machine Learning, volume 39 of Proceedings of Machine Learning Research, pp. 129–141. PMLR.
• Kirby and Sirovich (1990) Kirby M, Sirovich L (1990). “Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces.” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(1), 103–108.
• Langville et al. (2014) Langville AN, Meyer CD, Albright R, Cox J, Duling D (2014). “Algorithms, initializations, and convergence for the nonnegative matrix factorization.” arXiv preprint arXiv:1407.7299.
• Lee and Seung (1999) Lee DD, Seung HS (1999). “Learning the parts of objects by non-negative matrix factorization.” Nature, 401(6755), 788–791. doi:10.1038/44565.
• Lee and Seung (2001) Lee DD, Seung HS (2001). “Algorithms for non-negative matrix factorization.” In Advances in Neural Information Processing Systems, pp. 556–562.
• Lin (2007) Lin CJ (2007). “Projected gradient methods for nonnegative matrix factorization.” Neural computation, 19(10), 2756–2779.
• Mahoney (2011) Mahoney MW (2011). “Randomized Algorithms for Matrices and Data.” Foundations and Trends in Machine Learning, 3(2), 123–224. doi:10.1561/2200000035.
• Mahoney and Drineas (2009) Mahoney MW, Drineas P (2009). “CUR Matrix Decompositions for Improved Data Analysis.” Proceedings of the National Academy of Sciences, 106(3), 697–702.
• Martinsson (2016) Martinsson PG (2016). “Randomized Methods for Matrix Computations and Analysis of High Dimensional Data.” Preprint arXiv:1607.01649, pp. 1–55.
• Nesterov (2012) Nesterov Y (2012). “Efficiency of coordinate descent methods on huge-scale optimization problems.” SIAM Journal on Optimization, 22(2), 341–362.
• Paatero and Tapper (1994) Paatero P, Tapper U (1994).

“Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values.”

Environmetrics, 5(2), 111–126.
• Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011). “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research, 12, 2825–2830.
• Rokhlin et al. (2009) Rokhlin V, Szlam A, Tygert M (2009). “A Randomized Algorithm for Principal Component Analysis.” SIAM Journal on Matrix Analysis and Applications, 31(3), 1100–1124.
• Szlam et al. (2014) Szlam A, Kluger Y, Tygert M (2014). “An Implementation of a Randomized Algorithm for Principal Component Analysis.” Preprint arXiv:1412.3510, pp. 1–13.
• Tepper and Sapiro (2016) Tepper M, Sapiro G (2016). “Compressed nonnegative matrix factorization is fast and accurate.” IEEE Transactions on Signal Processing, 64(9), 2269–2283.
• Tropp et al. (2016) Tropp JA, Yurtsever A, Udell M, Cevher V (2016). “Randomized Single-View Algorithms for Low-Rank Matrix approximation.” arXiv preprint arXiv:1609.00048.
• Turk and Pentland (1991) Turk MA, Pentland AP (1991). “Face Recognition using Eigenfaces.” In

Proceedings on Computer Vision and Pattern Recognition

, pp. 586–591. IEEE.
• Udell and Townsend (2017) Udell M, Townsend A (2017). “Nice latent variable models have log-rank.” arXiv preprint arXiv:1705.07474.
• Voronin and Martinsson (2015) Voronin S, Martinsson PG (2015). “RSVDPACK: Subroutines for Computing Partial Singular Value Decompositions via Randomized Sampling on Single Core, Multi Core, and GPU Architectures.” Preprint arXiv:1502.05366, pp. 1–15.
• Wang and Li (2010) Wang F, Li P (2010). “Efficient nonnegative matrix factorization with random projections.” In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 281–292. SIAM.
• Wang and Zhang (2013) Wang S, Zhang Z (2013). “Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling.” The Journal of Machine Learning Research, 14(1), 2729–2769.
• Wang et al. (2016) Wang S, Zhang Z, Zhang T (2016). “Towards more efficient SPSD matrix approximation and CUR matrix decomposition.” Journal of Machine Learning Research, 17(210), 1–49.
• Wright (2015) Wright SJ (2015). “Coordinate descent algorithms.” Mathematical Programming, 151(1), 3–34.
• Zhou et al. (2013) Zhou T, Bian W, Tao D (2013). “Divide-and-conquer anchoring for near-separable nonnegative matrix factorization and completion in high dimensions.” In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp. 917–926. IEEE.
• Zhou and Tao (2012) Zhou T, Tao D (2012). “Bilateral random projections.” In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pp. 1286–1290. IEEE.
• Zou and Hastie (2005) Zou H, Hastie T (2005). “Regularization and variable selection via the elastic net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.