Semi-Orthogonal Non-Negative Matrix Factorization

Non-negative Matrix Factorization (NMF) is a popular clustering and dimension reduction method by decomposing a non-negative matrix into the product of two lower dimension matrices composed of basis vectors. In this paper, we propose a semi-orthogonal NMF method that enforces one of the matrices to be orthogonal with mixed signs, thereby guarantees the rank of the factorization. Our method preserves strict orthogonality by implementing the Cayley transformation to force the solution path to be exactly on the Stiefel manifold, as opposed to the approximated orthogonality solutions in existing literature. We apply a line search update scheme along with an SVD-based initialization which produces a rapid convergence of the algorithm compared to other existing approaches. In addition, we present formulations of our method to incorporate both continuous and binary design matrices. Through various simulation studies, we show that our model has an advantage over other NMF variations regarding the accuracy of the factorization, rate of convergence, and the degree of orthogonality while being computationally competitive. We also apply our method to a text-mining data on classifying triage notes, and show the effectiveness of our model in reducing classification error compared to the conventional bag-of-words model and other alternative matrix factorization approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/08/2021

Initialization for Nonnegative Matrix Factorization: a Comprehensive Review

Non-negative matrix factorization (NMF) has become a popular method for ...
09/29/2017

A Nonlinear Orthogonal Non-Negative Matrix Factorization Approach to Subspace Clustering

A recent theoretical analysis shows the equivalence between non-negative...
12/30/2016

Non-Negative Matrix Factorization Test Cases

Non-negative matrix factorization (NMF) is a prob- lem with many applica...
07/12/2019

A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization

Non-negative Matrix Factorization (NMF) asks to decompose a (entry-wise)...
06/22/2020

A Neural Network for Determination of Latent Dimensionality in Nonnegative Matrix Factorization

Non-negative Matrix Factorization (NMF) has proven to be a powerful unsu...
07/14/2012

MahNMF: Manhattan Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) approximates a non-negative matr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonnegative Matrix Factorization (NMF) has gain much attention due to its wide applications in machine learning such as cluster analysis

(Kim and Park, 2008b; Aggarwal and Reddy, 2013), text mining (Pauca et al., 2004; Shahnaz et al., 2006; Aggarwal and Zhai, 2012a), and image recognition (Lee and Seung, 1999; Buciu and Pitas, 2004). The purpose of the NMF is to uncover non-negative latent factors and relationships to provide meaningful interpretations for practical applications. NMF was first extensively studied by (Paatero and Tapper, 1994) as positive matrix factorization, and was subsequently popularized by the work of Lee and Seung (1999, 2001) in machine learning fields. Specifically, NMF seeks to approximate the targeted matrix X by factorizing it into two lower rank non-negative matrices F and G. These two matrices then provide a lower dimensional approximation of the original matrix. Other variations of the NMF include Sparse NMF (Hoyer, 2004), Orthogonal-NMF (Ding et al., 2006), and Semi-NMF (Ding et al., 2010)

. Additionally, NMF methods for binary data structures have also been developed using a logistic regression approach

(Kaban et al., 2004; Schachtner et al., 2010; Tomé et al., 2015; Larsen and Clemmensen, 2015).

In addition, Ding et al. (2006) proposed the Orthogonal-NMF (ONMF) by adding an orthogonal constraint on either F or G

. The dual constraints of non-negativity and orthogonality enforce each column of the matrix to have exactly one non-zero entry, which creates a rigid clustering interpretation. The objective function is also shown to be equivalent to the K-means clustering under this formulation

(Ding et al., 2006). Orthogonality can be imposed by using the Lagrangian multiplier (Ding et al., 2006; Kimura et al., 2015), or by employing the natural gradient of the Stiefel manifold (Yoo and Choi, 2008). Other methods introduce a weighing parameter instead of the Lagrangian multiplier (Li et al., 2010; Mirzal, 2014) to force orthogonality. However, achieving strict orthogonality is infeasible in the current literature due to the non-negative constraint. The deviation from strict orthogonality further increases as the number of basis vectors increases. Ding et al. (2010) later expanded the usage of NMF on mixed-sign matrices by relaxing the non-negative constraint on F and introducing Semi-NMF. Following ONMF, they drew an interesting connection between Semi-NMF and K-means clustering, where F can be interpreted as the cluster centroids and G as the cluster membership indicators. It is further discussed in their literature that the variants of NMF can be considered as a relaxed version of K-means clustering.

The most widely implemented algorithm for NMF and it’s variants is a multiplicative updates scheme based on matrix-wise alternating block coordinate descent (Lee and Seung, 1999, 2001; Ding et al., 2006; Yoo and Choi, 2008; Ding et al., 2010; Mirzal, 2014). Although widely applied and simple to implement, these methods have a relatively slow convergence rate since they only incorporate first-order gradient information (Lin, 2007; Han et al., 2009; Cichocki and Phan, 2009), and suffers from a zero-locking problem. To overcome this inefficiency of multiplicative updates, several state-of-the-art NMF algorithms were developed using block-coordinate descent (Kim et al., 2014), projected gradient descent (Lin, 2007), and alternating constrained least squares (Kim and Park, 2008a). As an alternative to matrix-wise updates, column-wise (Cichocki et al., 2007; Cichocki and Phan, 2009; Kimura et al., 2015) and element-wise updates (Hsieh and Dhillon, 2011) have been investigated and shown to significantly speed up the rate of convergence for NMF.

In this paper, we propose a new semi-orthogonal matrix factorization method under the framework of both continuous and binary observations. For the former, we take advantage of the fast convergence of the column-wise update scheme

(Cichocki and Phan, 2009) along with a line search implementation to further increase the rate of convergence. The orthogonality property is guaranteed by using an orthogonal-preserving scheme (Wen and Yin, 2013). Additionally, we show that the column-wise update scheme can be reduced to a simple matrix-wise alternating least squares update scheme due to its strict orthogonality enforcement at each iteration, which further reduces the complexity of the algorithm. The algorithm for the binary case is based on a logistic regression formulation (Kaban et al., 2004; Tomé et al., 2015) of the objective function with the same orthogonal-preserving scheme along with an implementation of Newton’s method.

Our numerical studies show that the objective function being optimized is monotonically decreasing under our proposed algorithm with a significantly faster convergence rate compared to all existing methods while retaining the strict orthogonality constraint throughout. Our model also performs consistently well regardless of the true structure of the target matrix via a deterministic SVD-based initialization, whereas other models are susceptible to local minimums.

One of the most prominent applications of NMF is word clustering and document clustering (Xu et al., 2003; Yoo and Choi, 2008; Aggarwal and Zhai, 2012a, b), due to the bi-clustering nature of NMF. We perform an extensive text mining data analysis on classifying triage notes. We show that the classification accuracy can be improved by transforming the data set to a basis representation using our method. Additionally, the orthogonal formulation within our method yields uncorrelated topic vectors. Furthermore, words with significant positive and negative weights can be viewed as separate clusters under the same topic vector, providing a sub-bi-clustering interpretation. On the other hand, the enforcement of non-negativity on the document-topic matrix permits the conventional sparse and by-parts interpretation that the NMF is well known for.

The paper is organized as follows. First, we review some of the established methodologies and algorithms of NMF in section 2. The proposed method for the continuous case and binary case are then presented in section 3 and section 4 respectively. Section 5 provides an extensive numerical study using both simulated and real triage text data sets. Discussions and conclusions of our method are presented in section 6.

2 Notations and Review of Existing Literature

Here the notations used in this paper. We denote an arbitrary matrix as X and a vector as x, i.e. a matrix with column vectors is . The Frobenius norm of matrix is given as .

We next provide a brief review of some popular NMF methods. The non-negative matrix factorization (Lee and Seung, 1999) aims to factorize a non-negative matrix into the product of two non-negative matrices, F and G, e.g.

(1)

Given an observed design matrix X, we can approximate it as the product of two lower rank matrices F and G, where and . In typical applications, is much less than or . More specifically, columns of X can be rewritten as , where x and g are the corresponding columns for X and G. Thus, each column vector x is approximated as a linear combination of F, weighted by the rows of G, or equivalently, F can be regarded as the matrix that consists of the basis vectors for the linear approximation of X.

The above minimization problem in (1) is not convex in solving both F and G simultaneously but is convex when one of the matrices is fixed. Therefore, only a local minimum can be guaranteed. To solve the problem in (1), the general approach is to alternate the update between F and G while fixing the other until the objective function converges. The most widely used algorithm is to implement a multiplicative update scheme (Lee and Seung, 1999, 2001), where F and G are updated by multiplying the current value with an adaptive factor that depends on the rescaling of the gradient of (1).

(2)

Aside from a slow convergence rate (Lin, 2007), the zero-locking problem (Langville et al., 2014) is also of an issue. This means that once a value in either F and G is 0, it is essentially stuck at the point. The zero-locking problem can be tackled by non-multiplicative methods such as projected gradient descent (Lin, 2007) or block-coordinate descent (Kim et al., 2014), which are computationally fast but the rate of convergence is still not optimal due to the inefficiency of matrix-wise update methods. Cichocki et al. (2007) proposed a column-wise update scheme known as hierarchical alternating least squares (HALS), which significantly increased the rate of convergence of the algorithm. This improvement is achieved in the sense that the update of one matrix is not only dependent on the other matrix, but also depends on its own columns. Consequently, every updated column affects the update of the remaining columns.

Ding et al. (2006) imposes orthogonality on NMF, where is an additional constraint. This guarantees the uniqueness of the solution and provides a hard clustering interpretation similar to -means. Since both non-negativity and orthogonality constraints are applied on F, each row of F will only have one non-zero entry, indicating the membership of in cluster . Therefore, the optimization problem becomes a minimization of the Lagrangian function

where Tr denotes the trace. Ding et al. (2006) ignored the non-negativity of the components in F and relied only on to solve for a unique value of . The update rules for F and G is as follows,

Yoo and Choi (2008) later developed multiplicative updates directly from the true gradient (natural gradient) in the Stiefel manifold, as opposed to the additive orthogonal constraint in other literature. Kimura et al. (2015) implemented a HALS update scheme on top of Ding et al. (2006)’s method. Both methods were shown to outperform the original ONMF in terms of orthogonal solutions but still deviates from strict orthogonality due to the additional non-negativity.

Ding et al. (2010) relaxed the non-negative constraint on F and proposed semi-NMF, which extends the usage of NMF to matrices with mixed-sign elements. They showed that Semi-NMF can be motivated from a clustering perspective and drew the connections with -means clustering (Ding et al., 2005, 2010). Suppose we apply a -means clustering on X, then indicates the cluster centroids and G denotes the cluster indicators: i.e., if belongs to cluster and otherwise. Due to a close relationship with -means, G is initialized using a -means clustering of X. A small constant is then added to all elements of G to prevent the zero-locking problem during initialization. Then F and G can be simultaneously solved using an alternating iterative scheme given as

(3)

where

The separation of the positive and negative parts of matrix A preserves the non-negativity in each step of the iteration for G. Semi-NMF has been reported to yield non-sparse solutions, which is not ideal for interpretation.

3 Semi-orthogonal Matrix Factorization for Continuous Matrix

3.1 Motivation

Some challenges and drawbacks of NMF variants in the existing literature are as follows. First, current formulations and algorithms do not yield exact orthogonal solutions for NMF even with orthogonality constraints as there is a trade-off between non-negativity and orthogonality when solving for the factorized solutions. Second, the rate of convergence is insufficient and numeric instability could lead to zero-locking or 0 values in the denominator for multiplicative update based algorithms. In addition, the quality of the solutions is also highly dependent on initialization. Thrid, there are potential overfitting and rank-deficient issues in the basis matrices for a non-orthogonal formulation of NMF due to correlation and linear dependence between basis vectors, which can affect the performances of subsequent supervised approach. The overlapping information between basis vectors also complicates clustering interpretation for text-mining applications.

To address the above limitations, we propose a new framework and algorithm for matrix factorization problems and introduce the Semi-orthogonal Non-negative Matrix Factorization (SONMF) in the following.

3.2 Methodology

Consider the following matrix factorization problem with cost function denoted as ,

where , , and . We solve this problem by alternatively updating the matrices F and G. However, the uniqueness of the proposed method is to take advantage of the Stiefel Manifold , where is the feasible set . In particular, our method enforces the entire solution path of F to be exactly on this manifold, thereby preserving strict orthogonality. In this section, we first present the proposed orthogonality preserving update scheme for F, then demonstrate that this strict enforcement of orthogonality leads to an extremely efficient update for G.

3.2.1 Orthogonal Preserving Update Scheme for F

The main challenge in achieving the exact orthogonal solution for ONMF is the additional non-negative constraint that’s imposed (Ding et al., 2006; Yoo and Choi, 2008; Kimura et al., 2015). In contrast, we relax F to be mixed, thus allowing us to utilize the full gradient of F and to perform a line search to find a better optimum solution. Following Wen and Yin (2013), we initialize F with a column-wise orthonormal matrix and then apply the Cayley Transformation to preserve the orthogonality while applying a line search for each update. Under the matrix representation, the gradient of F is

We can define a skew-symmetric matrix

S by

The next iteration of F is determined by the Crank-Nicolson-like scheme,

where has the closed form solution

and Q is obtained via the Cayley Transformation,

(4)

Thus,

and is a step size. The Cayley transformation is commutative, thus the order in (4) is also interchangable. The above formulation consists of several desirable properties. Since Q is an orthogonal matrix, thus is also a matrix with orthonormal columns for all values of . In addition,

therefore, it successfully preserves the orthonormality through the entire solution path. Moreover, equals the projection of R into the tangent space of at F. Therefore, is a descent path, which motivates us to perform a line search along the path. Here, the step size is adaptive as it is necessary to search for a proper step size at each iteration to guarantee the convergence of the algorithm. For initial , we chose it to be . Using the idea of backtracking line search, the choice for the value of can be made easily by calculating the difference between the mean square error of the previous iteration and the current iteration , given as . We increase by a factor of 2 when . If , then is divided by a factor of 2.

On the other hand, is rather intensive to compute, due to the fact that it requires inverting , which is an matrix. However, typical NMF applications have , thus we can use the Sherman-Morrison-Woodbury (SMW) formula to reduce this inversion process down to that of a matrix by rewriting S as a product of two low-rank matrices. Let and , then we can rewrite S as . The SMW formula is given as

for some invertible square matrix . Letting , , and , we can derive the final update for F as follows

using , we have

(5)
(6)

where in Eq. (5

), we replace the third identity matrix

I by . This leads to the final update rule in Eq. (6).

3.2.2 Hierarchical Alternating Least Squares Update Scheme for G

The main idea of the hierarchical alternating least squares (HALS) update scheme (Cichocki et al., 2007) is that it efficiently updates both the matrices globally and their individual vectors locally. Instead of updating the columns of F or G one by one as in matrix-wise update methods, the columns of F and G are updated simultaneously, thereby speeding up the rate of convergence. We use this idea for our update for G and show that our strict orthogonality constraint further simplifies the update scheme to a matrix-wise alternating least squares update scheme.

Following Cichocki et al. (2007), Cichocki and Phan (2009) and Kimura et al. (2015), we start with the derivation of the column-wise update of G by fixing F:

where is the residual matrix without the th component. To find a stationery point for , we calculate the gradient, which leads to

Thus, the naive column-wise update scheme of G is

(7)

where to preserve the non-negativity of G. The algorithm can be simplified due to the strict orthogonality constraint of F on the entire solution path, which implies . Thus, the updating rule in Eq. (7) can be simplified to

Since , the complete updating rule for each is given by

which yields the least squares algorithm form in Cichocki and Phan (2009) and Kimura et al. (2015). Now, taking advantage of the orthogonality constraint in the entire solution path, we have . Hence, the updating rule is simply . Noticing that this column-wise update is not affected by the changes in other columns, we are essentially using a simplified matrix-wise ALS update scheme,

(8)

This simplification is only possible if is exactly preserved throughout the solution path. The final part of this algorithm is to choose a suitable convergence criterion apart from the number of predefined iterations. We define the convergence criterion to be the difference of the objective function values between two iterations.

where any sufficient small value could be a feasible choice, such as 0.0005.
The algorithm for the continuous response is given as follows.

  Input: Arbitrary matrix X, Number of basis vectors .
  Output: Mixed-sign matrix F and non-negative matrix G such that and
  Initialization: Initialize F with orthonormal columns and .
  
  repeat
     
     
     
     
     repeat
        
        if  then
           
           
        else if  then
           
        end if
     until 
  until Convergence criterion is satisfied
Algorithm 1 Semi-Orthogonal NMF for Continuous X

4 Semi-orthogonal Matrix Factorization for Binary Matrix

For matrices with binary responses, the previous method for the continuous response is not applicable. In this section, we first review some existing methodologies on binary matrix factorization, then proceed to propose our new model.

4.1 Existing Work for Binary Matrix Factorization

Binary data arise often in real-world applications due to its simplicity in representing data information. For example, typical applications of binary data analysis and clustering include market-basket data, document-term data, recommender systems, or protein-protein complex interaction network. One way to factorize a binary matrix is to do so directly (Li, 2005; Zhang et al., 2007, 2010) where the factorized solutions are also binary. However, this leads to poor interpretation, as these methods does not account for the importance of each entry within the factorized components. Alternatively, methods that utilize the Bernoulli likelihood has shown to be more effective in this manner. In this formulation, we assume each

follows an independent Bernoulli distribution with parameter

, where each follows the logistic function ,

Earlier related work includes a probabilistic formulation of principal component analysis (PCA)

(Tipping and Bishop, 1999; Collins et al., 2002; Schein et al., 2003)

, thereby extending the applications of the well-known multivariate analysis method to binary data. In particular,

Schein et al. (2003) proposed the logistic PCA, where the optimization problem is formulated using a Bernoulli likelihood function,

(9)

However, since there is an absence of non-negativity constraint in this formulation, the underlying factors migh not have a meaningful interpretation.

Alternatively, Schachtner et al. (2010) proposed the binary non-negative matrix factorization (binNMF) (Schachtner et al., 2010)

. The binNMF model is similar to logistic PCA, which formulation is based on a non-linear sigmoidal function to transform the matrix product

onto the range of log-odds. The major difference between logistic PCA and binNMF is that binNMF models the log-odds in the Bernoulli distribution as two non-negative factors in accord with a parts-based model, which enforces the entries of both

F and G to be strictly non-negative. A similar formulation was also proposed by Larsen and Clemmensen (2015). Tomé et al. (2015) then expanded on binNMF and proposed the logNMF model, where they relaxed the non-negativity constraint on G. Their model requires strictly non-negative basis vectors (F) but is more flexible as it allows negative feature components (G).

We propose a new binary matrix factorization such that our matrix of basis vectors F are orthonormal, and the non-negative constraint is only imposed on G. Tomé et al. (2015) used a gradient ascent scheme with a fixed step size for both F and G, while we apply Newton’s method for G and a line search algorithm for F. Note that the implemented Newton’s method for optimization is different from the root-finding Newton-Raphson algorithm. The usage of fixed step size in Tomé et al. (2015) is rather restrictive as the algorithm only converges at a modest rate with a very small step size. On the other hand, convergence issues may arise due to larger step sizes. In comparison, our model effectively remedies this issue due to the line-search implementation in F. Even if the step size for the update of G is too large, the error can be corrected by choosing a more appropriate step size for F. Additionally, the combination of a line search update for F and Newton’s method for G provides a more efficient convergence towards the solution.

4.2 Methodology

The idea of the factorization is to find appropriate F and G such that they maximize the log-likelihood function given in Eq (9). Writing the logistic sigmoid function in its full form, we have

We defined the cost function based on the negative log-likelihood,

(10)

Since the 2nd derivative of the cost function is well defined in our problem, we implement Newton’s method for optimization to find the update rule for G, which allows for a faster rate of convergence with a slight increase in computation. However, this increase in computation is fairly trivial since the operation of the gradient and hessian is on an element-wise level. Therefore, the inversion of the Hessian can be avoided, which saves up a considerable amount of computation time. Assuming both X and F are fixed, the first and second derivatives of the cost function with respect to G are given as

and

Following Newton’s method, the update rule for G in matrix notation is given by

where is a step size, 1 is the matrix of all 1’s. The quotient and exponential function here are element-wise operations for matrices.

For the update rule of F, the only difference from the continuous case is the gradient, whereas the orthogonal-preserving scheme remains the same as in the continuous case. Following an equivalent deduction as G, the gradient of F is given as

Since the algorithm intends to minimize the cost function, an over-fitting problem might arise due to the nature of the problem. The algorithm seeks to maximize the probability that

is either 0 or 1 by approximating the corresponding entries within the probability matrix to be also close to 0 or 1. Since F is constrained to be orthonormal, the scale of the approximation is solely dependent on G. Thus, larger values in G will increase the risk of over-fitting. To avoid this issue, the step size for the update of G needs to be relatively small, and in our algorithm, we choose the default value to be 0.05.

  Input: Arbitrary matrix X with binary elements, Number of basis vectors
  Output: Mixed sign matrix F and non-negative matrix G such that
  Initialization: Initialize F with orthonormal columns, G arbitrary, , and .
  
  repeat
     
     
     
     
     
     
     repeat
        
        if  then
           
           
        else if  then
           
        end if
     until 
  until Convergence criterion is met
Algorithm 2 Semi-Orthogonal NMF for Binary X

5 Simulation and Data Experiments

We evaluate the performance of our model through different data experiments. In the first part, we compare the performance of our model with several well-established algorithms of NMF for the continuous case under difference simulation settings. For the binary version, we show that both the cost function and difference between true and estimated probability matrices monotonically converges under the algorithm, along with a comparison with another state of art model. For the second part, we apply our method to numerous real text data sets that are composed of triage notes from a hospital and evaluate classification performance on whether a patient should be admitted to the hospital for additional supervision. Overall, our method performs the best in terms of improving classification accuracy on top of the conventional bag-of-words model as opposed to the other NMF alternations. Finally, we discuss the interpretation of the word clusters extracted by our model. We do so by examining the largest positive values and the smallest negative values of the words under the

F matrix and show that these two clusters of words can be interpreted as further sub-bi-clusters.

We developed an R package “MatrixFact” that implements the two proposed method in this paper. In addition, we also implemented the original NMF (Lee and Seung, 2001), ONMF (Kimura et al., 2015), Semi-NMF (Ding et al., 2010), and logNMF (Tomé et al., 2015) in the same package. To the best of our knowledge, existing R packages only include various algorithms for regular NMF, but lack access to other methods, while our package bridges this gap. The package is available on GitHub and can be installed via the commend line “devtools::install.packages(cronshells/MatrixFact)” through R.

5.1 Simulation for Continuous Case

For the continuous case, we evaluate the average residual and orthogonal residual, where

(11)
(12)

In addition to the two main criteria above, we also evaluate how the algorithms perform on recovering the original matrices F and G by calculating the difference between the column space of F, G and , in which F and G are the true underlying matrix, and and are the approximated matrices from the factorization. That is,

(13)

where , , ,and are the projection matrices of their respective counterparts, i.e. . Additionally, We also investigate the sparsity of the factorized solutions. To measure this quantity, we shrink all elements equal to or less than to 0. We then evaluate the proportion of 0’s within F and G respectively.

We compare our method with three other popular NMF methods, that is NMF with multiplicative updates (Lee and Seung, 2001), Semi-NMF (Ding et al., 2010), and ONMF (Kimura et al., 2015). The simulations are conducted under an i7-7700HQ with four cores at 3.8GHZ. Three different scenarios are considered:

  1. where and where

  2. Non-negative and orthogonal and where

  3. Orthonormal and where

After generating the true underlying F and G, we construct the true X given as

where E is a matrix of random error such that . In this simulation experiment, we consider and . As a side note, refer to the section 8.2 for further discussion in the appendix on when the underlying true rank of F and G is less than the target factorization rank.

Initialization of NMF methods are crucial and have been extensively studied (Xue et al., 2008; Langville et al., 2006, 2014; Boutsidis and Gallopoulos, 2008) as good initialization allows the algorithm to achieve better local minimum at a faster convergence rate. An excellent choice of starting point for F is the left singular matrix U

of the singular value decomposition of

X. We apply the SVD to decompose X to it’s best rank-K factorization, given as

where is the rank of the target factorization. The truncated SVD is implemented as it provides the best rank-K approximation of any given matrix, according to (Eckart and Young, 1936; Wall et al., 2003; Gillis and Kumar, 2014; Qiao, 2015). Furthermore, the formulation of our model does not require the initialization of G, since the update rule for G given in (8) is only dependent on X and F. For more discussion of initialization, please refer to section (8.1) of the appendix.

For Semi-NMF, we apply the K-means initialization to F and G respectively according to Ding et al. (2010). Lee and Seung (2001) and Kimura et al. (2015) proposed to use random initialization for NMF and ONMF respectively. For a fair comparison, we initialize F and G using a slightly modified SVD approach, where we truncate all negative values of U to a small positive constant to enforce both non-negativity and avoid the zero-locking problem for the NMF. We then apply our update rule for G as the initialization for G, i.e.

where The average values of the above four criteria over 200 simulation trials with different underlying true F and G are reported under three scenarios in Table 1, 2, and 3 respectively, where each trial is set to run 500 iterations. We display the convergence plot of the objective function based on the first 250 iterations in Figure 1, 2, and 3, where the convergence criterion under consideration is

The last two columns of Table 1, 2, and 3 indicate the time and the number of iterations for each algorithm to reach this criterion.

Figure 1: Comparing convergence under Scenario 1 for using the average residual (11)
K
Average
Residual
Orthogonal
Residual
F
Sparsity
G
Sparsity
Time
(Secs)
Iterations
Until
Threshold
SONMF
10 0.0878 7.16 0.461 0.347 0 0 0.58 10.6
30 0.0807 4.23 0.694 0.634 0 0 1.07 11.7
50 0.0750 9.29 0.913 0.840 0 0 1.95 12.9
ONMF
10 0.2631 0.040 3.504 0.653 87.06 0 0.76 17.3
30 0.7440 0.439 7.031 3.276 94.63 0.04 1.71 44.3
50 1.1684 1.404 8.753 3.008 95.00 0.13 3.78 70.29
NMF
10 0.1598 N/A 1.390 1.773 1.39 1.77 3.20 292.1
30 0.4190 N/A 3.344 4.001 0.93 0.57 11.07 500
50 1.5420 N/A 5.493 5.854 0.82 0.41 18.48 500
Semi-NMF
10 0.1206 N/A 0.174 0.207 0 0 4.13 480.3
30 0.1879 N/A 1.839 3.499 0 0 11.35 500
50 0.2708 N/A 2.609 5.006 0 0 18.571 500
Table 1: Comparisons of the proposed method with other NMF methods on factorization accuracy, the sparsity of solutions and computation time, and convergence speed under simulation scenario (1). Note that the sparsity measures are given in percentages. The plus sign in column 8 and 9 indicates that the convergence threshold has not been satisfied after 500 iterations. Therefore, the required time and iterations for convergence surpass the displayed values.
Figure 2: Comparing convergence under Scenario 2 using the average residual (11)
K
Average
Residual
Orthogonal
Residual
F
Sparsity
G
Sparsity
Time
(Secs)
Iterations
Until
Threshold
SONMF
10 0.0765 1.08 0.278 0.485 0 3.95 0.77 10.1
30 0.0729 1.29 0.822 1.160 0 5.39 1.12 8.3
50 0.0678 1.34 1.408 1.774 0 4.77 1.59 7.8
ONMF
10 0.0747 0.578 0.492 0.595 59.5 0.021 0.79 18.9
30 0.0703 0.975 0.638 1.085 72.9 0.496 1.79 40.5
50 0.0666 1.636 0.669 1.575 76.6 2.51 3.38 60.0
NMF
10 0.1064 N/A 0.181 0.597 24.9 20.8 3.73 318.9
30 0.1297 N/A 0.469 1.497 38.4 30.6 11.71 394.2
50 0.1421 N/A 1.171 2.696 42.0 33.3 19.33 411.1
Semi-NMF
10 0.0749 N/A 0.277 0.367 0 0.39 29.3
30 0.0676 N/A 0.816 0.931 0 0 1.28 39.1
50 0.0628 N/A 1.401 1.586 0 0 2.49 49.6
Table 2: Comparisons of the proposed method with other NMF methods on factorization accuracy, computation time, and convergence speed under simulation scenario 2.
Figure 3: Comparing convergence under Scenario 3 for SONMF and Semi-NMF using the average residual (11).
K
Average
Residual
Orthogonal
Residual
F
Sparsity
G
Sparsity
Time
(Secs)
Iterations
Until
Threshold
SONMF
10 0.0857 4.6 3.665 3.679 0 18.93 0.84 6.7
30 0.0770 2.8 6.351 6.375 0 17.37 1.27 7.6
50 0.0694 1.6 7.976 8.011 0 16.59 1.88 8.8
Semi-NMF
10 0.0823 N/A 3.661 3.598 0 0 1.31 33.6
30 0.0758 N/A 6.336 6.330 0 0 1.86 37.1
50 0.0670 N/A 7.960 7.961 0 0 2.37 40.2
Table 3: Comparisons of the proposed method with other NMF methods on factorization accuracy, computation time, and convergence speed under simulation scenario (3). Only the first 50 iterations are shown as both models have already reached to convergence criteria.

Tables 1-3 show that SONMF has several advantages over other methods. First, our model converges quickly and consistently regardless of the structure of the true matrix we considered in the simulation. For example, our model reaches the convergence criterion and the true error with an average of about 10 iterations, greatly surpassing the rate of convergence of the other models. This is because the deterministic SVD-based initialization allows us to have a head start compared to the other methods, which leads to rapid convergence to the true error as we are on an excellent solution path. The line search implementation further speeds up the convergence rate.

For scenario (1), our model is significantly better in terms of factorization accuracy and recreating the true matrices, as shown by the smallest average residual, and values, especially for larger ’s. For ONMF and NMF, the mean value over 50 trial fails to converge to the true error, mainly due to a large number of saddle points that the true F possess, as shown by the large values. Due to the more restricted sample space, it is harder for NMF and especially ONMF to escape a bad solution path, unlike Semi-NMF which possesses a less confined parameter space. Additionally, the ONMF proposed by Kimura et al. (2015) does not guarantee the monotonic decreasing of the objective function value, due to the additional orthogonality constraint. The Semi-NMF has the least constraints among these four models, and converges to the true error eventually, but is computationally heavy due to the slow rate of convergence.

When the underlying structure of F is more well defined as in scenario (2), all four models converge to the true error, with the NMF having the slowest rate of convergence. For factorization accuracy, our model outperforms ONMF and NMF, but performs slightly worse than Semi-NMF due to the extra orthogonal constraint. In terms of recovering the true underlying matrix, it is not surprising that NMF and ONMF recover F better, as the true F is easier to estimate due to its unique structure.

For scenario (3), our model and Semi-NMF have similar performances. However, for the rate of convergence, we have an advantage than Semi-NMF. Similar to scenario (2), Semi-NMF performs slightly better than our model in terms of factorization accuracy and recovering the true matrix due to a less confined parameter space. Lastly, since our algorithm preserves strict orthogonality throughout the entire process, SONMF has a negligible orthogonality residual due to floating point error in computation, as opposed to the increasing orthogonality residual that ONMF possesses as the number of basis vectors increases.

The strict non-negative versions yield sparser solutions compared to the relaxed variants, which is not surprising as the sparsity solution of NMF has been well studied in previous literature (Lee and Seung, 2001; Guan and Dy, 2009; Ding et al., 2010). Semi-NMF produces fewer or no sparse solutions for all three scenarios. This result is consistent with Ding et al. (2010)’s finding. Our model has a slight advantage in this criteria over Semi-NMF, with a moderate degree of sparsity in the third scenario. However, the following empirical studies on real datasets show that our algorithm yields a moderate degree of sparsity in the G matrix as well.

5.2 Simulation for Binary Case

For the binary response, we use the mean value of the cost function in equation (10) as our evaluation criterion instead of the normalized residual. That is,

(14)

where is the total number of elements in X. We also consider the orthogonal residual, and given in equation (12) and (13) respectively. Additionally, we evaluate the difference between the probability matrix and the estimated probability matrix from our binary method,

We show that both the objective function and the difference between the true and estimated probability matrix are monotonically decreasing until convergence.

For the binary simulation setting, we generate mixed-sign F and non-negative G such that and . We then construct the true probability matrix P using the logistic sigmoid function,

We then add a matrix of random error E to P where . Finally, we generated the true X where each and has dimension 500-by-500. Similar to the continuous case, we consider .The average values of the above five criteria over 50 simulation trials are reported. We compare the performance of our method with logNMF (Tomé et al., 2015), where they set their step size for the gradient ascent to be 0.001.

For our model, we set the step size for Newton’s method update scheme for G to be 0.05. The initialization for F and G are the same as the continuous case. The stopping condition for both models is either a maximal number of iterations of 500 or a change in the cost for subsequent iterations is less than 0.0001.

5.2.1 Results for Simulation (Binary)

Figure 4: Comparison of performance criteria between our method and logNMF. Top: Mean cost; Bot: Difference between the estimated and true probability matrix.
K
Average
Cost
Orthogonal
Residual
F
Sparsity
G
Sparsity
Time
(seconds)
Iterations
Until
Threshold
SONMF
(Binary)
10 0.1718 1.814 42.001 1.696 1.029 0 15.56 23.52 228.3
30 0.0845 7.552 59.472 3.821 2.456 0 23.77 40.57 316.0
50 0.0220 1.012 65.114 5.627 4.111 0 28.03 64.21 407.9
logNMF
10 0.1725 N/A 47.128 1.711 1.054 18.10 0 29.35 323.1
30 0.0861 N/A 62.670 3.891 2.717 26.02 0 49.60 471.5
50 0.0589 N/A 66.028 5.658 4.456 32.03 0 60.67+ 500+
Table 4: Comparisons of the proposed method with logNMF on factorization accuracy, the sparsity of solutions, computation time, and convergence speed.

The result above shows that our model has a faster convergence rate towards the true cost and an overall lower mean cost than logNMF. Our model reaches the true cost at around 150 iterations while logNMF requires more than 300 iterations. The number of required iterations to converge also increases as increases since more basis vectors mean more parameters to be estimated. Additionally, our model has a lower error for , , and respectively when both algorithms reach the convergence criterion. The sparsity of the solutions is similar.

Unlike the continuous case, the SVD-based initialization does not provide a rapid convergence to the true error for our model. The reason here is because the SVD is applied on X, but F and G are estimating the underlying probability matrix of X and not X itself. For , the difference between P and converges once the average cost for the factorization reaches the true cost. Further iterations cause the models to over-fit, where will have a tendency to approach 1 when is 1 and 0 when is 0, and will in turn increase .

An important caveat to note here is that the rate of convergence of our model can be increased or decreased according to the step size of G to a certain extent. A smaller step size results in a more conservative solution with a slower rate of convergence, while a larger step size achieves the opposite. In our numerical experiment, we found out that 0.05 or 0.1 turns out to be a good step size in terms of this trade-off between the rate of convergence and risk of over-fitting. On the other hand, a step size greater than 1 would result in convergence issues.

5.3 Case Study (Triage Notes)

In this section, we focus on the application of our model on real text data sets. Our goal is to build a machine learning model to classify whether a patient’s condition is severe enough to warrant further inspection at the hospital. We show that the classification error can be improved by doing a linear transformation of basis with our model under a certain number of basis vectors. We also present the interpretation of the word clusters that our model identified for the considered data sets.

The triage data is provided by the Alberta Medical Center. There are approximately 500,000 patients (observations), each with a response variable, the disease (medical complaint) of the patient, and a column of text information regarding the reason of visit collected during the triage phase. The demographic information and vital signs of the patients are also recorded, but we only focus on the text information for this study. The response variable is a binary indicator for the disposition of the patient, i.e., whether the patient is admitted or discharged from the hospital after the doctor’s examination. The vocabulary used in the notes is relatively different across different diseases, thus we believe it’s necessary to consider each disease separately. For this study, we consider the analysis of 8 triage data sets, each related to a different type of disease.

We first pre-process the data with the tm package (Feinerer et al., 2008) by removing numbers, punctuation, a standard list of default stop words, and stemming words to their root form. We then represent the text data using a vector-space model (a.k.a. bag-of-words) (Salton et al., 1975). Given documents, we construct a word-document matrix where corresponds to the occurrence or significance of word in document , depending on the weighting scheme. In this study, we consider the tf-idf and binary weighing (Gupta et al., 2009) for the continuous and binary case respectively.

For classification, we denote the training and testing bag-of-words as and respectively. Applying the matrix factorization method yields a word-topic matrix and document-topic matrix , such that the mixed and orthogonal constraint is imposed on the word-topic vectors in .

After obtaining the factorized solution, we then project both and onto the column space of . Let and , then and is a reduced dimension representation of and respectively. and now effectively replace and as the new features. Depending on the number of basis vectors, this transformation allows us to effectively increase the classification accuracy, while decreasing the computing time to train a model due to the reduced dimension of the feature matrix. Intuitively, we can regard as a summary device, where each cluster/basis consists of linear combinations of different words. After applying the projection, can be viewed as a summary of the original bag-of-word matrix, where each document is now a linear combination of the topics from .

We apply a 5-fold cross validation for classification and results are averaged over 20 different runs, where the observations in each run are randomly assigned to different folds with stratified sampling via the caret package (Kuhn, 2008) to avoid unbalanced training and testing sets. We compare our model with four other NMF methods with the same procedures as above. For TF-IDF weighing, we apply our continuous model and compare the performance with the NMF (Lee and Seung, 2001), ONMF (Kimura et al., 2015), and Semi-NMF (Ding et al., 2010). For binary weighing, we consider the comparison with logNMF (Tomé et al., 2015). The stopping criteria are either when the maximum of 200 iterations is reached or if the subsequent change of the cost function (11) or (14) between iterations is less than . During our experiment, we notice that the step size suggested by Tomé et al. (2015) was too large for logNMF on these data sets which leads to unstable performances. Thus, we set for logNMF on these experiments. We consider LASSO (Tibshirani, 1996) using the glmnet package (Friedman et al., 2010) for classification. We show that our method of factorization not only improve the classification result over the naive bag-or-words but also outperforms other matrix factorization methods. In addition, we present the average residual from the factorization and the sparsity percentages of the solutions. Note that these two measurements are computed from the full bag-of-words, and not the for training sets from cross-validation.

5.3.1 Results for Classification of Triage Notes

The 7 triage datasets we consider in this study are given in the table below. The dimensions of the data set, the baseline accuracy of classifying all observations as discharged, and the classification accuracy using LASSO on the document-word matrix are also included.

Data sets
Dimension
(Docs-by-words)
Proportion
Discharged
LASSO
Accuracy
(tf-idf)
LASSO
Accuracy
(binary)
Altered Level of Consciousness 5220 5126 48.85 73.79 73.65
Cough 13084 5876 84.35 87.25 87.30
Fever 7302 4770 77.20 81.33 81.45
General Weakness 7442 5455 47.79 69.34 69.45
Lower Extremity Injury 12377 5180 82.50 88.36 88.45
Shortness of Breath 9322 4659 55.04 74.24 74.17
Stroke 5036 3869 45.17 74.41 74.19
Table 5: Datasets (order) considered in this study. The dimension of the document-word matrix after the data cleaning process, classification accuracy of using the majority class and LASSO (baseline for classification accuracy) are also provided.

The factorization and classification results for the first two datasets are provided in the following tables. Note that the sparsity of the solutions and classification accuracy are given in percentages.

Altered Level of Consciousness K Residual Sparsity (F,G) LASSO SONMF 10 0.1374 (0, 23.29) 73.98 30 0.1302 (0, 34.53) 74.11 50 0.1246 (0, 40.03) 74.59 100 0.1132 (0, 47.15) 74.92 150 0.1047 (0, 50.53) 75.24 NMF 10 0.1380 (65.52, 45.10) 72.60 30 0.1315 (79.20, 59.26) 73.13 50 0.1262 (83.46, 65.04) 73.57 100 0.1157 (87.94, 73.91) 74.35 150 0.1072 (89.95, 79.01) 74.53 ONMF 10 0.1378 (66.24, 40.51) 73.04 30 0.1311 (77.51, 60.84) 73.47 50 0.1260 (81.01, 70.08) 73.69 100 0.1156 (83.74, 83.33) 74.63 150 0.1077 (84.74, 89.12) 74.95 Semi 10 0.1372 (0, ) 73.49 30 0.1299 (0, ) 74.20 50 0.1239 (0, ) 74.52 100 0.1118 (0, ) 74.82 150 0.1022 (0, ) 74.93 Cough K Residual Sparsity (F,G) LASSO SONMF 10 0.0943 (0, 19.47) 86.59 30 0.0882 (0, 32.08) 87.31 50 0.0836 (0, 39.19) 87.44 100 0.0747 (0, 45.18) 87.58 150 0.0679 (0, 48.86) 87.71 NMF 10 0.0948 (64.76, 39.84) 85.26 30 0.0891 (77.81, 53.56) 86.70 50 0.0846 (82.09, 60.59) 87.07 100 0.0757 (86.68, 69.07) 87.28 150 0.0688 (88.73, 73.20) 87.49 ONMF 10 0.0946 (67.66, 35.51) 85.13 30 0.0888 (77.98, 57.69) 87.18 50 0.0843 (81.31, 68.48) 87.25 100 0.0756 (84.84, 80.23) 87.31 150 0.0688 (86.11, 85.41) 87.52 Semi 10 0.0942 (0, ) 86.74 30 0.0880 (0, ) 87.36 50 0.0832 (0, ) 87.44 100 0.0740 (0, ) 87.55 150 0.0665 (0, 0) 87.67
Table 6: Results between continuous variants of NMF on continuous bag-of-words matrix over different numbers of topics.
ALC (Binary) K Cost Sparsity (F,G) LASSO SONMF 10 0.0172 (0, 0.04) 70.33 30 0.0127 (0, 0.77) 71.76 50 0.0102 (0, 1.69) 72.63 100 0.0055 (0, 4.13) 73.58 150 0.0030 (0, 7.28) 73.93 logNMF 10 0.0182 (0.06, 0) 55.17 30 0.0186 (0.10, 0) 58.54 50 0.0232 (0.99, 0) 61.03 100 0.0367 (8.53, 0) 64.38 150 0.0501 (16.28, 0) 66.32 Cough (Binary) K Cost Sparsity (F,G) LASSO SONMF 10 0.0210 (0, 0.72) 84.43 30 0.0174 (0, 1.98) 84.92 50 0.0080 (0, 2.53) 85.70 100 0.0045 (0, 3.32) 87.17 150 0.0039 (0. 4.75) 87.42 logNMF 10 0.0178 (0.02, 0) 84.36 30 0.0185 (0.05, 0) 84.40 50 0.0213 (1.16, 0) 84.44 100 0.0277 (2.56, 0) 84.75 150 0.0402 (5.33, 0) 85.25
Table 7: Results between binary variants of NMF on continuous bag-of-words matrix over different numbers of topics.

For the continuous case, the mean residual exhibits a consistent and reasonable pattern throughout the presented data sets. Semi-NMF has the lowest mean residual, followed shortly by our method, which out-performs both ONMF and NMF. This is expected, as the mean residual is dependent on the number of constraints imposed on the model. Both ONMF and NMF yields significantly sparser results than the mixed variants. Our model yields increasingly sparse solutions in the G matrix as the number of topics increases, which is advantageous for interpretation compared to dense solutions found by Semi-NMF. On the other hand, neither binary methods yield sparse results.

Depending on the number of basis vectors, we observe that applying the factorization and projection has a to change in performance compared to the bag-of-words model. We also observe that our model retains higher classification accuracy compared to the other methods when under-fitting. In general, classification performance for the continuous case is better than the binary case, with logNMF having a significantly lower performance than the others. Thus, it is recommended to use the continuous-based bag-of-words and factorization method in practice. Among all these models, we notice that the methods with orthogonal constraint have a slight improvement over the non-constrained ones, due to the fact that orthogonality gives clearer of cluster representation. Since orthogonality implies uncorrelated projection directions, this further improves the performance of the LASSO model by eliminating the issue of multicollinearity issue during the model-fitting. Furthermore, the SONMF and Semi-NMF both perform consistently better than their non-negative counterparts. This is perhaps due to the less confined parameter space allowed in F in factorization, which reflects in a more accurate representation of the original data set.

The SONMF generally has an overall improvement of over the other methods, with significantly better performance, compared to the non-negative variants. The classification accuracy increases as the number of topics increases, but at a diminishing return. Based on our experiment, having more than 150 basis vectors does not provide a substantial improvement in the classification performance. Aside from over-fitting, the computation cost for factorizing such a large dimension bag-of-words matrix increases sharply as increases and thus the trade-off is not warranted.

5.3.2 Interpretation of Word Clusters

The above basis vectors can be interpreted as topics through a latent connection between the words and documents under the bag-of-words model. Since the word-topic matrix F is strictly orthogonal, our formulation of the problem yields orthogonal topic vectors. Orthogonal topic vectors have previously been considered by (Ding et al., 2006). In their formulation, each word can only belong to a single cluster, which creates distinct, uncorrelated topics with non-overlapping meanings. However, we believe that this implication is too strong, as most words have multiple meanings, and could belong to multiple topics. Thus, relaxing non-negativity from the orthogonal matrix still preserves the interpretation of uncorrelated topic vectors, while allowing words to belong to multiple clusters.

Aside from uncorrelated topic vectors, representing the loading of words through a linear combination of positive and negative terms also leads to a different interpretation as opposed to the conventional by-parts representation. Positive loading of each word indicates that the word belongs to the cluster with positive strength, while a negative loading represents the distance of a word from a specific topic. Words with large positive values under a topic indicate that they are the most representative of this topic, and also imply that these words tend to appear together and are highly correlated with each other. On the other hand, words with small negative values indicate that these words deviate significantly from this topic, and can be viewed as from a separate cluster and acronyms of the positive words within the same column. This interpretation creates a bi-clustering structure within a topic, in contrast to the zero representation of words in the non-negative case. As shown in our empirical results, the topic vectors in F are not sparse. Therefore, interpretation of a topic vector should rely on the words with the largest absolute values of their loading. Words with small magnitude can be regarded as noise.

In this section, we present examples of the word-topic vectors generated by our method from the ’Lower Extremity Injury’ and ’Symptoms of Stroke’ data sets.

Lower Extremity Injury
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Positive
hip weight knee xray ago
rotate bear pedal done day
glf ankle puls fracture week
break ubl strong told soccer
morphin roll fell show twist
Negative
ankle knee ankle alright head
knee ambulance ago date land
lower ice day now pulse
swell fall soccer able pedal
calf land play page back
Table 8: Words with the largest positive values and smallest negative values under first five topics for the Lower Extremity Injury data set, where glf stands for ground-level fall and ubl stands for Ubiquitin-like protein.
Symptoms of Stroke
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Positive
left team right equal episode
side stroke side grip min
leg see eye strong last
arm aware facial steady resolve
weak place face strength approx
Negative
deny deny left hours day
note resolve family tingly note
right symptom weak unsteady onset
episode home state sudden side
state week confus yesterday place
Table 9: First five topics of the Symptoms of Stroke data set, where gcs stands for glasgow coma scale.

From table 8, each topic specifically mentions the location and reason for the injury. We can interpret Topic 1 to be mainly on injury from falling, Topic 2 on the ankle injury, Topic 3 on the knee injury from biking, Topic 4 on x-rays, and Topic 5 on soccer by examining the words with positive weights. There is no overlapping of meanings between topics, and the words with negative weights under the same topic refers to a completely different location and cause. For instance, topic 3 and topic 5 are almost complete opposites of each other. The meaning of topics also fits our intuition, as it is unlikely that patients who injured their knees during biking would also twist their ankles during soccer since people are likely to restrain themselves from further physical activities if any one of the conditions happens. The contrast representation is more evident in the Symptoms of Stroke data set as shown in table 9. For Topics 1 and 3, our model correctly identifies ’left’ from ’right’. For Topic 4, we also observe that the word ’steady’ and ’unsteady’ have been placed in the opposite signs under the same topic. The above examples show that our model is able to not only cluster correlated terms more meaningfully, in additional to differentiating between dissimilar words.

6 Conclusion

In this paper, we proposed the semi-orthogonal non-negative matrix factorization (SONMF) as a new extension to the existing NMF methods and provided a formulation for the continuous and binary case respectively. Our model relaxes the non-negative constraint and adds an orthogonal constraint on F while retaining the non-negativity of G. Implementing the Cayley Transformation within our update scheme yields a strict orthogonal solution, a vital property that is not present in the existing literature. Additionally, the orthogonal formulation yields uncorrelated topic vectors for text mining applications, along with a sub-bi-clustering structure within the same topic.

For matrices with continuous entries, our model implements a hierarchical alternating least squares update scheme for G and a line search algorithm along the Stiefel manifold for F to preserve strict orthogonality. The strict orthogonality formulation further simplifies the column-wise update scheme for updating G to a matrix-wise non-negative least square update scheme. For binary matrices, our model applies a logistic regression formulation to approximate the underlying true probability matrix. Our model incorporates second-order information through Newton’s method to update G and the line search algorithm to update F as in the continuous case. Through numerical experiments and real data, we have shown the competitiveness of our method to achieve factorization accuracy, the degree of orthogonality, the rate of convergence, classification accuracy and interpretability on text mining applications.

Similar to other NMF methods, our model is capable of performing clustering effectively. Since our G is only non-negative, our model has a soft clustering interpretation, where each observation or feature can be mapped to different clusters Ding et al. (2006, 2010). The orthogonal constraint on F enforces the cluster centroids identified to be orthogonal to each other. This maximizes the constrasts and clear separation presentation between clusters. The relationship with clustering can also be drawn using the connection with PCA/SVD, where . Absorbing D into V, we can re-write PCA as , where U and are both mixed signs and orthogonal. The F matrix in our formulation is thus similar to U, and our G can be viewed as V with additional enforcement of non-negativity. The matrix G can also be viewed as the principal components, and F as the rotation matrix. It has been well studied in previous literature that PCA is effectively performing K-means clustering (Zha et al., 2002; Ding et al., 2005). In this sense G is the cluster indicator for the clusters such that and the principal directions, project all data points into the subspace spanned by the k cluster centroids (Ding et al., 2005, 2006).

Additional work can be conducted in the future in terms of adding sparsity and smoothness to the model, as well as to solve the over-fitting problem that is prevalent in a logistic regression formulation. Another interesting direction is to investigate the convergence rate of optimizing on manifolds. Boumal et al. (2016) recently showed that the Riemannian gradient descent method (the Stiefel Manifold is a special case of these) has a convergence rate of , where is the tolerance level. Thus, we believe that similar convergence rates can be applied to our method. Finally, a drawback of our model that the non-negative projection to preserve non-negativity is still ad-hoc. To investigate the theoretical convergence properties of SONMF, a new update scheme for G to preserve non-negativity will be needed.

7 Acknowledgments

This research is partially supported by NSF DMS-1613190 and DMS-182119. We thank the two anonymous reviewers and the action editor for their valuable comments and feedback.

8 Appendix

8.1 Additional Results for Triage Data Sets

In this section, we present the results of the remaining five data sets in alphabetical order.

Fever K Residual Sparsity (F,G) LASSO SONMF 10 0.1200 (0, 19.67) 81.09 30 0.1127 (0, 34.31) 81.87 50 0.1071 (0, 39.51) 81.95 100 0.0961 (0, 44.90) 82.03 150 0.0877 (0, 47.59) 82.16 NMF 10 0.1207 (64.88, 40.81) 79.58 30 0.1141 (78.26, 53.03) 81.23 50 0.1087 (82.66, 60.12) 81.36 100 0.0980 (87.12, 69.61) 81.60 150 0.0892 (89.08, 74.57) 81.68 ONMF 10 0.1204 (68.83, 36.91) 79.79 30 0.1135 (78.36, 58.42) 81.47 50 0.1082 (81.82, 67.97) 81.63 100 0.0979 (84.58, 81.28) 81.70 150 0.0894 (85.43, 87.36) 81.75 Semi 10 0.1199 (0, ) 80.98 30 0.1123 (0, ) 81.50 50 0.1064 (0, ) 81.82 100 0.0946 (0, 0) 81.89 150 0.0853 (0, 0) 81.91 General Weakness K Residual Sparsity (F,G) LASSO SONMF 10 0.1205 (0, 20.26) 68.05 30 0.1144 (0, 32.55) 68.84 50 0.1094 (0, 38.69) 69.18 100 0.0995 (0, 45.45) 69.57 150 0.0918 (0, 47.35) 69.95 NMF 10 0.1211 (62.94, 39.51) 66.64 30 0.1155 (77.32, 53.18) 67.18 50 0.1107 (81.84, 59.92) 67.75 100 0.1011 (86.58, 69.40) 68.56 150 0.0932 (88.64, 74.62) 68.99 ONMF 10 0.1209 (66.61, 35.62) 66.53 30 0.1151 (77.52, 57.51) 67.22 50 0.1104 (80.95, 67.70) 67.88 100 0.1010 (83.96, 81.85) 68.82 150 0.0935 (85.10, 87.80) 69.28 Semi-NMF 10 0.1204 (0, ) 68.16 30 0.1140 (0, ) 68.86 50 0.1088 (0, ) 69.03 100 0.0982 (0, 0) 69.54 150 0.0896 (0, 0) 69.87
Table 10: Results of continuous NMF variants on Fever and General Weakness data sets.
Fever (Binary) K Cost Sparsity (F,G) LASSO SONMF 10 0.0199 (0, 0.04) 77.22 30 0.0130 (0, 0.21) 79.93 50 0.0102 (0, 0.63) 80.96 100 0.0054 (0, 1.81) 81.54 150 0.0037 (0, 2.13) 81.93 logNMF 10 0.0201 (0.02, 0) 77.21 30 0.0195 (0.07, 0) 77.35 50 0.0228 (0.23, 0) 77.60 100 0.0350 (3.83, 0) 78.06 150 0.0475 (9.01, 0) 78.63 General Weakness (Binary) K Residual Sparsity (F,G) LASSO SONMF