1 Introduction
Nonnegative Matrix Factorization (NMF) has gain much attention due to its wide applications in machine learning such as cluster analysis
(Kim and Park, 2008b; Aggarwal and Reddy, 2013), text mining (Pauca et al., 2004; Shahnaz et al., 2006; Aggarwal and Zhai, 2012a), and image recognition (Lee and Seung, 1999; Buciu and Pitas, 2004). The purpose of the NMF is to uncover nonnegative latent factors and relationships to provide meaningful interpretations for practical applications. NMF was first extensively studied by (Paatero and Tapper, 1994) as positive matrix factorization, and was subsequently popularized by the work of Lee and Seung (1999, 2001) in machine learning fields. Specifically, NMF seeks to approximate the targeted matrix X by factorizing it into two lower rank nonnegative matrices F and G. These two matrices then provide a lower dimensional approximation of the original matrix. Other variations of the NMF include Sparse NMF (Hoyer, 2004), OrthogonalNMF (Ding et al., 2006), and SemiNMF (Ding et al., 2010). Additionally, NMF methods for binary data structures have also been developed using a logistic regression approach
(Kaban et al., 2004; Schachtner et al., 2010; Tomé et al., 2015; Larsen and Clemmensen, 2015).In addition, Ding et al. (2006) proposed the OrthogonalNMF (ONMF) by adding an orthogonal constraint on either F or G
. The dual constraints of nonnegativity and orthogonality enforce each column of the matrix to have exactly one nonzero entry, which creates a rigid clustering interpretation. The objective function is also shown to be equivalent to the Kmeans clustering under this formulation
(Ding et al., 2006). Orthogonality can be imposed by using the Lagrangian multiplier (Ding et al., 2006; Kimura et al., 2015), or by employing the natural gradient of the Stiefel manifold (Yoo and Choi, 2008). Other methods introduce a weighing parameter instead of the Lagrangian multiplier (Li et al., 2010; Mirzal, 2014) to force orthogonality. However, achieving strict orthogonality is infeasible in the current literature due to the nonnegative constraint. The deviation from strict orthogonality further increases as the number of basis vectors increases. Ding et al. (2010) later expanded the usage of NMF on mixedsign matrices by relaxing the nonnegative constraint on F and introducing SemiNMF. Following ONMF, they drew an interesting connection between SemiNMF and Kmeans clustering, where F can be interpreted as the cluster centroids and G as the cluster membership indicators. It is further discussed in their literature that the variants of NMF can be considered as a relaxed version of Kmeans clustering.The most widely implemented algorithm for NMF and it’s variants is a multiplicative updates scheme based on matrixwise alternating block coordinate descent (Lee and Seung, 1999, 2001; Ding et al., 2006; Yoo and Choi, 2008; Ding et al., 2010; Mirzal, 2014). Although widely applied and simple to implement, these methods have a relatively slow convergence rate since they only incorporate firstorder gradient information (Lin, 2007; Han et al., 2009; Cichocki and Phan, 2009), and suffers from a zerolocking problem. To overcome this inefficiency of multiplicative updates, several stateoftheart NMF algorithms were developed using blockcoordinate descent (Kim et al., 2014), projected gradient descent (Lin, 2007), and alternating constrained least squares (Kim and Park, 2008a). As an alternative to matrixwise updates, columnwise (Cichocki et al., 2007; Cichocki and Phan, 2009; Kimura et al., 2015) and elementwise updates (Hsieh and Dhillon, 2011) have been investigated and shown to significantly speed up the rate of convergence for NMF.
In this paper, we propose a new semiorthogonal matrix factorization method under the framework of both continuous and binary observations. For the former, we take advantage of the fast convergence of the columnwise update scheme
(Cichocki and Phan, 2009) along with a line search implementation to further increase the rate of convergence. The orthogonality property is guaranteed by using an orthogonalpreserving scheme (Wen and Yin, 2013). Additionally, we show that the columnwise update scheme can be reduced to a simple matrixwise alternating least squares update scheme due to its strict orthogonality enforcement at each iteration, which further reduces the complexity of the algorithm. The algorithm for the binary case is based on a logistic regression formulation (Kaban et al., 2004; Tomé et al., 2015) of the objective function with the same orthogonalpreserving scheme along with an implementation of Newton’s method.Our numerical studies show that the objective function being optimized is monotonically decreasing under our proposed algorithm with a significantly faster convergence rate compared to all existing methods while retaining the strict orthogonality constraint throughout. Our model also performs consistently well regardless of the true structure of the target matrix via a deterministic SVDbased initialization, whereas other models are susceptible to local minimums.
One of the most prominent applications of NMF is word clustering and document clustering (Xu et al., 2003; Yoo and Choi, 2008; Aggarwal and Zhai, 2012a, b), due to the biclustering nature of NMF. We perform an extensive text mining data analysis on classifying triage notes. We show that the classification accuracy can be improved by transforming the data set to a basis representation using our method. Additionally, the orthogonal formulation within our method yields uncorrelated topic vectors. Furthermore, words with significant positive and negative weights can be viewed as separate clusters under the same topic vector, providing a subbiclustering interpretation. On the other hand, the enforcement of nonnegativity on the documenttopic matrix permits the conventional sparse and byparts interpretation that the NMF is well known for.
The paper is organized as follows. First, we review some of the established methodologies and algorithms of NMF in section 2. The proposed method for the continuous case and binary case are then presented in section 3 and section 4 respectively. Section 5 provides an extensive numerical study using both simulated and real triage text data sets. Discussions and conclusions of our method are presented in section 6.
2 Notations and Review of Existing Literature
Here the notations used in this paper. We denote an arbitrary matrix as X and a vector as x, i.e. a matrix with column vectors is . The Frobenius norm of matrix is given as .
We next provide a brief review of some popular NMF methods. The nonnegative matrix factorization (Lee and Seung, 1999) aims to factorize a nonnegative matrix into the product of two nonnegative matrices, F and G, e.g.
(1) 
Given an observed design matrix X, we can approximate it as the product of two lower rank matrices F and G, where and . In typical applications, is much less than or . More specifically, columns of X can be rewritten as , where x and g are the corresponding columns for X and G. Thus, each column vector x is approximated as a linear combination of F, weighted by the rows of G, or equivalently, F can be regarded as the matrix that consists of the basis vectors for the linear approximation of X.
The above minimization problem in (1) is not convex in solving both F and G simultaneously but is convex when one of the matrices is fixed. Therefore, only a local minimum can be guaranteed. To solve the problem in (1), the general approach is to alternate the update between F and G while fixing the other until the objective function converges. The most widely used algorithm is to implement a multiplicative update scheme (Lee and Seung, 1999, 2001), where F and G are updated by multiplying the current value with an adaptive factor that depends on the rescaling of the gradient of (1).
(2) 
Aside from a slow convergence rate (Lin, 2007), the zerolocking problem (Langville et al., 2014) is also of an issue. This means that once a value in either F and G is 0, it is essentially stuck at the point. The zerolocking problem can be tackled by nonmultiplicative methods such as projected gradient descent (Lin, 2007) or blockcoordinate descent (Kim et al., 2014), which are computationally fast but the rate of convergence is still not optimal due to the inefficiency of matrixwise update methods. Cichocki et al. (2007) proposed a columnwise update scheme known as hierarchical alternating least squares (HALS), which significantly increased the rate of convergence of the algorithm. This improvement is achieved in the sense that the update of one matrix is not only dependent on the other matrix, but also depends on its own columns. Consequently, every updated column affects the update of the remaining columns.
Ding et al. (2006) imposes orthogonality on NMF, where is an additional constraint. This guarantees the uniqueness of the solution and provides a hard clustering interpretation similar to means. Since both nonnegativity and orthogonality constraints are applied on F, each row of F will only have one nonzero entry, indicating the membership of in cluster . Therefore, the optimization problem becomes a minimization of the Lagrangian function
where Tr denotes the trace. Ding et al. (2006) ignored the nonnegativity of the components in F and relied only on to solve for a unique value of . The update rules for F and G is as follows,
Yoo and Choi (2008) later developed multiplicative updates directly from the true gradient (natural gradient) in the Stiefel manifold, as opposed to the additive orthogonal constraint in other literature. Kimura et al. (2015) implemented a HALS update scheme on top of Ding et al. (2006)’s method. Both methods were shown to outperform the original ONMF in terms of orthogonal solutions but still deviates from strict orthogonality due to the additional nonnegativity.
Ding et al. (2010) relaxed the nonnegative constraint on F and proposed semiNMF, which extends the usage of NMF to matrices with mixedsign elements. They showed that SemiNMF can be motivated from a clustering perspective and drew the connections with means clustering (Ding et al., 2005, 2010). Suppose we apply a means clustering on X, then indicates the cluster centroids and G denotes the cluster indicators: i.e., if belongs to cluster and otherwise. Due to a close relationship with means, G is initialized using a means clustering of X. A small constant is then added to all elements of G to prevent the zerolocking problem during initialization. Then F and G can be simultaneously solved using an alternating iterative scheme given as
(3) 
where
The separation of the positive and negative parts of matrix A preserves the nonnegativity in each step of the iteration for G. SemiNMF has been reported to yield nonsparse solutions, which is not ideal for interpretation.
3 Semiorthogonal Matrix Factorization for Continuous Matrix
3.1 Motivation
Some challenges and drawbacks of NMF variants in the existing literature are as follows. First, current formulations and algorithms do not yield exact orthogonal solutions for NMF even with orthogonality constraints as there is a tradeoff between nonnegativity and orthogonality when solving for the factorized solutions. Second, the rate of convergence is insufficient and numeric instability could lead to zerolocking or 0 values in the denominator for multiplicative update based algorithms. In addition, the quality of the solutions is also highly dependent on initialization. Thrid, there are potential overfitting and rankdeficient issues in the basis matrices for a nonorthogonal formulation of NMF due to correlation and linear dependence between basis vectors, which can affect the performances of subsequent supervised approach. The overlapping information between basis vectors also complicates clustering interpretation for textmining applications.
To address the above limitations, we propose a new framework and algorithm for matrix factorization problems and introduce the Semiorthogonal Nonnegative Matrix Factorization (SONMF) in the following.
3.2 Methodology
Consider the following matrix factorization problem with cost function denoted as ,
where , , and . We solve this problem by alternatively updating the matrices F and G. However, the uniqueness of the proposed method is to take advantage of the Stiefel Manifold , where is the feasible set . In particular, our method enforces the entire solution path of F to be exactly on this manifold, thereby preserving strict orthogonality. In this section, we first present the proposed orthogonality preserving update scheme for F, then demonstrate that this strict enforcement of orthogonality leads to an extremely efficient update for G.
3.2.1 Orthogonal Preserving Update Scheme for F
The main challenge in achieving the exact orthogonal solution for ONMF is the additional nonnegative constraint that’s imposed (Ding et al., 2006; Yoo and Choi, 2008; Kimura et al., 2015). In contrast, we relax F to be mixed, thus allowing us to utilize the full gradient of F and to perform a line search to find a better optimum solution. Following Wen and Yin (2013), we initialize F with a columnwise orthonormal matrix and then apply the Cayley Transformation to preserve the orthogonality while applying a line search for each update. Under the matrix representation, the gradient of F is
We can define a skewsymmetric matrix
S byThe next iteration of F is determined by the CrankNicolsonlike scheme,
where has the closed form solution
and Q is obtained via the Cayley Transformation,
(4) 
Thus,
and is a step size. The Cayley transformation is commutative, thus the order in (4) is also interchangable. The above formulation consists of several desirable properties. Since Q is an orthogonal matrix, thus is also a matrix with orthonormal columns for all values of . In addition,
therefore, it successfully preserves the orthonormality through the entire solution path. Moreover, equals the projection of R into the tangent space of at F. Therefore, is a descent path, which motivates us to perform a line search along the path. Here, the step size is adaptive as it is necessary to search for a proper step size at each iteration to guarantee the convergence of the algorithm. For initial , we chose it to be . Using the idea of backtracking line search, the choice for the value of can be made easily by calculating the difference between the mean square error of the previous iteration and the current iteration , given as . We increase by a factor of 2 when . If , then is divided by a factor of 2.
On the other hand, is rather intensive to compute, due to the fact that it requires inverting , which is an matrix. However, typical NMF applications have , thus we can use the ShermanMorrisonWoodbury (SMW) formula to reduce this inversion process down to that of a matrix by rewriting S as a product of two lowrank matrices. Let and , then we can rewrite S as . The SMW formula is given as
for some invertible square matrix . Letting , , and , we can derive the final update for F as follows
using , we have
(5)  
(6) 
where in Eq. (5
), we replace the third identity matrix
I by . This leads to the final update rule in Eq. (6).3.2.2 Hierarchical Alternating Least Squares Update Scheme for G
The main idea of the hierarchical alternating least squares (HALS) update scheme (Cichocki et al., 2007) is that it efficiently updates both the matrices globally and their individual vectors locally. Instead of updating the columns of F or G one by one as in matrixwise update methods, the columns of F and G are updated simultaneously, thereby speeding up the rate of convergence. We use this idea for our update for G and show that our strict orthogonality constraint further simplifies the update scheme to a matrixwise alternating least squares update scheme.
Following Cichocki et al. (2007), Cichocki and Phan (2009) and Kimura et al. (2015), we start with the derivation of the columnwise update of G by fixing F:
where is the residual matrix without the th component. To find a stationery point for , we calculate the gradient, which leads to
Thus, the naive columnwise update scheme of G is
(7) 
where to preserve the nonnegativity of G. The algorithm can be simplified due to the strict orthogonality constraint of F on the entire solution path, which implies . Thus, the updating rule in Eq. (7) can be simplified to
Since , the complete updating rule for each is given by
which yields the least squares algorithm form in Cichocki and Phan (2009) and Kimura et al. (2015). Now, taking advantage of the orthogonality constraint in the entire solution path, we have . Hence, the updating rule is simply . Noticing that this columnwise update is not affected by the changes in other columns, we are essentially using a simplified matrixwise ALS update scheme,
(8) 
This simplification is only possible if is exactly preserved throughout the solution path. The final part of this algorithm is to choose a suitable convergence criterion apart from the number of predefined iterations. We define the convergence criterion to be the difference of the objective function values between two iterations.
where any sufficient small value could be a feasible choice, such as 0.0005.
The algorithm for the continuous response is given as follows.
4 Semiorthogonal Matrix Factorization for Binary Matrix
For matrices with binary responses, the previous method for the continuous response is not applicable. In this section, we first review some existing methodologies on binary matrix factorization, then proceed to propose our new model.
4.1 Existing Work for Binary Matrix Factorization
Binary data arise often in realworld applications due to its simplicity in representing data information. For example, typical applications of binary data analysis and clustering include marketbasket data, documentterm data, recommender systems, or proteinprotein complex interaction network. One way to factorize a binary matrix is to do so directly (Li, 2005; Zhang et al., 2007, 2010) where the factorized solutions are also binary. However, this leads to poor interpretation, as these methods does not account for the importance of each entry within the factorized components. Alternatively, methods that utilize the Bernoulli likelihood has shown to be more effective in this manner. In this formulation, we assume each
follows an independent Bernoulli distribution with parameter
, where each follows the logistic function ,Earlier related work includes a probabilistic formulation of principal component analysis (PCA)
(Tipping and Bishop, 1999; Collins et al., 2002; Schein et al., 2003), thereby extending the applications of the wellknown multivariate analysis method to binary data. In particular,
Schein et al. (2003) proposed the logistic PCA, where the optimization problem is formulated using a Bernoulli likelihood function,(9) 
However, since there is an absence of nonnegativity constraint in this formulation, the underlying factors migh not have a meaningful interpretation.
Alternatively, Schachtner et al. (2010) proposed the binary nonnegative matrix factorization (binNMF) (Schachtner et al., 2010)
. The binNMF model is similar to logistic PCA, which formulation is based on a nonlinear sigmoidal function to transform the matrix product
onto the range of logodds. The major difference between logistic PCA and binNMF is that binNMF models the logodds in the Bernoulli distribution as two nonnegative factors in accord with a partsbased model, which enforces the entries of both
F and G to be strictly nonnegative. A similar formulation was also proposed by Larsen and Clemmensen (2015). Tomé et al. (2015) then expanded on binNMF and proposed the logNMF model, where they relaxed the nonnegativity constraint on G. Their model requires strictly nonnegative basis vectors (F) but is more flexible as it allows negative feature components (G).We propose a new binary matrix factorization such that our matrix of basis vectors F are orthonormal, and the nonnegative constraint is only imposed on G. Tomé et al. (2015) used a gradient ascent scheme with a fixed step size for both F and G, while we apply Newton’s method for G and a line search algorithm for F. Note that the implemented Newton’s method for optimization is different from the rootfinding NewtonRaphson algorithm. The usage of fixed step size in Tomé et al. (2015) is rather restrictive as the algorithm only converges at a modest rate with a very small step size. On the other hand, convergence issues may arise due to larger step sizes. In comparison, our model effectively remedies this issue due to the linesearch implementation in F. Even if the step size for the update of G is too large, the error can be corrected by choosing a more appropriate step size for F. Additionally, the combination of a line search update for F and Newton’s method for G provides a more efficient convergence towards the solution.
4.2 Methodology
The idea of the factorization is to find appropriate F and G such that they maximize the loglikelihood function given in Eq (9). Writing the logistic sigmoid function in its full form, we have
We defined the cost function based on the negative loglikelihood,
(10) 
Since the 2nd derivative of the cost function is well defined in our problem, we implement Newton’s method for optimization to find the update rule for G, which allows for a faster rate of convergence with a slight increase in computation. However, this increase in computation is fairly trivial since the operation of the gradient and hessian is on an elementwise level. Therefore, the inversion of the Hessian can be avoided, which saves up a considerable amount of computation time. Assuming both X and F are fixed, the first and second derivatives of the cost function with respect to G are given as
and
Following Newton’s method, the update rule for G in matrix notation is given by
where is a step size, 1 is the matrix of all 1’s. The quotient and exponential function here are elementwise operations for matrices.
For the update rule of F, the only difference from the continuous case is the gradient, whereas the orthogonalpreserving scheme remains the same as in the continuous case. Following an equivalent deduction as G, the gradient of F is given as
Since the algorithm intends to minimize the cost function, an overfitting problem might arise due to the nature of the problem. The algorithm seeks to maximize the probability that
is either 0 or 1 by approximating the corresponding entries within the probability matrix to be also close to 0 or 1. Since F is constrained to be orthonormal, the scale of the approximation is solely dependent on G. Thus, larger values in G will increase the risk of overfitting. To avoid this issue, the step size for the update of G needs to be relatively small, and in our algorithm, we choose the default value to be 0.05.5 Simulation and Data Experiments
We evaluate the performance of our model through different data experiments. In the first part, we compare the performance of our model with several wellestablished algorithms of NMF for the continuous case under difference simulation settings. For the binary version, we show that both the cost function and difference between true and estimated probability matrices monotonically converges under the algorithm, along with a comparison with another state of art model. For the second part, we apply our method to numerous real text data sets that are composed of triage notes from a hospital and evaluate classification performance on whether a patient should be admitted to the hospital for additional supervision. Overall, our method performs the best in terms of improving classification accuracy on top of the conventional bagofwords model as opposed to the other NMF alternations. Finally, we discuss the interpretation of the word clusters extracted by our model. We do so by examining the largest positive values and the smallest negative values of the words under the
F matrix and show that these two clusters of words can be interpreted as further subbiclusters.We developed an R package “MatrixFact” that implements the two proposed method in this paper. In addition, we also implemented the original NMF (Lee and Seung, 2001), ONMF (Kimura et al., 2015), SemiNMF (Ding et al., 2010), and logNMF (Tomé et al., 2015) in the same package. To the best of our knowledge, existing R packages only include various algorithms for regular NMF, but lack access to other methods, while our package bridges this gap. The package is available on GitHub and can be installed via the commend line “devtools::install.packages(cronshells/MatrixFact)” through R.
5.1 Simulation for Continuous Case
For the continuous case, we evaluate the average residual and orthogonal residual, where
(11) 
(12) 
In addition to the two main criteria above, we also evaluate how the algorithms perform on recovering the original matrices F and G by calculating the difference between the column space of F, G and , in which F and G are the true underlying matrix, and and are the approximated matrices from the factorization. That is,
(13) 
where , , ,and are the projection matrices of their respective counterparts, i.e. . Additionally, We also investigate the sparsity of the factorized solutions. To measure this quantity, we shrink all elements equal to or less than to 0. We then evaluate the proportion of 0’s within F and G respectively.
We compare our method with three other popular NMF methods, that is NMF with multiplicative updates (Lee and Seung, 2001), SemiNMF (Ding et al., 2010), and ONMF (Kimura et al., 2015). The simulations are conducted under an i77700HQ with four cores at 3.8GHZ. Three different scenarios are considered:

where and where

Nonnegative and orthogonal and where

Orthonormal and where
After generating the true underlying F and G, we construct the true X given as
where E is a matrix of random error such that . In this simulation experiment, we consider and . As a side note, refer to the section 8.2 for further discussion in the appendix on when the underlying true rank of F and G is less than the target factorization rank.
Initialization of NMF methods are crucial and have been extensively studied (Xue et al., 2008; Langville et al., 2006, 2014; Boutsidis and Gallopoulos, 2008) as good initialization allows the algorithm to achieve better local minimum at a faster convergence rate. An excellent choice of starting point for F is the left singular matrix U
of the singular value decomposition of
X. We apply the SVD to decompose X to it’s best rankK factorization, given aswhere is the rank of the target factorization. The truncated SVD is implemented as it provides the best rankK approximation of any given matrix, according to (Eckart and Young, 1936; Wall et al., 2003; Gillis and Kumar, 2014; Qiao, 2015). Furthermore, the formulation of our model does not require the initialization of G, since the update rule for G given in (8) is only dependent on X and F. For more discussion of initialization, please refer to section (8.1) of the appendix.
For SemiNMF, we apply the Kmeans initialization to F and G respectively according to Ding et al. (2010). Lee and Seung (2001) and Kimura et al. (2015) proposed to use random initialization for NMF and ONMF respectively. For a fair comparison, we initialize F and G using a slightly modified SVD approach, where we truncate all negative values of U to a small positive constant to enforce both nonnegativity and avoid the zerolocking problem for the NMF. We then apply our update rule for G as the initialization for G, i.e.
where The average values of the above four criteria over 200 simulation trials with different underlying true F and G are reported under three scenarios in Table 1, 2, and 3 respectively, where each trial is set to run 500 iterations. We display the convergence plot of the objective function based on the first 250 iterations in Figure 1, 2, and 3, where the convergence criterion under consideration is
The last two columns of Table 1, 2, and 3 indicate the time and the number of iterations for each algorithm to reach this criterion.
K 








SONMF  
10  0.0878  7.16  0.461  0.347  0  0  0.58  10.6  
30  0.0807  4.23  0.694  0.634  0  0  1.07  11.7  
50  0.0750  9.29  0.913  0.840  0  0  1.95  12.9  
ONMF  
10  0.2631  0.040  3.504  0.653  87.06  0  0.76  17.3  
30  0.7440  0.439  7.031  3.276  94.63  0.04  1.71  44.3  
50  1.1684  1.404  8.753  3.008  95.00  0.13  3.78  70.29  
NMF  
10  0.1598  N/A  1.390  1.773  1.39  1.77  3.20  292.1  
30  0.4190  N/A  3.344  4.001  0.93  0.57  11.07  500  
50  1.5420  N/A  5.493  5.854  0.82  0.41  18.48  500  
SemiNMF  
10  0.1206  N/A  0.174  0.207  0  0  4.13  480.3  
30  0.1879  N/A  1.839  3.499  0  0  11.35  500  
50  0.2708  N/A  2.609  5.006  0  0  18.571  500 
K 








SONMF  
10  0.0765  1.08  0.278  0.485  0  3.95  0.77  10.1  
30  0.0729  1.29  0.822  1.160  0  5.39  1.12  8.3  
50  0.0678  1.34  1.408  1.774  0  4.77  1.59  7.8  
ONMF  
10  0.0747  0.578  0.492  0.595  59.5  0.021  0.79  18.9  
30  0.0703  0.975  0.638  1.085  72.9  0.496  1.79  40.5  
50  0.0666  1.636  0.669  1.575  76.6  2.51  3.38  60.0  
NMF  
10  0.1064  N/A  0.181  0.597  24.9  20.8  3.73  318.9  
30  0.1297  N/A  0.469  1.497  38.4  30.6  11.71  394.2  
50  0.1421  N/A  1.171  2.696  42.0  33.3  19.33  411.1  
SemiNMF  
10  0.0749  N/A  0.277  0.367  0  0.39  29.3  
30  0.0676  N/A  0.816  0.931  0  0  1.28  39.1  
50  0.0628  N/A  1.401  1.586  0  0  2.49  49.6 
K 








SONMF  
10  0.0857  4.6  3.665  3.679  0  18.93  0.84  6.7  
30  0.0770  2.8  6.351  6.375  0  17.37  1.27  7.6  
50  0.0694  1.6  7.976  8.011  0  16.59  1.88  8.8  
SemiNMF  
10  0.0823  N/A  3.661  3.598  0  0  1.31  33.6  
30  0.0758  N/A  6.336  6.330  0  0  1.86  37.1  
50  0.0670  N/A  7.960  7.961  0  0  2.37  40.2 
Tables 13 show that SONMF has several advantages over other methods. First, our model converges quickly and consistently regardless of the structure of the true matrix we considered in the simulation. For example, our model reaches the convergence criterion and the true error with an average of about 10 iterations, greatly surpassing the rate of convergence of the other models. This is because the deterministic SVDbased initialization allows us to have a head start compared to the other methods, which leads to rapid convergence to the true error as we are on an excellent solution path. The line search implementation further speeds up the convergence rate.
For scenario (1), our model is significantly better in terms of factorization accuracy and recreating the true matrices, as shown by the smallest average residual, and values, especially for larger ’s. For ONMF and NMF, the mean value over 50 trial fails to converge to the true error, mainly due to a large number of saddle points that the true F possess, as shown by the large values. Due to the more restricted sample space, it is harder for NMF and especially ONMF to escape a bad solution path, unlike SemiNMF which possesses a less confined parameter space. Additionally, the ONMF proposed by Kimura et al. (2015) does not guarantee the monotonic decreasing of the objective function value, due to the additional orthogonality constraint. The SemiNMF has the least constraints among these four models, and converges to the true error eventually, but is computationally heavy due to the slow rate of convergence.
When the underlying structure of F is more well defined as in scenario (2), all four models converge to the true error, with the NMF having the slowest rate of convergence. For factorization accuracy, our model outperforms ONMF and NMF, but performs slightly worse than SemiNMF due to the extra orthogonal constraint. In terms of recovering the true underlying matrix, it is not surprising that NMF and ONMF recover F better, as the true F is easier to estimate due to its unique structure.
For scenario (3), our model and SemiNMF have similar performances. However, for the rate of convergence, we have an advantage than SemiNMF. Similar to scenario (2), SemiNMF performs slightly better than our model in terms of factorization accuracy and recovering the true matrix due to a less confined parameter space. Lastly, since our algorithm preserves strict orthogonality throughout the entire process, SONMF has a negligible orthogonality residual due to floating point error in computation, as opposed to the increasing orthogonality residual that ONMF possesses as the number of basis vectors increases.
The strict nonnegative versions yield sparser solutions compared to the relaxed variants, which is not surprising as the sparsity solution of NMF has been well studied in previous literature (Lee and Seung, 2001; Guan and Dy, 2009; Ding et al., 2010). SemiNMF produces fewer or no sparse solutions for all three scenarios. This result is consistent with Ding et al. (2010)’s finding. Our model has a slight advantage in this criteria over SemiNMF, with a moderate degree of sparsity in the third scenario. However, the following empirical studies on real datasets show that our algorithm yields a moderate degree of sparsity in the G matrix as well.
5.2 Simulation for Binary Case
For the binary response, we use the mean value of the cost function in equation (10) as our evaluation criterion instead of the normalized residual. That is,
(14) 
where is the total number of elements in X. We also consider the orthogonal residual, and given in equation (12) and (13) respectively. Additionally, we evaluate the difference between the probability matrix and the estimated probability matrix from our binary method,
We show that both the objective function and the difference between the true and estimated probability matrix are monotonically decreasing until convergence.
For the binary simulation setting, we generate mixedsign F and nonnegative G such that and . We then construct the true probability matrix P using the logistic sigmoid function,
We then add a matrix of random error E to P where . Finally, we generated the true X where each and has dimension 500by500. Similar to the continuous case, we consider .The average values of the above five criteria over 50 simulation trials are reported. We compare the performance of our method with logNMF (Tomé et al., 2015), where they set their step size for the gradient ascent to be 0.001.
For our model, we set the step size for Newton’s method update scheme for G to be 0.05. The initialization for F and G are the same as the continuous case. The stopping condition for both models is either a maximal number of iterations of 500 or a change in the cost for subsequent iterations is less than 0.0001.
5.2.1 Results for Simulation (Binary)
K 










10  0.1718  1.814  42.001  1.696  1.029  0  15.56  23.52  228.3  
30  0.0845  7.552  59.472  3.821  2.456  0  23.77  40.57  316.0  
50  0.0220  1.012  65.114  5.627  4.111  0  28.03  64.21  407.9  
logNMF  
10  0.1725  N/A  47.128  1.711  1.054  18.10  0  29.35  323.1  
30  0.0861  N/A  62.670  3.891  2.717  26.02  0  49.60  471.5  
50  0.0589  N/A  66.028  5.658  4.456  32.03  0  60.67+  500+ 
The result above shows that our model has a faster convergence rate towards the true cost and an overall lower mean cost than logNMF. Our model reaches the true cost at around 150 iterations while logNMF requires more than 300 iterations. The number of required iterations to converge also increases as increases since more basis vectors mean more parameters to be estimated. Additionally, our model has a lower error for , , and respectively when both algorithms reach the convergence criterion. The sparsity of the solutions is similar.
Unlike the continuous case, the SVDbased initialization does not provide a rapid convergence to the true error for our model. The reason here is because the SVD is applied on X, but F and G are estimating the underlying probability matrix of X and not X itself. For , the difference between P and converges once the average cost for the factorization reaches the true cost. Further iterations cause the models to overfit, where will have a tendency to approach 1 when is 1 and 0 when is 0, and will in turn increase .
An important caveat to note here is that the rate of convergence of our model can be increased or decreased according to the step size of G to a certain extent. A smaller step size results in a more conservative solution with a slower rate of convergence, while a larger step size achieves the opposite. In our numerical experiment, we found out that 0.05 or 0.1 turns out to be a good step size in terms of this tradeoff between the rate of convergence and risk of overfitting. On the other hand, a step size greater than 1 would result in convergence issues.
5.3 Case Study (Triage Notes)
In this section, we focus on the application of our model on real text data sets. Our goal is to build a machine learning model to classify whether a patient’s condition is severe enough to warrant further inspection at the hospital. We show that the classification error can be improved by doing a linear transformation of basis with our model under a certain number of basis vectors. We also present the interpretation of the word clusters that our model identified for the considered data sets.
The triage data is provided by the Alberta Medical Center. There are approximately 500,000 patients (observations), each with a response variable, the disease (medical complaint) of the patient, and a column of text information regarding the reason of visit collected during the triage phase. The demographic information and vital signs of the patients are also recorded, but we only focus on the text information for this study. The response variable is a binary indicator for the disposition of the patient, i.e., whether the patient is admitted or discharged from the hospital after the doctor’s examination. The vocabulary used in the notes is relatively different across different diseases, thus we believe it’s necessary to consider each disease separately. For this study, we consider the analysis of 8 triage data sets, each related to a different type of disease.
We first preprocess the data with the tm package (Feinerer et al., 2008) by removing numbers, punctuation, a standard list of default stop words, and stemming words to their root form. We then represent the text data using a vectorspace model (a.k.a. bagofwords) (Salton et al., 1975). Given documents, we construct a worddocument matrix where corresponds to the occurrence or significance of word in document , depending on the weighting scheme. In this study, we consider the tfidf and binary weighing (Gupta et al., 2009) for the continuous and binary case respectively.
For classification, we denote the training and testing bagofwords as and respectively. Applying the matrix factorization method yields a wordtopic matrix and documenttopic matrix , such that the mixed and orthogonal constraint is imposed on the wordtopic vectors in .
After obtaining the factorized solution, we then project both and onto the column space of . Let and , then and is a reduced dimension representation of and respectively. and now effectively replace and as the new features. Depending on the number of basis vectors, this transformation allows us to effectively increase the classification accuracy, while decreasing the computing time to train a model due to the reduced dimension of the feature matrix. Intuitively, we can regard as a summary device, where each cluster/basis consists of linear combinations of different words. After applying the projection, can be viewed as a summary of the original bagofword matrix, where each document is now a linear combination of the topics from .
We apply a 5fold cross validation for classification and results are averaged over 20 different runs, where the observations in each run are randomly assigned to different folds with stratified sampling via the caret package (Kuhn, 2008) to avoid unbalanced training and testing sets. We compare our model with four other NMF methods with the same procedures as above. For TFIDF weighing, we apply our continuous model and compare the performance with the NMF (Lee and Seung, 2001), ONMF (Kimura et al., 2015), and SemiNMF (Ding et al., 2010). For binary weighing, we consider the comparison with logNMF (Tomé et al., 2015). The stopping criteria are either when the maximum of 200 iterations is reached or if the subsequent change of the cost function (11) or (14) between iterations is less than . During our experiment, we notice that the step size suggested by Tomé et al. (2015) was too large for logNMF on these data sets which leads to unstable performances. Thus, we set for logNMF on these experiments. We consider LASSO (Tibshirani, 1996) using the glmnet package (Friedman et al., 2010) for classification. We show that our method of factorization not only improve the classification result over the naive bagorwords but also outperforms other matrix factorization methods. In addition, we present the average residual from the factorization and the sparsity percentages of the solutions. Note that these two measurements are computed from the full bagofwords, and not the for training sets from crossvalidation.
5.3.1 Results for Classification of Triage Notes
The 7 triage datasets we consider in this study are given in the table below. The dimensions of the data set, the baseline accuracy of classifying all observations as discharged, and the classification accuracy using LASSO on the documentword matrix are also included.
Data sets 






Altered Level of Consciousness  5220 5126  48.85  73.79  73.65  
Cough  13084 5876  84.35  87.25  87.30  
Fever  7302 4770  77.20  81.33  81.45  
General Weakness  7442 5455  47.79  69.34  69.45  
Lower Extremity Injury  12377 5180  82.50  88.36  88.45  
Shortness of Breath  9322 4659  55.04  74.24  74.17  
Stroke  5036 3869  45.17  74.41  74.19 
The factorization and classification results for the first two datasets are provided in the following tables. Note that the sparsity of the solutions and classification accuracy are given in percentages.
For the continuous case, the mean residual exhibits a consistent and reasonable pattern throughout the presented data sets. SemiNMF has the lowest mean residual, followed shortly by our method, which outperforms both ONMF and NMF. This is expected, as the mean residual is dependent on the number of constraints imposed on the model. Both ONMF and NMF yields significantly sparser results than the mixed variants. Our model yields increasingly sparse solutions in the G matrix as the number of topics increases, which is advantageous for interpretation compared to dense solutions found by SemiNMF. On the other hand, neither binary methods yield sparse results.
Depending on the number of basis vectors, we observe that applying the factorization and projection has a to change in performance compared to the bagofwords model. We also observe that our model retains higher classification accuracy compared to the other methods when underfitting. In general, classification performance for the continuous case is better than the binary case, with logNMF having a significantly lower performance than the others. Thus, it is recommended to use the continuousbased bagofwords and factorization method in practice. Among all these models, we notice that the methods with orthogonal constraint have a slight improvement over the nonconstrained ones, due to the fact that orthogonality gives clearer of cluster representation. Since orthogonality implies uncorrelated projection directions, this further improves the performance of the LASSO model by eliminating the issue of multicollinearity issue during the modelfitting. Furthermore, the SONMF and SemiNMF both perform consistently better than their nonnegative counterparts. This is perhaps due to the less confined parameter space allowed in F in factorization, which reflects in a more accurate representation of the original data set.
The SONMF generally has an overall improvement of over the other methods, with significantly better performance, compared to the nonnegative variants. The classification accuracy increases as the number of topics increases, but at a diminishing return. Based on our experiment, having more than 150 basis vectors does not provide a substantial improvement in the classification performance. Aside from overfitting, the computation cost for factorizing such a large dimension bagofwords matrix increases sharply as increases and thus the tradeoff is not warranted.
5.3.2 Interpretation of Word Clusters
The above basis vectors can be interpreted as topics through a latent connection between the words and documents under the bagofwords model. Since the wordtopic matrix F is strictly orthogonal, our formulation of the problem yields orthogonal topic vectors. Orthogonal topic vectors have previously been considered by (Ding et al., 2006). In their formulation, each word can only belong to a single cluster, which creates distinct, uncorrelated topics with nonoverlapping meanings. However, we believe that this implication is too strong, as most words have multiple meanings, and could belong to multiple topics. Thus, relaxing nonnegativity from the orthogonal matrix still preserves the interpretation of uncorrelated topic vectors, while allowing words to belong to multiple clusters.
Aside from uncorrelated topic vectors, representing the loading of words through a linear combination of positive and negative terms also leads to a different interpretation as opposed to the conventional byparts representation. Positive loading of each word indicates that the word belongs to the cluster with positive strength, while a negative loading represents the distance of a word from a specific topic. Words with large positive values under a topic indicate that they are the most representative of this topic, and also imply that these words tend to appear together and are highly correlated with each other. On the other hand, words with small negative values indicate that these words deviate significantly from this topic, and can be viewed as from a separate cluster and acronyms of the positive words within the same column. This interpretation creates a biclustering structure within a topic, in contrast to the zero representation of words in the nonnegative case. As shown in our empirical results, the topic vectors in F are not sparse. Therefore, interpretation of a topic vector should rely on the words with the largest absolute values of their loading. Words with small magnitude can be regarded as noise.
In this section, we present examples of the wordtopic vectors generated by our method from the ’Lower Extremity Injury’ and ’Symptoms of Stroke’ data sets.
Lower Extremity Injury  
Topic 1  Topic 2  Topic 3  Topic 4  Topic 5 
Positive  
hip  weight  knee  xray  ago 
rotate  bear  pedal  done  day 
glf  ankle  puls  fracture  week 
break  ubl  strong  told  soccer 
morphin  roll  fell  show  twist 
Negative  
ankle  knee  ankle  alright  head 
knee  ambulance  ago  date  land 
lower  ice  day  now  pulse 
swell  fall  soccer  able  pedal 
calf  land  play  page  back 
Symptoms of Stroke  
Topic 1  Topic 2  Topic 3  Topic 4  Topic 5 
Positive  
left  team  right  equal  episode 
side  stroke  side  grip  min 
leg  see  eye  strong  last 
arm  aware  facial  steady  resolve 
weak  place  face  strength  approx 
Negative  
deny  deny  left  hours  day 
note  resolve  family  tingly  note 
right  symptom  weak  unsteady  onset 
episode  home  state  sudden  side 
state  week  confus  yesterday  place 
From table 8, each topic specifically mentions the location and reason for the injury. We can interpret Topic 1 to be mainly on injury from falling, Topic 2 on the ankle injury, Topic 3 on the knee injury from biking, Topic 4 on xrays, and Topic 5 on soccer by examining the words with positive weights. There is no overlapping of meanings between topics, and the words with negative weights under the same topic refers to a completely different location and cause. For instance, topic 3 and topic 5 are almost complete opposites of each other. The meaning of topics also fits our intuition, as it is unlikely that patients who injured their knees during biking would also twist their ankles during soccer since people are likely to restrain themselves from further physical activities if any one of the conditions happens. The contrast representation is more evident in the Symptoms of Stroke data set as shown in table 9. For Topics 1 and 3, our model correctly identifies ’left’ from ’right’. For Topic 4, we also observe that the word ’steady’ and ’unsteady’ have been placed in the opposite signs under the same topic. The above examples show that our model is able to not only cluster correlated terms more meaningfully, in additional to differentiating between dissimilar words.
6 Conclusion
In this paper, we proposed the semiorthogonal nonnegative matrix factorization (SONMF) as a new extension to the existing NMF methods and provided a formulation for the continuous and binary case respectively. Our model relaxes the nonnegative constraint and adds an orthogonal constraint on F while retaining the nonnegativity of G. Implementing the Cayley Transformation within our update scheme yields a strict orthogonal solution, a vital property that is not present in the existing literature. Additionally, the orthogonal formulation yields uncorrelated topic vectors for text mining applications, along with a subbiclustering structure within the same topic.
For matrices with continuous entries, our model implements a hierarchical alternating least squares update scheme for G and a line search algorithm along the Stiefel manifold for F to preserve strict orthogonality. The strict orthogonality formulation further simplifies the columnwise update scheme for updating G to a matrixwise nonnegative least square update scheme. For binary matrices, our model applies a logistic regression formulation to approximate the underlying true probability matrix. Our model incorporates secondorder information through Newton’s method to update G and the line search algorithm to update F as in the continuous case. Through numerical experiments and real data, we have shown the competitiveness of our method to achieve factorization accuracy, the degree of orthogonality, the rate of convergence, classification accuracy and interpretability on text mining applications.
Similar to other NMF methods, our model is capable of performing clustering effectively. Since our G is only nonnegative, our model has a soft clustering interpretation, where each observation or feature can be mapped to different clusters Ding et al. (2006, 2010). The orthogonal constraint on F enforces the cluster centroids identified to be orthogonal to each other. This maximizes the constrasts and clear separation presentation between clusters. The relationship with clustering can also be drawn using the connection with PCA/SVD, where . Absorbing D into V, we can rewrite PCA as , where U and are both mixed signs and orthogonal. The F matrix in our formulation is thus similar to U, and our G can be viewed as V with additional enforcement of nonnegativity. The matrix G can also be viewed as the principal components, and F as the rotation matrix. It has been well studied in previous literature that PCA is effectively performing Kmeans clustering (Zha et al., 2002; Ding et al., 2005). In this sense G is the cluster indicator for the clusters such that and the principal directions, project all data points into the subspace spanned by the k cluster centroids (Ding et al., 2005, 2006).
Additional work can be conducted in the future in terms of adding sparsity and smoothness to the model, as well as to solve the overfitting problem that is prevalent in a logistic regression formulation. Another interesting direction is to investigate the convergence rate of optimizing on manifolds. Boumal et al. (2016) recently showed that the Riemannian gradient descent method (the Stiefel Manifold is a special case of these) has a convergence rate of , where is the tolerance level. Thus, we believe that similar convergence rates can be applied to our method. Finally, a drawback of our model that the nonnegative projection to preserve nonnegativity is still adhoc. To investigate the theoretical convergence properties of SONMF, a new update scheme for G to preserve nonnegativity will be needed.
7 Acknowledgments
This research is partially supported by NSF DMS1613190 and DMS182119. We thank the two anonymous reviewers and the action editor for their valuable comments and feedback.
8 Appendix
8.1 Additional Results for Triage Data Sets
In this section, we present the results of the remaining five data sets in alphabetical order.
Comments
There are no comments yet.