A Block Coordinate Descent-based Projected Gradient Algorithm for Orthogonal Non-negative Matrix Factorization

03/23/2020
by   Soodabeh Asadi, et al.
University of Ljubljana
FHNW
0

This article utilizes the projected gradient method (PG) for a non-negative matrix factorization problem (NMF), where one or both matrix factors must have orthonormal columns or rows. We penalise the orthonormality constraints and apply the PG method via a block coordinate descent approach. This means that at a certain time one matrix factor is fixed and the other is updated by moving along the steepest descent direction computed from the penalised objective function and projecting onto the space of non-negative matrices. Our method is tested on two sets of synthetic data for various values of penalty parameters. The performance is compared to the well-known multiplicative update (MU) method from Ding (2006), and with a modified global convergent variant of the MU algorithm recently proposed by Mirzal (2014). We provide extensive numerical results coupled with appropriate visualizations, which demonstrate that our method is very competitive and usually outperforms the other two methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/02/2021

Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

In the non-negative matrix factorization (NMF) problem, the input is an ...
12/10/2020

Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem

In this paper, we consider the symmetric multi-type non-negative matrix ...
01/05/2017

Outlier Detection for Text Data : An Extended Version

The problem of outlier detection is extremely challenging in many domain...
08/02/2016

Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization

This work proposes a new algorithm for automated and simultaneous phenot...
05/14/2018

Integrating Hypertension Phenotype and Genotype with Hybrid Non-negative Matrix Factorization

Hypertension is a heterogeneous syndrome in need of improved subtyping u...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

Many machine learning applications require processing large and high dimensional data. The data could be images, videos, kernel matrices, spectral graphs, etc., represented as an

matrix . The data size and the amount of redundancy increase rapidly when and grow. To make the analysis and the interpretation easier, it is favorable to obtain compact and concise low rank approximation of the original data . This low-rank approximation is known to be very efficient in a wide range of applications, such as: text mining [2, 27, 30], document classification [3], clustering [20, 32], spectral data analysis [2, 13]

, face recognition

[35], and many more.

There exist many different low rank approximation methods. For instance, two well-known strategies, broadly used for data analysis, are singular value decomposition (SVD)

[10] and principle component analysis (PCA) [12]. Much of real-world data are non-negative, and the related hidden parts express physical features only when the non-negativity holds. The factorizing matrices in SVD or PCA can have negative entries, making it hard or impossible to put a physical interpretation on them. Non-negative matrix factorization was introduced as an attempt to overcome this drawback, i.e., to provide the desired low rank non-negative matrix factors.

1.2 Problem formulation

A non-negative matrix factorization problem (NMF) is a problem of factorizing the input non-negative matrix into the product of two lower rank non-negative matrices and :

(1)

where usually corresponds to the data matrix, represents the basis matrix, and is the coefficient matrix. With we denote the number of factors for which it is desired that . If we consider each of the columns of being a sample of

-dimensional vector data, the factorization represents each instance (column) as a non-negative linear combination of the columns of

, where the coefficients correspond to the columns of . The columns of can be therefore interpreted as the pieces that constitute the data . To compute and , condition (1) is usually rewritten as a minimization problem using the Frobenius norm:

(NMF)

It is demonstrated in certain applications that the performance of the standard NMF in (NMF) can often be improved by adding auxiliary constraints which could be sparseness, smoothness, and orthogonality. Orthogonal NMF (ONMF) was introduced by Ding et al., [9]. To improve the clustering capability of the standard NMF, they imposed orthogonality constraints on columns of or on rows of . Considering the orthogonality on columns of , it is formulated as follows:

(ONMF)

If we enforce orthogonality on the columns of and on rows of , we obtain the bi-orthogonal ONMF (bi-ONMF), which is formulated as

(bi-ONMF)

where

denotes the identity matrix.

1.3 Related work

The NMF was firstly studied by Paatero et al., [26, 1] and was made popular by Lee and Seung [18, 19]. There are several different existing methods to solve (NMF). The most used approach to minimize (NMF) is a simple MU method proposed by Lee and Seung [18, 19]. In Chu et al., [6], several gradient-type approaches have been mentioned. Chu et al., reformulated (NMF) as an unconstrained optimization problem, and then applied the standard gradient descent method. Considering both and as variables in (NMF), it is obvious that is a non-convex function. However, considering and separately, we can find two convex sub-problems. Accordingly, a block-coordinate descent (BCD) approach [19] is applied to obtain values for and that correspond to a local minimum of . Generally, the scheme adopted by BCD algorithms is to recurrently update blocks of variables only, while the remaining variables are fixed. NMF methods which adopt this optimization technique are, e.g., the MU rule [18], the active-set-like method [16], or the PG method for NMF [21]. In [21], two PG methods were proposed for the standard NMF. The first one is an alternating least squares (ALS) method using projected gradients. This way, is fixed first and a new is obtained by PG. Then, with the fixed at the new value, the PG method looks for a new . The objective function in each least squares problem is quadratic. This enabled the author to use Taylor’s extension of the objective function to obtain an equivalent condition with the Armijo rule, while checking the sufficient decrease of the objective function as a termination criterion in a step-size selection procedure. The other method proposed in [21] is a direct application of the PG method to (NMF). There is also a hierarchical ALS method for NMF which was originally proposed in [7, 11] as an improvement to the ALS method. It consists of a BCD method with single component vectors as coordinate blocks.

As the original ONMF algorithms in [20, 32] and their variants [33, 34, 5] are all based on the MU rule, there has been no convergence guarantee for these algorithms. For example, Ding et al., [9] only prove that the successive updates of the orthogonal factors will converge to a local minimum of the problem. Because the orthogonality constraints cannot be rewritten into a non-negatively constrained ALS framework, convergent algorithms for the standard NMF (e.g., see [21, 15, 14, 17]) cannot be used for solving the ONMF problems. Thus, no convergent algorithm was available for ONMF until recently. Mirzal [24] developed a convergent algorithm for ONMF. The proposed algorithm was designed by generalizing the work of Lin [23] in which a convergent algorithm was provided for the standard NMF based on a modified version of the additive update (AU) technique of Lee [19]. Mirzal [24] provides the global convergence for his algorithm solving the ONMF problem. In fact, he first proves the non-increasing property of the objective function evaluated by the sequence of the iterates. Secondly, he shows that every limit point of the generated sequence is a stationary point, and finally he proves that the sequence of the iterates possesses a limit point.

1.4 Our contribution

In this paper, we consider the penalty reformulation of (bi-ONMF), i.e., we add the orthogonality constraints multiplied with penalty parameters to the objective function to obtain reformulated problems (ONMF) and (bi-ONMF). The main contributions are:

  • We develop an algorithm for (ONMF) and (bi-ONMF), which is essentially a BCD algorithm, in literature also known as alternating minimization, coordinate relaxation, the Gauss-Seidel method, subspace correction, domain decomposition, etc., see e.g. [4, 29]. For each block optimization, we use a PG method and Armijo rule to find a suitable step-size.

  • We construct synthetic data sets of instances for (ONMF) and (bi-ONMF), for which we know the optimum value by construction.

  • We use MATLAB [31] to implement our algorithm and two well-known (MU-based) algorithms: the algorithm of Ding [9] and of Mirzal [24]. The code is available upon request.

  • The implemented algorithms are compared on the constructed synthetic data-sets in terms of: (i) the accuracy of the reconstruction, and (ii) the deviation of the factors from orthonormality. Accuracy is measured by the so-called root-square error (RSE), defined as

    (2)

    and deviations from orthonormality are computed using formulas (17) and (18) from Sect. 4. Our numerical results show that our algorithm is very competitive and almost always outperforms the MU algorithms.

1.5 Notations

Some notations used throughout our work are described here. We denote scalars and indices by lower-case Latin letters, vectors by lowercase boldface Latin letters, and matrices by capital Latin letters. denotes the set of by real matrices, and symbolizes the identity matrix. We use the notation to show the gradient of a real-valued function. We define and as the positive and (unsigned) negative parts of , respectively, i.e., . and denote the element-wise multiplication and the element-wise division, respectively.

1.6 Structure of the paper

The rest of our work is organized as follows. In Sect. 2, we review the well-known MU method and the rules being used for updating the factors per iteration in our computations. We also outline the global convergent MU version of Mirzal [24]. We then present our PG method and discuss the stopping criteria for it. Sect. 4 presents the synthetic data and the result of implementation of the three decomposition methods presented in Sect. 3. This implementation is done for both the problem (ONMF), as well as (bi-ONMF). Some concluding results are presented in Sect. 5.

2 Existing methods to solve (Nmf)

2.1 MU method of Ding [9]

Several popular approaches to solve (NMF) are based on so-called MU algorithms, which are simple to implement and often yield good results. The MU algorithms originate from the work of Lee and Seung [19]. Various MU variants were later proposed by several researchers, for an overview see [8]. At each iteration of these methods, the elements of and are multiplied by certain updating factors.

As already mentioned, (ONMF) was proposed by Ding et al., [9] as a tool to improve the clustering capability of the associated optimization approaches. To adapt the MU algorithm for this problem, they employed standard Lagrangian techniques: they introduced the Lagrangian multiplier (a symmetric matrix of size ) for the orthogonality constraint, and minimized the Lagrangian function where the orthogonality constraint is moved to the objective function as the penalty term . The complementarity conditions from the related KKT conditions can be rewritten as a fixed point relation, which finally can lead to the following MU rule for (ONMF):

(3)

They extended this approach to non-negative three factor factorization with demand that two factors satisfy orthogonality conditions, which is a generalization of (bi-ONMF). The MU rules (28)-(30) from [9], adapted to (bi-ONMF), are the main ingredients of Algorithm 2.1, which we will call Ding’s algorithm.

 

Algorithm 1. Ding’s MU algorithm for (bi-ONMF)

 

INPUT: ,

  1. Initialize: generate as an random matrix and as a random matrix.

  2. Repeat

    (4)
  3. Until convergence or a maximum number of iterations or maximum time is reached.

OUTPUT: .

 

Algorithm 2.1 converges in the sense that the solution pairs and generated by this algorithm yield a sequence of decreasing RSEs, see [9, Theorems 5, 7].

If has zero vector as columns or rows, a division by zero may occur. In contrast, denominators close to zero may still cause numerical problems. To escape this situation, we follow [28] and add a small positive number to the denominators of the MU terms (4). Note that Algorithm 2.1 can be easily adapted to solve (ONMF) by replacing the second MU rule from (4) with the second MU rule of (3).

2.2 MU method of Mirzal [24]

In [24], Mirzal proposed an algorithm for (ONMF) which is designed by generalizing the work of Lin [23]. Mirzal used the so-called modified additive update rule (the MAU rule), where the updated term is added to the current value for each of the factors. This additive rule has been used by Lin in [23] in the context of a standard NMF. He also provided in his paper a convergence proof, stating that the iterates generated by his algorithm converge in the sense that RSE is decreasing and the limit point is a stationary point. In [24], Mirzal discussed the orthogonality constraint on the rows of , while in [25] the same results are developed for the case of (bi-ONMF).

Here we review the Mirzal’s algorithm for (bi-ONMF), presented in the unpublished paper [25]. This algorithm actually solves the equivalent problem (pen-ONMF) where the orthogonality constraints are moved into the objective function (the so-called penalty approach), and the importance of the orthogonality constraints are controlled by the penalty parameters :

(pen-ONMF)

The gradients of the objective function with respect to and are:

(5)

For the objective function in (pen-ONMF), Mirzal proposed the MAU rules along with the use of and , instead of and , to avoid the zero locking phenomenon [24, Section 2]:

(6)
(7)

where is a small positive number.

Note that, the algorithms working with the MU rules for (pen-ONMF) must be initialized with positive matrices to avoid zero locking from the start, but non-negative matrices can be used to initialize the algorithm working with the MAU rules (see [25]).

Mirzal [25] used the MAU rules with some modifications by considering and in order to guarantee the non-increasing property, with a constant step to make and grow in order to satisfy the property. Here, and are the values added within the MAU terms to the denominator of update terms for and , respectively. The proposed algorithm by Mirzal [25] is summarised as Algorithm 2.2 below.

 

Algorithm 2. Mirzal’s algorithm for bi-ONMF [25]

 

INPUT: inner dimension , maximum number of iterations: maxit; small positive , small positive to increase .

  1. Compute initial and .

  2. For
    ;

    ;
    ;

    ;

    ; ;

    ;

OUTPUT: .

 

3 PG method for (Onmf) and (bi-ONMF)

3.1 Main steps of PG method

In this subsection we adapt the PG method proposed by Lin [21] to solve both (ONMF) as well as (bi-ONMF). Lin applied PG to (NMF) in two ways. The first approach is actually a BCD method. This method consecutively fixes one block of variables ( or ) and minimizes the simplified problem in the other variable. The second approach by Lin directly minimizes (NMF). Lin’s main focus was on the first approach and we follow it. We again try to solve the penalised version of the problem (pen-ONMF) by the block coordinate descent method, which is summarised in Algorithm 3.1.

 

Algorithm 3. BCD method for (pen-ONMF)

 

INPUT: inner dimension , initial matrices .

  1. Set .

  2. Repeat

    • Fix and compute new as follows:

      (8)
    • Fix and compute new as follows:

      (9)
  3. Until some stopping criteria is satisfied

OUTPUT: .

 

The objective function in (pen-ONMF) is not quadratic any more, so we lose the nice properties about Armijo’s rule that represent advantages for Lin. We managed to use the Armijo rule directly and still obtained good numerical results, see Sect. 4.

We refer to (8) or (9) as sub-problems. Obviously, solving these sub-problems in every iteration could be more costly than Algorithms 2.12.2. Therefore, we must find effective methods for solving these sub-problems. Similarly to Lin, we apply the PG method to solve the sub-problems (8) – (9). Algorithm 3.1 contains the main steps of the PG method for solving the latter and can be straightforwardly adapted for the former.

For the sake of simplicity, we denote by the function that we optimize in (8), which is actually a simplified version (pure terms removed) of the objective function from (pen-ONMF) for fixed:

Similarly, for is fixed, the objective function from (9) will be denoted by:

In Algorithm 3.1, is the projection operator which projects the new point (matrix) on the cone of non-negative matrices (we simply put negative entries to 0). Inequality (10) shows the Armijo rule to find a suitable step-size guaranteeing a sufficient decrease. Searching for is a time-consuming operation, therefore we strive to do only a small number of trials for new in Step 3.1. Similarly to Lin [21], we allow for any positive value. More precisely, we start with and if the Armijo rule (10) is satisfied, we increase the value of by dividing it with . We repeat this until (10) is no longer satisfied or the same matrix as in the previous iteration is obtained. If the starting does not yield which would satisfy the Armijo rule (10), then we decrease it by a factor and repeat this until (10) is satisfied. The numerical results obtained using different values of parameters (updating factor for ) and (parameter to check (10)) are reported in the following subsections.

 

Algorithm 4. PG method using Armijo rule to solve sub-problem (9)

 

INPUT: , and initial .

  1. Set

  2. Repeat

    • Find a (using updating factor ) such that for we have

      (10)
    • Set

    • Set ;

  3. Until some stopping criteria is satisfied.

OUTPUT:

 

3.2 Stopping criteria for Algorithms 3.1 and 3.1

As practiced in the literature (e.g. see [22]), in a constrained optimization problem with the non-negativity constraint on the variable , a common condition to check whether a point is close to a stationary point is

(11)

where is the differentiable function that we try to optimize and is the projected gradient defined as

(12)

and is a small positive tolerance. For Algorithm 3.1, (11) becomes

(13)

We impose a time limit in seconds and a maximum number of iterations for Algorithm 3.1 as well. Following [21], we also define stopping conditions for the sub-problems. The matrices and returned by Algorithm 3.1, respectively, must satisfy

(14)

where

(15)

and is the same tolerance used in (13). If the PG method for solving the sub-problem (8) or (9) stops after the first iteration, then we decrease the stopping tolerance as follows:

(16)

where is a constant smaller then 1.

4 Numerical results

In this section we demonstrate, how the PG method described in Sect. 3, performs compared to the MU-based algorithms of Ding and Mirzal, which were described in Subsections 2.1 and 2.2, respectively.

4.1 Artificial data

We created two sets of synthetic data using MATLAB [31]. The first set we call bi-orthonormal set (BION). It consists of instances of matrix , which were created as products of and , where has orthonormal columns while has orthonormal rows. We created five instances of , for each pair and from Table 1.

Matrices

were created in two phases: firstly, we randomly (uniform distribution) selected a position in each row; secondly, we selected a random number from

(uniform distribution) for the selected position in each row. Finally, if it happens that after this procedure some column of is zero or has a norm below , we find the first non-zero element in the largest column of (according to Euclidean norm) and move it into the zero column. We created similarly.

Table 1: Paris for which we created UNION and BION datasets

Each triple was saved as a triple of txt files. For example, NMF_BIOG_data_R_n=200_k=80_id=5.txt contains matrix obtained by multiplying matrices and , which were generated as explained above. With id=5, we denote that this is a 5th matrix corresponding to this pair . The second set contains similar data to BION, but only one factor () is orthonormal, while the other () is non-negative but not necessarily orthonormal. We call this dataset uni-orthonormal (UNION). All computations are done using MATLAB [31] and a high performance computer available at Faculty of Mechanical Engineering of University of Ljubljana. This is Intel Xeon X5670 (1536 hyper-cores) HPC cluster and an E5-2680 V3 (1008 hyper-cores) DP cluster, with an IB QDR interconnection, 164 TB of LUSTRE storage, 4.6 TB RAM and with 24 TFlop/s performance.

4.2 Numerical results for UNION

In this subsection, we present numerical results, obtained by Ding’s, Mirzal’s, and our algorithm for a uni-orthogonal problem (ONMF), using the UNION data, introduced in the previous subsection. We have adapted the last two algorithms (Algorithms 2.2, 3.1) for UNION data by setting in the problem formulation (bi-ONMF) and in all formulas underlying these two algorithms.

Recall that for UNION data we have for each pair from Table 1 five symmetric matrices for which we try to solve (ONMF) by Algorithms 2.1, 2.2 and 3.1. Note that all these algorithms demand as input the internal dimension , i.e. the number of columns of factor , which is in general not known in advance. Even though, we know this dimension by construction for UNION data, we tested the algorithms using internal dimensions equal to of . For

, we know the optimum of the problem, which is 0, so for this case we can also estimate how good are the tested algorithms in terms of finding the global optimum.

The first question we had to answer was which value of to use in Mirzal’s and PG algorithms. It is obvious that larger values of moves the focus from optimizing the RSE to guaranteeing the orthonormality, i.e., feasibility for the original problem. We decided not to fix the value of but to run both algorithms for and report the results.

For each solution pair returned by all algorithms, the non-negativity constraints are held by the construction of algorithms, so we only need to consider deviation of from orthonormality, which we call infeasibility and define it as

(17)

The computational results that follow in the rest of this subsection were obtained by setting the tolerance in the stopping criterion to , the maximum number of iterations to in Algorithm 3.1 and to 20 in Algorithm 3.1. We also set a time limit to seconds. Additionally, for and (updating parameter for in Algorithm 3.1) we choose and , respectively. Finally, for from (16) we set a value of .

In general, Algorithm 3.1 converges to a solution in early iterations and the norm of the projected gradient falls below the tolerance shortly after running the algorithm.

Results in Tables 2 and 3 and their visualisations on Figures 0(a)–Figures 0(f) and on Figures 1(a)–Figures 1(f) confirm expectations. More precisely, we can see that the smaller the value of , the better RSE. Likewise, the larger the value of , the smaller the infeasibility . In practice, we want to reach both criteria: small RSE and small infeasibility, so some compromise should be made. If RSE is more important than infeasibility, we choose the smaller value of and vice versa. We can also observe that regarding RSE the three compared algorithms do not differ a lot. However, when the input dimension approaches the real inner dimension , Algorithm 3.1 comes closest to the global optimum . The situation with infeasibility is a bit different. While Algorithm 2.1 performs very well in all instances, Algorithm 2.2 reaches better feasibility for smaller values of . Algorithm 3.1 outperforms the others for .

RSE of RSE of Alg. 2.2 RSE of Alg. 3.1
(% of ) Alg. 2.1
50 40 0.3143 0.2965 0.3070 0.3329 0.3898 0.2963 0.3081 0.3425 0.3508
50 60 0.2348 0.2227 0.2356 0.2676 0.3459 0.2201 0.2382 0.2733 0.2765
50 80 0.1738 0.1492 0.1634 0.1894 0.3277 0.1468 0.1620 0.1953 0.2053
50 100 0.0002 0.0133 0.0004 0.0932 0.2973 0.0000 0.0000 0.0000 0.0000
100 20 0.4063 0.3914 0.3955 0.4063 0.4254 0.3906 0.3959 0.4083 0.4210
100 40 0.3384 0.3139 0.3210 0.3415 0.3677 0.3116 0.3210 0.3488 0.3625
100 60 0.2674 0.2462 0.2541 0.2730 0.2978 0.2403 0.2528 0.2801 0.2974
100 80 0.1847 0.1737 0.1581 0.1909 0.2263 0.1629 0.1744 0.1959 0.2090
100 100 0.0126 0.0532 0.0427 0.0089 0.1515 0.0000 0.0000 0.0000 0.0075
200 20 0.4213 0.4024 0.4077 0.4080 0.4257 0.4005 0.4032 0.4162 0.4337
200 40 0.3562 0.3315 0.3398 0.3401 0.3647 0.3270 0.3313 0.3497 0.3738
200 60 0.2845 0.2675 0.2746 0.2748 0.2955 0.2573 0.2617 0.2812 0.3061
200 80 0.1959 0.1958 0.2013 0.1996 0.2085 0.1773 0.1819 0.1960 0.2133
200 100 0.0191 0.0753 0.0632 0.0622 0.0415 0.0000 0.0000 0.0069 0.0181
500 20 0.4332 0.4120 0.4119 0.4120 0.4121 0.4092 0.4096 0.4197 0.4346
500 40 0.3711 0.3506 0.3509 0.3507 0.3505 0.3430 0.3440 0.3537 0.3753
500 60 0.3003 0.2919 0.2923 0.2916 0.2909 0.2756 0.2766 0.2845 0.3031
500 80 0.2098 0.2186 0.2192 0.2207 0.2151 0.1931 0.1941 0.1999 0.2122
500 100 0.0273 0.0822 0.0864 0.0853 0.0713 0.0002 0.0003 0.0002 0.0097
1000 20 0.4386 0.4195 0.4194 0.4193 0.4195 0.4156 0.4160 0.4216 0.4324
1000 40 0.3777 0.3641 0.3640 0.3638 0.3637 0.3545 0.3548 0.3588 0.3707
1000 60 0.3070 0.3047 0.3055 0.3051 0.3036 0.2881 0.2880 0.2906 0.3006
1000 80 0.2164 0.2265 0.2248 0.2254 0.2236 0.2024 0.2029 0.2050 0.2106
1000 100 0.0329 0.0725 0.0772 0.0761 0.0709 0.0173 0.0030 0.0035 0.0035
Table 2: In this table we demonstrate how good RSE is achieved by Algorithms 2.1, 2.2 and 3.1 on UNION dataset. For each we take all 10 matrices (five of them corresponding to and five to ). We run all three algorithms on these matrices with inner dimensions with all possible values of . Each row represents the average (arithmetic mean value) RSE obtained on instances corresponding to given . For example, the last row shows the average value of RSE in 10 instances of dimension 1000 (five of them corresponding to and five to ) obtained by all three algorithms for all four values of , which were run with the input dimension .
(a) Values of RSE for different values of obtained by Algorithms 2.1 and 2.2 for
(b) Values of RSE for different values of obtained by Algorithms 2.1 and 3.1 for
(c) Values of RSE for different values of obtained by Algorithms 2.1 and 2.2 for
(d) Values of RSE for different values of obtained by Algorithms 2.1 and 3.1 for
(e) Values of RSE for different values of obtained by Algorithms 2.1 and 2.2 for
(f) Values of RSE for different values of obtained by Algorithms 2.1 and 3.1 for
Figure 1: This figure depicts data from Table 2. It contains six plots which illustrate the quality of Algorithms 2.1, 2.2 and 3.1 regarding RSE on UNION instances with , for . We can see that regarding RSE the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of yield larger values of RSE, but the differences are rather small. However, when approached 100 % of , Algorithm 3.1 comes closest to the global optimum .
Infeas. of Infeas. of Alg. 2.2 Infeas. of Alg. 3.1
(% of ) Alg. 2.1
50 20 0.0964 0.2490 0.0924 0.0155 0.0038 0.2298 0.0909 0.0154 0.0022
50 40 0.0740 0.1886 0.0676 0.0131 0.0040 0.1845 0.0670 0.0135 0.0023
50 60 0.0553 0.1324 0.0465 0.0068 0.0040 0.1245 0.0440 0.0091 0.0015
50 80 0.0324 0.0964 0.0241 0.0053 0.0034 0.0789 0.0250 0.0069 0.0020
50 100 0.0023 0.0257 0.0022 0.0023 0.0039 0.0000 0.0000 0.0000 0.0000
100 20 0.0774 0.2624 0.1441 0.0258 0.0064 0.2588 0.1308 0.0258 0.0036
100 40 0.0539 0.1754 0.0928 0.0168 0.0036 0.1654 0.0819 0.0182 0.0035
100 60 0.0400 0.1205 0.0545 0.0102 0.0024 0.1109 0.0487 0.0138 0.0033
100 80 0.0239 0.0890 0.0324 0.0062 0.0022 0.0623 0.0258 0.0083 0.0018
100 100 0.0062 0.0452 0.0153 0.0009 0.0016 0.0002 0.0000 0.0000 0.0000
200 20 0.0584 0.2157 0.1437 0.1433 0.0054 0.2087 0.1512 0.0348 0.0074
200 40 0.0356 0.1379 0.1004 0.1000 0.0036 0.1240 0.0806 0.0207 0.0053
200 60 0.0260 0.0955 0.0791 0.0793 0.0031 0.0754 0.0434 0.0143 0.0047
200 80 0.0154 0.0657 0.0634 0.0629 0.0017 0.0416 0.0218 0.0080 0.0026
200 100 0.0059 0.0412 0.0517 0.0512 0.0016 0.0002 0.0001 0.0002 0.0001
500 20 0.0332 0.1587 0.1894 0.1908 0.1908 0.1475 0.1268 0.0436 0.0087
500 40 0.0189 0.1155 0.1343 0.1349 0.1347 0.0770 0.0621 0.0227 0.0069
500 60 0.0134 0.0889 0.1095 0.1102 0.1055 0.0412 0.0312 0.0123 0.0038
500 80 0.0084 0.0656 0.0946 0.0954 0.0826 0.0300 0.0154 0.0061 0.0021
500 100 0.0050 0.0499 0.0847 0.0853 0.0693 0.0249 0.0003 0.0001 0.0001
1000 20 0.0211 0.1200 0.1344 0.1349 0.1350 0.1043 0.0970 0.0471 0.0097
1000 40 0.0122 0.0863 0.0951 0.0954 0.0954 0.0542 0.0422 0.0199 0.0059
1000 60 0.0073 0.0662 0.0776 0.0779 0.0779 0.0414 0.0205 0.0098 0.0037
1000 80 0.0045 0.0539 0.0671 0.0675 0.0675 0.0336 0.0103 0.0047 0.0018
1000 100 0.0040 0.0475 0.0600 0.0603 0.0604 0.0296 0.0066 0.0005 0.0003
Table 3: In this table we demonstrate how feasible (orthonormal) the solutions are computed by Algorithms 2.1, 2.2 and 3.1 on UNION data set, i.e., in this table we report the average infeasibility of the solutions underlying Table 2.

Results from Table 3, corresponding to are depicted on Figures 0(a)0(f).

(a) Values of for different values of obtained by Algorithms 2.1 and 2.2 for
(b) Values of for different values of obtained by Algorithms 2.1 and 3.1 for
(c) Values of for different values of obtained by Algorithms 2.1 and 2.2 for
(d) Values of for different values of obtained by Algorithms 2.1 and 3.1 for
(e) Values of for different values of obtained by Algorithms 2.1 and 2.2 for
(f) Values of for different values of obtained by Algorithms 2.1 and 3.1 for
Figure 2: This figure depicts data from Table 3. It contains six plots which illustrate the quality of Algorithms 2.1, 2.2 and 3.1 regarding infeasibility on UNION instances with , for . We can see that regarding infeasibility the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of yield smaller values of , but the differences are rather small.

4.3 Numerical results for bi-orthonormal data (BION)

In this subsection we provide the same type of results as in the previous subsection, but for the BION dataset.

We used almost the same setting as for UNION dataset: , maxit , and time limit = . Parameters were slightly changed (based on experimental observations): and . Additionally, we decided to take the same values for in Algorithms 2.2 and 3.1, since the matrices in BION dataset are symmetric and both orthogonality constraints are equally important. We computed the results for values of from . In Tables 45 we report average RSE and average infeasibility, respectively, of the solutions obtained by Algorithms 2.1, 2.2, and 3.1. Since for this dataset we need to monitor how orthonormal are both matrices and , we adapt the measure for infeasibility as follows:

(18)
RSE of RSE of Alg. 2.2 RSE of Alg. 3.1
(% of ) Alg. 2.1
50 20 0.7053 0.7053 0.7053 0.7053 0.8283 0.7053 0.7053 0.7055 0.8259
50 40 0.6108 0.6108 0.6108 0.6108 0.9066 0.6108 0.6108 0.6108 0.6631
50 60 0.4987 0.4987 0.4987 0.5442 0.9665 0.4987 0.4987 0.4987 0.5000
50 80 0.3526 0.3671 0.3742 0.4497 1.0282 0.3526 0.3796 0.3527 0.4374
50 100 0.0607 0.1712 0.2786 0.5198 1.0781 0.1145 0.1820 0.2604 0.3689
100 20 0.7516 0.7516 0.7516 0.7517 0.9070 0.7516 0.7516 0.7517 0.8224
100 40 0.6509 0.6509 0.6509 0.7174 0.9779 0.6509 0.6509 0.6509 0.6514
100 60 0.5315 0.5315 0.5315 0.5504 1.0401 0.5315 0.5315 0.5315 0.5352
100 80 0.3758 0.3787 0.4106 0.4542 1.1082 0.3801 0.3888 0.3917 0.3898
100 100 0.1377 0.1993 0.3311 0.4898 1.1734 0.0457 0.1016 0.2758 0.3757
200 20 0.7884 0.7884 0.7884 0.7884 0.9499 0.7884 0.7884 0.7884 0.7888
200 40 0.6828 0.6828 0.6828 0.6828 1.0325 0.6828 0.6828 0.6828 0.6828
200 60 0.5575 0.5575 0.5575 0.5647 1.0938 0.5575 0.5575 0.5575 0.5610
200 80 0.3942 0.3942 0.3965 0.5019 1.1618 0.3942 0.3942 0.3942 0.4373
200 100 0.1447 0.1851 0.3014 0.5400 1.2297 0.0202 0.1429 0.2964 0.3315
500 20 0.8242 0.8242 0.8242 0.8242 0.9956 0.8242 0.8242 0.8242 0.8243
500 40 0.7138 0.7138 0.7138 0.7138 1.0679 0.7138 0.7138 0.7138 0.7138
500 60 0.5828 0.5828 0.5828 0.6045 1.1534 0.5828 0.5828 0.5828 0.5828
500 80 0.4121 0.4121 0.4203 0.5285 1.2160 0.4121 0.4121 0.4121 0.4334
500 100 0.1405 0.1814 0.3401 0.5854 1.2822 0.0067 0.1059 0.2044 0.3378
1000 20 0.8436 0.8436 0.8436 0.8436 1.0261 0.8436 0.8436 0.8436 0.8436
1000 40 0.7306 0.7306 0.7306 0.7309 1.0916 0.7306 0.7306 0.7306 0.7306
1000 60 0.5965 0.5965 0.5965 0.6121 1.1669 0.5965 0.5965 0.5965 0.5968
1000 80 0.4218 0.4218 0.4256 0.5338 1.2389 0.4218 0.4218 0.4218 0.4397
1000 100 0.1346 0.1635 0.3324 0.5755 1.3080 0.0096 0.0697 0.1661 0.2188
Table 4: RSE obtained by Algorithms 2.1, 2.2 and 3.1 on the BION data. For the latter two algorithms, we used . For each we take all ten matrices (five of them corresponding to and five to ). We run all three algorithms on these matrices with inner dimensions with all possible values of . Like before, each row represents the average (arithmetic mean value) of RSE obtained on instances corresponding to given and given as a percentage of . We can see that the larger the , the worse the RSE, which is consistent with expectations.
Infeas. of Infeas. of Alg. 2.2 Infeas. of Alg. 3.1
(% of ) Alg. 2.1
50 20 0.0001 0.0070 0.0036 0.0010 0.0068 0.0017 0.0021 0.0021 0.0026
50 40 0.0000 0.0041 0.0021 0.0004 0.0056 0.0008 0.0012 0.0012 0.0014
50 60 0.0000 0.0030 0.0009 0.0032 0.0038 0.0005 0.0008 0.0009 0.0009
50 80 0.0000 0.0183 0.0030 0.0021 0.0028 0.0004 0.0202 0.0006 0.0013
50 100 0.0355 0.0533 0.0127 0.0045 0.0027 0.0418 0.0478 0.0123 0.0021
100 20 0.0001 0.0051 0.0024 0.0006 0.0063 0.0010 0.0012 0.0013 0.0016
100 40 0.0000 0.0029 0.0017 0.0066 0.0040 0.0004 0.0006 0.0007 0.0007
100 60 0.0000 0.0019 0.0008 0.0009 0.0027 0.0003 0.0004 0.0005 0.0005
100 80 0.0000 0.0039 0.0048 0.0015 0.0021 0.0062 0.0149 0.0037 0.0006
100 100 0.0606 0.0454 0.0105 0.0022 0.0018 0.0106 0.0228 0.0173 0.0028
200 20 0.0002 0.0033 0.0019 0.0005 0.0043 0.0005 0.0007 0.0007 0.0007
200 40 0.0001 0.0017 0.0010 0.0002 0.0027 0.0002 0.0003 0.0004 0.0003
200 60 0.0001 0.0010 0.0005 0.0004 0.0019 0.0001 0.0002 0.0002 0.0004
200 80 0.0000 0.0006 0.0006 0.0015 0.0014 0.0001 0.0001 0.0002 0.0013
200 100 0.0425 0.0280 0.0064 0.0019 0.0015 0.0046 0.0224 0.0240 0.0034
500 20 0.0001 0.0017 0.0011 0.0003 0.0025 0.0002 0.0003 0.0003 0.0003
500 40 0.0001 0.0008 0.0005 0.0001 0.0016 0.0001 0.0001 0.0002 0.0002
500 60 0.0000 0.0005 0.0003 0.0006 0.0013 0.0001 0.0001 0.0001 0.0002
500 80 0.0000 0.0003 0.0009 0.0009 0.0008 0.0000 0.0001 0.0001 0.0016
500 100 0.0258 0.0184 0.0045 0.0013 0.0007 0.0017 0.0101 0.0175 0.0053
1000 20 0.0001 0.0010 0.0006 0.0002 0.0024 0.0001 0.0002 0.0002 0.0002
1000 40 0.0000 0.0005 0.0003 0.0001 0.0009 0.0001 0.0002 0.0003 0.0002
1000 60 0.0000 0.0003 0.0002 0.0004 0.0009 0.0003 0.0002 0.0003 0.0003
1000 80 0.0000 0.0002 0.0005 0.0007 0.0006 0.0040 0.0001 0.0002 0.0020
1000 100 0.0173 0.0117 0.0031 0.0009 0.0005 0.0043 0.0050 0.0121 0.0060
Table 5: In this table we demonstrate how feasible (orthonormal) are the solutions and computed by Algorithms 2.1, 2.2, and 3.1 on the BION dataset, i.e., in this table we report the average infeasibility (18) of the solutions underlying Table 4. We can observe that with these settings of all algorithms we can bring infeasibility to order of very often, for all values of .

Figures 2(a)2(f) and 3(a)3(f) depict RSE and infeasibility reached by the three compared algorithms, for . We can see that all three algorithms behave well, however, Algorithm 3.1 is more stable and less dependent on the choice of . It is interesting to see that does not have a big impact on RSE and infeasibility for Algorithm 3.1, a significant difference can be observed only when the internal dimension is equal to the real internal dimension, i.e., when . Based on these numerical results, we can conclude that smaller achieve better RSE and almost the same infeasibility, so it would make sense to use .

For Algorithm 2.2 these differences are bigger and it is less obvious which is appropriate. Again, if RSE is more important then smaller values of should be taken, otherwise larger values.

(a) Values of RSE for different values of obtained by Algorithms 2.1 and 2.2 in BION data with
(b) Values of RSE for different values of obtained by Algorithms 2.1 and 3.1 on BION data with .
(c) Values of RSE for different values of obtained by Algorithms 2.1 and 2.2 in BION data with
(d) Values of RSE for different values of obtained by Algorithms 2.1 and 3.1 on BION data with .
(e) Values of RSE for different values of obtained by Algorithms 2.1 and 2.2 in BION data with
(f) Values of RSE for different values of obtained by Algorithms 2.1 and 3.1 on BION data with .
Figure 3: This figure contains six plots which illustrate the quality of Algorithms 2.1, 2.2 and 3.1 regarding RSE on BION instances with and , for . We can observe that Algorithm 3.1 is more stable, less dependent to the choice of and is computing better values of RSE.
(a) Values of infeasibility for different values of obtained by Algorithms 2.1 and 2.2 on BION data with
(b) Values of infeasibility for different values of obtained by Algorithms 2.1 and 3.1 on BION data with .
(c) Values of infeasibility for different values of obtained by Algorithms 2.1 and 2.2 in BION data with
(d) Values of infeasibility for different values of obtained by Algorithms 2.1 and 3.1 on BION data with .
(e) Values of infeasibility for different values of obtained by Algorithms 2.1 and 2.2 in BION data with
(f) Values of infeasibility for different values of obtained by Algorithms 2.1 and 3.1 on BION data with .
Figure 4: This figure contains six plots which illustrate the quality of Algorithms 2.1, 2.2 and 3.1 regarding the infeasibility on BION instances with and , for . We can observe that Algorithm 3.1 computes solutions with infeasibility (18) slightly smaller compared to solutions computed by Algorithm 2.2.

5 Concluding remarks

We presented a projected gradient method to solve the orthogonal non-negative matrix factorization problem. We penalized the deviation from orthonormality with some positive parameters and added the resulted terms to the objective function of the standard non-negative matrix factorization problem. Then, we considered minimizing the resulted objective function under the non-negativity conditions only, in a block coordinate decent approach.

The method was tested on two sets of synthetic data, one containing uni-orthonormal matrices and the other containing bi-orthonormal matrices. Different values for the adjusting parameters of orthogonality were applied in the implementation to determine good pairs of such values. The performance of our algorithm was compared with two algorithms based on multiplicative updates rules. Algorithms were compared regarding the quality of factorization (RSE) and how much the resulting factors deviate from orthonormality.

We provided an extensive list of numerical results which demonstrate that our method is very competitive and outperforms the others.

Acknowledgment

The work of the first author is supported by the Swiss Government Excellence Scholarships grant number ESKAS-2019.0147. This author also thanks the University of Applied Sciences and Arts, Northwestern Switzerland for supporting the work.

The work of the second author was partially funded by Slovenian Research Agency under research program P2-0256 and research projects N1-0057, N1-0071, and J1-8155.

The authors would also like to thank to Andri Mirzal (Faculty of Computing, Universiti Teknologi Malaysia) for providing the code for his algorithm (Algorithm 2.2) to solve (ONMF). This code was also adapted by the authors to solve (bi-ONMF).


References

  • [1] P. Anttila, P. Paatero, U. Tapper, and O. Järvinen (1995) Source identification of bulk wet deposition in finland by positive matrix factorization. Atmospheric Environment 29 (14), pp. 1705–1718. Cited by: §1.3.
  • [2] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons (2007) Algorithms and applications for approximate nonnegative matrix factorization. Computational statistics & data analysis 52 (1), pp. 155–173. Cited by: §1.1.
  • [3] M. W. Berry, N. Gillis, and F. Glineur (2009) Document classification using nonnegative matrix factorization and underapproximation. In 2009 IEEE International Symposium on Circuits and Systems, pp. 2782–2785. Cited by: §1.1.
  • [4] D.P. Bertsekas (2016) Nonlinear programming. Athena scientific optimization and computation series, Athena Scientific. External Links: ISBN 9781886529052, Link Cited by: 1st item.
  • [5] S. Choi (2008) Algorithms for orthogonal nonnegative matrix factorization. In

    2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence)

    ,
    pp. 1828–1832. Cited by: §1.3.
  • [6] M. Chu (2004) Optimality, computation, and interpretation of nonnegative matrix factorizations. In SIAM Journal on Matrix Analysis, Cited by: §1.3.
  • [7] A. Cichocki, R. Zdunek, and S. Amari (2007)

    Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization

    .
    In

    International Conference on Independent Component Analysis and Signal Separation

    ,
    pp. 169–176. Cited by: §1.3.
  • [8] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons. Cited by: §2.1.
  • [9] C. Ding, T. Li, W. Peng, and H. Park (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 126–135. Cited by: 3rd item, §1.2, §1.3, §2.1, §2.1, §2.1.
  • [10] G. H. Golub and C. Reinsch (1971) Singular value decomposition and least squares solutions. In Linear Algebra, pp. 134–151. Cited by: §1.1.
  • [11] N. Halko, P. Martinsson, and J. A. Tropp (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: §1.3.
  • [12] I. Jolliffe (2005) Principal component analysis. Wiley Online Library. Cited by: §1.1.
  • [13] A. Kaarna (2006) Non-negative matrix factorization features from spectral signatures of aviris images. In 2006 IEEE International Symposium on Geoscience and Remote Sensing, pp. 549–552. Cited by: §1.1.
  • [14] D. Kim, S. Sra, and I. S. Dhillon (2007) Fast newton-type methods for the least squares nonnegative matrix approximation problem. In Proceedings of the 2007 SIAM international conference on data mining, pp. 343–354. Cited by: §1.3.
  • [15] D. Kim, S. Sra, and I. S. Dhillon (2008) Fast projection-based methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining: The ASA Data Science Journal 1 (1), pp. 38–51. Cited by: §1.3.
  • [16] H. Kim and H. Park (2008) Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications 30 (2), pp. 713–730. Cited by: §1.3.
  • [17] J. Kim and H. Park (2008) Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In 2008 Eighth IEEE International Conference on Data Mining, pp. 353–362. Cited by: §1.3.
  • [18] D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §1.3.
  • [19] D. D. Lee and H. S. Seung (2001) Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562. Cited by: §1.3, §1.3, §2.1.
  • [20] T. Li and C. Ding (2006) The relationships among various nonnegative matrix factorization methods for clustering. In Sixth International Conference on Data Mining (ICDM’06), pp. 362–371. Cited by: §1.1, §1.3.
  • [21] C.J. Lin (2007) Projected gradient methods for nonnegative matrix factorization. Neural computation 19, pp. 2756–2779. Cited by: §1.3, §1.3, §3.1, §3.1, §3.2.
  • [22] C. Lin and J. J. Moré (1999) Newton’s method for large bound-constrained optimization problems. SIAM Journal on Optimization 9 (4), pp. 1100–1127. Cited by: §3.2.
  • [23] C. Lin (2007) On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks 18 (6), pp. 1589–1596. Cited by: §1.3, §2.2.
  • [24] A. Mirzal (2014) A convergent algorithm for orthogonal nonnegative matrix factorization. Journal of Computational and Applied Mathematics 260, pp. 149–166. Cited by: 3rd item, §1.3, §1.6, §2.2, §2.2, §2.2.
  • [25] A. Mirzal (2017) A convergent algorithm for bi-orthogonal nonnegative matrix tri-factorization. arXiv preprint arXiv:1710.11478. Cited by: §2.2, §2.2, §2.2, §2.2, §2.2.
  • [26] P. Paatero and U. Tapper (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (2), pp. 111–126. Cited by: §1.3.
  • [27] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons (2004) Text mining using non-negative matrix factorizations. In Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 452–456. Cited by: §1.1.
  • [28] J. Piper, V. P. Pauca, R. J. Plemmons, and M. Giffin (2004) Object characterization from spectral data using nonnegative factorization and information theory. In Proceedings of AMOS Technical Conference, Cited by: §2.1.
  • [29] P. Richtárik and M. Takác (2014) Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming 144 (1-2), pp. 1–38. Cited by: 1st item.
  • [30] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons (2006) Document clustering using nonnegative matrix factorization. Information Processing & Management 42 (2), pp. 373–386. Cited by: §1.1.
  • [31] The MathWorks MATLAB version r2019a. Cited by: 3rd item, §4.1, §4.1.
  • [32] W. Xu, X. Liu, and Y. Gong (2003) Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. Cited by: §1.1, §1.3.
  • [33] J. Yoo and S. Choi (2008) Orthogonal nonnegative matrix factorization: multiplicative updates on stiefel manifolds. In International conference on intelligent data engineering and automated learning, pp. 140–147. Cited by: §1.3.
  • [34] J. Yoo and S. Choi (2010) Orthogonal nonnegative matrix tri-factorization for co-clustering: multiplicative updates on stiefel manifolds. Information processing & management 46 (5), pp. 559–570. Cited by: §1.3.
  • [35] S. Zafeiriou, A. Tefas, I. Buciu, and I. Pitas (2006) Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification. IEEE Transactions on Neural Networks 17 (3), pp. 683–695. Cited by: §1.1.