Mixed membership matrix factorization has been used in document topic modeling , collaborative filtering , population genetics , and social network analysis . The underlying assumption is that an observed feature for a given sample is a mixture of shared, underlying groups. These groups are called topics in document modeling, subpopulations in population genetics, and communities in social network analysis. In bioinformatics applications the groups are called subtypes and we adopt that terminology here. Mixed membership matrix factorization simultaneously identifies both the underlying subtypes and the distribution over those subtypes for each individual sample.
1.1. Mixed Membership Model
The mixed membership matrix factorization problem can equivalently be viewed as inference in a particular statistical model 
. These models typically have a latent Dirichlet random variable that allows each sample to have its own distribution over subtypes and a latent variable for the feature weights that describe each subtype. The inferential goal is to estimate the joint posterior distribution over these latent variables and thus obtain the distribution over subtypes for each sample and the feature vector for each subtype. Non-negative matrix factorization techniques have been used in image analysis and collaborative filtering applications[15, 17]. Topic models for document clustering have also been cast as a matrix factorization problem .
The basic mixed membership model structure has been extended in various interesting ways. A hierarchical Dirichlet prior allows one to obtain a posterior distribution over the number of subtypes . A prior on the subtype variables allows one to impose specific sparsity constraints on the subtypes [13, 16, 21]. Correlated information may be incorporated to improve the coherence of the subtypes .
Sampling or variational inference methods are commonly used to estimate the posterior distribution of interest for mixed membership models, but these only provide local or approximate estimates. A mean-field variational algorithm  and a collapsed Gibbs sampling algorithm have been developed for Latent Dirichlet Allocation . However, Gibbs sampling is approximate for finite chain lengths and variational inference is only guaranteed to converge to a local optimum.
1.2. Benders’ Decomposition and Global Optimization (GOP)
In many applications it is important to obtain a globally optimal solution rather than a local or approximate solution. Recently, there have been significant advances in deterministic optimization methods for general biconvex optimization problems [8, 12]. Here, we show that mixed membership matrix factorization can be cast as a biconvex optimization problem and the -global optimum can be obtained by these deterministic optimization methods.
Benders’ decomposition exploits the idea that in a given optimization problem there are often “complicating variables” – variables that when held fixed yield a much simpler problem over the remaining variables 
. Benders developed a cutting plane method for solving mixed integer optimization problems that can be so decomposed. Geoffrion later extended Benders’ decomposition to situations where the primal problem (parametrized by fixed complicating variable values) no longer needs to be a linear program. The Global Optimization (GOP) approach is an adaptation of the original Benders’ decomposition that can handle a more general class of problems that includes mixed-integer biconvex optimization problems . Here, we exploit the GOP approach for solving a particular mixed membership matrix factorization problem.
Our contribution is bringing the GOP algorithm into contact with the mixed membership matrix factorization problem, computational improvements to the branch-and-bound GOP algorithm, and experimental results. Our discussion of the GOP algorithm here is necessarily brief. The details of problem conditions, convergence properties, and a full outline of the algorithm steps for the branch-and-bound version of the algorithm are found elsewhere .
We outline the general sparse mixed membership matrix factorization problem in Section 2. In Section 3, we use GOP to obtain an -global optimum solution for the mixed membership matrix factorization problem. In Section 5, we show empirical accuracy and convergence time results on a synthetic data set. Finally, we discuss further computational and statistical issues in Section 6.
2. Problem Formulation
The problem data is a matrix , where an element is an observation of feature in sample . We would like to represent each sample as a convex combination of subtype vectors, , where is a matrix of subtype vectors and is the mixing proportion of each subtype. We would like to be sparse because doing so makes interpreting the subtypes easier and often is believed to be sparse a priori for many interesting problems. In the specific case of cancer subtyping, may be a normalized gene expression measurement for gene in sample . We write this matrix factorization problem as
where is a -dimensional simplex.
Optimization problem (2) can be recast with a biconvex objective and a biconvex domain as
If either or is fixed then (2) reduces to a convex optimization problem. Indeed, if
is fixed, the optimization problem is a form of constrained linear regression. If
is fixed, we have a form of LASSO regression. We prove that (2) is a biconvex problem in Appendix B. Since both problems are computationally simple, we could take either or to be the “complicating variables” in Benders’ decomposition and we choose .
A common approach for solving an optimization problem with a nonconvex objective function is to alternate between fixing one variable and optimizing over the other. However, this approach only provides a local optimum . A key to the GOP algorithm is the Benders’-based idea that feasibility and optimality information is shared between the primal problems in the form of constraints.
We use the global optimization approach to solve for -global optimum values of and [9, 6, 7]. First, we partition the optimization problem decision variables into “complicating” and “non-complicating” variables. Then, the GOP algorithm alternates between solving a primal problem over for fixed , and solving a relaxed dual problem over for fixed . The primal problem provides an upper bound on the original optimization problem because it contains more constraints than the original problem ( is fixed). The relaxed dual problem contains fewer constraints and forms a valid global lower bound. The algorithm iteratively tightens the upper and lower bounds on the global optimum by alternating between the primal and relaxed dual problem and tightening the relaxation in the relaxed dual problem at each iteration.
We start by partitioning the problem into a relaxed dual problem and a primal problem. Recall our decision that the relaxed dual problem optimizes over for fixed values of the complicating variables and the primal problem optimizes over . We also initialize an iteration counter .
At each iteration, the relaxed dual problem is solved by forming a partition of the domain of and solving a relaxed dual primal problem for each subset. A branch-and-bound tree data structure is used to store the solution of each of these relaxed dual primal problems and we initialize the root node where . The parents of is denoted , the set of ancestors of is denoted , and the set of children of is denoted .
Finally, we initialize at a random feasible point, , and store it in since we will be starting the GOP iterations by solving the primal problem over for a fixed .
3.2. Solve Primal Problem and Update Upper Bound
The primal problem is (2) constrained to fixed value of at , ,
Since the primal problem is more constrained than (2), the solution, , is a global upper bound. We store the value of the upper bound, , where PUBD stores the tightest upper bound.
3.3. Solve the Relaxed Dual Problem and Update Lower Bound
The relaxed dual problem is a relaxed version of (2) in that it contains fewer constraints than the original problem. Initially, at the root node the domain of the relaxed dual problem is the entire domain of , . Each node stores a set of linear constraints (cuts) such that when all of the constraints are satisfied, they define a region in . Sibling nodes form a partition of parent’s region and a node deeper in the tree defines a smaller region than shallower nodes when incorporating the constraints of the node and all of its ancestors. These constraints are called qualifying constraints. Since the objective function is convex in for a fixed value of , a Taylor series approximation of the Lagrangian with respect to provides a valid lower bound on the objective function. Finally, since the objective function is convex in , the Taylor approximation is linear and the optimal objective is at a bound of . The GOP algorithm as outlined in  makes these ideas rigorous.
The relaxed dual problem for the mixed membership matrix factorization problem (2) for a node is
Relaxed Dual Problem
where is the linearized Lagrangian of (2), is the -th qualifying constraint, and is the value of at the bound such that the linearized Lagrangian is a valid lower bound in the region defined by the qualifying constraints at node . We have taken a second Taylor approximation with respect to to ensure the qualifying constraints are linear in and thus valid cuts as recommended in .
Construct a child node in the branch-and-bound tree.
A unique region in for the leaf node is defined by the -th row of derived from the primal problem at node . We can express this region as the qualifying constraint set,
First, we create the child node of and populate it with this constraint set and which will be used in the construction of the Lagrange function lower bound in the relaxed dual problem.
Second, we construct and solve the relaxed dual problem at . First, we add the qualifying constraint sets contained in each node along the path in the branch-and-bound tree from to the root, inclusively. For example, the qualifying constraint set for a node along the path is
where is the node’s qualifying constraint, is the node’s relaxed dual problem optimizer, and is a - vector defining the unique region for node since .
Third, we add the Lagrangian function lower bound constraints constructed from each node along the path in the branch-and-bound tree from to the root, inclusively. For example the linearized Lagrange function for node ,
Populate the child node with the linearized Lagrange function and qualifying constraints.
The Lagrangian function for the primal problem is
with Lagrange multipliers and .
The relaxed dual problem makes use of the Lagrangian function linearized about which we obtain through a Taylor series approximation,
where the qualifying constraint function is
The qualifying constraint is quadratic in . However, we require it to be linear in to yield a convex domain if or . So, we linearize the Lagrangian first with respect to about then about at . While the linearized Lagrangian is not a lower bound everywhere in , it is a valid lower bound in the region bound by the qualifying constraints with set at the corresponding bounds in the Lagrangian function.
The Lagrangian function linearized about is
Subsequently, the Lagrangian function linearized about is
and the gradient used in the qualifying constraint is
Solve the relaxed dual problem at the child node.
Once the valid qualifying constraints from the previous iterations have been identified and incorporated, the constraint for the current iteration is
The resulting relaxed dual problem is a linear program and can be solved efficiently using the off-the-shelf LP solver Gurobi . We store the optimal objective function value and the optimizing decision variables in the node.
Update the lower bound.
The global lower bound is provided by the lowest lower bound across all the leaf nodes in the branch-and-bound tree. We store this global lower bound in a variable, RLBD. Operationally, we maintain a dictionary where the value of a record is a pointer to a branch-and-bound tree node and the key is the optimal value of the relaxed dual problem at that leaf node. Using this dictionary, we select the smallest key and bound to the node of the tree indicated by the value. This element is eliminated from the dictionary since at the end of the next iteration, it will be an interior node and not available for consideration. We increment the iteration count and we update the global lower bound RLBD with the optimal value of the relaxed dual problem at the new node.
Since we always select the lowest lower bound provided by the relaxed dual problem, the lower bound is non-decreasing. If our convergence criteria has been met, then we exit the algorithm and report the optimal from the node’s primal problem and the optimal from the node’s relaxed dual problem. Finite -convergence and -global optimality proofs can be found in .
4. Computational Improvements
In the relaxed dual problem branch-and-bound tree, a leaf node below the current node
is constructed for each unique region defined by the hyperplane arrangement. In the GOP framework, there arehyperplanes, one of each so-called “connected variable” and all of the elements of are connected variables. So, an upper bound on the number of regions defined by cuts is because each region may be found by selecting a side of each cut. Thus we have the computationally complex situation of needing to solve a relaxed dual problem for each of the possible regions.
Let an arrangement denote a set of hyperplanes and denote the set of unique regions defined by . In our particular situation, all of the hyperplanes pass through the unique point , so all of the regions are unbounded except by the constraints provided in . A recursive algorithm for counting the number of regions known as Zaslavsky’ Theorem, is outlined in . Indeed, is often much less that . However, due to its recursive nature, computing the number of hyperplanes using Zaslavsky’s theorem is computationally slow.
4.1. Cell Enumeration Algorithm
We have developed an A-star search algorithm for cell enumeration to simultaneously identify and count the set of unique regions defined by arrangement with sign vectors. First, we preprocess the arrangement to eliminate trivial and redundant hyperplanes. We eliminate a hyperplane from if the coefficients are all zero and eliminate duplicate hyperplanes in (see Appendix C). We are left with a reduced arrangement, .
Here we define two concepts, strict hyperplane and adjacent region. A strict hyperplane is defined as non-redundant bounding hyperplane in a single region. If two regions exist that have sign vectors differing in only one hyperplane, then this hyperplane is a strict hyperplane. We define an adjacent region of region as a neighbor region of if they are separated by exactly one strict hyperplane. The general idea of the A-star algorithm uses ideas from partial order sets. We first initialize a root region using an interior point method and then determine all of its adjacent regions by identifying the set of strict hyperplanes. This process guarantees that we can enumerate all unique regions.
We define . The rows are regions and there are columns. Each element of this matrix is either or . The region in is uniquely identified by the zero-one vector in the row of . If the element of the row of is , then . Similarly, if the element of the row of is , then . The A-star search algorithm completes the matrix for the current node and a leaf node is generated for each row of . Thus each unique region defined by the qualifying constraint cuts provided by the Lagrange dual of the primal problem at the current node. The details of the A-star search algorithm are covered in Section C.
In this section, we present our experiments on synthetic data sets and show accuracy and convergence speed. Computational complexity is evaluated by both the theoretical and empirical time complexity.
5.1. Illustrative Example
We use a simple data set to show the operation of the algorithm in detail and facilitate visualization of the cut sets. The data set, , and true decision variable values, , are
We ran the GOP algorithm with sparsity constraint variable and convergence tolerance . There are connected variables, so we solve at most relaxed dual problems at each iteration. These relaxed dual problems are independent and can be distributed to different computational threads or cores. The primal problem is a single optimization problem and will not be distributed. The optimal decision variables after 72 iterations are
and the Lagrange multipliers are and .
Figure 1 (a) shows the convergence of the upper and lower bounds by iteration. The upper bound converges quickly and the majority of the time in the algorithm is spent proving optimality. With each iteration regions of the solution space are tested until the lower bound is tightened sufficiently to meet the stopping criterion. Figure 1 (b) shows the first ten values considered by the algorithm with isoclines of the objective function with fixed. It is evident that the algorithm is not performing hill-climbing or any other gradient ascent algorithm during its search for the global optimum. Instead, the algorithm explores a region bound by the qualifying constraints to construct a lower bound on the objective function. We run it using 20 random initial values and the optimal objective functions for all random initializations are all 0, which shows that the GOP algorithm found the globally optimal solutions of this small instance. Furthermore, the algorithm does not search nested regions, but considers previously explored cut sets (Figure 1 (b)).
Figure 2 shows the branch-and-bound tree and corresponding -space region with the sequence of cut sets for the first three iterations of the algorithm. One cut in Figure 2 (b, d, f) is obtained for each of the qualifying constraints. We initialize the algorithm at .
5.2. Accuracy and Convergence Speed
We ran our GOP algorithm using 64 processors on a synthetic data set which is randomly generated on the scale of one feature (), two subtyes () and ten samples (). Figure 3 (a) shows that our GOP algorithm converges very quickly to -0.17 duality gap in the first 89 iterations in 120 seconds. The optimal and of each iteration are shown with a range of colors to represent corresponding RLBD in Figure 3 (b, c). The dark blue represents low RLBD and the dark red represents high RLBD. The RLBD of the initial , , is -59.87; The RLBD of iteration 89, , is -0.17. It demonstrates that the GOP algorithm can change modes very easily without getting stuck in local optima.
5.3. Computational Complexity
We evaluate the GOP algorithm by theoretical analysis and empirical measurements of the time complexity on simulated data sets. The problem has four main components: primal problem, preprocessing, unique region identification, and relaxed dual problems.
5.3.1. Theoretical Time Complexity
The primal problem is a convex quadratic program with decision variables. The time complexity for the primal problem solving is then .
We address the cases of overlapping qualifying constraint cuts by sorting the rows of the qualifying constraint coefficient matrix and comparing the coefficients of adjacent rows. We first sort the rows of the qualifying constraint coefficient matrix using heapsort which takes time on average. The algorithm subsequently passes through the rows of the matrix to identify all-zero coefficients and duplicate cuts; each pass takes time. We define as the number of unique qualifying constraints.
Unique region identification
The interior point method that we used in the A-star search algorithm is a linear program of size with the time complexity of . The time complexity for enumerating the set of unique regions is
, which exhibits polynomial behavior. The time complexity of the partial order A-star algorithm is polynomial in the best case and exponential in the worst case, depending on the heuristic. We defineas the number of identified unique regions.
Relaxed dual problems
There are decision variables for each relaxed dual problem, so the time complexity for each is . The total time for solving the relaxed dual problems is , which depends on the number of relaxed dual problems.
5.3.2. Empirical Timing Results
We constructed 12 synthetic data sets in a full-factorial arrangement with , , and and measured CPU time for each component of one iteration. For each arrangement, each element of the true is:
is the sample from a Normal distribution by its mean. For the true , for are evenly spaced samples over the interval of ; for are evenly spaced samples over the interval of .
Primal: primal problem. Pre: preprocessing. URI: unique region identification. Num: number of relaxed dual problems. Dual: relaxed dual problems. Total: total time of one iteration. Here we have two subtypes. A single processor is used. Time for solving relaxed dual problems is highlighted in percentage.
Table 1 shows that the time per iteration increases linearly with when and are fixed. The time for solving all the relaxed dual problems increases as the number of samples increases. Even though the step of solving all the relaxed dual problems takes more than of the total time per iteration when the number of samples is , our algorithm is easily parallelized to solve the relaxed dual problems, allowing the algorithm to scale nearly linearly with the size of the data set.
We have presented a global optimization algorithm for a mixed membership matrix factorization problem. Our algorithm brings ideas from the global optimization community (Benders’ decomposition and the GOP method) into contact with statistical inference problems for the first time. The cost of the global optimal solution is the need to solve a number of linear programs that grows exponentially in the number of so-called “connected” variables in the worst case – in this case the elements of . Many of these linear programs are redundant or yield optimal solutions that are greater than the current upper bound and thus not useful. A branch-and-bound framework  reduces the need to solve all possible relaxed dual problems by fathoming parts of the solution space We further mitigate this cost by developing an search algorithm for identifying and enumerating the true number of unique linear programs.
We are exploring the connections between GOP and the other alternating optimization algorithms such as the expectation maximization (EM) and variational EM algorithm. Since the complexity of GOP only depends on the connected variables, the graphical model structure connecting the complicating and non-complicating variables may be used to identify the worst-case complexity of the algorithm prior to running the algorithm. A factorized graph structure may provide an approximate, but computationally efficient algorithm based on GOP. Additionally, because the Lagrangian function factorizes into the sum of Lagrangian functions for each sample in the data set, we may be able to update the parameters based on GOP for a selected subset of the data in an iterative or sequential algorithm. We are exploring the statistical consistency properties of such an update procedure.
Finally, we have derived an algorithm for particular loss functions for the sparsity constraint and objective function. The GOP framework can handle integer variables and thus may be used with ancounting “norm” rather than the norm to induce sparsity. This would give us a mixed-integer biconvex program, but the conditions for the framework. Structured sparsity constraints can also be defined as is done for elastic-net extensions of LASSO regression. It may be useful to consider other loss functions for the objective function depending on the application.
We acknowledge Hachem Saddiki for valuable discussions and comments on the manuscript.
-  Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed Membership Stochastic Blockmodels. J. Mach. Learn. Res., 9:1981–2014, September 2008.
-  Jacques F Benders. Partitioning Procedures for Solving Mixed-variables Programming Problems. Numer. Math., 4(1):238—-252, 1962.
-  David M. Blei and John D. Lafferty. Correlated Topic Models. In Proc. Int. Conf. Mach. Learn., pages 113–120, 2006.
-  David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
-  Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
-  Christodoulos A. Floudas. Deterministic Global Optimization, volume 37 of Nonconvex Optimization and Its Applications. Springer US, Boston, MA, 2000.
-  Christodoulos A Floudas. Deterministic global optimization: theory, methods and applications, volume 37. Springer Science & Business Media, 2013.
-  Christodoulos A. Floudas and C E Gounaris. A Review of Recent Advances in Global Optimization. J. Glob. Optim., 45:3–38, 2008.
-  Christodoulos A. Floudas and V Visweswaran. A Global Optimization Algorithm (GOP) for Certain Classes of Nonconvex NLPs. Comput. Chem. Eng., pages 1–34, 1990.
-  A. M. Geoffrion. Generalized Benders Decomposition. J. Optim. Theory Appl., 10:237–260, 1972.
-  Jochen Gorski, Frank Pfeuffer, and Kathrin Klamroth. Biconvex Sets and Optimization with Biconvex Functions: A Survey and Extensions. Math. Methods Oper. Res., 66:373–407, 2007.
-  Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer Science & Business Media, 2013.
-  Ata Kabán. On Bayesian Classification with Laplace Priors. Pattern Recognit. Lett., 28(10):1271–1282, July 2007.
-  Peter Lancaster, Miron Tismenetsky, et al. The theory of matrices: with applications. Elsevier, 1985.
-  D D Lee and H S Seung. Learning the Parts of Objects by Non-negative Matrix Factorization. Nature, 401:788–791, 1999.
D J C MacKay.
Bayesian Interpolation.Neural Comput., 1992.
-  Lester Mackey, David Weiss, and Michael I Jordan. Mixed Membership Matrix Factorization. In Int. Conf. Mach. Learn., pages 1–8, 2010.
-  Gurobi Optimization et al. Gurobi optimizer reference manual. URL: http://www. gurobi. com, 2:1–3, 2012.
-  J K Pritchard, M Stephens, and P Donnelly. Inference of Population Structure using Multilocus Genotype Data. Genetics, 155:945–959, 2000.
-  Ajit P. Singh and Geoffrey J. Gordon. A Unified View of Matrix Factorization Models. In Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), volume 5212 LNAI, pages 358–373, 2008.
-  Matt Taddy. Multinomial Inverse Regression for Text Analysis. J. Am. Stat. Assoc., page 121008121831000, 2012.
-  Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes. In Adv. Neural Inf. Process. Syst., number 1, 2005.
-  Han Xiao and Thomas Stibor. Efficient Collapsed Gibbs Sampling for Latent Dirichlet Allocation. J. Mach. Learn. Res. …, pages 63–78, 2010.
-  Wei Xu, Xin Liu, and Yihong Gong. Document Clustering Based on Non-negative Matrix Factorization. Proc. 26th Annu. Int. ACM SIGIR Conf. Res. Dev. informaion Retr. - SIGIR ’03, page 267, 2003.
-  Thomas Zaslavsky. Facing up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes: Face-count Formulas for Partitions of Space by Hyperplanes, volume 154. American Mathematical Soc., 1975.
Appendix A Derivation of relaxed dual problem constraints
The Lagrange function is the sum of the Lagrange functions for each sample,
and the Lagrange function for a single sample is
We see that the Lagrange function is biconvex in and . We develop the constraints for a single sample for the remainder.
a.1. Linearized Lagrange function with respect to
Casting as a vector and rewriting the Lagrange function gives
where is formed by stacking the columns of in order. The coefficients are formed such that
The linear coefficient matrix is the vector,
The quadratic coefficient is the and block matrix
The Taylor series approximation about is
The gradient with respect to is
Plugging the gradient into the Taylor series approximation gives
Simplifying the linearized Lagrange function gives
Finally, we write the linearized Lagrangian using the matrix form of ,
While the original Lagrange function is convex in for a fixed , the linearized Lagrange function is not necessarily convex in . This can be seen by collecting the quadratic, linear and constant terms with respect to ,
Now, if and only if is positive semidefinite, then is convex. The condition is satisfied at but may be violated at some other value of .
a.2. Linearized Lagrange function with respect to
Now, we linearize (15) with respect to . Using the Taylor series approximation with respect to gives
The gradient for this Taylor series approximation is
where is the vector of qualifying constraints associated with the Lagrange function. The qualifying constraint is linear in .
Plugging the gradient into the approximation gives
The linearized Lagrange function is bi-linear in and .
Finally, simplifying the linearized Lagrange function gives
Appendix B Proof of Biconvexity
To prove the optimization problem is biconvex, first we show the feasible region over which we are optimizing is biconvex. Then, we show the objective function is biconvex by fixing and showing convexity with respect to , and then vice versa.
b.1. The constraints form a biconvex feasible region
Our constraints can be written as
The inequality constraint (22) is convex if either or is fixed, because any norm is convex. The equality constraints (23) is an affine combination that is still affine if either or is fixed. Every affine set is convex. The inequality constraint (24) is convex if either or is fixed, because is a linear function.
b.2. The objective is convex with respect to
We prove the objective is a biconvex function using the following two theorems.
Let be a convex open set and let be twice differentiable. Write for the Hessian matrix of at . If is positive semidefinite for all , then is convex ().
symmetric matrix is positive semidefinite (PSD) if and only if there exists such that ().
The objective of our problem is,
The objective function is the sum of the objective functions for each sample.
The gradient with respect to ,
Take second derivative with respect to to get Hessian matrix,