# Computation of the Maximum Likelihood estimator in low-rank Factor Analysis

Factor analysis, a classical multivariate statistical technique is popularly used as a fundamental tool for dimensionality reduction in statistics, econometrics and data science. Estimation is often carried out via the Maximum Likelihood (ML) principle, which seeks to maximize the likelihood under the assumption that the positive definite covariance matrix can be decomposed as the sum of a low rank positive semidefinite matrix and a diagonal matrix with nonnegative entries. This leads to a challenging rank constrained nonconvex optimization problem. We reformulate the low rank ML Factor Analysis problem as a nonlinear nonsmooth semidefinite optimization problem, study various structural properties of this reformulation and propose fast and scalable algorithms based on difference of convex (DC) optimization. Our approach has computational guarantees, gracefully scales to large problems, is applicable to situations where the sample covariance matrix is rank deficient and adapts to variants of the ML problem with additional constraints on the problem parameters. Our numerical experiments demonstrate the significant usefulness of our approach over existing state-of-the-art approaches.

## Authors

• 9 publications
• 35 publications
07/31/2019

### Binary Component Decomposition Part I: The Positive-Semidefinite Case

This paper studies the problem of decomposing a low-rank positive-semide...
05/12/2021

### A new perspective on low-rank optimization

A key question in many low-rank problems throughout optimization, machin...
02/16/2018

### High-dimensional covariance matrix estimation using a low-rank and diagonal decomposition

We study high-dimensional covariance/precision matrix estimation under t...
06/01/2019

### Multi-reference factor analysis: low-rank covariance estimation under unknown translations

We consider the problem of estimating the covariance matrix of a random ...
07/31/2019

### Binary component decomposition Part II: The asymmetric case

This paper studies the problem of decomposing a low-rank matrix into a f...
06/28/2019

### Learning Markov models via low-rank optimization

Modeling unknown systems from data is a precursor of system optimization...
03/13/2013

### Ranking and combining multiple predictors without labeled data

In a broad range of classification and decision making problems, one is ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Factor Analysis (FA) [2, 6, 24]

, arguably a fundamental dimensionality reduction technique for multivariate data has been around for more than a hundred years. FA is popularly used to understand the correlation structure among a collection of observed variables in terms of a smaller number of common factors. In a typical FA model, we assume that the (mean centered) observed random vector

may be expressed in the form: , where, is a matrix of factor loadings, is a random vector of scores and is a vector of uncorrelated variables. We assume that and are uncorrelated, have zero means and without loss of generality we set —this leads to the following decomposition:

 Cov(x):=Σ=LL⊤+Ψ, (1)

where, . Decomposition (1) suggests that the population covariance matrix , can be written as the sum of a low rank positive semidefinite (PSD) matrix and a diagonal matrix with nonnegative entries. In particular, this implies that can be decomposed in two parts where,

represents the variance of

which is shared with other variables via common factors (communality) and represents the variance of not shared with other variables (specific or unique variance).

One of the most popular FA estimation methods is the Maximum Likelihood (ML) procedure [24, 2, 6]. Given multivariate samples , which are assumed to be mean-centered, the task is to minimize the negative log-likelihood with respect to that is of the form (1). This leads to the following optimization problem:

 minimize L(Σ):=−logdet(Σ−1)+tr(Σ−1S) (2) s.t. Σ=Ψ+LL⊤ Ψ=diag(ψ1,…,ψp)⪰ϵI,

where, is the sample covariance matrix; and are the optimization variables; the notation means that is PSD; and is the usual trace operator. Here, is a small positive constant, specified a-priori to ensure that Problem (2) is bounded below111Indeed, implies that . Thus, and which shows that Problem 2 is bounded below. Note that Problem (2) with need not have a finite solution, i.e., the ML solution need not exist. Note that if one of the then , a similar argument applies if becomes unbounded. Thus the infimum of Problem (2) is attained whenever .. We note that Problem (2) can be rewritten in terms of a new variable and placing a rank constraint on as: – hence, in what follows, we will often refer to Problem (2) as a rank constrained optimization problem.

Observe that Problem (2) is nonconvex in since (a) the objective function is not convex [10] in and (b) the equality constraint is nonconvex.

##### Related work:

FA has a long tradition that dates back to more than a hundred years [33]. We present here a selective overview that is relevant for this work—important contributions in FA have been nicely documented in [24, 2, 6, 4]. Despite being a problem of fundamental importance in statistical estimation, not much is known about its computational properties. Unfortunately, popular off-the-shelf implementations for ML factor analysis (as available in Matlab and R that are in routine use) are quite unstable222We have observed this in our experiments and this is also reported in our section on numerical experiments.

– they are based on rather ad-hoc computational algorithms and lead to negative variance estimates which are problematic from a statistical inferential viewpoint. This is perhaps not surprising, given that the basic problem underlying ML factor analysis is a difficult (nonconvex) optimization problem and there has been limited work in developing mathematical optimization based algorithms for this problem. It is also difficult to generalize existing algorithms in the presence of additional constraints, depending upon the problem/application context.

[21, 2] present a nice overview of classical algorithms used for ML factor analysis. Some modern computational approaches for ML factor analysis are based on the seminal contribution of [17]. This approach requires to be of full rank and one needs to assume that has exactly rank . It applies a (rather ad-hoc) gradient descent like algorithm w.r.t . Recently [28] provide necessary and sufficient conditions for existence of a solution of Problem (2) with , however, they do not provide any computational algorithm for solving Problem (2). Another popular approach for ML factor analysis is based on the EM algorithm[30, 8]. Publicly available implementations of the EM-type methods apply to the case where, is smaller than and hence is not full rank.

Not all methods in FA are based on the ML framework. In other approaches, one seeks to estimate a matrix of the form , which is close to the sample covariance matrix in terms of some metric (for e.g., the Frobenius norm: ). Some popular methods in the literature are minimum residual FA, principal axis, principal component method, minimum trace FA, among others —see [32, 6, 7] for more description on these approaches. Fairly recently [7] proposed integer optimization based methods for the problem of minimizing , where, is of the form (1) with an additional restriction of . [31] study the noiseless decomposition for the FA problem, using nuclear norm relaxations of the rank constraint on . The aforementioned line of work is different from the ML criterion in FA as the data-fidelity measure is different. In this paper, our focus is on the computational properties of the ML problem (2).

##### Contributions:

The main contributions of this paper can be summarized as follows:

1. We propose a new computational framework for the task of (Gaussian) Maximum Likelihood estimation in Factor Analysis – a problem central to classical and modern multivariate statistical learning. The associated optimization problem is challenging: it is given by the minimization of a nonconvex function subject to a rank constraint and additional semidefinite constraints. We reformulate the problem as the minimization of a nonsmooth nonconvex function of eigenvalues of a positive semidefinite matrix, subject to (simple) polyhedral constraints.

2. Using (convexity) properties of spectral functions, we show that the objective function can be expressed as a difference of convex functions; and is hence amenable to computational techniques in difference of convex optimization (see for example [16]

). The computational bottleneck of our algorithm is a low rank singular value decomposition (SVD) of a

matrix that needs to be performed for every iteration – exploiting problem structure, we show that this can be computed with cost . An important advantage of our proposal, when compared to many other commonly used, publicly available implementations; is that it applies to the case where the sample covariance matrix is rank deficient.

3. We explore computational guarantees of our proposed algorithm in terms of reaching a first order stationary point. We demonstrate that on a series of numerical examples (including both real and synthetic datasets), our method significantly outperforms commonly used approaches for ML factor analysis (in terms of superior numerical stability, smaller computation times and obtaining solutions with better objective values). To our knowledge, our proposal is one of the most scalable computational mathematical optimization-based approaches for factor analysis. Our approach also generalizes to instances of ML factor analysis where, is not necessarily diagonal.

##### Notation:

For a real symmetric matrix , we will denote by , the vector of real eigenvalues of , i.e., with for all . For a real postive semidefinite (PSD) matrix we use to denote its symmetric square root. If is invertible and PSD, then we use to denote the square root of . The matrix

denotes the identity matrix (with dimension determined from the context, unless otherwise mentioned). For a vector

, we use the notation to denote component-wise inequality; for a matrix , we use the notation (or ) to denote that the matrix is positive semidefinite (respectively, positive definite). We will assume all diagonals of are strictly greater than zero. For a non-negative integer we denote by .

## 2 Methodology

We state a simple result (the proof of which is omitted) regarding the eigenvalues of the product of two matrices – a property that is used throughout the paper.

###### Proposition 1.

For any two symmetric matrices and , we have .

A simple corollary of the above that is used widely in the paper is:

###### Corollary 1.

For any PSD matrix , we have:

 λ(Φ12SΦ12)=λ(SΦ)=λ(ΦS)=λ(S12ΦS12) (3)

### 2.1 Reformulations

In this section we present a reformulation of the low rank optimization Problem (2) to one that does not involve any (combinatorial) rank constraint: Proposition 2 reformulates Problem (2) as an optimization problem in . The resulting problem (4) is amenable to efficient optimization techniques based on difference of convex optimization. In Proposition 3 we provide a characterization of an optimal solution of Problem (4).

###### Proposition 2.

(a) Problem (2) is equivalent to:

 minimize {logdet(Ψ)+tr(S∗)+r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1)} (4) s.t. Ψ=diag(ψ1,…,ψp)⪰ϵI,

where, denote the eigenvalues of ; and the optimization variables are and .

(b) Suppose is a minimizer of Problem (4) and where,

are eigenvectors of

corresponding to its top eigenvalues with for all . Then is a minimizer of Problem (2).

###### Proof.

Part (a): We first minimize Problem (2) with respect to for a fixed value of . A simple application of the Sherman Woodbury formula, with some rearrangement gives:

 Σ−1=(Ψ+LL⊤)−1=Ψ−1−Ψ−1L(I+L⊤Ψ−1L)−1L⊤Ψ−1=Ψ−1−Ψ−12Ψ−12L(I+L⊤Ψ−12Ψ−12L)−1L⊤Ψ−12Ψ−12 (5)

Writing in the last line of display (5), we get:

 Σ−1=Ψ−1−Ψ−12L∗(I+(L∗)⊤L∗)−1(L∗)⊤Ψ−12.

The above implies that:

 tr(Σ−1S)= tr(Ψ−1S)−tr(Ψ−12L∗(I+(L∗)⊤L∗)−1(L∗)⊤Ψ−12S) (6) = tr(Ψ−12SΨ−12)−tr((L∗)⊤Ψ−12SΨ−12L∗(I+(L∗)⊤L∗)−1) = tr(S∗)−tr((L∗)⊤S∗L∗(I+(L∗)⊤L∗)−1)     (Using, S∗=Ψ−12SΨ−12)

where, in the second and third lines of display (6) we (repeatedly) used the fact that .

 −logdet(Σ−1)=logdet(Σ)=logdet(Ψ+LL⊤)=logdet(Ψ)+logdet(I+L∗⊤L∗) (7)

Using (6), (7) in the objective function of Problem (2), i.e., we get the following equivalent reformulation of Problem (2):

 (8)

where, recall we use the notation: and ; and the optimization variables in Problem (8) are (and consequently, ). Note that

for any orthogonal matrix

. So we can substitute by in Problem (8). We choose such that the columns of are orthogonal or zero vectors. Note that the derivative of w.r.t. is given by:

 ∂h(Ψ,L)∂L=−2Σ−1(Σ−S)Σ−1L. (9)

Setting the above gradient to be zero and writing we have . After some elementary algebra this can be written as:

 S∗L∗(I+L∗⊤L∗)−1=L∗. (10)

Since we choose the columns of to be pairwise orthogonal or zero vectors, it follows that is a diagonal matrix. This means that (10) is a collection of eigenvector equations for the matrix . From condition (10) we have that at an optimal value of , the columns of are either pairwise orthogonal eigenvectors of with eigenvalues as the diagonal entries of or they are zero vectors. Denoting the columns of by the part of the function in display (8), that depends upon is given by:

 g(L)=r∑i=1(log(1+z⊤izi)−z⊤iS∗zi1+z⊤izi). (11)

Since s are pairwise orthogonal or zero vectors, it follows from equation (10) that

 S∗zi=βizi, with βi=1+z⊤izi. (12)

Note that in the above equation, either with or and equals some eigenvalue of with eigenvector — thus (11) becomes

 g(L)=r∑i=1(log(βi)−βi+1). (13)

Note that is strictly decreasing for all . So it is easy to see that (13) is minimized for for where, are the top eigenvalues of . The optimal choice of is given by when and when , is an eigenvector of with eigenvalue and with . Noting that

 minΨ⪰ϵI,L h(Ψ,L) = minΨ⪰ϵI {minLh(Ψ,L)}, (14)

and substituting the values of that minimize the inner minimization problem, above, into the objective function , we obtain (4).

Part (b): The proof of this part is a consequence of the proof of Part (a). ∎

The method of minimizing the objective function w.r.t. with held fixed, is inspired by the classical work of [19, 20, 17]—this line of work however, assumes to be of full rank. We do not assume that is full rank. [28] investigate the existence of ML solutions for a general — no algorithms for computing the solution are presented. The expression (4) derived herein, does not appear in [28]. Formulation (4) plays a key role in developing algorithms for Problem (4), a main focus of this paper.

Proposition 3 shows that any solution of Problem (4) is bounded above.

###### Proposition 3.

Let be a solution of Problem (4). Then .

###### Proof.

Note that for any , . Let be the optimal value of for fixed value of . Then from the proof of Proposition 2, it follows that at . Setting (9) to zero, we have ; and applying Sherman Woodbury formula on , we have the following chain of inequalities:

 L = SΣ−1L (15) = = SΨ−1L−SΨ−1L((I+L⊤Ψ−1L)−1L⊤Ψ−1L) (16) = (17) = SΨ−1L(I+L⊤Ψ−1L)−1. (18)

Eqn (15) follows from (5); Eqn (17) follows from (16) by using the observation that for a PSD matrix we have the following identity: (which can be verified by simple algebra).

Moreover using (5) on the expression simplifies as:

 SΣ−1 = SΨ−1−(SΨ−1L(I+L⊤Ψ−1L)−1)(L⊤Ψ−1) (19) = SΨ−1−LL⊤Ψ−1   (Using expression of L from rhs of~{}% (???)) = SΨ−1−(Σ−Ψ)Ψ−1   (Since, Σ=LL⊤+Ψ) = SΨ−1−ΣΨ−1+I.

Note that we have the following expression for :

 Σ−1(Σ−S)Σ−1=Σ−1−Σ−1(SΣ−1)=Σ−1−Σ−1(SΨ−1−ΣΨ−1+I)=−(Σ−1S)Ψ−1+Ψ−1=−(Ψ−1S−Ψ−1Σ+I)Ψ−1+Ψ−1=Ψ−1(Σ−S)Ψ−1. (20)

where, the second line follows by using expression of from (19); and the fourth line follows by using the same expression for .

Using (20), the expression for reduces to

 ∂h∂Ψ=diag(Σ−1(Σ−S)Σ−1)=Ψ−1diag(Σ−S)Ψ−1. (21)

So, we conclude that if for some . We are minimizing Problem  (4), so the optimal solution must satisfy . This completes the proof. ∎

We present another equivalent representation of Problem (4), by a simple change of variables . In what follows below, unless otherwise specified, we use the shorthand and .

###### Corollary 2.

Problem (4) is equivalent to the following optimization problem in :

 minimize f(ϕ):=p∑i=1(−logϕi+siiϕi)+r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1) (22) s.t. 0≺Φ=diag(ϕ1,…,ϕp)⪯1ϵI,

where, are the top eigenvalues of . If is a solution to Problem (4) then, we have where, is a solution to Problem (22).

###### Remark 1.

Problem (4) (and Problem (22)) is a minimization problem in (respectively, ), unlike the (rank constrained) Problem (2) with variables and . Note that Problem (22) is nonconvex due to the nonconvex objective function, though the constraints are convex. Corollary 3 shows that the the objective function appearing in Problem (22) is neither convex nor concave but it can be written as a difference of simple convex functions.

### 2.2 Expressing Problem (22) as a difference of convex functions

In this section we show via Proposition 4 and  5 that the objective function in Problem (22) can be written as a difference of two convex functions (Corollary 3). This renders the application of algorithms based on difference of convex optimization, to get good solutions to Problem (22). Proposition 6 shows that when the sample covariance matrix is full rank, the objective function in Problem (22) can be expressed purely in terms of the eigenvalues of .

Let be an ordering of and define:

 Hr(y):=r∑i=1(log(max{1,y(i)})−max{1,y(i)}+1). (23)

The following proposition shows that is concave on .

###### Proposition 4.

as defined in (23), is concave on .

###### Proof.

We first establish that admits the following representation:

 Hr(y)=minw˜H(w;y):=p∑i=1wi(log(max{1,yi})−max{1,yi}+1)s.t.p∑i=1wi=r,   0≤wi≤1,i∈[p], (24)

as the minimum (w.r.t. the optimization variable ) of the linear functional . To see why this is true, note that the scalar function is decreasing on . Hence the sum will be minimized for a choice: whenever is one of the top elements among ; and for all other choices of . This justifies representation (24).

For any , note that is concave. Hence, for every fixed , the function

 (y1,…,yp)↦p∑i=1wi(log(max{1,yi})−max{1,yi}+1)

is concave on . Since the pointwise infimum of a family of concave functions is concave [10], is concave on the nonnegative reals. ∎

###### Proposition 5.

For any , the function

 ϕ↦h(ϕ):=r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1) (25)

is concave on ; where, are the eigenvalues of .

###### Proof.

Note (Corollary 1). By a classic result due to Davis [12, 23] the following mapping

 S12ΦS12↦r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1) (26)

is concave in if and only if the function (23) is symmetric333A function is said to be symmetric in its arguments if, for any permutation of the indices , we have and concave in on . It is easy to see that the function in (23) is symmetric and concavity follows from Proposition 4. So we conclude that the map in (26) is concave in . The linearity of the map implies that is concave in on . This completes the proof of the proposition. ∎

###### Corollary 3.

For any , can be written as the difference of two convex functions, that is: , where,

 f1(ϕ)=p∑i=1(−logϕi+siiϕi)   and   f2(ϕ)=−r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1).
###### Proof.

The convexity of is easy to see. Proposition 5 implies that is convex. ∎

When is full rank, i.e., , then Problem (22) can be rewritten purely as a function of the eigenvalues – this is established in Proposition 6.

###### Proposition 6.

If Problem (22) is equivalent to the following problem:

 minimize¯f(ϕ):=p∑i=1(−logλ∗i+λ∗i)+r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1)s.t.0≺Φ=diag(ϕ1,…,ϕp)⪯1ϵI, (27)

where, are the eigenvalues of .

###### Proof.

Problem (22) is equivalent to minimizing over The function can be expressed as:

 ¯f(ϕ)= p∑i=1(−logϕi+siiϕi)−logdet(S)+r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1) (28) = −logdet(S∗)+tr(S∗)+r∑i=1(log(max{1,λ∗i})−max{1,λ∗i}+1) (29) = (30)

where, line (29) follows from (28) by observing that

 p∑i=1(−log(ϕi)+siiϕi)−logdet(S)= −logdet(Φ)+tr(SΦ)−logdet(S) (31) = −logdet(SΦ)+tr(SΦ) = −logdet(S∗)+tr(S∗),

where, ; and the last equality in (31) made use of Corollary 1. Moreover, note that since is of full rank and ; all the eigenvalues . This completes the proof. ∎

Proposition 6 provides an interesting alternate characterization of the formulation presented in Corollary 3 – this helps us gain additional understanding of the optimization problem for ML factor analysis when .

Herein, we study the computation of (sub)gradients [29] of properties of the functions and . Note that is differentiable. However, the convex (spectral) function is not differentiable. A subgradient of can be computed following the work of [22] on differentiability of spectral functions of Hermitian matrices. To this end, consider the representation of in (24); and define the function on . Define which is a convex function in . If is a subgradient of , then it can be computed by using Danskin’s theorem: where, is a minimizer of the inner optimization task in Problem (24); and is the gradient of , with th coordinate given by and for all . The function is differentiable at iff is unique444We note that this occurs if .. The set of all subgradients of is given by

 Conv({−p∑i=1^wi∇g(yi):^w is a minimizer of Problem~{}(???)}).

Let us consider a matrix , with eigen decomposition , and consequently consider the spectral convex function Using properties of subgradients of spectral functions [22], we have that a subgradient of is given by:

 ∂~gr(A)=Vdiag(∂˜Hr(λ))V⊤, (32)

where, is a subgradient of .

If is the eigen decomposition of

; then using the chain rule, the

th coordinate of is given by:

 ∂if2(ϕ)=(Φ−12UD1U⊤Φ12S)ii, (33)

where, , with

 δi=⎧⎪⎨⎪⎩max{0,1−1λ∗i}if 1≤i≤r 0otherwise.

### 2.3 Algorithm for Problem (22): A DC optimization approach

Problem (22) is a nonconvex optimization problem with semidefinite constraints and obtaining a global minimum for a general is quite challenging. We thus focus on developing efficient computational procedures for obtaining good (feasible) solutions to Problem (22). By Corollary 3, Problem (22) is equivalent to the following nonconvex optimization problem:

 minimize f(ϕ)=f1(ϕ)−f2(ϕ) (34) s.t. ϕ∈C:={ϕ:1ϵ≥ϕi>0,i∈[p]}.

We use a sequential linearization procedure: wherein, at every iteration, we linearize the function (leaving as is) and solve the resultant convex problem. This is an instance of the well-known difference of convex optimization based algorithms [16, 35, 27]

or DC algorithms in short. In the machine learning community, these methods are also known as the convex concave procedure

[36, 37]. These algorithms have gained significant traction in the wider optimization community, especially recently, due to their pervasive use in practice – some excellent recent works on this topic include [1, 13, 25, 26] (see also references therein). However, to our knowledge, ours is the first paper to use this approach in the context of ML factor analysis. We will also like to remind the reader that the reformulations in Section 2.1 play a key role in arriving at decomposition (34) — this sets the stage for the application of the DC approach.

Let us formally describe the algorithm. If denotes the value of at the th iteration, we linearize at with unchanged and obtain a convex approximation of , denoted by ; and this is given by:

 f(ϕ)≈f1(ϕ)−(f2(ϕ(k))+⟨∇k,ϕ−ϕ(k)⟩):=F(ϕ;ϕ(k)), (35)

where, is a subgradient of at (Section 2.2.1). We compute as:

 ϕ(k+1)∈argmin1ϵ1≥ϕ>0  F(ϕ;ϕ(k))=argmin1ϵ1≥ϕ>0  p∑i=1(−logϕi+siiϕi−∇k,iϕi), (36)

where, is the th coordinate of (for all ). The th entry of is given by:

 ϕ(k+1)i=min{1sii−∇k,i,1ϵ}   for   i∈[p].

The updates continue till some stopping criterion is satisfied. This can be in terms of the relative change in the successive objective values: