SQUAREM: An R Package for Off-the-Shelf Acceleration of EM, MM and Other EM-like Monotone Algorithms

10/26/2018 ∙ by Yu Du, et al. ∙ Johns Hopkins University Eli Lilly and Company 0

We discuss R package SQUAREM for accelerating iterative algorithms which exhibit slow, monotone convergence. These include the well-known expectation-maximization algorithm, majorize-minimize (MM), and other EM-like algorithms such as expectation conditional maximization, and generalized EM algorithms. We demonstrate the simplicity, generality, and power of SQUAREM through a wide array of applications of EM/MM problems, including binary Poisson mixture, factor analysis, interval censoring, genetics admixture, and logistic regression maximum likelihood estimation (an MM problem). We show that SQUAREM is easy to apply, and can accelerate any fixed-point, smooth, contraction mapping with linear convergence rate. Squared iterative scheme (Squarem) algorithm provides significant speed-up of EM-like algorithms. The margin of the advantage for Squarem is especially huge for high-dimensional problems or when EM step is relatively time-consuming to evaluate. Squarem can be used off-the-shelf since there is no need for the user to tweak any control parameters to optimize performance. Given its remarkable ease of use, Squarem may be considered as a default accelerator for slowly converging EM-like algorithms. All the comparisons of CPU computing time in the paper are made on a quad-core 2.3 GHz Intel Core i7 Mac computer. R Package SQUAREM can be downloaded at https://cran.r-project.org/web/packages/SQUAREM/index.html.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The R package SQUAREM provides convergence acceleration techniques for speeding-up slow, monotone iterative algorithms. These include the well-known expectation-maximization (EM) algorithm[Dempster et al., 1977], majorize-minimize (MM)[Lange et al., 2000], and other algorithmic variants such as expectation-conditional maximization (ECM)[Meng and Rubin, 1993], expectation-conditional maximization or either (ECME)[Liu and Rubin, 1998], among others. Dempster et al. [1977] refer to such variants as “generalized EM (GEM)" when M step is only partially implemented. In this paper, we term these “EM-like algorithms”, because they all have a contractive fixed-point mapping with linear rate of convergence, like EM. For the definition of linear rate of convergence and contractive mapping, please refer to [Ortega and Rheinboldt, 1970a, Chapter 5]. All of these algorithms are essentially based on the idea that a relatively difficult optimization problem can be converted to a much simpler iterative algorithm with guaranteed, albeit slow, convergence. A visual imagery is apt here: instead of embarking upon a direct and treacherously steep climb, we approach the summit through a winding, gradually ascending path. Interestingly, this idea has become very attractive now with the advent of big data revolution and high-dimensional applications, where solving the original optimization problem is either impossible or prohibitively expensive. There is a nice analogy to this in numerical linear algebra for solving large-scale linear system of equations. Indirect, iterative techniques (e.g., Gauss-Seidel) for solving linear systems were considered to be too slow and impractical, and only of pedagogical interest. Instead, the attention of the research community was focused on direct methods such as the various decomposition and factorization methods (LU, QR, SVD). But, such direct techniques are ill-suited to solve the modern day, large-scale linear systems with millions of equations. Therefore, clever adaptations of indirect iterative methods are emerging as the methods of choice (e.g., conjugate-gradient)[Censor and Zenios, 1997; Saad, 2003]. Similarly, for the estimation of statistical models in large, high-dimensional modern applications, EM-like algorithms are becoming indispensable tools in the arsenal of computational scientists [Patro et al., 2014; Shiraishi et al., 2015; Raj et al., 2015; Chiou et al., 2017, etc.].

EM-like algorithms are characterized by two essential features: reliable, monotone convergence, and slow, linear rate of convergence. Therefore, any strategy that can accelerate the rate of convergence of these algorithms, without compromising on their reliability and ease of use, will be of huge help. Zhou et al. [2011] remarked that “In many statistical problems, maximum likelihood estimation by an EM or MM algorithm suffers from excruciatingly slow convergence. This tendency limits the application of these algorithms to modern high-dimensional problems in data mining, genomics, and imaging. Unfortunately, most existing acceleration techniques are ill-suited to complicated models involving large numbers of parameters. The squared iterative methods (SQUAREM) recently proposed by Varadhan and Roland constitute one notable exception.” The goal of this paper is to demonstrate the utility of this “notable exception", Squarem, proposed by Varadhan and Roland [2008], which is available in the R package SQUAREM [Varadhan, 2016].

The main aim of SQUAREM is to facilitate the development of computationally efficient new statistical models. In particular, our package provides acceleration schemes which can speed up the estimation of the statistical models, where the model parameters are estimated with monotone, EM-like algorithms. Acceleration of these estimation algorithms can be readily achieved using the function squarem(). Here we demonstrate the simplicity, generality, and power of SQUAREM through a wide array of applications of EM/MM problems in R[R Core Team, 2015], illustrating how easy it is to use Squarem to derive efficient solutions. However, it should be recognized that there is no foolproof numerical algorithm; in poorly identified problems, where even the EM algorithm can fail, Squarem is not guaranteed to work.

2 Squared iterative method

Suppose we have observed data

that comes from a probability density function

where is the parameter of interest. We are often interested in computing the MLE (maximum likelihood estimates) of , denoted by EM algorithm is a popular technique for computing MLE, which consists of two steps, E-step and M-step[Dempster et al., 1977]. EM algorithm is natural when there is a missing data component in the probability model, which when known greatly simplifies the estimation of Let us use to denote the complete data. EM algorithm then becomes:

  • E-step

    A function is constructed such that

    where refers to the iteration of the algorithm, is the complete data log-likelihood, and is the conditional density function of missing data given observed data . Thus, the function computes the expected value of complete data log-likelihood given the current estimates of parameter values and observed data.

  • M-step

    M-step maximizes the function in E-step over to iteratively compute the next, iteration of parameter values, , such that

The EM algorithm therefore defines a fixed-point mapping such that and

Two convergence criteria can be applied and are both satisfied by the EM algorithm : 1) for parameter estimates , as , ( is the Euclidean Norm); 2) the convergence is defined by the sequence of the likelihood function of the parameter estimates, , such that , as . The EM algorithm guarantees to produce monotone convergence such that . By Taylor’s theorem under regularity conditions, expand around :

where is the Jacobian matrix of evaluated at Dempster et al. [1977] showed that

, the Jacobian matrix, measures the fraction of missing information. Under weak regularity conditions, the eigenvalues of

lie on Thus, the largest eigenvalue of governs the rate of convergence for EM. The closer this is to unity, indicating a large fraction of missing information, the slower EM converges.

Motivated by the Cauchy-Barzilai-Borwein(CBB) method [Raydan and Svaiter, 2002], Roland and Varadhan [2005] and Varadhan and Roland [2008] constructed Squarem by defining the following recursive error relation:

where ,

is the identity matrix and

is the steplength that takes into account the larger eigenvalues of . The pseudocode for Squarem algorithm is listed in Table 1 [Varadhan and Roland, 2008], which demonstrates the remarkable simplicity of the proposed algorithm.

Input: , , ,
While not converged
Compute steplength
If , set ; else
, stabilization step (done only if )
Table 1: Pseudocode for Squarem.

There are three choices for , the steplength as described in Varadhan and Roland [2008]. It is our experience that generally works the best, and hence it is the default steplength used in SQUAREM. Varadhan and Roland [2008] also showed global convergence of Squarem algorithm, i.e., Squarem can converge to a stationary point from any starting value in the parameter space, or at least, in a large part of it by modifying steplength to ensure monotonicity. Note that when steplength is equal to , a Squarem evaluation is the same as two EM updates. Thus, each iteration of Squarem involves 2 or 3 evaluations of EM. Hence, when we compare EM to Squarem, we use the number of EM steps rather than number of iterations. Apart from the EM steps, there is minimal cost in computing the Squarem parameter updates, including the computation of the value of likelihood functions. In addition to the convergence criteria provided earlier, we give a definition of convergence acceleration as follows: suppose is the sequence of estimates produced by Algorithm 1, while is that given by Algorithm 2, then we say that Algorithm 2 accelerates Algorithm 1 if as .

3 Description of R package Squarem

R Package SQUAREM is available on CRAN. It can be downloaded via https://cran.r-project.org/web/packages/SQUAREM/index.html. SQUAREM works for any smooth, contraction mapping with a linear convergence rate (e.g., EM-like algorithms). We describe below the two main functions, squarem() and fpiter(). Obviously, squarem() is the featured function in the package.

  • squarem(), for squared iterative scheme

    squarem() is a function to accelerate any smooth, contractive, fixed-point iteration algorithm including EM/MM and other EM-like algorithms. The main arguments include par, fixptfn, objfn and control. par denotes the starting value of parameters. The argument fixptfn defines a function representing the fixed-point iteration: . fixptfn encodes a single step of any EM-like algorithm.

    R> fixptfn <- function(par, data, ...) {
    +    pnew <- F(par, data, ...)
    +    return(pnew)
    +  }
    

    objfn is the objective function we want to minimize. In the case of EM-like algorithms, it would be the negative log-likelihood function of data. It is not essential to supply the objective function in order for Squarem to work, but its provision guarantees global convergence. control specifies a list of algorithm options including maxiter, maximum number of iterations, and tol, tolerance, among others. If , the algorithm shall declare convergence at the iteration ( shows the Euclidean Norm). Under regularity conditions given by Wu [1983], the satisfaction of the convergence does imply a local optimum. There are 3 important control parameters in squarem(), namely, K, method and objfn.inc. method specifies the choice of steplength and K specifies the order of the squared iterative scheme. The default values method = 3 and K = 1 generally work well. objfn.inc guides the monotonicity of the objective function. Setting objfn.inc = 0 ensures strict monotonicity, while objfn.inc = Inf results in an unguarded acceleration scheme, where the objective function is not evaluated at all. The default is objfn.inc = 1, which results in a nearly-monotone acceleration scheme. We can also set this to equal the average log-likelihood i.e., log-likelihood per individual sample.

    To summarize, the default usage of squarem() is such that

    squarem(par, fixptfn, objfn, ..., control = list()).

  • fpiter(), for fixed-point iteration scheme

    fpiter() is a function to implement the fixed-point iteration algorithm including EM, MM and other EM-like algorithms, without any acceleration. The main arguments include par, fixptfn, objfn and control, which work the same way as in the squarem() except that there are no Squarem specific control parameters in the argument control.

    To summarize, the default usage of fpiter() is such that

    fpiter(par, fixptfn, objfn, ..., control = list()).

In the next section, we demonstrate a detailed illustration of how to use SQUAREM.

4 How to apply Squarem acceleration

Imagine that an EM-like algorithm is used to estimate a model, with slow, linear rate of convergence. In order to speed up the algorithm using R Package SQUAREM, there are two main steps to be prepared.

  • Step 1

    Write the part from the used EM-like algorithm into an R function such that it does only one step of the EM-like algorithm. This function corresponds to the argument fixptfn in function squarem().

  • Step 2

    Write an associated merit function to minimize, for example, the negative log likelihood function. This function corresponds to the argument objfn in function squarem().

There are also other arguments in function squarem() as specified in Section 3, such as starting values, tolerance and maximum number of iterations, but the default choices often work well. Once we have these arguments ready, we can launch the function squarem() to do the acceleration, simple and easy. We now illustrate this usage of R Package SQUAREM in detail with a simple example of mixture problem introduced below, which was also discussed in Varadhan and Roland [2008]. Here we revisit this example, mainly to illustrate how remarkably easy it is to apply the squarem() function, starting from an existing EM algorithm function.

In many studies, the study sample comes from a population which is a mix of two or more types of units, each with varying characteristics. Finite mixture models are ideally suited to account for this kind of heterogeneity. A finite mixture model estimates parameters describing each subpopulation and their mixing probabilities. EM algorithm is a popular technique to compute the maximum likelihood estimates for mixture models, but is notorious for its slow convergence. Here, we use a two-component Poisson mixture to illustrate the usage and power of Squarem compared to the EM algorithm. We use the data of the number of deaths of women 80 years and older during the years 1910-1912 from The London Times [Hasselblad, 1969].

We use to denote the mixing probability and let ,

be the mean of Poisson distribution from population 1 and 2, respectively. Let

be the number of death, and be the number of days when death occurred.

The real data is displayed in Table 2, where death number .

Death, Frequency, Death, Frequency,
0 162 5 61
1 267 6 27
2 271 7 8
3 185 8 3
4 111 9 1
Table 2: Data on deaths of women 80 years or older during 1910 to 1912 from The London Times.

EM algorithm

For derivation of EM step, see Appendix A.1.

The EM update is such that

where are the iteration of estimates and is defined in Appendix A.1.

Next, we demonstrate how easy it is to set up Squarem acceleration of an EM-like algorithm. We implement the EM algorithm using function EM.poisson.mixture() below.

R> EM.poisson.mixture <- function(p, maxiter = 5000, tol = 1e-08, y) {
+    iter <- 1
+    conv <- FALSE
+    pnew <- rep(NA,3)
+    while(iter < maxiter) {
+      i <- 0 : (length(y) - 1)
+      zi <- p[1] * exp(-p[2]) * p[2]^i /
+        (p[1] * exp(-p[2]) * p[2]^i + (1 -
+        p[1]) * exp(-p[3]) * p[3]^i)
+      pnew[1] <- sum(y * zi) / sum(y)
+      pnew[2] <- sum(y * i * zi) / sum(y * zi)
+      pnew[3] <- sum(y * i * (1 - zi)) / sum(y * (1 - zi))
+      res <- sqrt(crossprod(pnew - p))
+      p <- pnew
+      if(res < tol) {
+        conv <- TRUE
+        break
+      }
+      iter <- iter + 1
+    }
+    return(list(par = p, fpevals = iter, convergence = conv))
+  }

In order to implement squared iterative scheme(Squarem) using function squarem(), we extract the part in the above EM function that corresponds to one EM step and put it into a separate function poissmix.em(). This function corresponds to the argument fixptfn in the squarem() function. Cutting and pasting the code chunk from the above function, we create the function for fixptfn and complete step 1 in applying Squarem.

R> poissmix.em <- function(p,y) {
+    pnew <- rep(NA,3)
+    i <- 0 : (length(y) - 1)
+    zi <- p[1] * exp(-p[2]) * p[2]^i /
+      (p[1] * exp(-p[2]) * p[2]^i + (1 -
+      p[1]) * exp(-p[3]) * p[3]^i)
+    pnew[1] <- sum(y * zi) / sum(y)
+    pnew[2] <- sum(y * i * zi) / sum(y * zi)
+    pnew[3] <- sum(y * i * (1 - zi)) / sum(y * (1 - zi))
+    p <- pnew
+    return(pnew)
+  }

Step 2 is to write an associated merit function to minimize, in this case, the negative log likelihood function. The log likelihood of observed data is such that:

Therefore, the negative log likelihood is coded into function poissmix.loglik(). This function corresponds to the argument objfn in function squarem().

R> poissmix.loglik <- function(p,y) {
+    i <- 0 : (length(y) - 1)
+    loglik <- y * log(p[1] * exp(-p[2]) * p[2]^i / exp(lgamma(i + 1)) +
+      (1 - p[1]) * exp( - p[3]) * p[3]^i / exp(lgamma(i + 1)))
+    return(-sum(loglik))
+  }

Now, we are all set to apply squarem(). Let us then compare Squarem and the EM algorithm with starting value and tolerance , to see how remarkably easy it is to apply Squarem, off-the-shelf acceleration, once the EM algorithm has been implemented.

R> library("SQUAREM")
R> poissmix.dat <- data.frame(death = 0 : 9,
+    freq = c(162, 267, 271, 185, 111, 61, 27, 8, 3, 1))
R> y <- poissmix.dat$freq
R> p0 <- c(0.3,1,5)
R> system.time(f0 <- EM.poisson.mixture(p = p0, y = y))
user system elapsed
0.036 0.005 0.040
R> f0
$par
[1] 0.3598864 1.2560968 2.6634056

$value.objfn
[1] 1989.946

$fpevals
[1] 2696

$convergence
[1] TRUE
R> system.time(f1 <- fpiter(par = p0, fixptfn = poissmix.em,
+    objfn = poissmix.loglik, control = list(tol = 1.e-08), y = y))
user system elapsed
0.039 0.000 0.039
R> f1
$par
[1] 0.3598864 1.2560968 2.6634056

$value.objfn
[1] 1989.946

$fpevals
[1] 2696

$objfevals
[1] 0

$convergence
[1] TRUE
R> system.time(f2 <- squarem(par = p0, fixptfn = poissmix.em,
+    objfn = poissmix.loglik, control = list(tol = 1.e-08), y = y))
user system elapsed
0.003 0.000 0.002
R> f2
$par
[1] 0.3598859 1.2560960 2.6634050

$value.objfn
[1] 1989.946

$iter
[1] 19

$fpevals
[1] 54

$objfevals
[1] 19

$convergence
[1] TRUE

The output shows the equivalence of the standard EM loop function EM.poisson.mixture() and function fpiter() in R Package SQUAREM, and a dramatic improvement for Squarem over the EM algorithm. Squarem outperforms EM for this case by a factor of 50 in terms of the number of EM evaluations and by a factor of 20 with regards to the CPU running time. From this point on, we will use function fpiter() to implement EM and other EM-like algorithms.

We conduct two algorithms for 5000 randomly generated starting values

where

is a uniform random variable on the interval

. Table 3

displays the results. We provide the mean and 95% confidence interval for the number of EM evaluations, fevals, as well as the CPU time (in seconds). In general, Squarem converges 13 times faster than EM with only 3.2% of the number of EM evaluations that the EM algorithm takes.

fevals CPU time(s)
EM 3140(2510, 3182) 0.039(0.031, 0.046)
Squarem 101(57,132) 0.003(0.002, 0.005)
Table 3: The convergence performance in terms of the number of EM evaluations, fevals and the CPU running time comparing EM to Squarem for Poisson mixture estimation with 5000 randomly generated starting values.

We also plot the error curve as a function of the number of EM evaluations in Figure 1 using the starting value , where is the iteration estimates . The truth, , is derived by running squarem() with a very small convergence tolerance, e.g., . Figure 1 shows that the error for Squarem algorithm drops at a much faster rate than EM. Squarem converges in approximately 50 EM evaluations while the error is still quite large for EM after 100 iterations.

In the next section, we continue to demonstrate the utility of SQUAREM through a wide application of EM/MM problems, including interval censoring, genetics admixture, and logistic regression maximum likelihood estimation (an MM problem).

Figure 1: The comparison of convergence behaviour between Squarem and EM.

5 Examples

5.1 Factor analysis

Factor analysis is a statistical modeling approach that aims to explain the variability among observed variables in terms of a smaller set of unobserved factors. Factor analysis is widely applied in areas where observed variables may be conceptualized as manifesting from some unobserved latent factors, such as psychometrics, behavioral sciences, social sciences, and marketing. The latent factors can be regarded as missing data in a multivariate normal model and the EM algorithm [Dempster et al., 1977], therefore, becomes a natural way to compute the maximum likelihood estimates. We will illustrate using two examples the dramatic accelerations of EM by Squarem and also compare with ECME [Liu and Rubin, 1998], an acceleration of EM. One example comes from real data, as used by Liu and Rubin [1998] and Rubin and Thayer [1982], while the other is a simulation example.

Notations

Following the notation in Rubin and Thayer [1982], let be observed data matrix and Z be unobserved factor matrix where .

are independently and identically distributed vectors following multivariate normal distribution. The marginal distribution of

is such that multivariate normal

. Let the variance of each component of

be , so is also the correlation matrix for . Factor analysis model assumes that given the factors , the components of vector become independent and multivariate normal where . is called factor loading matrix while the diagonal variances in are called uniquenesses in factor analysis. In general, maximum likelihood factor analysis involves estimating and . But is often considered identity matrix (orthogonal factor model) and the maximum likelihood estimator of is always , the column means of observed data matrix . Suppose we center matrix by its column means, is always vector zero. Therefore, we are left with only to estimate. Given , the marginal distribution of is multivariate normal with mean zero vector and covariance matrix . Thus we can write the log likelihood of observed data matrix ,

where is the sample covariance of . The negative log likelihood to be minimized is coded in function factor.loglik(), to check monotonicity in Squarem algorithm.

EM algorithm

For derivation of EM step, see Appendix A.2.

If the loading matrix is unrestriced, the EM update is such that

where are the iteration of estimates and are defined in the Appendix A.2.

Similarly, if the loading matrix has a priori zeroes, the EM update is such that

where refers to the variable in vector , subscript corresponds to the factors with nonzero loadings for the variable and is the diagonal element of . Next we consider two data examples to illustrate the simplicity and stability of Squarem to accelerate EM algorithm.

Real data example

We use the real data from Jöreskog [1967] as in Rubin and Thayer [1982] and Liu and Rubin [1998]. The data consists of 9 variables, 4 factors, and 2 patterns of a priori zeroes for the loadings such that one a priori zero loadings on factor 4 for variables 1 through 4, and a different a priori zero loadings on factor 3 for variables 5-9. There is otherwise no restrictions. The sample covariance matrix is given below:

We use the starting values of and as in Liu and Rubin [1998], where

and for

The negative log likelihood is given by the function factor.loglik() below, which corresponds to the argument objfn} in \verbsquarem()+.

R> factor.loglik <- function(param, cyy) {
+    beta.vec <- param[1:36]
+    beta.mat <- matrix(beta.vec, 4, 9)
+    tau2 <- param[37:45]
+    tau2.mat <- diag(tau2)
+    Sig <- tau2.mat + t(beta.mat) %*% beta.mat
+    loglik <- -145/2 * log(det(Sig)) - 145/2 * sum(diag(solve(Sig, cyy)))
+    return(-loglik)
+  }

One EM update is given by the function factor.em() below, which corresponds to the argument fixptfn in squarem().

R> factor.em <- function(param, cyy) {
+    param.new <- rep(NA, 45)
+    beta.vec <- param[1:36]
+    beta.mat <- matrix(beta.vec, 4, 9)
+    tau2 <- param[37:45]
+    tau2.mat <- diag(tau2)
+    inv.quantity <- solve(tau2.mat + t(beta.mat) %*% beta.mat)
+    small.delta <- inv.quantity %*% t(beta.mat)
+    big.delta <- diag(4) - beta.mat %*% inv.quantity %*% t(beta.mat)
+    cyy.inverse <- t(small.delta) %*% cyy %*% small.delta + big.delta
+    cyy.mat <- t(small.delta) %*% cyy
+    beta.new <- matrix(0, 4, 9)
+    beta.p1 <- solve(cyy.inverse[1:3, 1:3]) %*% cyy.mat[1:3, 1:4]
+    beta.p2 <- solve(cyy.inverse[c(1, 2, 4), c(1, 2, 4)])
+      %*% cyy.mat[c(1, 2, 4), 5:9]
+    beta.new[1:3, 1:4] <- beta.p1
+    beta.new[c(1, 2, 4), 5:9] <- beta.p2
+    tau.p1 <- diag(cyy)[1:4] - diag(t(cyy.mat[1:3, 1:4])
+      %*% solve(cyy.inverse[1:3, 1:3])
+      %*% cyy.mat[1:3, 1:4])
+    tau.p2 <- diag(cyy)[5:9] - diag(t(cyy.mat[c(1, 2, 4), 5:9])
+      %*% solve(cyy.inverse[c(1, 2, 4), c(1, 2, 4)])
+      %*% cyy.mat[c(1, 2, 4), 5:9])
+    tau.new <- c(tau.p1, tau.p2)
+    param.new <- c(as.numeric(beta.new), tau.new)
+    param <- param.new
+    return(param.new)
+  }

In order to compare with ECME algorithm as implemented by Liu and Rubin [1998], we also write the function factor.ecme() below to do one ECME iteration. The only difference from EM algorithm is that for M-step, after we update the loading matrix , we find that maximizes the actual constrained likelihood of observed data matrix given the updated using Newton-Raphson.

R> factor.ecme <- function(param, cyy) {
+    n <- 145
+    param.new <- rep(NA, 45)
+    beta.vec <- param[1:36]
+    beta.mat <- matrix(beta.vec, 4, 9)
+    tau2 <- param[37:45]
+    tau2.mat <- diag(tau2)
+    inv.quantity <- solve(tau2.mat + t(beta.mat) %*% beta.mat)
+    small.delta <- inv.quantity %*% t(beta.mat)
+    big.delta <- diag(4) - beta.mat %*% inv.quantity %*% t(beta.mat)
+    cyy.inverse <- t(small.delta) %*% cyy %*% small.delta + big.delta
+    cyy.mat <- t(small.delta) %*% cyy
+    beta.new <- matrix(0, 4, 9)
+    beta.p1 <- solve(cyy.inverse[1:3, 1:3]) %*% cyy.mat[1:3, 1:4]
+    beta.p2 <- solve(cyy.inverse[c(1, 2, 4), c(1, 2, 4)])
+      %*% cyy.mat[c(1, 2, 4), 5:9]
+    beta.new[1:3, 1:4] <- beta.p1
+    beta.new[c(1, 2, 4), 5:9] <- beta.p2
+    A <- solve(tau2.mat + t(beta.new) %*% beta.new)
+    sum.B <- A %*% (n * cyy) %*% A
+    gradient <- - tau2/2 * (diag(n*A) - diag(sum.B))
+    hessian <- (0.5 * (tau2 %*% t(tau2))) * (A * (n * A - 2 * sum.B))
+    diag(hessian) <- diag(hessian) + gradient
+    U <- log(tau2)
+    U <- U - solve(hessian, gradient)
+    tau.new <- exp(U)
+    param.new <- c(as.numeric(beta.new), tau.new)
+    param <- param.new
+    return(param.new)
+  }

Next we use R Package SQUAREM to compute the MLE by EM, Squarem, ECME, and Squared ECME algorithms. Tolerance is set to be across all algorithms.

  • EM

    In order to perform the EM algorithm, we use function fpiter() in SQUAREM Package. The arguments consist of a starting value, function factor.em() that encodes one EM update, negative log likelihood function factor.loglik(), other variables as needed by these functions, and a control list to specify changes to default values. The starting value for and comes from Liu and Rubin [1998].

    R> library("SQUAREM")
    R> system.time(f1 <- fpiter(par = param.start, cyy=cyy,
    +    fixptfn = factor.em, objfn = factor.loglik,
    +    control = list(tol = 10^(-8), maxiter = 20000)))
    
     user  system elapsed
    2.805   0.028   2.834
    
    R> f1$fpevals
    
    [1] 14659
    

    It takes 14659 iterations to converge for the EM algorithm, which spends 2.834 seconds.

  • ECME

    We replace function factor.em() by factor.ecme() that implements one ECME update thus to implement ECME algorithm.

    R> system.time(f2 <- fpiter(par = param.start, cyy = cyy,
    +    fixptfn = factor.ecme, objfn = factor.loglik,
    +    control = list(tol = 10^(-8), maxiter = 20000)))
    
     user  system elapsed
    1.378   0.029   1.409
    
    R> f2$fpevals
    
    [1] 6408
    

    It takes 6408 iterations of ECME updates to converge, less than half of what the EM needs. Also it spends 1.409 seconds, approximately half of the time it takes for the EM to converge.

  • Squarem

    Next, we use function squarem() in SQUAREM Pakcage to apply Squarem algorithm to accelerate EM. The arguments are the same as in function fpiter() except a few control parameters particularly set for Squarem algorithm.

    R> system.time(f3 <- squarem(par = param.start, cyy = cyy,
    +    fixptfn = factor.em, objfn = factor.loglik,
    +    control = list(tol = 10^(-8))))
    
     user  system elapsed
    0.226   0.006   0.233
    
    R> f3$fpevals
    
    [1] 876
    

    It only takes 876 iterations of EM updates to converge, which is faster by a factor of 17 and 7 in terms of the number of fixed point evaluations when compared to the EM and ECME, respectively. Moreover, Squarem only uses 0.233 seconds to converge, 12 times faster than the EM and 6 times faster than ECME.

  • Squared ECME

    Squarem algorithm can even be used to accelerate ECME, which is already a faster version of the EM algorithm. Let us call this Squared ECME.

    R> system.time(f4 <- squarem(par = param.start, cyy = cyy,
    +    fixptfn = factor.ecme, objfn = factor.loglik,
    +    control = list(tol = 10^(-8))))
    
     user  system elapsed
    0.111   0.005   0.117
    
    R> f4$fpevals
    
    [1] 400
    

    The squared ECME converges in only 400 iterations compared to 6400 iterations for ECME, and it takes only 0.11 seconds.

In order to accommodate the randomness of CPU time, we run the above 4 schemes 100 times and summarize the mean and standard deviation of CPU running time into Table 

4, along with the number of fixed point evaluations needed.

EM ECME Squarem Squared ECME
CPU time 2.799(0.116) 1.382(0.046) 0.221(0.012) 0.107(0.005)
Number of EM iterations 14659 6408 876 400
Table 4: The comparison of EM, ECME, Squarem and Squared ECME algorithms on real data from Jöreskog [1967]. Note the CPU time is in seconds.

Table 4 demonstrates that Squarem algorithm can greatly and easily improve on both EM and ECME algorithms for factor analysis problem.

Simulation example

In the simulation example, we generate 200 observations of 32 subject scores and we assume that there are 4 latent factors. In this case, we do not impose any priori zero loadings for convenience of comparison. The function used to generate data and compare the performance of EM to Squarem is coded into simulate.FAEM(), accessible from the replication script.

The results of comparison of CPU running time and the number of EM evaluations between EM algorithm and Squarem are summarized in Figure 2. It can be seen that Squarem performs consistently better than EM algorithm for both criteria, by a factor of at least 10 in most cases.

(a)

The cumulative distribution function of CPU running time for EM algorithm versus Squarem.

(b) The cumulative distribution function of the number of EM evaluations (log10 scale) for EM algorithm versus Squarem.
(c) The scatter plot of CPU running time for EM algorithm versus Squarem.
(d) The scatter plot of the number of EM evaluations for EM algorithm versus Squarem.
Figure 2: The comparison of CPU running time and the number of EM evaluations between EM algorithm and Squarem.

5.2 Interval censoring

Interval censoring is a common phenomenon in survival analysis, where we do not observe the precise time of an event for each individual, but we only know the time interval during which the individual’s event occurs. Following the notations in Gentleman and Geyer [1994], we assume that survival time, , also known as failure time, come from a distribution . Each individual goes through a sequence of inspection times . The survival time for individual is not observed, however, the last inspection time prior to and the first inspection time after are recorded. An example of interval censored data is displayed in Table 5.

Last inspection time prior to First inspection time after
Individual 1 1 3
Individual 2 2 6
Individual n 3 4
Table 5: The example of interval censored data(unit: year).

Therefore, data consists of time intervals for each individual and the event for individual is known to happen during that interval. Let be the unique ordered times of , and , the cell of an matrix, be such that

and , . The log likelihood of data is therefore

The negative log likelihood is coded in function loglik(), corresponding to the argument objfn in squarem(). A in function loglik() refers to the alpha matrix and pvec is the vector of probabilities, .

R> loglik <- function(pvec, A) {
+    -sum(log(c(A %*% pvec)))
+  }

EM algorithm

For derivation of EM step, see Appendix A.3.

The EM update is such that

where is the iteration of estimates and Such one EM update is written in function intEM(), corresponding to the argument fixptfn in squarem().

R> intEM <- function(pvec, A) {
+    tA <- t(A)
+    Ap <- pvec * tA
+    pnew <- colMeans(t(Ap)/colSums(Ap))
+    pnew * (pnew > 0)
+  }

EM-ICM algorithm

Wellner and Zhan [1997] developed a hybrid algorithm called EM-ICM for the MLE computation of interval censored data. This algorithm alternates steps of iterative convex minorant(ICM) and of EM. Wellner and Zhan [1997] showed in paper that EM-ICM substantially improves the performance of the EM algorithm. We also compare Squarem with EM-ICM using real data example and a simulated one. We use function EMICM() in R Package interval, written by Fay and Shaw [2010], to implement EM-ICM algorithm.

Real data example

The real data comes from Finkelstein and Wolfe [1985], which gives the interval when cosmetic deterioration occurred in 46 individuals with early breast cancer under radiotherapy. Table 6 shows censored interval for each individual.

(45,Inf] (6,10] (0,7] (46,Inf] (7,16] (17,Inf]
(7,14] (37,44] (0,8] (4,11] (15,Inf] (11,15]
(22,Inf] (46,Inf] (46,Inf] (25,37] (46, Inf] (26, 40]
(46, Inf] (27,34] (36,44] (46, Inf] (36, 48] (37, Inf]
(40, Inf] (17,25] (46, Inf] (11,18] (38, Inf] (5, 12]
(37, Inf] (0,5] (18, Inf] (24, Inf] (36, Inf] (5, 11]
(19, 35] (17,25] (24, Inf] (32, Inf] (33, Inf] (19,26]
(37,Inf] (34,Inf] (36, Inf] (46,Inf]
Table 6: The censored intervals when cosmetic deterioration occurred.

We use function Aintmap in R Package interval to produce matrix and then generate starting values.

R> library("interval")
R> A <- Aintmap(dat[, 1], dat[, 2])
R> m <- ncol(A)
R> pvec <- rep(1/m, length = m)

We modified the function EMICM() in R Package interval in order to keep the same starting values across all algorithms, a uniform starting value where , .

Next, we compare the performance of the above three algorithms, EM, Squarem and EM-ICM. We did not include EM-ICM for comparison in the number of EM evaluations because intrinsically EM-ICM algorithm is a hybrid algorithm where each EM-ICM step is different from an EM evaluation. The tolerance for convergence is set at , the same across all algorithms.

  • EM algorithm

    R> system.time(ans1 <- fpiter(par = pvec, fixptfn = intEM,
    +    objfn = loglik, A = A, control = list(tol = 1e-8)))
    
     user  system elapsed
    0.008   0.000   0.009
    
    ans1$fpevals
    
    [1] 216
    
  • Squarem

    R> system.time(ans2 <- squarem(par = pvec, fixptfn = intEM,
    +    objfn = loglik, A = A, control = list(tol = 1e-8)))
    
     user  system elapsed
    0.002   0.000   0.002
    
    ans2$fpevals
    
    [1] 40
    
  • EM-ICM algorithm

    R> system.time(ans3 <- EMICM.mod(dat, EMstep = TRUE,
    +    ICMstep = TRUE, keepiter = FALSE,
    +    tol = 1e-08, maxiter = 1000))
    
     user  system elapsed
    0.025   0.001   0.027
    
R> max(abs(ans1$par - ans2$par))
[1] 0
R> max(abs(ans2$par - ans3$pf))
[1] 4.707805e-05

All three algorithms converge to the same point as evidenced by the maximum difference in absolute value between algorithms-produced parameter estimates. The EM algorithm performs fairly well on this real dataset, perhaps due to the small sample size. Even so, Squarem still outperforms the EM by a factor of 5 in terms of the number of EM evaluations and a factor of 4 in CPU running time. We show in the following section that Squarem and EM-ICM algorithms are more advantageous than the EM algorithm using a simulated example as sample size increases.

Simulation example

For each individual, we randomly generate censored intervals by creating a survival time (event) and a stochastic sequence of inspection times. The left end of interval is the last inspection time before the event while the right end is the first inspection time after. The function we use to generate interval censored data is coded in gendata().

R> gendata <- function(n, mu.nexam = 5) {
+    foo <- matrix(NA, nrow = n, ncol = 3)
+    for(i in 1:n) {
+      st <- rweibull(1, shape = 1, scale = 5)
+      nexam <- rpois(1, mu.nexam)
+      exam <- round(runif(nexam, 0, 10), 1)
+      exam <- c(0, exam, Inf)
+      foo[i, ] <- c(time = st, L = max(exam[st > exam]),
+        R = min(exam[st <= exam]))}
+    return(foo)
}

First, let us try sample size , simulate 100 times and compare the performance of the EM, Squarem and EM-ICM algorithms. The results are summarized in Figure 3.

Figure 3: The comparison of CPU running time among EM, Squarem and EM-ICM algorithms, applied to 100 simulated datasets for a moderate sample size .

It can be seen from Figure 3 that Squarem and EM-ICM algorithms are both, on average, approximately 10 times faster than the EM, for a moderate sample size . The performance of both algorithms are comparable. Although Squarem has a lower median CPU running time, its distribution is more variable than EM-ICM algorithm. In order to show the improvement of the number of EM evaluations for Squarem over EM algorithm, we summarize the mean and standard deviation of the number of EM evaluations for both algorithms in Table 7 for this simulation study.

EM Squarem
Mean 6745 446
Standard deviation 4737 306
Table 7: The comparison of mean and standard deviation of the number of EM evaluations between EM algorithm and Squarem on simulated data example for a moderate sample size .

On average, EM algorithm takes 15 times more EM steps to converge than Squarem for this simulation study with a moderate sample size . Next, we increase the sample size from to and evaluate the performance of the EM, Squarem and EM-ICM algorithms again on 100 simulated interval censored datasets.

Figure 4: The comparison of CPU running time among EM, Squarem and EM-ICM algorithms, applied to 100 simulated datasets for a large sample size .

As sample size expands to , Figure 4 shows that the advantage of Squarem and EM-ICM algorithms becomes greater. Both algorithms on average converge at least 17 times faster than the EM algorithm. EM-ICM is specifically tailored to interval censoring maximum likelihood estimation, hence it is not surprising that it outperforms the EM algorithm by a large margin. However, it is noteworthy that Squarem, which is a general purpose EM-like algorithm accelerator, is very competitive with EM-ICM algorithm as shown in this simulation study. Table 8 again compares the number of EM evaluations between EM algorithm and Squarem. Squarem on average outperforms EM algorithm by a factor of 17 in terms of the number of EM evaluations.

EM Squarem
Mean 13397 792
Standard deviation 3947 262
Table 8: The comparison of mean and standard deviation of the number of EM evaluations between EM algorithm and Squarem on simulated data example for a large sample size .

5.3 Genetics global ancestry estimation problem

Here we demonstrate the use of Squarem to solve an important problem in quantitative genetics that is notoriously computationally challenging. Suppose our study population is an admixed population with ancestral populations. The goal is to estimate the proportion of ancestry from each contributing population for each individual’s entire genome and simultaneously estimate the allele frequencies of the ancestral populations. Let us use to denote such admixture proportions for individual , where is the proportion of subject genome that is attributed to the ancestral population , and is the number of subjects. Let Q be the admixture proportions matrix. We assume that all genome-wide markers are bi-allelic (either allele 1 or allele 2). Let F be the population allele frequency matrix with being the frequency of allele 1 at marker , in population . Matrix F and Q consist of parameters we are interested in estimating. The data consists of genetic polymorphism data sampled from diploid individuals. Specifically, we have recorded the genotype at genetic polymorphisms ("markers") for each individual. Genotype at marker for individual is represented as allele 1 counts, . We assume that individuals are independent and under ADMIXTURE model, the log likelihood of data is:

where is a constant that does not contain parameters from F and Q. See Alexander et al. [2009] for full description of model.

The negative log likelihood of data is coded in function loglike(), corresponding to the argument objfn in squarem().

R> loglike <- function(param, X, K) {
+    n <- nrow(X)
+    p <- ncol(X)
+    F <- matrix(param[1: (p * K)], p, K)
+    Q <- matrix(param[(p * K + 1): (p * K + n * K)], n, K)
+    loglikelihood <- sum(X * log(Q %*% t(F)) + (2 - X) *
+      log(Q %*% (1 - t(F))))
+    return(-loglikelihood)
+  }

EM algorithm

For derivation of EM step, see Appendix A.4.

The EM update of matrix F and Q is such that

where are defined in the Appendix A.4.

This one EM evaluation is written in function admixture.em(), corresponding to the argument fixptfn in squarem(). We adapt the code provided on Peter Carbonetto github account [Carbonetto, 2016].

R> admixture.em <- function(param, X, K) {
+    eps <- 1e-6
+    n <- nrow(X)
+    p <- ncol(X)
+    m  <- matrix(eps, n, K)
+    n0 <- matrix(eps, p, K)
+    n1 <- matrix(eps, p, K)
+    F <- matrix(param[1: (p * K)], p, K)
+    Q <- matrix(param[(p * K + 1): (p * K + n * K)], n, K)
+    r <- array(0, dim = c(p, 4, K, K))
+    for (i in 1:n) {
+      colnames(r) <- c("00","01","10","11")
+        for (j in 1:K)
+          for (k in 1:K) {
+            r[,"00", j, k] <- (X[i, ] == 0) *
+              (1 - F[, j]) * (1 - F[, k])
+            r[,"01", j, k] <- (X[i, ] == 1) *
+              (1 - F[, j]) * F[, k]
+            r[,"10", j, k] <- (X[i, ] == 1) *
+              F[, j] * (1 - F[, k])
+            r[,"11", j, k] <- (X[i, ] == 2) *
+              F[, j] * F[, k]
+            r[, , j, k]     <- r[, ,j,k] *
+              Q[i, j] * Q[i, k]
+          }
+      dim(r) <- c(p, 4 * K^2)
+      r <- r / rowSums(r)
+      dim(r) <- c(p, 4, K, K)
+      colnames(r) <- c("00", "01", "10", "11")
+      m[i, ] <- m[i, ] + apply(r, 3, sum) + apply(r, 4, sum)
+      for (k in 1:K) {
+        n0[, k] <- n0[, k] + rowSums(drop(r[, "00", k, ])) +
+          rowSums(drop(r[, "01", k, ])) +
+          rowSums(drop(r[, "00", , k])) +
+          rowSums(drop(r[, "10", , k]))
+        n1[, k] <- n1[, k] + rowSums(drop(r[, "10", k, ])) +
+          rowSums(drop(r[, "11", k, ])) +
+          rowSums(drop(r[, "01", , k])) +
+          rowSums(drop(r[, "11", , k]))
+      }
+    }
+    F <- n1/(n0 + n1)
+    Q <- m/rowSums(m)
+    return(c(as.vector(F), as.vector(Q)))
+  }

Simulation example

We simulate an allele 1 count matrix where there are 150 individuals and 100 markers for each individual. We use . The starting value of

is randomly drawn from a uniform distribution in the range of

while that of is . We implement the EM algorithm and Squarem to compute maximum likelihood estimates of matrix F and Q and compare their performance.

  • EM algorithm

    R> load("geno.sim.RData")
    R> set.seed(413)
    R> p <- 100
    R> n <- 150
    R> K <- 3
    R> F <- matrix(runif(p * K), p, K)
    R> Q <- matrix(1/K, n, K)
    R> param.start <- c(as.vector(F), as.vector(Q))
    R> system.time(f1 <- fpiter(par = param.start, fixptfn = admixture.em,
    +    objfn = loglike, control = list(tol = 1e-4), X = geno, K = 3))
    
      user   system elapsed
    197.094   5.817 203.040
    
    R> f1$fpevals
    
    [1] 1115
    
  • Squarem

    R> system.time(f2 <- squarem(par = param.start, fixptfn = admixture.em,
    +    objfn = loglike,
    +    control = list(tol = 1e-4, maxiter = 2000), X = geno, K = 3))
    
     user   system  elapsed
    47.460   1.403  48.869
    
    R> f2$fpevals
    
    [1] 270
    

In this example, Squarem outperforms the EM algorithm by a factor of 4 in terms of CPU running time as well as the number of EM evaluations. For large genetic datasets, the E-step is by far the most computationally intensive part of the algorithm. For a faster implementation of the E-step using C (and interfaced to R using the .Call() function), see Carbonetto [2016]. Although the admixutre problem is naturally framed using EM, its convergence is very slow. Alexander et al. [2009] implemented faster solution to this problem (using a block relaxation optimization method), which has permitted application of ADMIXTURE to very large genetic datasets. Our EM-based implementation in R is much slower than the ADMIXTURE software, but, nevertheless it serves to illustrate the benefits of Squarem in a difficult optimization problem from genetics.

5.4 MM algorithm - logistic regression maximum likelihood estimation

In this section, we discuss a quadratic majorization algorithm (an MM algorithm) for computing the maximum likelihood estimates of logistic regression coefficients. Minorize and maximize or equivalently, majorize and minimize (MM) algorithms typically exhibit slow linear convergence just like the EM algorithm. We show that Squarem can provide significant acceleration of MM algorithms.

EM algorithm may be viewed as a special case of MM algorithms[Zhou and Zhang, 2012]. The majorization algorithms are widely applied, for example, in the work of De Leeuw [1994], Heiser [1995], Lange et al. [2000], among others. Suppose we want to minimize function over . We construct a majorization function on such that

where denotes the iteration, . Therefore, instead of minimizing , we minimize such that

We repeat the updates of until convergence and this completes the majorization algorithm. Note that in the EM algorithm, the function plays the role of the minorizing function.

Quadratic majorization algorithm

Taylor’s theorem often leads to quadratic majorization algorithms [Böhning and Lindsay, 1988] where the majorization function is quadratic. By Taylor’s theorem, expand at ,

where is on the line between and . The majorization function is constructed by constructing a matrix such that is always positive semi-definite regardless of . So,

is a majorization function for . Let us define a clever variable such that and majorization function is equivalent to the following:

At the iteration, to minimize over is simply to minimize , thus the majorization algorithm becomes:

Therefore, in order to implement quadratic majorization algorithm, we need to construct the matrix and compute the gradient of function . Let us consider logistic regression maximum likelihood estimation. Suppose we have an design matrix where there are subjects and predictors. Let be the number of successes for subject , given the overall number of experiments, . We use to denote the regression coefficients. The goal is to derive the maximum likelihood estimates of .

The negative log likelihood of data is:

where

The gradient of is such that:

where the element of is , while the second derivative of is:

where is a diagonal matrix with the diagonal element Based on the fact that , the matrix can be constructed such that where is the diagonal matrix consisting of elements . Thus the quadratic algorithm becomes:

Let us denote the above algorithm by uniform bound quadratic algorithm since uniformly for any and each subject . Jaakkola and Jordan [2000] and Groenen et al. [2003] developed a non-uniform bound, , where is a diagonal matrix that consists of elements , . Thus the non-uniform bound quadratic algorithm becomes:

Real data example

We use the cancer remission data in Lee [1974]. The outcome is a binary indicator of whether cancer remission occurred for the subject. Column 1 is the intercept and variables , , , are results of six medical tests. The first five lines of data are as follows:

R> ld <- read.table("lee_data.txt")
R> head(ld, 5)
V1  V2   V3   V4  V5    V6    V7 V8
 1 0.8 0.83 0.66 1.9 1.100 0.996  1
 1 0.9 0.36 0.32 1.4 0.740 0.992  1
 1 0.8 0.88 0.70 0.8 0.176 0.982  0
 1 1.0 0.87 0.87 0.7 1.053 0.986  0
 1 0.9 0.75 0.68 1.3 0.519 0.980  1

The negative log likelihood function is coded in binom.loglike(), corresponding to the argument objfn} in \verbsquarem()+.

R> binom.loglike <- function(par, Z, y) {
+    zb <- c(Z %*% par)
+    pib <- 1 / (1 + exp(- zb))
+    return(as.numeric(-t(y) %*% (Z %*% par) - sum(log(1 - pib))))
+  }

The uniform bound quadratic majorization algorithm update and the non-uniform one are written in function qmub.update(), qmvb.update(), respectively, corresponding to the argument fixptfn in squarem().

R> qmub.update <- function(par, Z, y) {
+    Zmat <- solve(crossprod(Z)) %*% t(Z)
+    zb <- c(Z %*% par)
+    pib <- 1 / (1 + exp(-zb))
+    ub <-  pib - y
+    par <- par - 4 * c(Zmat %*% ub)
+    par
+  }
R> qmvb.update <- function(par, Z, y) {
+    zb <- c(Z %*% par)
+    pib <- 1 / (1 + exp(-zb))
+    wmat <- diag((2 * pib - 1)/(2 * zb))
+    ub <-  pib - y
+    Zmat <- solve(t(Z) %*% wmat %*% Z) %*% t(Z)
+    par <- par - c(Zmat %*% ub)
+    par
+  }

Now let us apply these two quadratic majorization algorithms and their Squared versions where we implement squared iterative scheme, to compare their performance. The tolerance used is and the starting value is

  • Uniform bound quadratic majorization algorithm

    R> library("SQUAREM")
    R> Z <- as.matrix(ld[, 1:7])
    R> y <- ld[, 8]p0 <- rep(10, 7)
    R> system.time(ans1 <- fpiter(par = p0, fixptfn = qmub.update,
    +    objfn = binom.loglike,
    +    control = list(maxiter = 20000), Z = Z, y = y))
    
     user  system  elapsed
    0.051   0.003   0.055
    
    R> ans1
    
    $par
    [1]  58.0384838  24.6615508  19.2935824 -19.6012695   3.8959635
    [6]  0.1510923  -87.4339059
    
    $value.objfn
    [1] 10.87533
    
    $fpevals
    [1] 1127
    
    $objfevals
    [1] 0
    
    $convergence
    [1] TRUE
    
  • Squared uniform bound quadratic majorization algorithm

    R> system.time(ans2 <- squarem(par = p0, fixptfn = qmub.update,
    +    objfn = binom.loglike, Z = Z, y = y))
    
     user  system  elapsed
    0.011   0.000   0.011
    
    R> ans2
    
    $par
    [1]  58.0384863  24.6615466  19.2935777 -19.6012645   3.8959634
    [6]  0.1510923  -87.4339043
    
    $value.objfn
    [1] 10.87533
    
    $iter
    [1] 41
    
    $fpevals
    [1] 118
    
    $objfevals
    [1] 43
    
    $convergence
    [1] TRUE
    
  • Non-uniform bound quadratic majorization algorithm

    R> system.time(ans3 <- fpiter(par = p0, fixptfn = qmvb.update,
    +    objfn = binom.loglike,
    +    control = list(maxiter = 20000), Z = Z, y = y)
    
     user  system  elapsed
    0.029   0.001   0.030
    
    R> ans3
    
    $par
    [1]  58.0384866  24.6615451  19.2935760 -19.6012627   3.8959634
    [6]  0.1510923  -87.4339030
    
    $value.objfn
    [1] 10.87533
    
    $fpevals
    [1] 442
    
    $objfevals
    [1] 0
    
    $convergence
    [1] TRUE
    
  • Squared non-uniform bound quadratic majorization algorithm

    R> system.time(ans4 <- squarem(par = p0, fixptfn = qmvb.update,
    +    objfn = binom.loglike, Z = Z, y = y))
    
     user  system  elapsed
    0.009   0.000   0.009
    
    R> ans4
    
    $par
    [1]  58.0384868  24.6615443  19.2935751 -19.6012618   3.8959633
    [6]  0.1510923  -87.4339024
    
    $value.objfn
    [1] 10.87533
    
    $iter
    [1] 30
    
    $fpevals
    [1] 88
    
    $objfevals
    [1] 30
    
    $convergence
    [1] TRUE
    

All four algorithms converge to the same maximum likelihood estimates but Squarem improves on both uniform and non-uniform bound quadratic majorization algorithms in terms of the number of quadratic majorization updates and CPU running time (in seconds). For uniform bound, Squarem converges around 5 times faster and saves the number of quadratic majorization updates by a factor of 10. The non-uniform bound quadratic majorization improves on the uniform bound one, but its Squared version provides further acceleration. Compared to non-uniform bound algorithm, Squarem shortens the computing time by a factor of 3 and cuts the number of quadratic majorization updates by a factor of 5.

We randomly generate 500 starting values where is a uniform random variable in the range of . We then summarize the performance for these four algorithms in terms of CPU running time and the number of QM (quadratic majorization) updates in Figure 5 and Figure 6.

Figure 5: The comparison among basic and Squared uniform bound quadratic majorization algorithms, and basic and Squared non-uniform bound quadratic majorization algorithms in terms of CPU running time (in seconds).
Figure 6: The comparison among basic and Squared uniform bound quadratic majorization algorithms, and basic and Squared non-uniform bound quadratic majorization algorithms in terms of the number of quadratic majorization(QM) updates.

Figure 5 and Figure 6 clearly show that Squarem provides substantial acceleration for both uniform bound and non-uniform bound quadratic majorization algorithms. Table 9 displays that for different starting values, on average, compared to the uniform bound QM algorithm, its corresponding Squared version saves the number of QM updates by a factor of 7 and runs 5 times faster while compared to the non-uniform bound QM algorithm, an already faster version over the uniform bound QM algorithm, the Squared version further improves by a factor of 3 in CPU running time and a factor of 5 in the number of QM updates.

Ub QM Squared Ub QM Non-Ub QM Squared Non-Ub QM
CPU time(mean) 0.079 0.017 0.021 0.006
CPU time(sd) 0.006 0.005 0.003 0.001
QM updates(mean) 2068 273 419 80
QM updates(sd) 22 81 19 15
Table 9: The comparison among basic and Squared uniform bound quadratic majorization algorithms, and basic and Squared non-uniform bound quadratic majorization algorithms in terms of the mean and standard deviation of CPU running time and the number of quadratic majorization (QM) updates for cancer remission data with 500 randomly selected starting values.

6 Discussion

Since the seminal work of Dempster et al. [1977], EM and its variants have become the workhorse of computational statistics. More broadly, there are iterative algorithms which do not fit into the missing-data framework, but which are EM-like in the sense that they exhibit slow, monotone, global convergence like the EM algorithm. These include the minorize and maximize (MM) algorithm. Even more broadly, we can include all fixed-point iterations which are contractive [Ortega and Rheinboldt, 1970b] and linearly convergent. They are broader in the sense that there need not be an objective function (e.g., log-likelihood) associated with the contraction mapping. The remarkable fact is that it is extremely easy to use Squarem to try and accelerate these iterative algorithms. All that the user has to do is to create a function, fixptfn(), that implements a single step of the fixed-point iteration, whether it is EM/ECM/ECME/GEM/MM or any other contractive mapping. The objective function, objfn(), is optional, although we recommend that it be provided, if it is easy to code. The convergence criterion used in function squarem() is stringent for high-dimensional problems, and in future versions, we will incorporate other parameter based criteria and criteria based on the objective function. In addition, there are other features of Squarem not illustrated in this paper, including the options of higher-order Squarem schemes and the tracking of algorithm’s progress. For full features of Squarem, see https://cran.r-project.org/web/packages/SQUAREM/SQUAREM.pdf.

The main theme of the paper is that existing modeling problems based on EM-like algorithms can potentially be made computationally more efficient by using the convergence acceleration provided by Squarem. This is particularly true in high-dimensional problems where EM-like algorithms can be excruciatingly slow. There are several examples in the literature where Squarem has been effectively used. To name a few, Matthew Stephens lab at University of Chicago has been using SQUAREM in many of their models pertaining to genetic studies where a very large number of parameters are estimated [Shiraishi et al., 2015; Raj et al., 2015, among others]. Patro et al. [2014] incorporated Squarem in the development of their new computational method, "Sailfish", to substantially increase the efficiency of processing sequencing reads. More recently, Chiou et al. [2017] have used SQUAREM

to speed up the estimation in a semi-parametric model for panel recurrent event count data. There are more such examples.

Squarem is computationally efficient. It requires little effort beyond that of the basic fixed-point iteration. The computation of Squarem parameter update is trivially easy since it only requires a couple of vector products. The only place where some inefficiency could occur is in the evaluation of the objective function to assess whether the Squarem step can be accepted. When the Squarem step results in non-monotonicity, it is rejected and instead the most recent EM update is retained. This results in some wasted effort. This is typical for algorithms with fast local convergence, for example, the Newton’s method in unconstrained optimization which needs to be safe-guarded with line-search or trust-region approach to ensure global convergence. However, Squarem is efficient when the objective function evaluation is relatively cheaper than that of the fixed-point iteration, which is often the case.

Squarem can be used off-the-shelf since there is no need for the user to tweak any control parameters to optimize its performance. Given its ease of application, Squarem may be considered as a default accelerator for EM-like algorithms. We invite the reader to try Squarem acceleration on his/her own EM-like algorithm or any slowly convergent, contractive fixed-point iteration.

Appendix A Derivation of EM

a.1 Poisson mixtures

Let us define the missing variable , such that

Let ,. Given , the death number comes from population 1, following Poisson distribution with mean while otherwise, the death number comes from population 2, following Poisson distribution with mean .

  • E-step

    where

  • M-step

    We take the derivative of function with respect to and set to zero in order to derive the estimates of iteration.

    Let

    So,

    Let

    So,

    Similarly we can derive that

a.2 Factor analysis

  • E-step