I Introduction
The inevitable result of the staggering growth in the size and complexity of datasets used in statistics, control, and machine learning has been a rise of interest in optimization algorithms that efficiently solve the problems that crop up in these settings. For instance, a fundamental problem in unsupervised learning is the problem of
clustering, where the task in question is to group certain objects of interest into subgroups called clusters, such that all objects in a particular cluster share features (in some predefined sense) with each other, but not with objects in other clusters [1, 2].The Expectation-Maximization (EM) algorithm [3] is one of the most commonly used methods in distribution-basedclustering analysis and density estimation [4, 5]. Finite mixture models
consist of distributions for which each sample in the dataset originates from a randomly selected finite collection of distributions, each characterized by a finite number of parameters. These parameters, when lumped together, characterize the overall sampling distribution, and are often estimated through the EM algorithm.
Gaussian mixture models
(GMMs) consist of finite mixture models of multivariate Gaussian distributions, with the parameters being the means, covariance matrices, and distribution probabilities. In this framework,
GMM clustering consists of estimating the parameters in a GMM that maximize its likelihood function (iteratively maximized through the EM algorithm), followed by assigning to each data point the “cluster” corresponding to its most likely multivariate Gaussian distribution in the GMM.A current trend at the intersection of optimization, machine learning, and control is to leverage ideas from dynamical systems to gain perspectives about iterative optimization algorithms [6, 7, 8, 9, 10]
. The key idea is to view the iterations as the evolution of a state vector over time, with the initial state representing the conditions over which the algorithm is initialized. Then, the rule to construct each subsequent iteration is modeled as a function that dictates the dynamics of the state-space dynamical system which represents the algorithm.
We can use general point-to-set notions of convergence of optimization algorithms, such as Zangwill’s convergence theorem [11], to show the local convergence of the EM algorithm. A detailed analysis of the convergence properties of EM can be found in [12] and [13]. The convergence of the EM algorithm is well studied in the literature, particularly for GMMs [12, 14]. In [15] we proposed to change the perspectives on local optimizers and convergence of the EM algorithm by assessing, respectively, the equilibria and asymptotic stability (in the sense of Lyapunov) of a nonlinear dynamical system that represents the standard EM algorithm, through explicit use of Lyapunov stability theory.
In this paper, we propose borrowing the notions of integral quadratic constraints (IQCs) from robust control theory [16] as a tool for analysis of generalized EM (GEM) algorithms and EM-like algorithms. We will see how IQCs provide us with an elegant framework to analyze system blocks which are noisy, and sometimes impossible to model, simply using information about constraints that link inputs and outputs of operators. Additionally, IQCs allow us to derive rates of convergence of optimization algorithms applied on not only general strongly convex functions, but also functions that evolve adversarially or stochastically over time, since all we need is for the functions to satisfy a suitable IQC. For strongly convex functions, we can readily derive these IQCs, which, in turn, allow us to analyze first-order optimization methods using small semidefinite programs [7, 9]. As a consequence, we build upon these results and propose to use IQCs for strongly convex functions to obtain tractable semidefinite programs that allow us to recover comparable rates of convergence to those of the standard EM algorithm for GMMs [14, 17].
The main contributions of this manuscript are as follows. We start by showing how we can cast gradient-based EM-like algorithms as LTI systems with a feedback nonlinearity. We then investigate the absolute stability of systems of this form using IQCs. Following this, we craft a strongly convex function that allows us to design a novel EM-like algorithm for GMMs. Finally, combining all these ingredients, we establish bounds on the convergence rates of a GEM algorithm, and the aforementioned novel EM-like algorithm, which generalizes bounds found in the literature for the EM on GMMs.
Ii Problem Statement
In this section, we review the Expectation-Maximization (EM) and generalized EM (GEM) algorithms and provide a statement for the problem considered in this paper.
Let be some vector of unknown (but deterministic) parameters characterizing a distribution of interest, which we seek to infer from a collected dataset (from now on assumed fixed) and a statistical model composed by a family of joint probability density or mass functions (possibly mixed) indexed by , where
(1) |
is some latent (hidden) random vector. The EM algorithm (Algorithm 1) seeks to find a local maximizer of the incomplete likelihood function given by
(2) |
The mapping is, naturally, referred to as the complete likelihood function. To optimize , the EM algorithm alternates at each iteration between two steps. First, in the expectation step (E-step), we compute , defined through
(3a) | ||||
(3b) |
so that denotes the expected value of the complete log-likelihood function with respect to . Second, in the maximization step (M-step), we maximize and update the current estimate as that maximizer.
Before formally stating the EM algorithm, let us make some mild simplifying assumptions that will avoid pathological behavior on the -function .
Assumption 1.
does not depend on and has positive Lebesgue measure.
Assumption 2.
is twice continuously differentiable in .
Notice that, from Assumption 1, the conditional distribution is well defined in , since for every . Finally, we make the following simplifying assumption, which makes the M-step well defined.
Assumption 3.
has a unique global maximizer in .
With all these ingredients and assumptions, we summarize the EM algorithm in Algorithm 1.
Any variant of Algorithm 1 that does not explicitly maximize at the M-step (e.g., the -EM algorithm we introduced in [15]) is referred to as a generalized EM (GEM) algorithm. In other words, a GEM is any algorithm that produces a sequence of iterations from an initial state such that for every satisfying .
Given the above framework, the problem that we wish to address in this paper is as follows.
Problem 1.
(i) Construct a linear first-order numerical optimization scheme that approximates the EM algorithm (Algorithm 1) in the form of a GEM algorithm or EM-like algorithm. (ii) Derive explicit bounds for the convergence rate of the previously constructed GEM or EM-like algorithms through the use of integral quadratic constraints and other results borrowed from robust control theory.
Iii Analysis of Gradient-Based EM-like Algorithms via IQCs
In this section, we provide an overview of dynamical systems perspectives on iterative optimization algorithms, with a particular focus on gradient-based EM-like and GEM algorithms. We then review results from IQCs, which provide us with insights that allow us to design a novel EM-like algorithm. Finally, establish bounds on the convergence rates of the studied algorithms.
Iii-a A Dynamical Systems Perspective on Iterative Optimization Algorithms
Consider the unconstrained optimization problem
(4) |
of an arbitrary nonlinear scalar function with , assumed to be continuously differentiable and strongly convex with Lipschitz continuous gradient (e.g., for maximum likelihood estimation). The most common set of algorithms designed to solve such a problem are, by far, standard first-order methods such as the gradient descent method [18], heavy-ball method [19], and Nesterov’s accelerated method [20]. It is shown in [7] and [9] that all the aforementioned first-order methods can be represented by a discrete-time, linear time-invariant (LTI) system in state space, connected in feedback through a nonlinearity , by setting as the gradient mapping.
For instance, the standard gradient descent algorithm with constant step size , given by
(5) |
can be represented as the dynamical system
(6a) | ||||
(6b) | ||||
(6c) |
where and denote, respectively, the input and output vector signals of the underlying LTI subsystem given by (6a)–(6b). To achieve this representation, it suffices to choose
(7) |
as the underlying LTI subsystem, as the feedback nonlinearity, and to identify . We can perform the same representation for both the heavy-ball method and Nesterov’s accelerated method, with the only change being on the state matrices . It is also important to ensure the well-posed nature of the feedback in (6). This is done by ensuring that the feedforward matrix is zero in the representation of each of the dynamical systems to ensure that the feedback input depends explicitly on the output.
More generally, any given iterative optimization algorithm of the form
(8) |
can be represented as a time-varying nonlinear dynamical system in state space,
(9a) | ||||
(9b) |
by identifying .
Iii-B Linear First-Order GEM and EM-like Algorithms
From the previously discussed perspective, limit points and convergence of algorithm (8) correspond, respectively, to equilibria and asymptotic stability (in the sense of Lyapunov) of its dynamical system representation (9). This perspective was explicitly pursued in [15] for Algorithm 1, which reduces to the time-invariant system
(10) |
with
(11) |
under the framework and assumptions of Section II.
According to [21], for most distributions , the iterations of the EM algorithm can be stated as
(12) |
for a suitable transformation matrix . For instance, an explicit expression for the transformation matrix for GMMs was first given in [14], and later extended to models in the exponential family [22].
Before we proceed, we will make one last assumption, which holds for distributions belonging to any mixture model of the exponential family [21].
Assumption 4.
has a unique stationary point in , for every .
Remark 1.
The transformation matrix also provides us with valuable insights regarding the rate of convergence of the EM algorithm. Indeed, differentiating the equation
(14) |
which generalizes (12), we have
(15) |
where with ,
(16) |
and denotes the Hessian matrix of . Therefore, it should be clear to see that near a stationary point of over which is bounded, we have
(17) |
As a consequence, it follows that, under the aforementioned conditions, the EM algorithm exhibits superlinear convergence when
approaches zero. In particular, the nature of convergence is dictated by the eigenvalues of the matrix
. If the eigenvalues are near zero, then the transformation matrix scales the EM update step by approximately the negative inverse Hessian, and the EM algorithm behaves as Newton’s method. On the other hand, if the eigenvalues are near unity (in absolute value), then the EM algorithm exhibits first-order convergence.Let us now explicitly consider a GMM, described as
(18) |
with , , positive definite, and such that . In other words, a sample from this distribution originates from randomly choosing among multivariate Gaussian distributions (), with respective probabilities , and consecutively sampling it. The vector of unknown parameters lumps together the scalar parameters within for . Consequently, we have
(19) |
for , with and . Let be a matrix whose columns constitute a basis for the space . Notice then that the EM algorithm can be represented as a gradient ascent with fixed unit step size,
(20) |
on the projected space , where , so that [14].
Recall that, for a convergent sequence , its linear convergence rate is defined as
(21) |
where denotes the limit of the sequence. Let denote a local maximizer of . Now, given that the matrix is the projection of the bilinear matrix product onto the set , EM demonstrates a rate of convergence
(22) |
where and
denote, respectively, the smallest and largest singular value of the the matrix
[14], , and . Note that, even though this result has been proved for GMMs, it can be readily extended to any model of the exponential family for which Assumptions 3 and 4 hold [21]. In addition, the reader is also directed to [23] for a learning-theoretic analysis of the convergence properties of gradient EM with general number of clusters and mixing coefficients [17].To conclude this discussion, we introduce two algorithms, the first an EM-like linear first-order algorithm based on a specially crafted strongly convex function, and the second one a GEM algorithm based on a gradient ascent of the -function.
Iii-B1 An EM-like Algorithm
Iii-B2 A GEM Algorithm
Iii-C Optimization of Quadratic Forms
Consider a given strongly convex quadratic function
(26) |
where for some constants . We now overview a dynamical system representation of linear first-order numerical schemes on (26), of the form (6).
We start by setting so that the feedback is well posed. Since we are interested in representing linear first-order optimization methods, we set . It is straightforward to see that , and thus the only stationary point is . Therefore, we have , and, enforcing that is an equilibrium of the system, we have
(27a) | ||||
(27b) |
since . Equipped with this realization, we can rewrite the state equation as
(28) |
and thus,
(29) |
where is defined as the closed-loop state transition matrix.
Iii-D Integral Quadratic Constraints
The problem of connecting a linear dynamical system in feedback with a nonlinearity is one of the most studied problems in nonlinear stability theory, and its stability analysis is known as the Lur’e problem [24]. Megretski and Rantzer [16] proposed the idea of Integral Quadratic Constraints (IQCs) to investigate the solution of the Lur’e problem using an LMI-based approach and to provide a more general framework to the Kalman-Yakubovich-Popov (KYP) lemma [25] and the S-procedure [26]. In the following discussion, we briefly explore the notions of pointwise and sector IQCs, which we will then use to bound the convergence rate of the GEM and EM-like algorithms considered before.
Definition 1 (Pointwise IQC).
A nonlinear mapping is said to satisfy the pointwise IQC defined by a symmetric and indefinite matrix if
(30) |
for every .
In Section III-A, we noted that when we analyze optimization algorithms from the dynamical systems perspective, a common source of the nonlinearity is the gradient map of a function . It is especially instructive to define IQCs that characterize the properties of a strongly convex function with Lipschitz continuous gradient.
Definition 2 (Sector IQC for the gradient map).
For a strongly convex function with strong convexity parameter , having Lipschitz continuous gradients with Lipschitz constant , the gradient map satisfies the sector IQC defined by
(31) |
Next, we state the lossless dimensionality reduction principle found in Section 4.2 of [7].
Lemma 1.
For the dynamical systems representation (6) of the gradient descent, heavy-ball, and Nesterov’s accelerated method, which are all first-order linear optimization methods (), we can assume, without loss of generality for the computation of their corresponding convergence rates, that the dimension of is .
This principle is likely to be applicable for other first-order numerical optimization schemes as well. Finally, we state a modified form of Theorem 4 in [7].
Lemma 2.
Remark 2.
It is interesting to note that the feasibility of the LMI is equivalent to the feasibility of a semidefinite program (SDP). The size of the SDP is proportional to , and, if is large, it can be computationally costly to solve large SDPs. However, since most of the optimization algorithms that we are interested in have a block-diagonal structure, we can use the lossless dimensionality reduction argument of Lemma 1 to analyze the SDP for without loss of generality.
Iii-E Convergence Rates Using IQCs
Now, we present a new viewpoint of deriving the rates of convergence for the EM-like algorithm (23) for GMMs and the GEM algorithm (25
) for arbitrary probability distributions.
For the first scenario, we will show that the gradient of the crafted strongly convex objective function (24) seen as a feedback nonlinearity of the dynamical system representation (6) with suitable , and , allows us to bound the convergence rate for (23).
Theorem 1.
Proof.
Notice that Algorithm (23) can be represented as the LTI system with a feedback nonlinearity (6), with , , , and . Due to the regularity of , we can invoke Lemmas 1 and 2 to yield the following LMI
(33) |
for some and , where denotes the convergence rate. Since is constructed as a scalar (from Lemma 1), we can set without loss of generality to obtain the following LMI
(34) |
The negative semidefiniteness of the above matrix can be ensured if both and the Schur complement of the bottom right block are negative semidefinite. Thus, we have
(35a) | ||||
(35b) |
Combining these two, we have
(36) |
which yields . ∎
Remark 3.
In this analysis, the only assumption that we made about is that satisfies the sector IQC of (31). In general, it is instructive to see that not only does the analysis hold for general strongly convex functions, it is also true for functions that change over time, because all we need is that each input-output pair associated with the gradient block satisfies the pointwise sector IQC.
We now focus on Algorithm (25) for arbitrary probability distributions, which can be similarly represented as the dynamical system (6), allowing us to bound its convergence rate.
Theorem 2.
If is an -strongly convex function with -Lipschitz gradient satisfying
(37) |
for , then the convergence rate of the sequence generated by (25), with , can be bounded by .
Proof.
It suffices to replicate the proof of Theorem 1. ∎
Iv Conclusions and Future Work
In this paper, we proposed an analysis of a GEM algorithm and a designed EM-like algorithm, as LTI systems with a feedback nonlinearity, by leveraging tools from robust control theory, particularly IQCs. This analysis allowed us to craft a strongly convex objective function, which led to the design of the aforementioned novel EM-like algorithm for GMMs. Furthermore, we leveraged recent results on the stability of linear first-order numerical optimization schemes represented by LTI systems connected to a sector-bounded nonlinearity, to establish bounds on the convergence rates of the studied algorithms. In particular, the derived bounds for our proposed EM-like algorithm generalize bounds found in the literature for the EM algorithm on GMMs, and our analysis of an existing gradient ascent GEM algorithm based on the -function allowed us to derive bounds similar to those found in the literature.
The perspective used in this paper also provided us with insight into how control systems theory can be used to analyze, and eventually design, new classes of iterative optimization schemes. Indeed, even though the convergence properties of the considered GEM algorithms are well-known, our analysis is intended to help open the field to this line of research, due to its potential for the design of accelerated or more memory-efficient numerical optimization schemes.
References
- [1] C. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.
- [2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005.
- [3] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society. Series B (methodological), pp. 1–38, 1977.
- [4] R. D. Nowak, “Distributed EM algorithms for density estimation and clustering in sensor networks,” IEEE Transactions on Signal Processing, vol. 51, no. 8, pp. 2245–2253, Aug 2003.
- [5] T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGraw-Hill, Inc., 1997.
- [6] S. Pequito, A. P. Aguiar, B. Sinopoli, and D. A. Gomes, “Unsupervised learning of finite mixture models using mean field games,” in Proceedings of the 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Sep. 2011, pp. 321–328.
- [7] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integral quadratic constraints,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 57–95, Jan. 2016.
- [8] J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in Proceedings of the 2011 IEEE Conference on Decision and Control and European Control Conference, Dec 2011, pp. 3800–3805.
- [9] M. Fazlyab, A. Ribeiro, M. Morari, and V. Preciado, “Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems,” SIAM Journal on Optimization, vol. 28, no. 3, pp. 2654–2689, 2018.
- [10] Z. E. Nelson and E. Mallada, “An integral quadratic constraint framework for real-time steady-state optimization of linear time-invariant systems,” in Proceedings of the 2018 American Control Conference, June 2018, pp. 597–603.
- [11] W. I. Zangwill, Nonlinear programming: a unified approach. Prentice-Hall Englewood Cliffs, NJ, 1969, vol. 196, no. 9.
- [12] C. J. Wu, “On the convergence properties of the EM algorithm,” The Annals of Statistics, pp. 95–103, 1983.
- [13] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM review, vol. 26, no. 2, pp. 195–239, 1984.
- [14] L. Xu and M. I. Jordan, “On Convergence Properties of the EM Algorithm for Gaussian Mixtures,” Neural Computation, vol. 8, no. 1, pp. 129–151, Jan 1996.
- [15] O. Romero, S. Chatterjee, and S. Pequito, “Convergence of the Expectation-Maximization Algorithm Through Discrete-Time Lyapunov Stability Theory,” to appear in Proceedings of the 2019 American Control Conference, 2019.
- [16] A. Megretski and A. Rantzer, “System analysis via integral quadratic constraints,” IEEE Transactions on Automatic Control, vol. 42, no. 6, pp. 819–830, 1997.
- [17] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees for the EM algorithm: From population to sample-based analysis,” The Annals of Statistics, vol. 45, no. 1, pp. 77–120, 2017.
- [18] M. Avriel, Nonlinear programming: analysis and methods. Courier Corporation, 2003.
- [19] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson, “Global convergence of the heavy-ball method for convex optimization,” in Proceedings of the 2015 European Control Conference. IEEE, 2015, pp. 310–315.
- [20] Y. E. Nesterov, “A method for solving the convex programming problem with convergence rate ,” in Dokl. Akad. Nauk SSSR, vol. 269, 1983, pp. 543–547.
- [21] R. Salakhutdinov, S. Roweis, and Z. Ghahramani, “Optimization with EM and Expectation-conjugate-gradient,” in Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ser. ICML’03. AAAI Press, 2003, pp. 672–679.
- [22] ——, “Relationship between gradient and EM steps in latent variable models,” Department of Computer Science, University of Toronto, Gatsby Neuroscience Unit, University College London, Tech. Rep., 2003.
- [23] B. Yan, M. Yin, and P. Sarkar, “Convergence of Gradient EM on Multi-component Mixture of Gaussians,” in Advances in Neural Information Processing Systems, 2017, pp. 6956–6966.
- [24] H. K. Khalil, Nonlinear Systems. Englewood Cliffs, New Jersey: Prentice-Hall, 2001.
- [25] A. Rantzer, “On the Kalman-Yakubovich-Popov lemma,” Systems and Control Letters, vol. 28, no. 1, pp. 7–10, 1996.
- [26] I. Pólik and T. Terlaky, “A survey of the S-lemma,” SIAM review, vol. 49, no. 3, pp. 371–418, 2007.
Comments
There are no comments yet.