I Introduction
With the ever-expanding size and complexity of data-sets used in the field of statistics, control systems, and machine learning, there has been a growing interest in developing algorithms that efficiently find the solution to the optimization problems that arise in these settings. For example, a fundamental problem in exploratory data mining is the problem of cluster analysis, where the central task is to group objects into subgroups (i.e., clusters) such that the objects in a particular cluster share several characteristics (or, features) with those that are, in some sense, sufficiently different from objects in different clusters [1, 2].
The Expectation-Maximization (EM) algorithm [3] is one of the most popular methods used in distribution-based clustering analysis and density estimation[4, 5]
. Given a dataset, we can assume that the data is distributed according to a finite mixture of Gaussian distributions whose parameters are randomly initialized and iteratively improved using the EM algorithm that seeks to maximize the likelihood that the data is justified by the distributions. This leads to finding the finite Gaussian mixture that hopefully best fits the dataset in question.
A current trend in optimization, machine learning, and control systems, is that of leveraging on a dynamical systems interpretation of iterative optimization algorithms [6, 7, 8]
. The key idea is to view the the estimates themselves in the iterations of the algorithm as a state vector at different discrete instances of time (in particular, the initial approximation is viewed as the initial state), while the mechanism itself used to construct each subsequent estimate is modeled as a state-space dynamical system. Then, local optimizers and convergence in the optimization algorithm roughly translate to equilibria and asymptotic stability (in the sense of Lyapunov) in its dynamical system interpretation.
The convergence of the EM algorithm has been studied from the point-of-view of general point-to-set notions of convergence of optimization algorithms such as Zangwill’s convergence theorem [9]. Works such as [10] and [11] provide proofs of the convergence of the sequence of estimates generated by the EM algorithm.
The main contribution of this paper is to present a dynamical systems perspective of the convergence of the EM algorithm. The convergence of the EM algorithm is well known. However, our nonlinear stability analysis approach is intended to help open the field to new iteration schemes by possible addition of an artificial external input in the dynamical system representation of the EM algorithm. Then, leveraging tools from feedback systems theory, we could design a control law that translates to an accelerated convergence of the algorithm for specific subclassess of distributions.
The rest of the paper is organized as follows. In Section II, we briefly review the problem of maximum likelihood estimation and the EM algorithm. In Section III, we propose a dynamical systems perspective of the EM algorithm and propose a particular generalized EM (GEM) algorithm. In Section IV we establish our main convergence results by leveraging discrete-time Lyapunov stability theory. Finally, Section V concludes the paper.
Notation
The set of non-negative integers is represented by , the set of real numbers is represented by , and denotes the -dimensional real vectors. The Euclidean norm is denoted by . We denote the open -ball around a point as , and the closed -ball as . The gradient and Hessian matrix of a scalar function are denoted, respectively, by and . The notation denotes that the real-valued square matrix is negative definite. We do not distinguish random vectors and their corresponding realizations through notation, but instead let it be implicit through context. For a given random vector
, we denote its probability distribution as
. For simplicity, we will assume that every random vector is continuous, and hence, every distribution a probability density function. We denote the expected value of a function
of with respect to the distribution by , or when the distribution is clear from context.Ii Expectation-Maximization Algorithm
In this section, we recall the Expectation-Maximization (EM) algorithm. Let be some vector of unknown (but deterministic) parameters characterizing a distribution of interest, which we seek to infer from a collected dataset (from now on assumed fixed). To estimate from the dataset , we first need a statistical model, i.e., an indexed class of probability distributions . The function given by
(1) |
denotes the likelihood function. The objective is to compute the maximum likelihood estimate (MLE):
(2) |
where the maximizer of is not necessarily unique, and hence, neither is the MLE. For that reason, it is actually more accurate to use
(3) |
as the definition of the MLE. From this point on, we will treat
(4) |
as a set, unless it consists of a single point , in which case we may use .
Next, we will introduce some assumptions that will ensure well-definedness throughout this paper.
Assumption 1.
for every .
This assumption is simply a mild technical condition intended to avoid pathological behaviors, and is satisfied by most mixtures of distributions used in practice (e.g., Gaussian, Poisson, Beta). Furthermore, we surely have for at least some (since, otherwise, the dataset is entirely useless regarding maximum likelihood estimation), and thus it suffices that we disregard from any such that .
The underlying assumption for the EM algorithm is that there exists some latent (non-observable) random vector for which we possess a “complete” statistical model (as opposed to the “incomplete” model ), and for which maximizing the expected value of the complete log-likelihood function is easier than the incomplete likelihood function. However, since is latent, the idea behind the EM algorithm is to iteratively maximize the expected complete log-likelihood.
Assumption 2.
does not depend on .
Together with Assumption 1, this assumption will further allow us to avoid certain pathological cases. Specifically, we can properly define the expected log-likelihood function (hereafter, also referred to as the -function) , defined as
(5a) | ||||
(5b) |
where . With all these ingredients and assumptions, we summarize the EM algorithm in Algorithm 1. Notice that, the term in Algorithm 1 denotes the expected complete log-likelihood function for any given iteration .
Remark 1.
In practice, the iterations of Algorithm 1 are computed until some stopping criterion is achieved, such that it approximates .
Iii Dynamical System Interpretation of the EM Algorithm and Convergence
Formally, the convergence of the EM algorithm is concerned with the existence and characteristics of the limit of the sequence as . In particular, local convergence refers to the property of the sequence converging to the same point for every initial approximation that is sufficiently close to . On the other hand, global convergence refers to convergence to the same point for any initial approximation. Ideally, is a global maximizer (or at least a local one) of the likelihood function. In practice, Algorithm 1 may converge to other stationary points of the likelihood function [3, 10].
We will now see how the EM algorithm (such as many other iterative optimization algorithms) can be interpreted as a dynamical system in state-space, for which convergence translates to (asymptotic) stability in the sense of Lyapunov.
To start, recall that any discrete-time time-invariant nonlinear dynamical system in state-space can be described by its dynamics, which are of the form
() |
and where denotes the state of the system and is some known function. In particular, any that satisfies
(6) |
for every represents a particular realization of the different iterations of Algorithm 1. For the sake of simplicity, let us make the following assumption.
Assumption 3.
has a unique global maximizer in for each .
Remark 2.
Assumption 3 does not imply that the likelihood function has a unique global maximizer, and subsequently, the MLE may still be non-unique. Furthermore, under Assumption 3, the sequence generated by Algorithm 1 is unique for each , the function given by
(EM) |
for is uniquely defined, and () with captures the dynamical evolution emulated by Algorithm 1, i.e. for every .
Recall that, for a dynamical system of the form (), we say that is an equilibrium of the system if implies that for every . In other words, if is a fixed point of , i.e., . For self-consistency, we now formally define Lyapunov stability.
Definition 1 (Lyapunov stability).
Let be an equilibrium of the dynamical system (). We say that is stable if the trajectory is arbitrarily close to provided that it starts sufficiently close to . In other words, if, for any , there exists some such that implies that for every .
Further, we say that is (locally) asymptotically stable if, apart from being stable, the trajectory converges to provided that it starts sufficiently close to . In other words, if is stable and there exists some such that , implies that as .
From the previous definition, it readily follows that asymptotic stability of the trajectory generated by () with is nearly equivalent to local convergence of the EM algorithm, since for any in some sufficiently small open ball centered around . Therefore, to establish local convergence of the EM algorithm from the point of view of the asymptotic stability of the corresponding dynamical system, we first need to establish that the points of interest (i.e., the local maxima of the likelihood function) are equilibria of the system (i.e., fixed points of ).
Let be a local maximizer of . More precisely, let us start by considering . Notice that the -function can be re-written as follows:
(7a) | ||||
(7b) | ||||
(7c) |
where
denotes the Kullback-Leibler divergence from
to , and denotes the differential Shannon entropy of [12]. Since the entropy term in (7c) does not depend on , then(8) |
for .
Remark 3.
In general, the Kullback-Leibler divergence from an arbitrary distribution to another , denoted as , satisfies the following two properties: (i) for every and , a result known as Gibbs’ inequality; and (ii) if and only if almost everywhere.
Therefore, we can now state the following.
Proof.
Inspecting (8) at , it readily follows that and are both maximized at , which in turn implies that .
However, it should be clear that the previous argument does not hold for non-global maximizers of the likelihood function. One approach to get around this issue is to consider a specific variant of the generalized EM algorithm (GEM)^{1}^{1}1A GEM algorithm is any variant of the EM algorithm where the M-step is replaced by a search of some such that , if one exists (otherwise ), not necessarily in .. More precisely, we propose a GEM algorithm that searches for a global maximizer of in a restricted parameter space: (i.e., a -ball around ). We call this algorithm the -EM algorithm, which is summarized in Algorithm 2.
Next, we make the following simplifying assumption.
Assumption 4.
has a unique global maximizer in , for every and .
Naturally, we can interpret Algorithm 2 as the dynamical system () with given by
(-EM) | ||||
(9) |
where (9) was derived following a similar argument that led to (8). Therefore, similar to Proposition 1, we have the following result.
Proposition 2.
Proof.
Let be a local maximizer of . Inspecting (9) at , it readily follows that is maximized at . Furthermore, if is small enough, then is also maximized (in ) at , which implies that .
Iv Local Convergence Through Discrete-Time Lyapunov Stability Theory
In this section, we will discuss how we can establish the local convergence of the EM algorithm by exploiting classical results from Lyapunov stability theory in the dynamical system interpretation of the EM algorithm.
To start, we state the discrete-time version of the Lyapunov theorem (Theorem 1.2 in [13]). First, recall that a function is said to be positive semidefinite, if for every . Furthermore, we say that is positive definite with respect to , if and for .
Theorem 1 (Lyapunov Stability).
Let be an equilibrium of the dynamical system () in the interior of and let be such that . Suppose that is continuous and there exists some continuous function (called a Lyapunov function) such that and are, respectively, positive definite with respect to and positive semidefinite, where . Then, is stable. If is also positive definite with respect to , then is asymptotically stable.
Remark 4.
The proof of Theorem 1 can be found in [13]. The statement of the theorem therein assumes local Lipschitz continuity of . This is likely a residual from the classical assumption of local Lipschitz continuity of in continuous systems with dynamics of the form , which is required by the Picard-Lindelöf theorem to ensure unique existence of a solution to the differential equation for each initial state . For discrete-time systems, on the other hand, the unique existence of the trajectory is immediate. However, a careful analysis of the argument used in the proof found in [13] reveals that the continuity of is nevertheless implicitly needed to ensure the continuity of , since the extreme value theorem is invoked for .
In order to leverage the previous theorem to establish local convergence of the EM algorithm to local maxima of the likelihood function, we need to propose a candidate Lyapunov function. However, before doing so, we need to ensure that is continuous, which is attained by imposing some regularity on the likelihood function.
Assumption 5.
is twice continuously differentiable.
Subsequently, we obtain the following result.
Lemma 1.
and are both continuous.
Proof.
Under Assumption 5, it readily follows from (5b) that is continuous in both of its arguments. Let be a sequence converging to . Note that, for each , we have for every . Taking the limit when , and leveraging the continuity of , we have for every . Consequently,
(10) |
and therefore,
(11) |
This same argument can be readily adapted for .
Let be a local maximizer of . Once again, let us start by considering . From Proposition 1, we know that is an equilibrium of () for . Next, a naive guess of a candidate Lyapunov function would be to consider , since this would satisfy for every , but ; hence, it is not a Lyapunov function. Notwithstanding, if we subtract from the previous candidate, i.e., , then and for . As a consequence, it should be clear that
(12) |
appears to be the ideal candidate, since for every and . Yet, may be only positive semidefinite instead of positive definite (with respect to ), since if and only if . To circumvent this issue, we will need to assume that is an isolated maximizer^{2}^{2}2We say that is an isolated maximizer (stationary point) of if it is the only local maximizer (stationary point) of in for some small enough . of .
Lemma 2.
Suppose that is an isolated maximizer of . Then, given by (12) with or is positive definite with respect to and is positive semidefinite, provided that is small enough.
Proof.
Let us focus on the case . The positive definiteness follows from (12) and the definition of isolated maximizer. On the other hand, from (8), it follows that
and therefore,
(13) |
for every . Thus, from the strict monotonicity of the logarithm function, it follows that is indeed positive semidefinite. The case follows by essentially the same argument.
Remark 5.
We are now in conditions to establish the stability of isolated MLEs in the dynamical system that represents the EM algorithm. Further, through a similar argument, the stability of arbitrary isolated local maximizers of the likelihood for the dynamical system that represents the -EM algorithm with small enough . However, non-asymptotic stability does not seem to translate into any interesting aspect of the convergence of the EM or -EM algorithms.
In order to establish the positive definiteness of , we first need to characterize the equilibria of the dynamical systems that represent Algorithms 1 and 2.
Lemma 3.
Every fixed point and in the interior of is a stationary point of the likelihood function.
Proof.
Next, we will need to assume additional regularity on the likelihood function. A very common assumption for a parameterized statistical model is for the parameterization to be injective, meaning that each distribution in the model is indexed by exactly one instance of the parameter space.
Assumption 6.
is injective.
Equipped with the last two assumptions (Assumption 5 and 6), we are ready to establish the positive definiteness of , and subsequently, the local convergence of Algorithm 1 to ML estimates.
Theorem 2 (Local Convergence of EM to MLE).
If , then the sequence generated by Algorithm 1 converges to for every initial approximation in a small enough open ball centered around .
Proof.
First, recall from Proposition 1 that is an equilibrium of () with (i.e., a fixed point of ), and from Lemma 1 that is continuous. Furthermore, is in the interior of since .
Let be small enough such that is the only stationary point of in . Such can be chosen since is itself a stationary point of and (which, together, they ensure isolated stationarity). In particular, is an isolated maximizer. Then, from Lemma 2, the function given by (12) with is positive definite with respect to , and is positive semidefinite.
Note that the inequality of the divergence term in (13) is strict for . This is because, from Assumption 6, if and only if is a fixed point of . But such a point needs to be a stationary point (see Lemma 3), which would lead to the contradiction , since is the only stationary point of in . Therefore, , i.e., for every . Finally, since is a fixed point of , then , which concludes that is positive definite. The conclusion follows by invoking Theorem 1, since was just proved to be asymptotically stable.
Theorem 3 (Local Convergence of -EM to Local Maxima).
If is such that and , and is small enough, then the sequence generated by Algorithm 2 converges to for every initial approximation in a small enough open ball centered around .
Proof.
The proof follows similar steps to those in the proof of Theorem 2 with the following adaptations: first, replace by , and secondly, by .
Notice that, the reason why Theorem 3 cannot be readily adapted for the EM algorithm (as opposed to the -EM algorithm) is that local maximizers may fail to be fixed points of and therefore, equilibria of () with . To circumvent this limitation, we will focus on the limit points of the EM algorithm. First, recall that is a fixed point of Algorithm 1, if there exists some such the as for the sequence generated by Algorithm 1, which is captured by the following result.
Lemma 4.
If is a limit point of Algorithm 1, then it is also a fixed point of .
Proof.
Let be such that as , where was generated by Algorithm 1. Then, by the continuity of , it follows that .
Remark 6.
Notice that, while not every local maximizer of the likelihood function is a limit point of the EM algorithm, the same is not true for the -EM algorithm, provided that is sufficiently small and the local maximizer is sufficiently regular (i.e., an isolated stationary point of the likelihood function).
Upon Remark 6, and the convergence results established before, we can now establish the following claim.
Theorem 4 (Local Convergence of EM to its Limit Points).
Proof.
We conclude this section by exploring how the notion of exponential stability can be leveraged to bound the convergence rate of the EM algorithm. First, recall that the (linear) rate of convergence for a sequence is the number given by
(15) |
provided that the limit exists. Additionally, recall that an equilibrium of () is said to be exponentially stable if there exist constants such that, for every in a sufficiently small open ball centered around , we have for every .
Remark 7.
If is an exponentially stable equilibrium of () with , then generated by Algorithm 1 converges to with linear convergence rate
(16a) | ||||
(16b) | ||||
(16c) |
for every initial approximation in a sufficiently small open ball centered around .
The following theorem (adapted from Theorem 5.7 in [13]) allows us to ensure exponential stability, and subsequently to bound the linear convergence rate, provided our Lyapunov function satisfies some additional (mild) regularity conditions.
Theorem 5 (Exponential Stability).
Let be an equilibrium of () in the interior of , with for some small enough , and continuous. Let be a continuous and positive definite function (with respect to ) such that
(17a) | ||||
(17b) |
for every , for some constants . Then, is exponentially stable. More precisely, we have for every and with small enough, where with
(18) |
and .
Equipped with this result, we are now ready to establish the following sufficient conditions.
Proposition 3.
Let be a limit point of Algorithm 1 such that . Suppose that there exist constants such that
(19a) | ||||
(19b) |
or
(20a) | ||||
(20b) |
for every in some small enough . Then, is exponentially stable.
Proof.
From Lemma 4, it follows that is an equilibrium of () with . The result follows from invoking Theorem 5. First, to see that condition (17a) is verified, we notice that this is equivalent to (19a) for (which is continuous and positive definite with respect to in ). On the other hand, condition (17b) readily follows from (19b). Similarly, (17) follows from (20) for (which is also continuous and positive definite with respect to in ).
Remark 8.
Notice that (19a) actually readily holds for .
V Conclusion and Future Work
We proposed a dynamical systems perspective of the Expectation-Maximization (EM) algorithm, by analyzing the evolution of its estimates as a nonlinear state-space dynamical system. In particular, we drew parallels between limit points and equilibria, convergence and asymptotic stability, and we leveraged results on discrete-time Lyapunov stability theory to establish local convergence results for the EM algorithm. In particular, we derived conditions that allow us to construct explicit bounds on the linear convergence rate of the EM algorithm by establishing exponential stability in the dynamical system that represents it. Future work will be dedicated to leveraging tools from integral quadratic constraints (IQCs) to construct accelerated EM-like algorithms by including artificial control inputs optimally designed through feedback.
References
- [1] C. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.
- [2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005.
- [3] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society. Series B (methodological), pp. 1–38, 1977.
- [4] R. D. Nowak, “Distributed em algorithms for density estimation and clustering in sensor networks,” IEEE Transactions on Signal Processing, vol. 51, no. 8, pp. 2245–2253, Aug 2003.
- [5] T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGraw-Hill, Inc., 1997.
- [6] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integral quadratic constraints,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 57–95, 2016.
- [7] J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in Proceedings of the 2011 IEEE Conference on Decision and Control and European Control Conference, Dec 2011, pp. 3800–3805.
- [8] Z. E. Nelson and E. Mallada, “An integral quadratic constraint framework for real-time steady-state optimization of linear time-invariant systems,” in Proceedings of the 2018 American Control Conference, June 2018, pp. 597–603.
- [9] W. I. Zangwill, Nonlinear programming: a unified approach. Prentice-Hall Englewood Cliffs, NJ, 1969, vol. 196, no. 9.
- [10] C. J. Wu, “On the convergence properties of the EM algorithm,” The Annals of Statistics, pp. 95–103, 1983.
- [11] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM review, vol. 26, no. 2, pp. 195–239, 1984.
- [12] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). New York, NY, USA: Wiley-Interscience, 2006.
- [13] N. Bof, R. Carli, and L. Schenato, “Lyapunov theory for discrete time systems,” arXiv:1809.05289, 2018.
Comments
There are no comments yet.