# A Dynamical Systems Approach for Convergence of the Bayesian EM Algorithm

Out of the recent advances in systems and control (S&C)-based analysis of optimization algorithms, not enough work has been specifically dedicated to machine learning (ML) algorithms and its applications. This paper addresses this gap by illustrating how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based. The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM). Following first principles from dynamical systems stability theory, conditions for convergence of MAP-EM are developed. Furthermore, if additional assumptions are met, we show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S&C approach. The convergence guarantees in this paper effectively expand the set of sufficient conditions for EM applications, thereby demonstrating the potential of similar S&C-based convergence analysis of other ML algorithms.

There are no comments yet.

## Authors

• 5 publications
• 10 publications
• 84 publications
• 10 publications
• ### Convergence of the Expectation-Maximization Algorithm Through Discrete-Time Lyapunov Stability Theory

In this paper, we propose a dynamical systems perspective of the Expecta...
10/04/2018 ∙ by Orlando Romero, et al. ∙ 0

• ### On the Convergence of Bound Optimization Algorithms

Many practitioners who use the EM algorithm complain that it is sometime...
10/19/2012 ∙ by Ruslan R. Salakhutdinov, et al. ∙ 0

• ### Analysis of Gradient-Based Expectation-Maximization-Like Algorithms via Integral Quadratic Constraints

The Expectation-Maximization (EM) algorithm is one of the most popular m...
03/03/2019 ∙ by Sarthak Chatterjee, et al. ∙ 0

• ### Optimization algorithms inspired by the geometry of dissipative systems

Accelerated gradient methods are a powerful optimization tool in machine...
12/06/2019 ∙ by Alessandro Bravetti, et al. ∙ 0

• ### On Dissipative Symplectic Integration with Applications to Gradient-Based Optimization

Continuous-time dynamical systems have proved useful in providing concep...
04/15/2020 ∙ by Guilherme França, et al. ∙ 18

• ### A MAP approach for ℓ_q-norm regularized sparse parameter estimation using the EM algorithm

In this paper, Bayesian parameter estimation through the consideration o...
08/05/2015 ∙ by Rodrigo Carvajal, et al. ∙ 0

• ### Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives

We analyze the convergence rate of various momentum-based optimization a...
02/28/2020 ∙ by Michael Muehlebach, et al. ∙ 60

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This work is inspired by the recent stream of papers importing ideas from (dynamical) systems and control (S&C) theory into optimization [Elia2011, Su2014, Lessard2016, Wibisono2016, Fazlyab2017b, Scieur2017, Franca2018, Taylor2018, Wilson2018, Franca2019, Orvieto2019]. While a significant volume of these optimization-based papers have been published at machine learning (ML) venues, only a few have been explicitly dedicated to addressing concrete ML problems, applications, or algorithms [Plumbley1995, Pequito2011UnsupervisedLO, Zhu2018AnOC, Aquilanti2019, Liu2019]. Furthermore, only a small subset of this emerging topic of research has focused directly on discrete-time analysis that is the direct result of discretizations of an underlying continuous-time version of the algorithms [Lessard2016, Fazlyab2018, Fazlyab2018b, Taylor2018, romero2019].

The aforementioned effort to bring ideas from S&C into optimization is largely done via Lyapunov-based tools, which are most popular for nonlinear systems [Khalil2001]. Simply put, Lyapunov functions may be seen as abstract surrogates of energy in a dynamical system. If the function is monotonically decreasing over time, then some form of stability must be present. Likewise, if persistently strictly increasing, then instability is inevitable. There exist several close relationships and parallels between notions of stability of dynamical systems and convergence of optimization algorithms. However, to the best of the authors’ knowledge, the current literature lacks a comprehensive summary of these relationships, with the closest work that we are aware of being are [Schropp1995, Lessard2016, Fazlyab2018, Taylor2018, Wilson2018].

Furthermore, most literature in S&C regarding Lyapunov stability theory deals exclusively with continuous-time systems, and (dynamical) stability properties are typically not preserved after discretization [Stuart1998]. However, a vast amount of general Lyapunov-based results possess a discrete-time equivalent [Kalman1959, Byrnes1994, Agarwal2000, Jiang2001, Jiang2002, Lakshmikantham2002, Kellett2004, Kellett2005, Geiselhart2016, Bof2018]. These discrete-time equivalent Lyapunov-based stability results have largely not been adequately leveraged for convergence analysis of optimization algorithms with ML applications.

Further challenges arise when it comes to putting Lyapunov’s ideas to practice, particularly in the context of O&ML. Most notably, finding a suitable Lyapunov function, often out of a set of candidates, may prove extraordinarily difficult. For instance, for physical engineering systems, Lyapunov functions are often constructed from first principles as some meaningful measure of energy. In optimization, the situation is somewhat similar in the sense that a suitable Lyapunov function may often be constructed as a surrogate of the objective function. There exist some general Lyapunov function construction methods, but they steep come with natural limits regarding their scope [Li2014]. Outside of these scenarios, Lyapunov functions are traditionally constructed rather empirically, on a case-by-case basis. Furthermore, while there exist in the literature certain reciprocal existence results of Lyapunov functions for stable systems, as well as Lyapunov-based instability results, it should be noted that not being able to find or construct a Lyapunov function is not enough ground to assert instability.

### Contributions

In this paper, we conduct a S&C-based analysis of the convergence of a widely popular algorithm used for incomplete-data estimation and unsupervised learning – the expectation-maximization (EM) algorithm. More precisely, we will focus on the Bayesian variant of the EM algorithm originally proposed in

[Dempster1977], which we refer to as the MAP-EM algorithm, since it is used for maximum a posteriori (MAP) estimation [Figueiredo2004]. We will leverage notions from discrete-time Lyapunov stability theory to study the convergence of MAP-EM, and in the process provide exclusive insights on the robustness of our derived conditions for stability (asymptotic or otherwise), and thus convergence guarantees. Furthermore, it should be noted that these results could likely not have been constructed without our approach. The primary contributions of this paper are:

• [noitemsep,topsep=0pt]

• Using Lyapunov stability theory, we show that, under certain regularity assumptions, the MAP-EM algorithm is locally convergent to strict local maxima of the incomplete-data posterior, including the MAP estimator.

• We provide conditions to establish Lyapunov exponential stability, which translates to a linear convergence rate of EM, or even a quadratic one under stronger conditions.

• We prove that under certain conditions (most notably, unimodality of the incomplete-data posterior), the EM algorithm is globally convergent.

With this, we argue for the possibility of extending our S&C-based framework to discover robust stability conditions and novel convergence guarantees of alternative iterative optimization algorithms used in machine learning.

## 2 Background: MAP-EM Algorithm

Let be an unknown parameter of interest that we seek to infer from an idealized unobservable dataset , via the statistical model and prior . Given that

is not directly observable, suppose another random variable

, which may be seen as an incomplete version of , is observable. In this definition, , with  seen as missing data. For this reason, is typically referred to as the complete dataset. More generally, we could have for some , with observable and hidden, or, simplistically for some . In practice, there could even be no explicit missing data or no relationship between and in terms of transformations. Instead the only relationship could be the Markov condition  [Gupta2011], meaning that  is conditionally independent of  subject to , i.e. .

We adopt the notation that

are all (absolutely) continuous random variables, but, in reality, only

is required to be so, with and being allowed to be discrete, continuous, or of mixed type. For ease of notation, we use to refer, respectively, to integration with respect to (w.r.t.) implicit

-finite measures that dominate the probability distributions of

, or appropriate conditionals of these, and which coincide with the construction of the respective densities via Radon-Nikodym derivatives. We aim to estimate from via the maximum a posteriori (MAP) estimator: , where the maximization is taken over the entire parameter space. In some situations a global maximizer may not exist, and we need to be content with (or even give preference to) a “good” local maximizer [Figueiredo2004]. The mapping is typically referred to as the incomplete-data log-likelihood function, whereas the complete-data log-likelihood function.

Under the incomplete-data framework, a popular approach to compute the MAP estimator is through the expectation-maximization (EM) algorithm (also known as MAP-EM in our Bayesian framework), whose iterations are

 ^θk+1=argmaxθQ(θ,^θk),where,Q(θ,^θ)≜Ex∼p(x|y,^θ)[logp(θ|x)] (1)

from a given initial estimate , where denotes the expected complete-data log-posterior, conditional to the observed data and current parameter estimate . The maximum in (1) is taken within the entire parameter space. Eventually, as we will demonstrate later, EM is (in general) a local (greedy) search method w.r.t. the actual objective function – the (incomplete-data) log-posterior 111The underlying assumption of the EM algorithm is that (1) is easy to globally maximize in . If it can’t be exactly and globally maximized, but instead if we can find some  such that (potentially excluding the case when  is already fixed point of EM), then any variant of EM that settles for such a sequence  of iterates, where , is known as a generalized EM (GEM) algorithm..

### 2.1 An Information-Theoretic Perspective

As we dive into the information theoretic perspective, recall that

denotes the Kullback-Leibler (KL) divergence between the probability density functions

and w.r.t. the same -finite dominating measure over  implicitly denoted via . The (random) variables of interest satisfies the following assumption

###### Assumption 1.

The random variables  satisfy the Markovian condition .

Now, the Markov Assumption 1, i.e. , allows us to restate the -function and subsequently the EM algorithm in information-theoretic terms, leading to the following proposition.

###### Proposition 1.

Under Assumption 1, we have

 Q(θ,^θ)=logp(θ|y)−DKL[pX(⋅|y,^θ)∥pX(⋅|y,θ)]+terms that do not depend on θ, (2)

and thus, the MAP-EM algorithm (1) can be rewritten as

 ^θk+1=argmaxθ{logp(θ|y)−d(θ,^θk)}, (3)

where and .

Since the sole purpose of the -function is to be maximized at the M-step, we thus redefine it as by dropping the terms that do not depend on . Notice that the EM algorithm can be recognized as a (generalized) proximal point algorithm (PPA)

 ^θk=argminθ{ℓ(θ)+βkd(θ,^θ)}, (4)

with a fixed step size , where the function we seek to minimize . Furthermore, may be seen as a regularizing premetric222It is well known that , with equality corresponding almost exclusively to . More precisely, consider two probability distributions and over the same measurable space , both dominated over some -finite measure  in the space . With a slight overload of notation, let , where and . Then, with equality if and only if (-a.s.), which is also equivalent to (-a.s.). Despite this, the KL divergence is non-symmetric and does not satisfy the triangle inequality, in general, and thus it is not a metric.. This perspective was explored in detail in [Chretien2000, Figueiredo2004], which served as inspiration for this work.

Notice also that, if the prior

is “flat”, meaning that we assign the (typically) non-informative (and potentially degenerate) prior given by a uniform distribution over the parameter space, then

, with naturally denoting the incomplete-data log-likelihood function. Therefore, the Bayesian EM algorithm (MAP-EM) generalizes the traditional EM algorithm, and for this reason, from this point on we will largely refer to MAP-EM as simply EM.

## 3 The Dynamical Systems Approach

We now interpret the EM algorithm as a (discrete-time and time-invariant) nonlinear state-space dynamical system,

 ^θk+1 =F(^θk),∀k∈Z+,%with,F(^θ)≜argmaxθQ(θ,^θ). (5)

In the language of dynamical systems, is known as the state of the system (5) at time step (usually denoted as

, much like in reinforcement learning, but for the sake of notational consistency with EM we opted for

, and, in particular, is called the initial state. The space of points from which the state can take values is known as the state space. In our case, the parameter space and the state space coincide. The sequence is often called a trajectory starting from initial state . The function represents the dynamics of the system. The following assumption ensures that is uniquely defined.

###### Assumption 2.

has a unique global maximizer for each .

In other words, the complete-data log-posterior is expected to have a unique global maximizer, which in principle does not prevent the incomplete-data log-posterior from having multiple global maxima or from being unbounded, such as in the cases of GMMs with unknown covariance matrices.

### 3.1 Equilibrium in Dynamical Systems

We say that a point in the state space is an equilibrium of the dynamical system (5) if implies that for every . In other words, is an equilibrium point if and only if is a fixed point of . This implies that, , the equilibria of the dynamical system representation of EM naturally coincide with its fixed points.

Recall that limit points of the EM algorithm (1) consist of points for which there exists some such that as . However, EM need not be locally convergent near limit points of its dynamical system representation, which requires that as for every  in a small enough neighborhood of . Such points are known as (locally) attractive in dynamical systems theory. In order to establish a key relationship between limit points of EM and equilibria of its dynamical systems representation, needs to be continuous as stated in Assumption 3.

###### Assumption 3.

is continuous.

This follows, for instance, if is continuous and conditional to and has a finite support, which is the case for GMM clustering.

###### Lemma 1.

Under Assumptions 13, is continuous.

Referring the readers to supplementary section for Lemma 1’s proof, we move ahead to establish a key relationship between limit points of EM and its dynamical systems representation.

###### Proposition 2.

Under Assumptions 13, any limit point of the EM algorithm (1) is also a fixed point of given by (5), and thus an equilibrium of EM’s dynamical system representation (5).

The reciprocal is not true, however, which is known to occur for unstable equilibria of nonlinear systems. Therefore, we now focus on another key concept in dynamical systems theory – that of (Lyapunov) stability – and proceed to study its relationship with convergence of the EM algorithm.

### 3.2 Lyapunov Stability

We say that in the parameter space is a stable point of the system (5) if the trajectory is arbitrarily close to , provided that is sufficiently small. In means that, if and only, for any , there exists some such that for every satisfying , we have for every . If, in addition, there exists some small enough such that, for every satisfying , we have as , then we say  is a (locally) asymptotically stable point of the system. In other words, asymptotically stable points are simply stable and attractive points of the system. If the attractiveness is global, meaning that can be made arbitrarily large, then we say that  is globally asymptotically stable. Finally, if in addition there exist and such that, for every satisfying , we have for every , then we say that  is a (locally) exponentially stable point of the system. Global exponential stability holds if can be made arbitrarily large.

###### Lemma 2.

Consider the dynamical system (5) with an arbitrary continuous function . Then, every stable point is also an equilibrium.

In particular, we can now see that, under Assumptions 13, every stable point of the dynamical system that represents EM is also a fixed point of EM. Furthermore, it should be clear that local maxima of the incomplete-data log-posterior (that happen to be asymptotically stable) must be locally convergent points of EM. On the other hand, exponentially stable local maxima lead EM to attain a -linear convergence rate, where originates from the definition of exponential stability. In particular, if the cost function  is Lipschitz continuous, then clearly . The the relationships between all the aforementioned concepts is summarized in the supplementary section B.

To establish the different notions of stability, we use the ideas proposed by Lyapunov, which we summarize in Lemma 3 in what we refer to as the “Lyapunov theorem” (actually a collective of results). Before stating the Lyapunov theorem, let us introduce some convenient terminology borrowed from nonlinear systems theory. We say that a function is:

1. [noitemsep,topsep=-5pt]

2. (locally) positive semidefinite w.r.t.  if and for near ;

3. (locally) positive definite w.r.t  if with equality if and only if , for every  near ;

4. (locally) negative semidefinite (respectively, negative definite) if is positive semidefinite (respectively, positive definite);

5. radially unbounded if the state space is an entire Euclidean space and as .

Global variants of definitions 1–3 hold if can be picked anywhere in the state space. Finally, we say that a scalar function is of class  if and if it is continuous and strictly increasing.

We are now ready to state Lyapunov’s theorem for generic discrete-time systems of the form (5). These results, and Lyapunov-based results alike, are of crucial importance in practice for showing the different forms of stability of nonlinear systems.

###### Lemma 3 (Lyapunov theorem).

Consider the dynamical system (5) with an arbitrary continuous function , and let be an arbitrary point in the state space. Let be a continuous function and . Consider the following assertions about these functions:

1. [noitemsep,topsep=0pt]

2. is positive definite w.r.t. ;

3. is negative semidefinite w.r.t. ;

4. is negative definite w.r.t. ;

5. and are both globally positive definite w.r.t.  and is radially unbounded;

6. There exist class- functions such that holds near , for some , and, for every near :

 α1(∥^θ−^θ⋆∥)≤V(^θ) ≤α2(∥^θ−^θ⋆∥) (6a) ΔV(^θ) ≤−α3(∥^θ−^θ⋆∥). (6b)

Then,  is

• [noitemsep,topsep=0pt]

• stable if 1 and 2 hold;

• asymptotically stable if 1 and 3 hold;

• globally asymptotically stable if 4 holds;

• exponentially stable if 5 holds.

See supplementary section for proof. Notice that was designed so that , where and . Furthermore, it is intended to represent a discrete-time equivalent to . The negative semidefiniteness simply translates to non-strict monotonicity of , which can be interpreted as a surrogate to .

## 4 Main Results: Convergence of EM

We now provide conditions that establish the different notions of Lyapunov stability explored thus far, for the dynamical system representation of the EM algorithm, and make appropriate conclusions in terms of the convergence of EM. To proceed, we first propose the natural candidate for a Lyapunov function in optimization, the function

 V(^θ)≜ℓ(^θ)−ℓ(^θ⋆)=logp(^θ⋆|y)−logp(^θ|y), (7)

where is a particular strict local maximum of interest (fixed for the remaining of this section), i.e. a particular point in the parameter space (state space) for which we seek convergence of EM. Since employing the Lyapunov theorem requires to be continuous, we make a mild assumption on the continuity of the incomplete-data posterior and then establish (non-asymptotic) stability.

is continuous.

###### Proposition 3.

Under Assumptions 14, any strict local maximum of the incomplete-data log-posterior is a stable equilibrium of the dynamical system (5) that represents EM.

###### Proof.

Consider the candidate Lyapunov function (7), defined for  near . Clearly, is positive definite w.r.t. . Furthermore, notice that, since , then . Therefore, we have

 logp (F(^θ)|y)−DKL[pX(⋅|y,^θ)∥pX(⋅|y,F(^θ))]=d(F(^θ),^θ)≥logp(^θ|y)−DKL[pX(⋅|y,^θ)∥pX(⋅|y,^θ)]=d(^θ,^θ)=0, (8)

where is defined in (4). In particular, we have

 ΔV(^θ)=V(F(^θ))−V(^θ)=logp(^θ|y)−logp(F(^θ)|y)≤−d(F(^θ),^θ)≤0, (9a)

thus the result follows by invoking Lyapunov’s theorem. ∎

To establish asymptotic stability, it suffices that the inequality in (8) be strict, since through the same argument used for non-asymptotic stability, we thus have for sufficiently small. In order to achieve this, we make the following assumption.

###### Assumption 5.

if and only if , for  near  and arbitrary .

This is also a relatively mild condition when  is a “good” local minimizer of . It follows, for instance, from a strong form of identifiability of the parameterized posterior latent distribution. It suffices that, for near , whenever we have

 Px∼p(x|y,^θ)[p(x|y,θ)=p(x|y,^θ)]=1. (10)

Failure to have Assumption 5 be satisfied near a particular strict local maximizer  could result in that point not being being a fixed point of EM (i.e. equilibrium of EM’s dynamical system representation), let alone an asymptotically stable point and thus not locally convergent or even a limit point. Unfortunately, it is a well-documented behavior of the EM algorithm that its convergence properties and overall performance heavily rely on whether the initialization occurred near a “good” local maximizer of the incomplete-data log-likelihood or log-posterior [Figueiredo2004]. In the context of GMM clustering, for instance, Assumption 5 simply means that, after having sampled the GMM at hand, then the different posterior class probabilities will strictly depend on any perturbation to the parameters in the model (namely, the prior class probabilities and their corresponding means and covariance matrices).

### 4.1 Local and Global Convergence of EM

We now formally state and prove the local convergence of EM as a consequence of asymptotic Lyapunov stability. The proof is the straightforward culmination of previous section’s discussion.

###### Theorem 1 (Local Convergence).

Let a strict local maximizer of the incomplete-data posterior and an isolated fixed point of EM. If Assumptions 15 hold, then is an asymptotically stable equilibrium of the dynamical system (5) that represents the EM algorithm (1), and thus EM is locally convergent to .

###### Proof.

Consider, once again, the candidate Lyapunov function (7), defined for  near . Following the same strategy used in the proof of Proposition 3, we see that that is continuous and positive definite w.r.t. , but now is not only negative semidefinite w.r.t. , but also negative definite, due to Assumption 5. ∎

In practice, since fixed points of EM must be stationary points of the log-posterior [Figueiredo2004], then a sufficient condition for this assumption to hold would be the continuous differentiability of for near , and that were an isolated stationary point. We now state the conditions that lead to global convergence of EM.

###### Theorem 2.

Suppose that and that the conditions of Theorem 1 hold, with Assumption 5 holding globally (i.e. if and only if for every ). If is radially unbounded and  is the only fixed point of EM, then  is a globally asymptotically stable equilibrium of the dynamical system (5) that represents the EM algorithm (1), and thus EM is globally convergent to .

###### Proof.

We can reuse the argument in the proofs of Proposition 3 and Theorem 1 to check that condition 4 of the Lyapunov theorem holds with the additional assumptions. ∎

In particular, the unicity of fixed points follows, for the case of continuously differentiable incomplete-data log-posterior, for unimodal distributions. Furthermore, we require the support of the posterior to be the entire Euclidean space of appropriate dimension, and to vanish radially, (i.e. as ). These last two conditions are largely technical and can be roughly circumvented in practice (for absolutely continuous posterior distributions). On the other hand, the unimodality rarely occurs and, in fact, EM can easily converge to saddle points or diverge.

### 4.2 Linear and Quadratic Convergence of EM

Returning to the assumptions that lead to exponential stability, and thus local convergence of EM, we now explore alternatives that can lead to linear and quadratic convergence of EM. We achieve this by further strengthening the strong identifiability Assumption 5. In addition, we want condition 5 of the Lyapunov theorem to be satisfied for the candidate Lyapunov function (7).

###### Assumption 6.

There exist some class- functions such that holds near , for some , and for every with near

 e−α2(∥^θ−^θ⋆∥)≤p(^θ|y)p(^θ⋆|y)≤e−α1(∥^θ−^θ⋆∥),and,d(θ,^θ)≥α3(∥^θ−^θ⋆∥). (11)

This assumption sets the stage to state conditions for the linear convergence (in the sense of optimization) of the EM algorithm iterates.

###### Theorem 3 (Linear convergence of iterates).

Under the conditions of Theorem 1 and Assumption 6, is an exponentially stable equilibrium (with rate ) of the dynamical system (5) that represents the EM algorithm (1), and thus .

Similarly to what is noted in [Taylor2018], if is Lipschitz continuous, then we clearly have . In that sense, actually converges quadratically, in the sense of optimization. Alternatively, it would have sufficed that for some to achieve . However, as discussed in Section 3.2, we can relax the condition in Lyapunov’s theorem required for exponential stability into something that leads to a notion of stability stronger than asymptotic stability but weaker than exponential stability, and which allows us to directly establish the Q-linear convergence of .

###### Theorem 4 (Quadratic convergence of posterior).

Under the conditions of Theorem 1, if there exists some such that

 p(^θ|y)≤p(^θ⋆|y)⋅e−d(F(^θ),^θ)1−μ, (12)

for every near , then with a Q-linear convergence rate upper bounded by . In particular, we have .

Notice that (12) can be restated as  for near , to more closely resemble (11). Naturally, the main disadvantage of the last result is that checking the inequality (12) is likely to be virtually impossible in practice, given that it’s directly based on the EM dynamics . Furthermore, it has been widely observed that EM’s convergent rate is often sublinear, so the conditions in the previous results likely only hold in a few “good” local maximizers of the incomplete-data log-posterior. The last result focus on linear convergence of the posterior, by using as the new candidate Lyapunov function.

###### Theorem 5 (Linear convergence of the posterior).

Under the conditions of Theorem 1, if the concavity-like condition

 p(F(^θ)|y)≥μp(^θ)+(1−μ)p(^θ⋆|y), (13)

holds for every near , then .

For the ease of readability, the proofs of Theorems 3-5 are detailed in the supplementary section.

### 4.3 Experimental validation of the convergence properties

We demonstrate the convergence properties of the MAP-EM algorithm on a general GMM with independent Gaussian priors on the unknown means. The details can be found in Appendix K, where we can see that, as the prior becomes more informative, convergence is achieved at faster rate. We also note that our convergence results readily apply to the MAP-EM algorithm over many distributions other than GMMs, and thus the wide range of applications it has been applied to.

## 5 Conclusion and Next Steps

In this paper, we addressed a gap in analyzing and designing optimization algorithms from the point of view of dynamical systems and control theory. Most recent literature largely focus on continuous-time

representations (via ordinary differential equations or inclusions) of general-purpose

gradient-based nonlinear optimization algorithms. However, we provide a unifying framework for the study of iterative optimization algorithms as discrete-time dynamical systems, and describe several relationships between forms of Lyapunov stability of state-space dynamical systems and convergence of optimization algorithms. In particular, we explored how exponential stability can be used to derive linear (or superlinear) convergence rates. We then narrowed this framework in detail to analyze convergence of the expectation-maximization (EM) algorithm for maximum a posteriori (MAP) estimation. Following first principles from dynamical systems stability theory, conditions for convergence of MAP-EM were developed, including conditions to show fast convergence (linear or quadratic), though EM often converges sublinearly.

The conditions we derived have a convenient statistical and information-theoretic interpretation and may thus be used in the future to design other novel EM-like algorithms with provable convergence rate. The conditions we derive would have been difficult to unveil without our approach, and thus we argue that a treatment similar to ours can we adopted for the convergence analysis of many other algorithms in machine learning. For future work, we believe that our approach can prove valuable in the design of EM-like algorithms for online estimation subject to a concept drift or data poisoning. In fact, by carefully designing a controller (input) on an otherwise unstable system, we can leverage a process known as stabilization to force different notions stability of dynamical systems that represent an optimization algorithm, and thus improve its convergence rate or robustness.

## Acknowledgements

This work was supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).

## Appendix A Convergence via Subexponential Stability

As is often remarked when addressing in the context of EM, monotonicity is generally not enough to attain convergence. Nevertheless, the negative definiteness condition ensures asymptotic stability, and thus (local) convergence. Indeed, from an optimization perspective, the Lyapunov conditions for exponential stability are simply equivalent to strict monotonicity of , together with an absolute attainable lower bound of the surrogate of at .

Note that from (6a) it follows that

 ΔV(^θ)≤−(α3∘α−12)(V(^θ)) (14)

for every  near , which can be restated as

 Vk+1≤(id−α3∘α−12)(Vk), (15)

for near , where . Therefore, , where and . In particular, if with () and , then (14) and (15) become

 ΔVk≤−(1−μ)Vk (16)

and

 Vk+1≤μVk, (17)

respectively, where . In that case, we have as , with a Q-linear convergence rate upper bounded by . In particular, we have . We further note that and do not directly influence the bound . This observation is similar in spirit to Lemma 1 in [Aitken1994].

With these remarks into consideration, we note that if is continuous, positive definite w.r.t. , and  (16) (equivalently, (17)) holds for near , then the linear rate still holds, without necessarily having linear convergence of . Therefore, (14) without necessarily (6a) induces a weaker notion than exponential stability, but stronger than asymptotic stability. In fact, it coincides with the notion of -stability for , with  [Lakshmikantham2002].

From an optimization perspective, we can leverage the ideas discussed in the previous paragraphs by noting that we are often concerned about the convergence rate in terms of the objective function or some meaningful surrogate of it, rather than directly the iterates in a numerical optimization scheme. This approach was implicit, for instance, in [Taylor2018, Vaquero2019].

## Appendix B Relationships between stability and convergence

We represent the relationships between all the aforementioned concepts (local variants only, for the sake of simplicity) through the following diagram:

[arrows=Rightarrow] & exponentially stable [d][r] & linearly convergent [d]

& asymptotically stable [d][r] & locally convergent [d]

& stable point [d] & limit point [d]

& equilibrium [r, Leftrightarrow] & fixed point.

## Appendix C Proof of Proposition 1

From Assumption 1, we have

 p(y|x)=p(y|x,θ)=p(y|θ)p(x|θ)p(x|y,θ)=1p(θ|x)p(y)p(x)p(θ|y)p(x|y,θ), (18)

and therefore

 p(θ|x)=p(y)/p(x)p(y|x)p(θ|y)p(x|y,θ)=p(θ|y)p(x|y,θ)p(x|y). (19)

Let indicate that is constant. Then, we have

 logp(θ|x)∼logp(θ|y)+logp(x|y,θ)=logp(θ|y)−log(p(x|y,^θ)p(x|y,θ)) (20)

for any fixed . Taking the expected value in and attending to the definitions of EM (1) and KL divergence (Subsection 2.1), then the expression (2) follows. Finally, (3) readily follows by combining (1) and (2) and disregarding any additive terms that do not depend on , since those will not affect the so-called M-step (maximization step) of the EM algorithm, i.e. the maximization in (1).

## Appendix D Proof of Lemma 1

Let a convergent sequence, not necessarily generated by the EM algorithm (i.e. without necessarily having ), with limit . Notice that

 Q(limk→∞F(^θk),^θ) =Q(limk→∞F(^θk),limk→∞^θk) (21a) =limk→∞Q(F(^θk),^θk) (21b) ≥limk→∞Q(F(^θ),^θk) (21c) =Q(F(^θ),limk→∞^θk) (21d) =Q(F(^θ),^θ) (21e) ≥Q(limk→∞F(^θk),^θ), (21f)

where (21b) and (21d) both follow from the continuity of the -function (Assumption 3). On the other hand, the inequalities (21c) and (21f) follow by noting that and for every . In particular, they follow by choosing and , respectively, and taking the limit on both sides.

Therefore, we have

 Q(limk→∞F(^θk),^θ)=Q(F(^θ),^θ)=maxθQ(θ,^θ), (22)

and thus , which in turn makes continuous.

## Appendix E Proof of Proposition 2

Let be such that , where was generated (1). Then,

 F(^θ⋆)=F(limk→∞^θk)=limk→∞F(^θk)=limk→∞^θk+1=^θ⋆, (23)

where the second equality follows from the continuity of .

## Appendix F Proof of Lemma 2

Let such that as and let be such that, if , then for every . Naturally, , and thus as . Therefore, given any sequence such that , then as . Furthermore, , and thus as . From the continuity of , it follows that .

## Appendix G Proof of Lemma 3 (Lyapunov Theorem)

For the non-asymptotic and asymptotic stability part of the Lyapunov theorem, please refer to the proofs of Theorems 5.9.1–5.9.2 in [Agarwal2000], Corollary 4.8.1 and Theorem 4.8.3 in [Lakshmikantham2002], or Theorem 1.2 in [Bof2018]. For the global asymptotic stability, please refer to Theorem 5.9.8 in [Agarwal2000], Theorem 4.9.1. in [Lakshmikantham2002], or Theorem 1.4 in [Bof2018].

For the exponential stability, we can adapt the ideas in [Aitken1994] and the proof of Theorem 5.4 in [Bof2018]. More precisely, we have

 α1(∥^θk+1−^θ⋆∥)≤Vk+1=ΔVk+Vk≤α2(∥^θk−^θ⋆∥)−α3(∥^θk−^θ⋆∥)≤α1(ρ∥^θk−^θ⋆∥), (24)

and thus . In other words, as Q-linearly, with a rate upper bounded by .

## Appendix H Proof of Theorem 3 (Linear Convergence of Iterates)

Consider the candidate Lyapunov function , which is continuous due to Assumption 4. As seen in the proof of Proposition 3, we have . But then, from (11), we obtain (6b). Therefore, condition 5 of Lemma 3 is satisfied, and thus exponential stability is certified.

## Appendix I Proof of Theorem 4 (Quadratic Convergence of Posterior)

Consider once again the continuous and positive definite candidate Lyapunov function . We will leverage the ideas in Appendix A. Indeed, notice that (12) can be rewritten as

 p(^θk|y)≤p(^θ⋆|y)e−d(^θk+1,^θk)1−μ. (25)

Taking the log and rearranging terms, we get

 d(^θk+1,^θk)≤(1−μ)[logp(^θ⋆|y)−logp(^θk|y)]=(1−μ)Vk. (26)

Finally, since , then  follows.

## Appendix J Proof of Theorem 5 (Linear Convergence of Posterior)

This time, we consider the continuous and positive definite candidate Lyapunov function . Clearly, (13) can be rewritten as

 p(^θk+1|y)≥μp(^θk|y)+(1−μ)p(^θ⋆|y), (27)

and thus

 Vk+1=p(^θ⋆|y)−p(^θk+1|y)≤μ(p(^θ⋆|y)−p(^θk|y)=μVk. (28)

The result follows from the discussion in Appendix A.

## Appendix K An Illustrative Example & Experiments

Let be some dataset that we wish to cluster. Let us assume that is randomly, but independently, sampled from an unknown class , which we thus seek to infer. For the sake of simplicity, consider that the class probabilities such that