 # Fitting A Mixture Distribution to Data: Tutorial

This paper is a step-by-step tutorial for fitting a mixture distribution to data. It merely assumes the reader has the background of calculus and linear algebra. Other required background is briefly reviewed before explaining the main algorithm. In explaining the main algorithm, first, fitting a mixture of two distributions is detailed and examples of fitting two Gaussians and Poissons, respectively for continuous and discrete cases, are introduced. Thereafter, fitting several distributions in general case is explained and examples of several Gaussians (Gaussian Mixture Model) and Poissons are again provided. Model-based clustering, as one of the applications of mixture distributions, is also introduced. Numerical simulations are also provided for both Gaussian and Poisson examples for the sake of better clarification.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Every random variable can be considered as a sample from a distribution, whether a well-known distribution or a not very well-known (or “ugly”) distribution. Some random variables are drawn from one single distribution, such as a normal distribution. But life is not always so easy! Most of real-life random variables might have been generated from a mixture of several distributions and not a single distribution. The mixture distribution is a weighted summation of

distributions where the weights sum to one. As is obvious, every distribution in the mixture has its own parameter . The mixture distribution is formulated as:

 f(x;Θ1,…,ΘK)=K∑k=1wkgk(x;Θk), (1) subject to    K∑k=1wk=1.

The distributions can be from different families, for example from beta and normal distributions. However, this makes the problem very complex and sometimes useless; therefore, mostly the distributions in a mixture are from one family (e.g., all normal distributions) but with different parameters. This paper aims to find the parameters of the distributions in the mixture distribution

as well as the weights (also called “mixing probabilities”)

.

The remainder of paper is organized as follows. Section 2 reviews some technical background required for explaining the main algorithm. Afterwards, the methodology of fitting a mixture distribution to data is explained in Section 3. In that section, first the mixture of two distributions, as a special case of mixture distributions, is introduced and analyzed. Then, the general mixture distribution is discussed. Meanwhile, examples of mixtures of Gaussians (example for continuous cases) and Poissons (example for discrete cases) are mentioned for better clarification. Section 4 briefly introduces clustering as one of the applications of mixture distributions. In Section 5, the discussed methods are then implemented through some simulations in order to have better sense of how these algorithms work. Finally, Section 6 concludes the paper.

## 2 Background

This section reviews some technical background required for explaining the main algorithm. This review includes probability and Bayes rule, probability mass/density function, expectation, maximum likelihood estimation, expectation maximization, and Lagrange multiplier.

### 2.1 Probability and Bayes Rule

If denotes the total sample space and denotes an event in this sample space, the probability of event is:

 P(A)=|A||S|. (2)

The conditional probability, i.e., probability of occurance of event given that event happens, is:

 P(A|B) =P(A,B)P(B) (3) =P(B|A)P(A)P(B), (4)

where , , , and are called likelihood, posterior, prior, and marginal probabilities, respectively. If we assume that the event consists of some cases , we can write:

 P(Ai|B)=P(B|Ai)P(Ai)n∑j=1P(B|Aj)P(Aj). (5)

the equations (4) and (5) are two versions of Bayes rule.

### 2.2 Probability Mass/Density Function

In discrete cases, the probability mass function is defined as:

 f(x)=P(X=x), (6)

where and are a random variable and a number, respectively.

In continuous cases, the probability density function is:

 f(x)=limΔx→0P(x≤X≤x+Δx)Δx=∂P(X≤x)∂x. (7)

In this work, by mixture of distributions, we imply mixture of mass/density functions.

### 2.3 Expectation

Expectation means the value of a random variable on average. Therefore, expectation is a weighted average where the weights are probabilities of the random variable to get different values. In discrete and continuous cases, the expectation is:

 E(X)=∑dom xxf(x), (8) E(X)=∫\displaylimitsdom xxf(x)dx, (9)

respectively, where is the domain of . The conditional expectation is defined as:

 EX|Y(X|Y)=∑dom xxf(x|y), (10) EX|Y(X|Y)=∫\displaylimitsdom xxf(x|y)dx, (11)

for discrete and continuous cases, respectively.

### 2.4 Maximum Likelihood Estimation

Assume we have a sample with size , i.e.,

. Also assume that we know the distribution from which this sample has been randomly drawn but we do not know the parameters of that distribution. For example, we know it is drawn from a normal distribution but the mean and variance of this distribution are unknown. The goal is to estimate the parameters of the distribution using the sample

available from it. This estimation of parameters from the available sample is called “point estimation”. One of the approaches for point estimation is Maximum Likelihood Estimation (MLE). As it is obvious from its name, MLE deals with the likelihood of data.

We postulate that the values of sample, i.e.,

, are independent random variates of data having the sample distribution. In other words, the data has a joint distribution

with parameter and we assume the variates are independent and identically distributed () variates, i.e., with the same parameter . Considering the Bayes rule, equation (4), we have:

 L(Θ|x1,…,xn)=fX(x1,…,xn|Θ)π(Θ)fX(x1,…,xn). (12)

The MLE aims to find parameter which maximizes the likelihood:

 ˆΘ=argmaxΘL(Θ). (13)

According to the definition, the likelihood can be written as:

 L(Θ|x1,…,xn) :=f(x1,…,xn;Θ) (a)=n∏i=1f(xi,Θ), (14)

where is because the are . Note that in literature, the is also denoted by for simplicity.

Usually, for more convenience, we use log-likelihood rather than likelihood:

 ℓ(Θ) :=logL(Θ) (15) =logn∏i=1f(xi,Θ)=n∑i=1logf(xi,Θ). (16)

Often, the logarithm is a natural logarithm for the sake of compatibility with the exponential in the well-known normal density function. Notice that as logarithm function is monotonic, it does not change the location of maximization of the likelihood.

### 2.5 Expectation Maximization

Sometimes, the data are not fully observable. For example, the data are known to be whether zero or greater than zero. As an illustration, assume the data are collected for a particular disease but for convenience of the patients participated in the survey, the severity of the disease is not recorded but only the existence or non-existence of the disease is reported. So, the data are not giving us complete information as is not obvious whether is or .

In this case, MLE cannot be directly applied as we do not have access to complete information and some data are missing. In this case, Expectation Maximization (EM) is useful. The main idea of EM can be summarized in this short friendly conversation:

– What shall we do? The data is missing! The log-likelihood is not known completely so MLE cannot be used.
– Mmm, probably we can replace the missing data with something…
– Aha! Let us replace it with its mean.
– You are right! We can take the mean of log-likelihood over the possible values of the missing data. Then everything in the log-likelihood will be known, and then…
– And then we can do MLE!

Assume and denote the observed data (’s in the above example) and the missing data (’s in the above example). The EM algorithm includes two main steps, i.e., E-step and M-step.

In the E-step, the log-likelihood (equation (15)), is taken expectation with respect to the missing data in order to have a mean estimation of it. Let denote the expectation of the likelihood with respect to :

 Q(Θ):=ED(miss)|D(obs),Θ[ℓ(Θ)]. (17)

Note that in the above expectation, the and are conditioned on, so they are treated as constants and not random variables.

In the M-step, the MLE approach is used where the log-likelihood is replaced with its expectation, i.e., ; therefore:

 ˆΘ=argmaxΘQ(Θ). (18)

These two steps are iteratively repeated until convergence of the estimated parameters .

### 2.6 Lagrange Multiplier

Suppose we have a multi-variate function (called “objective function”) and we want to maximize (or minimize) it. However, this optimization is constrained and its constraint is equality where is a constant. So, the constrained optimization problem is:

 maximizeΘ1,…,ΘK Q(Θ1,…,ΘK), (19) subject to P(Θ1,…,ΘK)=c.

For solving this problem, we can introduce a new variable which is called “Lagrange multiplier”. Also, a new function , called “Lagrangian” is introduced:

 L(Θ1,…,ΘK,α)= Q(Θ1,…,ΘK) (20) −α(P(Θ1,…,ΘK)−c).

Maximizing (or minimizing) this Lagrangian function gives us the solution to the optimization problem (Boyd & Vandenberghe, 2004):

 ∇Θ1,…,ΘK,αLset=0, (21)

which gives us:

 ∇Θ1,…,ΘKLset% =0 ⟹∇Θ1,…,ΘKQ=α∇Θ1,…,ΘKQ, ∇αLset=0 ⟹P(Θ1,…,ΘK)=c.

## 3 Fitting A Mixture Distribution

As was mentioned in the introduction, the goal of fitting a mixture distribution is to find the parameters and weights of a weighted summation of distributions (see equation (1)). First, as a spacial case of mixture distributions, we work on mixture of two distributions and then we discuss the general mixture of distributions.

### 3.1 Mixture of Two Distributions

Assume that we want to fit a mixture of two distributions and to the data. Note that, in theory, these two distributions are not necessarily from the same distribution family. As we have only two distributions in the mixture, equation (1) is simplified to:

 f(x;Θ1,Θ2)=wg1(x;Θ1)+(1−w)g2(x;Θ2). (22)

Note that the parameter (or in general) is called “mixing probability” (Friedman et al., 2009) and is sometimes denoted by (or in general) in literature.

The likelihood and log-likelihood for this mixture is:

 L(Θ1,Θ2) =f(x1,…,xn;Θ1,Θ2)(a)=n∏i=1f(xi;Θ1,Θ2) =n∏i=1[wg1(xi;Θ1)+(1−w)g2(xi;Θ2)],
 ℓ(Θ1,Θ2)=n∑i=1log[ wg1(xi;Θ1) +(1−w)g2(xi;Θ2)],

where is because of the assumption that are . Optimizing this log-likelihood is difficult because of the summation within the logarithm. However, we can use a nice trick here (Friedman et al., 2009): Let be defined as:

 Δi:={1if xi % belongs to g1(x;Θ1),0if xi belongs to g2(x;Θ2),

and its probability be:

 {P(Δi=1)=w,P(Δi=0)=1−w.

Therefore, the log-likelihood can be written as:

 ℓ(Θ1,Θ2)= ⎧⎪ ⎪⎨⎪ ⎪⎩∑ni=1log[wg1(xi;Θ1)]if Δi=1∑ni=1log[(1−w)g2(xi;Θ2)]if Δi=0

The above expression can be restated as:

 ℓ(Θ1,Θ2)=n∑i=1[ Δilog[wg1(xi;Θ1)]+ (1−Δi)log[(1−w)g2(xi;Θ2)]].

The here is the incomplete (missing) datum because we do not know whether it is or for . Hence, using the EM algorithm, we try to estimate it by its expectation.

The E-step in EM:

 Q(Θ1, Θ2)=n∑i=1[E[Δi|X,Θ1,Θ2]log[wg1(xi;Θ1)]+ E[(1−Δi)|X,Θ1,Θ2]log[(1−w)g2(xi;Θ2)]].

Notice that the above expressions are linear with respect to and that is why the two logarithms were factored out. Assume which is called “responsibility” of (Friedman et al., 2009).

The is either or ; therefore:

 E[Δi|X,Θ1,Θ2] = 0×P(Δi=0|X,Θ1,Θ2)+ 1×P(Δi=1|X,Θ1,Θ2) = P(Δi=1|X,Θ1,Θ2).

According to Bayes rule (equation (5)), we have:

 P(Δi =1|X,Θ1,Θ2) =P(X,Θ1,Θ2,Δi=1)P(X;Θ1,Θ2) =P(X,Θ1,Θ2|Δi=1)P(Δi=1)∑1j=0P(X,Θ1,Θ2|Δi=j)P(Δi=j).

The marginal probability in the denominator is:

 P(X;Θ1,Θ2) =(1−w)g2(xi;Θ2)+wg1(xi;Θ1).

Thus:

 ˆγi=ˆwg1(xi;Θ1)ˆwg1(xi;Θ1)+(1−ˆw)g2(xi;Θ2), (23)

and

 Q(Θ1,Θ2)= n∑i=1[ˆγilog[wg1(xi;Θ1)]+ (24) (1−ˆγi)log[(1−w)g2(xi;Θ2)]].

Some simplification of will help in next step:

 Q(Θ1, Θ2)=n∑i=1[ˆγilogw+ˆγilogg1(xi;Θ1)+ (1−ˆγi)log(1−w)+(1−ˆγi)logg2(xi;Θ2)].

The M-step in EM:

 ˆΘ1,ˆΘ2,ˆw=argmaxΘ1,Θ2,wQ(Θ1,Θ2,w).

Note that the function is also a function of and that is why we wrote it as .

 ∂Q∂Θ1=n∑i=1[ˆγig1(xi;Θ1)∂g1(xi;Θ1)∂Θ1]set=0, (25) ∂Q∂Θ2=n∑i=1[1−ˆγig2(xi;Θ1)∂g2(xi;Θ2)∂Θ2]set=0, (26) ∂Q∂w=n∑i=1[ˆγi(1w)+(1−ˆγi)(−11−w)]set=0, ⟹ˆw=1nn∑i=1ˆγi (27)

So, the mixing probability is the average of the responsibilities which makes sense. Solving equations (25), (26), and (27) gives us the estimations , , and in every iteration.

The iterative algorithm for finding the parameters of the mixture of two distributions is shown in Algorithm LABEL:algorithm_twoMixture.

algocf[!t]

#### 3.1.1 Mixture of Two Gaussians

Here, we consider a mixture of two one-dimensional Gaussian distributions as an example for mixture of two continuous distributions. In this case, we have:

 g1(x;μ1,σ21) =1√2πσ21exp(−(x−μ1)22σ21) =ϕ(x−μ1σ1), g2(x;μ2,σ22) =1√2πσ22exp(−(x−μ2)22σ22) =ϕ(x−μ2σ2),

where is the probability density function of normal distribution. Therefore, equation (22) becomes:

 f(x;μ1,μ2, σ21,σ22)= (28) wϕ(x−μ1σ1)+(1−w)ϕ(x−μ2σ2).

The equation (23) becomes:

 ˆγi=ˆwϕ(xi−μ1σ1)ˆwϕ(xi−μ1σ1)+(1−ˆw)ϕ(xi−μ2σ2). (29)

The is:

 Q(μ1, μ2,σ21,σ22)=n∑i=1[ˆγilogw +ˆγi(−12log(2π)−logσ1−(xi−μ1)22σ21) +(1−ˆγi)log(1−w) +(1−ˆγi)(−12log(2π)−logσ2−(xi−μ2)22σ22)].

Therefore:

 ∂Q∂μ1=n∑i=1[ˆγi(xi−μ1σ21)]set=0, ⟹ˆμ1=∑ni=1ˆγixi∑ni=1ˆγi, (30) ∂Q∂μ2=n∑i=1[(1−ˆγi)(xi−μ2σ22)]set=0, ⟹ˆμ2=∑ni=1(1−ˆγi)xi∑ni=1(1−ˆγi), (31) ∂Q∂σ1=n∑i=1[ˆγi(−1σ1+(xi−μ1)2σ31)]set=0, ⟹ˆσ21=∑ni=1ˆγi(xi−ˆμ1)2∑ni=1ˆγi, (32) ∂Q∂σ2=n∑i=1[(1−ˆγi)(−1σ2+(xi−μ2)2σ32)]set=0, ⟹ˆσ22=∑ni=1(1−ˆγi)(xi−ˆμ2)2∑ni=1(1−ˆγi), (33)

and is the same as equation (27).

Iteratively solving equations (29), (30), (31), (32), (33), and (27) using Algorithm (LABEL:algorithm_twoMixture) gives us the estimations for , , , , and in equation (28).

#### 3.1.2 Mixture of Two Poissons

Here, we consider a mixture of two Poisson distributions as an example for mixture of two discrete distributions. In this case, we have:

 g1(x;λ1)=e−λ1λx1x!, g2(x;λ2)=e−λ2λx2x!,

therefore, equation (22) becomes:

 f(x;λ1,λ2)=we−λ1λx1x!+(1−w)e−λ2λx2x!. (34)

The equation (23) becomes:

 ˆγi=ˆw(e−ˆλ1ˆλxi1xi!)ˆw(e−ˆλ1ˆλxi1xi!)+(1−ˆw)(e−ˆλ2ˆλxi2xi!). (35)

The is:

 Q(λ1,λ2)= n∑i=1[ˆγilogw +ˆγi(−λ1+xilogλ1−logxi!) +(1−ˆγi)log(1−w) +(1−ˆγi)(−λ2+xilogλ2−logxi!)].

Therefore:

 ∂Q∂λ1=n∑i=1[ˆγi(−1+xiλ1)]set=0, ⟹ˆλ1=∑ni=1ˆγixi∑ni=1ˆγi, (36) ∂Q∂λ2=n∑i=1[(1−ˆγi)(−1+xiλ2)]set=0, ⟹ˆλ2=∑ni=1(1−ˆγi)xi∑ni=1(1−ˆγi), (37)

and is the same as equation (27).

Iteratively solving equations (35), (36), (37), and (27) using Algorithm (LABEL:algorithm_twoMixture) gives us the estimations for , , and in equation (34).

### 3.2 Mixture of Several Distributions

Now, assume a more general case where we want to fit a mixture of distributions to the data. Again, in theory, these distributions are not necessarily from the same distribution family. For more convenience of reader, equation (1) is repeated here:

 f(x;Θ1,…,ΘK)=K∑k=1wkgk(x;Θk), subject to    K∑k=1wk=1.

The likelihood and log-likelihood for this mixture is:

 L(Θ1,…,ΘK) =f(x1,…,xn;Θ1,…,ΘK) (a)=n∏i=1f(xi;Θ1,…,ΘK) =n∏i=1K∑k=1wkgk(xi;Θk)
 ℓ(Θ1,…,ΘK)=n∑i=1log[K∑k=1wkgk(xi;Θk)],

where is because of assumption that are . Optimizing this log-likelihood is difficult because of the summation within the logarithm. We use the same trick as the trick mentioned for mixture of two distributions:

 Δi,k:={1if xi % belongs to gk(x;Θk),0otherwise,

and its probability is:

 {P(Δi,k=1)=wk,P(Δi,k=0)=1−wk.

Therefore, the log-likelihood can be written as:

 ℓ(Θ1,…,ΘK)= ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩∑ni=1log[w1g1(xi;Θ1)]                if Δi,1=1% and Δi,k=0  ∀k≠1∑ni=1log[w2g2(xi;Θ2)]                if Δi,2=1% and Δi,k=0  ∀k≠2        ⋮∑ni=1log[wKgK(xi;ΘK)]                if Δi,K=1% and Δi,k=0  ∀k≠K

The above expression can be restated as:

 ℓ(Θ1,…,ΘK)=n∑i=1[K∑k=1Δi,klog(wkgk(xi;Θk))].

The here is the incomplete (missing) datum because we do not know whether it is or for and a specific . Therefore, using the EM algorithm, we try to estimate it by its expectation.

The E-step in EM:

 Q(Θ1,…,ΘK)=n∑i=1[K∑k=1 E[Δi,k|X,Θ1,…,ΘK] ×log(wkgk(xi;Θk))].

The is either or ; therefore:

 E[Δi,k|X, Θ1,…,ΘK] =0×P(Δi,k=0|X,Θ1,…,ΘK) +1×P(Δi,k=1|X,Θ1,…,ΘK) =P(Δi,k=1|X,Θ1,…,ΘK).

According to Bayes rule (equation (5)), we have:

 P( Δi,k=1|X,Θ1,…,ΘK) =P(X,Θ1,…,ΘK,Δi,k=1)P(X;Θ1,…,ΘK) =P(X,Θ1,…,ΘK|Δi,k=1)P(Δi,k=1)∑Kk′=1P(X,Θ1,…,ΘK|Δi,k′=1)P(Δi,k′=1).

The marginal probability in the denominator is:

 P(X;Θ1,…,ΘK)=K∑k′=1wk′gk′(xi;Θk′).

Assuming that (called responsibility of ), we have:

 ˆγi,k=ˆwkgk(xi;Θk)∑Kk′=1ˆwk′gk′(xi;Θk′), (38)

and

 Q(Θ1,…,ΘK)=n∑i=1K∑k=1ˆγi,klog(wkgk(xi;Θk)). (39)

Some simplification of will help in next step:

 Q(Θ1, …,ΘK)= n∑i=1K∑k=