## 1 Introduction

Every random variable can be considered as a sample from a distribution, whether a well-known distribution or a not very well-known (or “ugly”) distribution. Some random variables are drawn from one single distribution, such as a normal distribution. But life is not always so easy! Most of real-life random variables might have been generated from a mixture of several distributions and not a single distribution. The mixture distribution is a weighted summation of

distributions where the weights sum to one. As is obvious, every distribution in the mixture has its own parameter . The mixture distribution is formulated as:(1) | ||||

The distributions can be from different families, for example from beta and normal distributions. However, this makes the problem very complex and sometimes useless; therefore, mostly the distributions in a mixture are from one family (e.g., all normal distributions) but with different parameters. This paper aims to find the parameters of the distributions in the mixture distribution

as well as the weights (also called “mixing probabilities”)

.The remainder of paper is organized as follows. Section 2 reviews some technical background required for explaining the main algorithm. Afterwards, the methodology of fitting a mixture distribution to data is explained in Section 3. In that section, first the mixture of two distributions, as a special case of mixture distributions, is introduced and analyzed. Then, the general mixture distribution is discussed. Meanwhile, examples of mixtures of Gaussians (example for continuous cases) and Poissons (example for discrete cases) are mentioned for better clarification. Section 4 briefly introduces clustering as one of the applications of mixture distributions. In Section 5, the discussed methods are then implemented through some simulations in order to have better sense of how these algorithms work. Finally, Section 6 concludes the paper.

## 2 Background

This section reviews some technical background required for explaining the main algorithm. This review includes probability and Bayes rule, probability mass/density function, expectation, maximum likelihood estimation, expectation maximization, and Lagrange multiplier.

### 2.1 Probability and Bayes Rule

If denotes the total sample space and denotes an event in this sample space, the probability of event is:

(2) |

The conditional probability, i.e., probability of occurance of event given that event happens, is:

(3) | ||||

(4) |

where , , , and are called likelihood, posterior, prior, and marginal probabilities, respectively. If we assume that the event consists of some cases , we can write:

(5) |

### 2.2 Probability Mass/Density Function

In discrete cases, the probability mass function is defined as:

(6) |

where and are a random variable and a number, respectively.

In continuous cases, the probability density function is:

(7) |

In this work, by mixture of distributions, we imply mixture of mass/density functions.

### 2.3 Expectation

Expectation means the value of a random variable on average. Therefore, expectation is a weighted average where the weights are probabilities of the random variable to get different values. In discrete and continuous cases, the expectation is:

(8) | |||

(9) |

respectively, where is the domain of . The conditional expectation is defined as:

(10) | |||

(11) |

for discrete and continuous cases, respectively.

### 2.4 Maximum Likelihood Estimation

Assume we have a sample with size , i.e.,

. Also assume that we know the distribution from which this sample has been randomly drawn but we do not know the parameters of that distribution. For example, we know it is drawn from a normal distribution but the mean and variance of this distribution are unknown. The goal is to estimate the parameters of the distribution using the sample

available from it. This estimation of parameters from the available sample is called “point estimation”. One of the approaches for point estimation is Maximum Likelihood Estimation (MLE). As it is obvious from its name, MLE deals with the likelihood of data.We postulate that the values of sample, i.e.,

, are independent random variates of data having the sample distribution. In other words, the data has a joint distribution

with parameter and we assume the variates are independent and identically distributed () variates, i.e., with the same parameter . Considering the Bayes rule, equation (4), we have:(12) |

The MLE aims to find parameter which maximizes the likelihood:

(13) |

According to the definition, the likelihood can be written as:

(14) |

where is because the are . Note that in literature, the is also denoted by for simplicity.

Usually, for more convenience, we use log-likelihood rather than likelihood:

(15) | ||||

(16) |

Often, the logarithm is a natural logarithm for the sake of compatibility with the exponential in the well-known normal density function. Notice that as logarithm function is monotonic, it does not change the location of maximization of the likelihood.

### 2.5 Expectation Maximization

Sometimes, the data are not fully observable. For example, the data are known to be whether zero or greater than zero. As an illustration, assume the data are collected for a particular disease but for convenience of the patients participated in the survey, the severity of the disease is not recorded but only the existence or non-existence of the disease is reported. So, the data are not giving us complete information as is not obvious whether is or .

In this case, MLE cannot be directly applied as we do not have access to complete information and some data are missing. In this case, Expectation Maximization (EM) is useful. The main idea of EM can be summarized in this short friendly conversation:

– What shall we do? The data is missing! The log-likelihood is not known completely so MLE cannot be used.

– Mmm, probably we can replace the missing data with something…

– Aha! Let us replace it with its mean.

– You are right! We can take the mean of log-likelihood over the possible values of the missing data. Then everything in the log-likelihood will be known, and then…

– And then we can do MLE!

Assume and denote the observed data (’s in the above example) and the missing data (’s in the above example). The EM algorithm includes two main steps, i.e., E-step and M-step.

In the E-step, the log-likelihood (equation (15)), is taken expectation with respect to the missing data in order to have a mean estimation of it. Let denote the expectation of the likelihood with respect to :

(17) |

Note that in the above expectation, the and are conditioned on, so they are treated as constants and not random variables.

In the M-step, the MLE approach is used where the log-likelihood is replaced with its expectation, i.e., ; therefore:

(18) |

These two steps are iteratively repeated until convergence of the estimated parameters .

### 2.6 Lagrange Multiplier

Suppose we have a multi-variate function (called “objective function”) and we want to maximize (or minimize) it. However, this optimization is constrained and its constraint is equality where is a constant. So, the constrained optimization problem is:

(19) | ||||||

subject to |

For solving this problem, we can introduce a new variable which is called “Lagrange multiplier”. Also, a new function , called “Lagrangian” is introduced:

(20) | ||||

Maximizing (or minimizing) this Lagrangian function gives us the solution to the optimization problem (Boyd & Vandenberghe, 2004):

(21) |

which gives us:

## 3 Fitting A Mixture Distribution

As was mentioned in the introduction, the goal of fitting a mixture distribution is to find the parameters and weights of a weighted summation of distributions (see equation (1)). First, as a spacial case of mixture distributions, we work on mixture of two distributions and then we discuss the general mixture of distributions.

### 3.1 Mixture of Two Distributions

Assume that we want to fit a mixture of two distributions and to the data. Note that, in theory, these two distributions are not necessarily from the same distribution family. As we have only two distributions in the mixture, equation (1) is simplified to:

(22) |

Note that the parameter (or in general) is called “mixing probability” (Friedman et al., 2009) and is sometimes denoted by (or in general) in literature.

The likelihood and log-likelihood for this mixture is:

where is because of the assumption that are . Optimizing this log-likelihood is difficult because of the summation within the logarithm. However, we can use a nice trick here (Friedman et al., 2009): Let be defined as:

and its probability be:

Therefore, the log-likelihood can be written as:

The above expression can be restated as:

The here is the incomplete (missing) datum because we do not know whether it is or for . Hence, using the EM algorithm, we try to estimate it by its expectation.

The E-step in EM:

Notice that the above expressions are linear with respect to and that is why the two logarithms were factored out. Assume which is called “responsibility” of (Friedman et al., 2009).

The is either or ; therefore:

According to Bayes rule (equation (5)), we have:

The marginal probability in the denominator is:

Thus:

(23) |

and

(24) | ||||

Some simplification of will help in next step:

The M-step in EM:

Note that the function is also a function of and that is why we wrote it as .

(25) | |||

(26) | |||

(27) |

So, the mixing probability is the average of the responsibilities which makes sense. Solving equations (25), (26), and (27) gives us the estimations , , and in every iteration.

The iterative algorithm for finding the parameters of the mixture of two distributions is shown in Algorithm LABEL:algorithm_twoMixture.

algocf[!t]

#### 3.1.1 Mixture of Two Gaussians

Here, we consider a mixture of two one-dimensional Gaussian distributions as an example for mixture of two continuous distributions. In this case, we have:

where is the probability density function of normal distribution. Therefore, equation (22) becomes:

(28) | ||||

The equation (23) becomes:

(29) |

The is:

Therefore:

(30) | |||

(31) | |||

(32) | |||

(33) |

and is the same as equation (27).

#### 3.1.2 Mixture of Two Poissons

Here, we consider a mixture of two Poisson distributions as an example for mixture of two discrete distributions. In this case, we have:

therefore, equation (22) becomes:

(34) |

The equation (23) becomes:

(35) |

The is:

Therefore:

(36) | |||

(37) |

and is the same as equation (27).

### 3.2 Mixture of Several Distributions

Now, assume a more general case where we want to fit a mixture of distributions to the data. Again, in theory, these distributions are not necessarily from the same distribution family. For more convenience of reader, equation (1) is repeated here:

The likelihood and log-likelihood for this mixture is:

where is because of assumption that are . Optimizing this log-likelihood is difficult because of the summation within the logarithm. We use the same trick as the trick mentioned for mixture of two distributions:

and its probability is:

Therefore, the log-likelihood can be written as:

The above expression can be restated as:

The here is the incomplete (missing) datum because we do not know whether it is or for and a specific . Therefore, using the EM algorithm, we try to estimate it by its expectation.

The E-step in EM:

The is either or ; therefore:

According to Bayes rule (equation (5)), we have:

The marginal probability in the denominator is:

Assuming that (called responsibility of ), we have:

(38) |

and

(39) |

Some simplification of will help in next step:

Comments

There are no comments yet.