1. A good reference is the Tom Mitchell: Machine learning, A gentle tutorial on EM algorithm.
2. What is a mixture model?
Stolen from Wikipedia
A mixture model is a model in which the independent variables are measured as fractions of a total. For example, suppose researchers are trying to find the optimal mixture of ingredients for a fruit punch consisting of grape juice, mango juice, and pineapple juice. A mixture model is suitable here because the results of the taste tests will not depend on the amount of ingredients used to make the batch but rather on the fraction of each ingredient present in the punch. The sum of the mixture components is always 100%, and a mixture model takes this restriction into account.
A mixture model can also be a formalism for modeling a probability density function as a sum of parameterized functions. In mathematical terms,
where pX(x) is the modeled probability distribution function, K is the number of components in the mixture model, and ak is mixture proportion of component k. By definition, 0 < ak < 1 for all and .
h(x | λk) is a probability distribution parameterized by λk.
Mixture models are often used when we know h(x) and we can sample from pX(x), but we would like to determine the ak and λk values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.
Two common approaches for estimation in mixture models
It's common to think of mixture modeling (under the second definition) as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.
The Expectation-maximization algorithm is one way to compute the missing memberships of data points in our chosen distribution model. It is an iterative procedure, where we start with initial parameters for our model distribution (the ak's and λk's of the model listed above). The estimation process proceeds iteratively in two steps, the Expectation Step, and the Maximization Step.
The expectation step
With initial guesses for the parameters in our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point. An example will provide some clarity. Let's consider a simple example. We have a collection of data points xi that can be modeled as coming from a sum of two Gaussian distributions. The probability expression for our model is
Where f is the mixing coefficient in [0,1] (ak = 2 = f and ak = 1 = 1 − f), and we assume σ is known and constant. For each of our data points xi, we can compute a membership value for each of the two Gaussians as follows
and similarly for y2,i.
In the case of a Gaussian mixture model,
The maximization step
With our expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters. For the mixing coefficient f this is simply the fractional membership of all data points in the second Gaussian.
where N is the total number of data points. For μ1,
With new estimates for f and the μ's, we proceed back to the Expectation step to recompute new membership values. The procedure is repeated until there is no further change in the mixture model parameters.