Talk:Expectation–maximization algorithm

Error in Gaussian Mixture Example?
I am not an expert (far from it) but there seems to be an error in the final step, calculation of Sigma(t+1) from Sigma(t). Shouldn't there be a square root over the entire term? Two (not very scientific) arguments: stddev (meaning Sigma) is given in the same units as x. We can see that the calculation of Sigma(t+1) is a weighted average (weighted by T) of the square of delta x (meaning it is in x^2 units). So from the point of view of the dimensions of Sigma and x, there should be a square root. The other argument is that when implementing this algorithm (straight from wikipedia), I got several errors, until adding the square root over the average of delta_x^2.

To summarize: Right now the last lines read Sigma_j(t+1) =, but I believe they should read Sigma_j(t+1) = sqrt. Can someone who knows what he's doing have a look? —Preceding unsigned comment added by 212.150.57.142 (talk) 11:18, 18 January 2010 (UTC)


 * $$\Sigma$$ is the covariance matrix, which is the multivariate extension of the variance, and so is in square units. The code in R (programming language) that generated the picture is in the comments at File:Em old faithful.gif (though I used a built-in function to calculate the empirical covariance). —3mta3 (talk) 23:04, 18 January 2010 (UTC)

Not strictly correct
I don't think this is strictly correct:

M-Step: re-estimate the distribution parameters to [M]aximize the likelihood of the data, given the expected estimates of the unknown variables

The issue is "given the expected estimates of the unknown variables." It is not quite that simple in some cases. For example suppose m were missing and had an expectation with respect to current parameter estimates and observed data. Note now E(log(m)) != log(E(m)). Now suppose that such a thing appeared in your log likelihood (a more complicated version appears in my work e.g. E(log(det(I)) != log(det(E(I))). You have to maximize the left of my equation but that is not the same thing as maximizing the right. Now in the more popular examples of EM this is never an issue! Or as I like to say to my advisor E(l(m,pi)| obs, current) = l(E(m| obs, current),pi) which is a really nice feature that essentially says find the expectations for the missing data and plug em in to your log likelihood. This happens for example in binomial models.

Why didn't I change it directly? Because I am actually thumbing through literature to make sure this is true "You have to maximize the left of my equation but that is not the same thing as maximizing the right." Even though the two quantities arn't equal, they may be maximized at the same time given some constraints on the problem. So dunno... —Preceding unsigned comment added by 129.123.249.243 (talk) 00:10, 22 January 2009 (UTC)
 * I agree. I think the problem arises because the expectation can be used in the simple linear case used in most examples, which gives the wrong impression for more advanced uses. –3mta3 (talk) 12:06, 22 March 2009 (UTC)

Simple example?
Can someone contribute a very simple toy example to show how EM works? The one currently there (a mixture of Gaussians) is too long and is highly mathematical. My feeling is that very few people are going to wade through it.

My usual comment on such matters: Wikipedia is supposed to be an encyclopedia, not a maths textbook! 99.9% of readers of an article will not be experts, but will be trying to understand a new subject. Therefore articles should concentrate on building readers' insight, and not on being pedantically correct and complete. That comes later. Now, this may come as a bit of a shock to some of you, but most people regard complex mathematical formulae as obstacles to be overcome (or ignored).

"There's nothing so complicated that it can't be explained simply" (or something to that effect) -- A. Einstein

--84.9.77.220 (talk) 23:23, 21 February 2008 (UTC)

While repetitive, I second the above opinion. And, while I have significant background in science, math, and engineering, a quick read through the EM article does not yield useful insights. I ask that authors exert some effort to write clearly and succinctly in educating the reader in the significance of their topic, as well as causing the reader to clearly understand "how it works". The current EM article does neither, even though I believe the authors are quite competent and expert in their subject. Will you please take the time to educate those of us less knowledgeable in your area? —Preceding unsigned comment added by 214.3.40.3 (talk) 19:48, 11 March 2008 (UTC)

See the new section EM in a nutshell. Comments are appreciated. I'd like to dress up the formatting once I'm more experienced with wiki editing. Aanthony1243 (talk) 14:15, 22 May 2008 (UTC)

I found this article easier to digest than most anything else I read on EM. Unfortunately, it seems that the Gaussian mixture model is the simplest example out there, unless one can concoct something with Bernoulli distributions.
 * The Gaussian mixture is the canonical example. I've simplified it somewhat, reducing it to a 2 component mixture, and getting rid of the unnecessary Lagrange multiplier business. –3mta3 (talk) 09:58, 23 March 2009 (UTC)

Why not just make it a mixture of normal distributions?? Making it a mixture of multivariate normals makes it much much more complicated then it needs to be, and obscures a bit of the intuition of the method behind the mathematics of the multivariate. —Preceding unsigned comment added by 209.6.15.128 (talk) 00:45, 20 October 2009 (UTC)

Like so many math-related in wikipedia, this one is totally pointless. Most people are looking for an explanation of a concept, not a mathematical definition. There is no example, no hint here why this algorithm is useful and for what. Most algorithms are easy to explain (divide by conquer, sorting stuff, gradient search, etc), should this be possible here as well? Maximilianh (talk) 10:25, 6 July 2010 (UTC)

Proposal
I propose the following "simple example". It might be stupid or wrong, but I just want to start somewhere.

Estimation algorithms like expectation maximization are used when one cannot try all possible combinations to optimize a model. This happens in real-world situations when single tests are too expensive or time-consuming and in numerical optimization procedure when there are too many combinations to check.

A simple example application of Expectation Maximization could be the optimization of a formula to recognize cars that contain drugs based on their outlook (brand, colors, etc). The formula specifies the probability that a car contains drugs based on its outlook (brand, color, year). It is a lot too time-consuming to search all cars to optimize the formula, so one can not search all cars. One could sample cars randomly but this might not be the most effective way to optimize the formula. Instead, based on an initial very simplified or a random formula what cars to search, one could first use the simplified formula to calculate the probability for each car. Maximilianh (talk) 10:20, 6 July 2010 (UTC)
 * Expectation step: One samples from these values, e.g. rather pick cars with higher probabilities than with lower ones. These cars will then be searched.
 * Maximization step: Based on the results, the parameters of the formula will be adapted to fit as well as possible the results of the searches.
 * That is a bit complicated... The more simple example is the coin-tossing example. Given a set of observations of Heads-Tails, such as HTHHTTTHHHTTH, find the most likely biasing for the coin, where 0.5 would be a "fair" coin that equally give heads and tails results. To do this, one needs to compute the probability of observing that sequence for a certain biasing of the coin (as a formula), then take the derivative of this to find the local maxima. The local maxima will give the most likely (expectation maximised) result. The probability function for observing a particular sequence is simply the binomial distribution pmf formula, with p as the dependant variable ($$V = \textstyle {n \choose k}\, p^k (1-p)^{n-k}$$, where n is number of tosses, k is number of heads (or tails, but choose one)). Differentiate the binomial ($$\frac{dV}{dP}$$) to locate the maxima and you are done. I haven't written this up, as it has been a while since I have done EM properly. (I believe the answer is $$p = k/n $$.) You can then follow this up, assuming the given EM bias, with a confidence interval computation, to find out the certainty of the coin being biased (confidence interval), and when you are 99% certain, you then tackle the coin-tosser and demand your bet losses back. User A1 (talk) 10:35, 6 July 2010 (UTC)
 * Of course, this particular example does not display the iterative nature of many EM methods. User A1 (talk) 10:54, 6 July 2010 (UTC)
 * The other problem with your suggestion is that it acts on a non continuous parameter space (eg brand). this is a slightly different problem. User A1 (talk) 12:33, 6 July 2010 (UTC)

Expected value
There is a serious conflict between the phrase "expected value" as it is used in this article vs how it is defined in the expected value article. Jim Bowery (talk) 18:41, 16 October 2008 (UTC)

Hello. I have a comment about the recent edit which credits the popularity of EM methods to the convergence proof instead of the ease of formulation. Maybe ease of formulation is not quite on the mark, but it seems that the convergence proof isn't either -- after all, the other methods (gradient ascent, etc) will also converge within some neighborhood of a stationary point, and that's all that's guaranteed for EM methods, right? -- FWIW, my point about ease of formulation was this: suppose you have a method that computes maximum likelihood estimates in the complete data case. When you consider the case of incomplete data, the previous method can often be adapted in a very straightforward way; the canonical example would be Gaussian mixture density estimation, in which computation of mean and variance are just replaced by a weighted mean and a weighted variance. I guess ultimately the reason for popularity is an empirical question; we would have to look at a lot of papers and see what reasons, if any, are given. Happy editing, Wile E. Heresiarch 15:37, 29 Apr 2004 (UTC)


 * Given that it is hard to pin down people's motivations for liking or disliking an algorithm, perhaps we should drop the sentence from the article. But, based on what I heard at machine learning conferences, EM was preferred due to convergence. I believe that the paper triggered the shift from gradient descent (used in neural networks) to EM. That paper doesn't have a nice simple M-step like mixture of Gaussians (see equations 27-29, and following), so the popularity wasn't based on simplicity, I think. -- hike395 07:34, 1 May 2004 (UTC)


 * Well, I agree that it's hard to divine motivation. I dropped the sentence that referred to a reason for popularity. Wile E. Heresiarch 22:45, 1 May 2004 (UTC)


 * You note that EM is only guaranteed to maximize likelihood locally. This point should be made clearly from the start of the article.  And yes, other methods such as gradient ascent or conjugate gradient will work just as well.  (In fact, EM is closely related to gradient ascent and can be regarded as a "self-scaling" version of gradient ascent (i.e., it picks its own stepsize).  In many instances of EM, like the forward-backward and inside-outside algorithms, the E step is actually computing the gradient of the log probability.) 70.16.28.121 01:48, 13 August 2006 (UTC)

Notation
Can someone tells me the subscripts next to the expected value (the big E) is suppose to mean? I am not familier with math notation. -confused
 * The notation is used from probability theory, where the vertical bar in an expression $$P(A|B)$$ denotes a conditional probability. In the article, the notation is saying given $$x$$ (the observations) and the current likelihood expression $$\theta^{(t)}$$, what is the expected value in the case of the imposed statistical model. The key link is the conditional distribution link above. $$t$$ may denote the tth iteration of the EM algorithm used. User A1 (talk) 10:34, 26 June 2009 (UTC)

Consistency of symbol font
At present, the symbols that are inline are sometimes in a bigger font than the text, sometimes the same font size (look at subscripted thetas for example). The difference seems to depend on whether there's a "/," in the text string between the math tags. They should at least be consistent - I thought maybe I'd just change them to one or the other, but perhaps it's worth asking first whether there's a preference amongst the original authors, or whether there's a general convention.

Animated picture
On 3 January 2007 someone has removed this animated image :

I thought the image was quite nice, but maybe it shouldn't be at the top of the image as in but more to the bottom. So I thought if someone (with a little bit more knowledge about this topic than me) can write a little piece about the application in question we could move it to the "Examples" section. Skaakt 09:56, 9 February 2007 (UTC)
 * This image doesn't really show anything about expectation maximization other than it yields incremental improvements at each step. It may make sense in the context of an application though. Sancho 06:18, 11 February 2008 (UTC)

Eliminate alpha-EM?
The section on alpha-EM should be nixed IMO. It is neither 1) notable (the linked paper is not highly cited; it is mostly self-sited in conference reports) 2) clearly explained (e.g. "α-log likelihood ratio" and "α-divergence" are undefined) 3) of value to a general readership. Given the username who added it (Yasuo2) who also wrote the article on its author, I assume this is a vanity edit. — Preceding unsigned comment added by 108.75.137.21 (talk) 14:31, 6 May 2012 (UTC)

Description of the two steps
In the article it says for the E step:

Calculate the expected value of the log likelihood function, with respect to the conditional distribution of $$\mathbf{Z}$$ given $$\mathbf{X}$$ under the current estimate of the parameters $$\boldsymbol\theta^{(t)}$$:
 * $$Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \,$$

According two the article the M step is the maximization of that very same quantity, just calculated. This sounds a bit confusing to me. I suggest to define the M step as done in "Pattern Recognition and Machine Learning" by Christopher M. Bishop, as the evaluation of the quantities $$p(\boldsymbol Z | \boldsymbol X, \boldsymbol\theta^{(t)}$$), which can then be used in the M step for the maximization.

Dont you agree, that this description makes more sense? If there are no objections, I would go ahead and change the article accordingly. --Sagzehn (talk) 14:49, 23 June 2012 (UTC)


 * Bishop is dealing with only a special case of the EM algorithm. Possibly the formula in the E step could be written in terms of the condititional distribution rather than the likelihood, but that may not be easy if one is to accommodate discrete, continuous and mixed continuous-discrete distributions. But it is important to retain a separate E-step, firstly because that is how the algorithm is traditionally explained and, secondly, because the expectation in the E-step is often evaluated analytically, rather than by brute-force as in Bishop's case. The outcome of the E-step can be a function to be maximised that is much simpler/quicker to evaluate than a direct substitution.
 * If you can find a source that treats the general case of the EM algorithm in a better way, there is no reason why what is here can't be changed. Melcombe (talk) 16:12, 23 June 2012 (UTC)

I agree this is kind of confusing. I have adjusted the wording a bit. In any algorithm I know the output of the expectation step in the code are the membership probabilities $$p(\boldsymbol Z | \boldsymbol X, \boldsymbol\theta^{(t)})$$. But, the actual expectation step is of course: $$Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \,$$. However, this would be stupid to do in code, because the subsequent maximization can maximize for $$\theta$$, etc. separately. So, $$Q$$ must be seen as a kind of theoretical place-holder that you wouldn't use directly in the final algorithm, but which results you use in the final formulation of the maximization procedures for the individual parameters. Anne van Rossum (talk) 15:35, 21 May 2014 (UTC)

minor grammar corrections?
Your edit here made nonsense of a grammatical (but long and awkward) sentence. What was your intent? Dicklyon (talk) 17:15, 1 July 2012 (UTC)

In the Description portion of the page, you have "Note that in typical models to which EM is applied:" and list the the set of observed data points X can be discrete or continuous. Then you say that the missing variables Z are discrete, and there is one for each observed data point. This is not possible if X is continuous and Z is discrete. Also, the wording "typical models" is ambiguous. Is that saying that it will not work when Z is continuous? I think (at least in this section) you should have a formal definition of the algorithm. It would actually be nice to have it in pseudocode like most CS algorithms are generally given. MindAfterMath (talk) 10:48, July 19, 2012 (UTC)


 * You are misreading the (poorly explained) notation ... X and Z refer to the whole dataset at once, not per case. It seems to be saying that in this "typical case" the potentially observed variables may be either continuous or discrete, but that only the discrete ones might be missing in this supposedly "typical case". But X and Z sort the potentially observable data into two separate sets, those actually observed, X, and those missing, Z. The "typical case" seesm to correspond to the case of a "mixture model" where an observed value (in X, continuous or discrete) may have arisen from one of a number of populations (labelled by a  discrete value which is either in X if observed, or in Z if missing), in unknown proportions, with each sub-population having its own density or mass function, with its own parameters.... and in this "typical case" the label is always misssing: see mixture model where this instance of EM is discussed. In the general case parts or all of Z would be continuous. As to pseudocode, it seems impossible to do this for the general description of EM being covered in this article as the important steps need to be converted first into an implemention by thinking mathematically and statistically, and applying algebra. If you had a specific application of EM in mind, there might be pseudocode in the corresponding article. Melcombe (talk) 12:17, 21 July 2012 (UTC)

Accessibility
This article sounds like written by mathematicians for mathematicians. Learning curve is too steep. Since the algorithm is used by widespread computer library like OpenCV, it would be useful to add concrete, practical and simple examples. -- O.C.

The EM solution to the Gaussian mixture model, presented here, is about as simple as it gets. The only simplification that is possible is to use univariate instead of multivariate normal RV, but I suspect you would still have the same objections. Fact of the matter is, the EM algorithm is conceptually difficult. Without at least a year or two of mathematical statistics under one's belt, I don't think it's reasonable to expect to develop a deep intuitive understanding of the EM algorithm in its general form. This is my general reaction to the preponderance of "this math article is too hard" comments about technical topics on wikipedia. An explanation of the EM algorithm (and many other technical topics) *not* in mathematical terms would be so vague that it would be essentially meaningless. (I'm not an editor, but those are my two cents.) Hawthoreus (talk) 21:33, 22 January 2014 (UTC)

Incorrect formula in Gaussian Mixture Example?
Shouldn't the $$\tau_j$$ be omitted from the fourth indented equation, section 12.1, since the likelihood is conditional on observing $$\mathbf{z}$$? This would make sense since $$f(x_i;z_i,\theta) = \sum_{j=1}^2 \mathbb{I}(z_i=j) \phi(x_i;\mu_j,\sigma_j)$$, with $$\phi$$ the normal density function with given parameters $$\mu_j$$ and $$\sigma_j$$. With $$\tau_j$$ it isn't a true density. Hawthoreus (talk) 21:15, 22 January 2014 (UTC)

Should the number of latent variable k be extended to j and not n, since it need not be the same as the n of x? Bc pitt (talk) 16:36, 17 February 2015 (UTC)

Why is a "citation needed" for the statement that the derivative is zero at a inflection point?
The statement "that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a maximum or a saddle point" is flagged with a "citation needed," but this property of smooth functions is, I believe, covered in secondary school calculus classes. I think that the "citation" flag detracts from the discussion for no practical effect, and should be removed.

PCT4671 (talk) 03:01, 18 March 2014 (UTC)


 * An inflection point is where the second derivative is zero. Where the derivative is zero, it's a min or max or saddle point.  A ref for the contents of that paragraph would be useful, so we could rewrite it more clearly, and back up the "in fact it can be proven that in this particular context ..." whatever it's trying to say.  Dicklyon (talk) 03:10, 18 March 2014 (UTC)

Requested move

 * The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section. 

The result of the move request was: no consensus to move the page at this time, per the discussion below. Dekimasu よ! 00:55, 12 November 2014 (UTC)

Expectation–maximization algorithm → Expectation-maximization algorithm – "Expectation-maximization" is a compound word and should therefore use a hyphen, not an en dash as is currently the case. There already exists an article Expectation-maximization algorithm, though, otherwise I would have just moved the article directly. This other article needs to be removed first. —Kri (talk) 13:32, 5 November 2014 (UTC)
 * Not so sure about that: The algorithm has two distinct steps for each iteration – an expectation step and a maximization step. A simple hyphenation would seem more a description of an algorithm for maximizing an expectation rather than an algorithm involving repeated operation of an expectation step and a maximization step. —BarrelProof (talk) 20:24, 5 November 2014 (UTC)
 * I wasn't quite sure until I saw Barrel's point, which strongly suggests an en dash. "Expectation" does not modify "maximization". The algorithm has alternating expectation and maximization steps. It's a valuable example of the clarifying purpose of the en dash, as set out at WP:DASH. The en dash gets a reasonable shoing in Googlebooks. Tony   (talk)  01:12, 7 November 2014 (UTC)


 * The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.

"Alternatives to EM" needs work
The final section of the article, "Alternatives to EM" is poorly written and reads like an advertisement, citing three papers by only a single group of researchers. I think this section should be rewritten with more citations from a broader array of groups, or this section should be removed.

References for Alternatives section
Apparently, in response to a previous comment, the Alternatives section was re-written, but now there are no links to information about the alternatives.

I would be grateful if the section either hyperlinked the alternatives to other articles in Wikipedia, or provided a citation to find references.

Thanks Gwis999 (talk) 17:27, 24 January 2019 (UTC)