Talk:Pooled variance

The .pdf linked to in the external links section of "Pooled Variance" is a dead link. I'm going to delete it.

One comment: why is the pooled variance based on a weigthed mean of the *degrees of freedom*? A simple weighted mean should be based on the n_i, not the (n_i - 1)? 24.203.205.14 (talk) 23:56, 13 May 2008 (UTC) cousined.

Text from Talk:pooled standard deviation
I don't have a source for "n-1 is used instead of n for the same reason it may be used in calculating standard deviations from samples.", but it seemed like it needed explaining and that seems why -- to cancel out the (n-1) in the denominator of sample variances(squared standard deviations, which are often more straightforward to compute with). Also, in terms of variances, isn't this operation just the weighted mean of the samples' variances, with the weights taken to be (ni - 1)? I suppose, since the purpose is estimating a parameter σ (the thing which I called the "true standard deviation"), it makes sense to assume that the ( ÷ (n-1)) version of standard deviation is being used, rather than the ( ÷ n) one that fulfills the conditions in Standard deviation... Help appreciated from anyone who has more of a clue than me :-)
 * &mdash;Isaac Dupree(talk) 00:13, 18 February 2007 (UTC)

63.81.122.66 (talk) 12:34, 29 July 2008 (UTC) We use n-1 because the degrees of freedom of the system is reduced. For instance, given n data points and their average (mean) m, all I need for complete information is n-1 data points and the mean, since I can compute the missing data point from n-1 points and the mean. So to keep the data independent, we need to count only the independent (random) data. Thats all I know. By the way, I know that variance would be more precise terminology, but the people who are looking for this information, like me, aren't statisticians so I vote keep it as is

I made some minor changes on the issue.. (updated link and added the notion of Bessel)--NMeden (talk) 16:01, 25 November 2009 (UTC)

variance and standard deviation are different Mbaha (talk) 20:50, 15 September 2008 (UTC)

Equation has mistake...?

 * $$s_p^2=\frac{(n_1 - 1)s_1^2+(n_2 - 1)s_2^2+\cdots+(n_k - 1)s_k^2}{n_1+n_2+\cdots+n_k - k}$$

should be:
 * $$s_p^2=\frac{(n_1 - 1)s_1^2+(n_2 - 1)s_2^2+\cdots+(n_k - 1)s_k^2}{n_1+n_2+\cdots+n_k}$$

Correct? TFJamMan (talk) 20:36, 26 May 2010 (UTC)

I believe the original formula as stated above is correct. Why? Because a fundamental property of weights is that they must total to 1. Sorry, I don't have time to figure out the math markup language used here, but the proposed revision would total to 1 - 1/SUM(x_i). Ipscheer (talk) 21:53, 16 July 2010 (UTC)

Attention
Besides being poorly written regarding stats phraseology, the article lacks important detail of the the stats context to which this stuff applies. Thus we have "maximum likelihood" in a heading with no mention of some distribution being assumed, no mention of use in testing, etc. Melcombe (talk) 09:55, 29 October 2010 (UTC)

The equation is wrong
Using the sample case provided, the square of sigma is 2.76 If we calculate the square of sigma directly, the answer is 74.

The source of error comes from the square term when we calculate sigma. — Preceding unsigned comment added by McMEM (talk • contribs) 01:43, 26 January 2011 (UTC)


 * The source of the difference actually comes from this article being about pooled variance, which *estimates* the total variance (ignoring differences in the mean). This yields 2.76 in this example, which, as you stated, is far from the true variance of 73.95. If you want to calculate the exact variance of the whole data set, you also need the means of the subsets. You can use the following equation:


 * $$s_p^2=\frac{(n_1 - 1)s_1^2+(n_2 - 1)s_2^2+\cdots+(n_k - 1)s_k^2+n_1(\bar{x_1} - \bar{x_p})^2+n_2(\bar{x_2} - \bar{x_p})^2+\cdots+n_k(\bar{x_k}-\bar{x_p})^2}{n_1+n_2+\cdots+n_k - 1}$$


 * The symbols $$\bar{x_p}$$ and $$\bar{x_1}\cdots\bar{x_k}$$ denote the mean of the complete data set and the means of the groups/subsets respectively. Remember you can easily calculate the mean of the whole data set $$\bar{x_p}$$ from the individual means of the subsets:


 * $$\bar{x_p}=\frac{n_1\bar{x_1}+n_2\bar{x_2}+\cdots n_k\bar{x_k}}{n_1+n_2+\cdots+n_k}$$
 * Sources:, . Please note that those tutorials are calculating the uncorrected variance, whereas I am applying Bessel's correction in the equation above. Choose the one which fits your use of corrected vs. uncorrected variance.
 * Anjoschu (talk) 08:29, 5 July 2013 (UTC)

Exact Pooled Variance
Is there any merit to mentioning the "exact pooled variance", which is a slightly different animal, being the variance of an entire pooled data set? In other words, assuming that the two data sets are sampled from the same population (such as for repetitions of the same experiment), we can take the variance of the combined sample to get a more refined estimate of the population variance. It is a trivial task to calculate when you have all the underlying data points, but as pointed out by Rudmin [] it can also be calculated very easily with just the means, variances and sample sizes from the individual samples. The simple formula is "the (weighted) mean of the variances plus the (weighted) variance of the means". Would this make sense to have its own page, linked to from here, or a disambiguation page? — Preceding unsigned comment added by TigreGeek (talk • contribs) 20:59, 11 September 2015 (UTC)

I think there is. I spent half a day trying to figure out how to calculate the "exact pooled variance," in part because I couldn't even figure out what it was called in the first place. (I should've just come to the talk section here...) Rudmin's term "exact pooled variance" is alternatively referred to as the "combined variance" and "joint variance" by other sources. I'm not knowledgeable or skilled enough to write the section, however.--Dysquist (talk) 17:25, 11 July 2017 (UTC)

Role of assumption of normality needs to be discussed
The article doesn't mention any assumption of normality and is written as if the expression given for the pooled variance were independent of the distribution and followed from general principles. At one point, it even claims that the use of the weighting factors $$n_i-1$$ comes from Bessel's correction. This is wrong. Whereas Bessel's correction is required to get an unbiased estimator of the variance of any distribution, without assumption of normality, the expression for the pooled variance can only be derived as the minimum variance estimator and maximum likelihood estimator of the variance under the assumption of normality (see this math.stackexchange thread). The weight factors that minimize the variance of the resulting variance estimator depend on the fourth central moment, and the result is $$n_i-1$$ only for the normal distribution. This should be mentioned here. Joriki (talk) 16:03, 26 May 2020 (UTC)