Wikipedia:Reference desk/Archives/Mathematics/2011 July 23

= July 23 =

Estimating population standard deviation in the limit of small numbers
I have a problem where I need to measure the standard deviation of a population that is difficult to sample, and so I would like to use as few samples as I can practically get away with.

In the limit of large numbers, and with the assumption that the underlying distribution is normally distributed, I know that the appropriate estimates are:

$$\bar y = {\sum{y_i} \over N}$$

$$s(y) = \sqrt {\sum{(y_i - \bar y)^2} \over N-1}$$

Where the standard error in the standard deviation is

$$SE( s(y) ) = {s(y) \over \sqrt{2 N}}$$

Which implies that if I had 200 samples, I would expect to know the standard deviation to about 5%.

But this is devised in the limit of large N. I would like to know how the uncertainty might change in the limit of small N (e.g. N = 5 or 10). Applied as is, the formulas suggest at N = 5, the estimate of the population standard deviation will have an error of about 30% in the typical case. But is that really true, or in considering such small numbers would my error be significantly worse than that (and how much worse)?

Also, in the limit of small numbers, are there any procedures that can improve the estimate the population standard deviation. For example, would the interquartile range be less subject to fluctuations. I doubt it, but it is probably worth asking. Dragons flight (talk) 18:03, 23 July 2011 (UTC)
 * A lot of this depends on your assumptions about the distribution of the data, and whether you take a frequentist or Bayesian approach. If Bayesian there's no easy way out, just choose your prior and do what Bayes says, the result will probably not have a nice closed form.
 * The formula you give is the square root of an unbiased estimator of the variance. It is not, however, an unbiased estimator of the standard deviation. Those also exist if we assume normality.
 * If our estimator for the standard deviation takes the form $$s=\sqrt{c(N)\sum(y_i-\bar{y})^2}$$, then it is always the case that the variance of s is exactly (if my calculations are correct) $$c(N)(N-1)\sigma^2-\mathbb{E}[s]^2$$. -- Meni Rosenfeld (talk) 08:51, 24 July 2011 (UTC)