Wikipedia:Reference desk/Archives/Mathematics/2009 July 24

= July 24 =

population proportion
When we make inferences about one population proportion, what assumptions do we need to make? Mark all that apply.


 * a. Random samples.
 * b. Normal distribution of the response variable.
 * c. The sample size is 30 or greater.
 * d. Counts of successes and failures at least 15 each.
 * e. Counts of successes and failures at least 5 each.

Well, I do assume simple random sample (A). And since data is categorical (yes/no), it's not normally distributed (so not B). The sample size (n) of 30 is a population mean/sample mean assumption (so not C). But what about D or E? —Preceding unsigned comment added by 70.169.186.78 (talk • contribs) 05:34, 24 July 2009


 * Consider a population of $$\scriptstyle N$$  items of which $$\scriptstyle I$$  are special. Take a sample of $$\scriptstyle n$$  items of which $$\scriptstyle i$$  are special. This can be done in $$\scriptstyle\binom I i \binom {N-I}{n-i}$$ ways. This is the well known hypergeometric distribution formula.
 * Deduction is estimating sample information from population data. Knowing $$\scriptstyle N,n,I$$ the mean value of $$\scriptstyle i$$  is $$\scriptstyle \mu = \frac{nI}N$$,  and the variance-to-mean ratio is  $$\scriptstyle \varepsilon  = \frac{(N-n)(N-I)}{N(N-1)}$$, so the estimate is  $$\scriptstyle i\approx f(N,n,I)=\mu\pm\sqrt{\mu\varepsilon}$$.
 * Example: $$\scriptstyle f(2,1,1)=\mu\pm\sqrt{\mu\varepsilon}$$ where $$\scriptstyle \mu = \frac{(1)(1)}{2}=\frac 1 2$$ and $$\scriptstyle \varepsilon  = \frac{(2-1)(2-1)}{(2)(2-1)} = \frac 1 2$$. So  $$\scriptstyle  f(2,1,1)=\frac 1 2\pm \sqrt {(\frac 1 2)(\frac 1 2)} =\frac 1 2\pm \frac 1 2$$.
 * This result is exactly what should be expected: if the population contains two items, ($$\scriptstyle N = 2$$), one of which is special, ($$\scriptstyle I = 1$$), you take a sample containing one item, ($$\scriptstyle n = 1$$), then you don't know whether this selected item is special or not, so the estimate of the number of special items in the sample is $$\scriptstyle i\approx \frac 1 2\pm \frac 1 2$$.
 * Induction (or inference) is estimating population information from sample data. Knowing $$\scriptstyle N,n,i$$ you estimate $$\scriptstyle I\approx F(N,n,i)=-1-f(-2-n,-2-N,-1-i)$$. This formula is exact, for small or big samples, and for small or big populations. The only assumption is that the sample is random.
 * Example: $$\scriptstyle F(1,0,0)=-1-f(-2-0,-2-1,-1-0)=-1-f(-2,-3,-1)=-1-(\mu\pm\sqrt{\mu\varepsilon})=(-1-\mu)\mp\sqrt{\mu\varepsilon}$$ where $$\scriptstyle \mu = \frac{(-3)(-1)}{-2}=-\frac 3 2$$ and $$\scriptstyle \varepsilon  = \frac{((-2)-(-3))((-2)-(-1))}{(-2)((-2)-1)} = -\frac 1 6$$. So  $$\scriptstyle  F(1,0,0)=(-1-(-\frac 3 2))\pm \sqrt {(-\frac 3 2)(-\frac 1 6)} =\frac 1 2\pm \frac 1 2$$.
 * This result is exactly what should be expected: if you take no sample, ($$\scriptstyle n = i = 0$$), and the population contains one item, ($$\scriptstyle N = 1$$), then you don't know whether this item is special or not, so the estimate of the number of special items in the population is $$\scriptstyle I\approx \frac 1 2\pm \frac 1 2$$. Bo Jacoby (talk) 13:30, 24 July 2009 (UTC).

You need (a) or something like it. Let's say you want to estimate the proportion of voters who will vote Republican next week. If you take your sample from the group of Republicans who are meeting in the building next door, you're making a mistake. It doesn't make sense to assume a normal distribution. Each person will either vote Republican or not, so you get either a 0 or a 1, and that's not normally distributed. But you might conclude that the total number who will vote Republican is approximated normally distributed&mdash;that depends in part on sample size and in part on how the sample was taken. But it's a logical inference, not an assumption.

For a binary response variable, the question of whether that sum is approximately normally distributed depends not only on sample size, but also on how close the proportion is to either of the two extremes–0 and 1. And there are ways of drawing inferences when it's not approximately normally distributed and the sample size is small.

One sometimes sees a rough rule of thumb that you shouldn't conclude approximate normality unless you've got at least five outcomes in each of the two categories. I would add that you should use a continuity correction unless the numbers in both categories are pretty big. Michael Hardy (talk) 23:52, 24 July 2009 (UTC)