Wikipedia:Reference desk/Archives/Mathematics/2014 June 23

= June 23 =

Basic statistics
I have two basic statistics questions: I'm hoping the answers will be fairly simple, but starting at the Statistics article doesn't really give me any hints on how to proceed. If it's possible to determine the answers without knowing N (other than knowing N >> n), that would be a big advantage. Tevildo (talk) 00:18, 23 June 2014 (UTC)
 * 1) I have a bag containing N balls, which can be black or white. I take a sample of size n, which contains a white balls and b black balls.  What test do I use to enable me to say "There's a 95% probability that the percentage of white balls in the bag is between x and y"?
 * 2) I have two bags. I take a sample of size n from each:  the sample from the first bag contains a1 white balls and b1 black balls, and the sample from the second bag contains a2 white balls and b2 black balls.  What test enables me to say "There's a 95% probability that the percentage of white balls in Bag 1 is different to the percentage in Bag 2"?
 * See . Bo Jacoby (talk) 03:55, 23 June 2014 (UTC).
 * That page seems to require a Facebook login, so I can't access it. Tevildo (talk) 18:17, 23 June 2014 (UTC)
 * It appears that you could sign up without a Facebook or Google account, but I'd be amazed if there isn't an equally good discussion somewhere that doesn't require membership of any kind. —Tamfang (talk) 07:56, 24 June 2014 (UTC)
 * It's not accurate to say "There's a 95% probability that the percentage of white balls in the bag is between x and y" because it either is in that interval or not. What you really want to say is that "95% of the possible samples from size n will have a sample proportion (of white balls) in this interval." You cannot test that without your sample being simple random. You seem to be looking for a confidence interval or z-test for the proportion, for which the standard error $$\sigma=\sqrt{\frac{p(1-p)}{n}}$$ where p is the proportion you want to test, as you are testing a proportion. The sample size should be reasonably large in both cases.--Jasper Deng (talk) 06:12, 23 June 2014 (UTC)
 * Thanks - how is the standard error interpreted? For example, let's say n = 100 and a = 20, so p (in your equation) is 0.2.  This gives us a value of sigma of 0.04.  Does this mean that 98% of samples from the bag will have between 16 and 24 white balls, or have I misinterpreted it? Tevildo (talk) 18:17, 23 June 2014 (UTC)
 * You have to take the inverse of the cumulative distribution function to construct the interval. In the case of 95%, you want z-scores (number of standard deviations) of approximately plus or minus 1.959 (as under these particular conditions, the sampling distribution is approximately normal). In this case, that means your interval is from .12 to .28, which 95% of samples of size 100 from your population will fall in.
 * I do not know how to justify why, but this holds because you have both a and 100-a greater than or equal to ten and 100 is less than or equal to 10% of the population size.--Jasper Deng (talk) 18:58, 23 June 2014 (UTC)
 * Thanks again. So sigma in your equation is the standard deviation, and two standard deviations (which is what I really need) is ±8.00 balls in this case? Question 1 is answered.  Any hints on Question 2?  Apologies for not really knowing enough about the subject to fully appreciate the nuances. Tevildo (talk) 19:34, 23 June 2014 (UTC)
 * A quick follow-up - what if a _is_ less than 10? The actual data I have to deal with have p = 0.23, but it's not impossible that it might be around 0.05 for a different test. Tevildo (talk) 19:36, 23 June 2014 (UTC)
 * Firstly, the answer to question 2 is technically zero probability of them being exactly the same because that is only one of uncountably many values of the difference. Instead you want to ask "what is the probability that the difference between the two is within a set error bound?". Now we can answer this with a two-proportion z-test; the random variable $$p_1-p_2$$, the difference of the sample means, is normally distributed since the original sample proportions are normally distributed, under the right conditions (they are basically the same as the ones I outlined in the previous comment). You then test the null hypothesis $$p_1=p_2, p_1-p_2=0$$ and alternate hypothesis of them not being equal. Here $$\sigma=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}$$ where the two denominators are the respective sample sizes. With this standard deviation, you apply the cumulative distribution function to find the probability that a randomly chosen sample's mean will be in your "tolerable interval".--Jasper Deng (talk) 19:45, 23 June 2014 (UTC)
 * And as for if the size conditions aren't met, the test won't necessarily fail but the normal approximation becomes inappropriate and not necessarily applicable here.--Jasper Deng (talk) 20:06, 23 June 2014 (UTC)
 * That's great, thanks, the boss will be happy. Is there a more appropriate test to use when p < 0.1?  This isn't so critical, but I'll get a few brownie points if I can introduce it. Tevildo (talk) 20:18, 23 June 2014 (UTC)
 * p<.1 is fine as long as np and n(1-p) is large enough (i.e. make n bigger). There are other better tests to use, but I don't know them (I have only learned introductory-level statistics). Of exceedingly high importance is that you check other conditions too; the sample must be simple random and the population does have to be at least ten times the sample size, otherwise this could produce invalid results. Importantly, I was not told why these rather-arbitrary conditions are required, so I can say nothing about what happens when they are violated. Also, if this is a business decision, you probably want more than just elementary statistics.--Jasper Deng (talk) 20:36, 23 June 2014 (UTC)
 * Jasper, see Bayesian probability. The percentage of white balls is either in the range or not, but we don't know which, so we can talk about our subjective probability of it being true.
 * That said, no prior about the bag is given, so we need to be careful about how we arrive at that subjective probability. -- Meni Rosenfeld (talk) 19:10, 24 June 2014 (UTC)
 * I am sure that the answers given will make Tevildo's boss happy, most bosses are clueless about statistics. But in my view they are pretty much meaningless. To answer the questions properly, we need to consider how the bag got filled. One possibility: it was filled by taking balls at random from a large store containing equal numbers of black and white balls. Another possiblity: it was delivered by a supplier, which supplies bags of three kinds: all black, all white, and 70 black 30 white. There are of course countless other possibilities. Maproom (talk) 21:22, 23 June 2014 (UTC)
 * Well, if that's an important factor, the bags were filled (at random) from a large store containing A white balls and B black balls, and we want to estimate what A / (A + B) is, knowing a and b (the number of balls in the sample). And many employees are clueless about statistics, as well. ;) Tevildo (talk) 21:59, 23 June 2014 (UTC)
 * That, and our knowledge of N, make it into a meaningful question. And here is some serious advice: it will be better for your career prospects to give a simple answer which your boss thinks he can understand, than a correct one which he knows he can't. Maproom (talk) 23:01, 23 June 2014 (UTC)

Sorry that you cannot access the link. Try this dropbox link instead. .

Your first basic question, with a=20 white balls in the sample, b=80 black balls in the sample, n=a+b=100 balls in the sample, N=100000 balls in the population, has the answer 20 80 (ci@induce%]) 100000 0.12624 0.28551 0.71449 0.87376 meaning that the percentage of white balls in the population is, with 95% confidence, between 13% and 29%, and the percentage of black balls in the population is, with 95% confidence, between 71% and 87%.

Your second basic question, with n=100 balls in each sample, a1=20 white balls in the first sample, b1=80 black balls in the first sample, has the answer 20 80 ci@predict 100 10 31 69 90 meaning that the number of white balls in the second sample is, with 95% confidence, between 10 and 31, and the number of black balls in the second sample is, with 95% confidence, between 69 and 90.

The four J-programs used in these calculations are the following. deduce %~`*`:3"2@(,: (%:@* -.))@(+/@[ %~ 1, ,:)  predict (deduce~-@>:)~   induce (,:0:)+[predict(-+/)~   ci {.(([:>.0:>.-),.[:<.+/@[<.+)+:@{: Bo Jacoby (talk) 22:55, 23 June 2014 (UTC).

To Meni's warning: "no prior about the bag is given, so we need to be careful about how we arrive at that subjective probability". The number of white balls may possibly be 0,1,2,3,...,N. Without further information these N+1 possibilities are equally credible. The computed result is not subjective. Given the same information different people will obtain the same answer. Bo Jacoby (talk) 10:27, 25 June 2014 (UTC).
 * Statistics does not completely rule out the real proportion being outside the confidence interval, but instead finds that it's unlikely to find the number outside the interval.
 * But I must say that the test I did (and evidently Bo Jacoby did too, since we got the same result) relies on the sampling distribution, which is normal under suitable conditions (which however includes N at least ten times the size of n, for some reason), notably not including anything about the population's distribution. Naturally if the number of white balls were 0 or N, then we would have a problem since the concept of a confidence interval breaks down (as the sample proportion will always be 0 or 1, so the standard error of the sampling distribution becomes 0).--Jasper Deng (talk) 18:22, 25 June 2014 (UTC)

It seems as if Jasper used Frequentist inference. I used Bayesian inference. The conditional credibility (K|k) that the bag (of N balls) contained K white balls when a random sample (of n balls) contained k white balls is computed as
 * $$(K|k)=(k|K){(K|0)\over(k|0)}$$

where
 * $$(K|0)={1 \over N+1}$$

is the unconditional credibility that the bag (of N balls) contained K white balls, and
 * $$(k|0)={1 \over n+1}$$

is the unconditional probability that a sample (of n balls) contains k white balls, and
 * $$(k|K)= {\binom{K}{k}\binom{N-K}{n-k}\over \binom{N}{n}}$$

is the conditional probability that a random sample (of n balls) contains k white balls, assuming that the bag (of N balls) contained K white balls. So the conditional credibility is
 * $$(K|k)={\binom{K}{k}\binom{N-K}{n-k}\over \binom{N}{n}}{{1 \over N+1}\over{1 \over n+1}}={\binom{K}{k}\binom{N-K}{n-k}\over \binom{N+1}{n+1}}$$

The mean value μ and the standard deviation σ is computed from the equations
 * $$\mu=\sum_K (K|k)K$$
 * $$\mu^2 + \sigma^2=\sum_K (K|k)K^2$$

A 95% confidence interval is approximately μ±2σ. Bo Jacoby (talk) 23:04, 25 June 2014 (UTC).
 * Bo, as you surely remember, we've had this argument countless times, and I doubt we'll resolve it now; I'll just do my part and say: Assuming a uniform prior on the number of white balls is reasonable, but is far from being an obvious choice. Especially considering that in practice, more information about the bag is probably already available and can be revealed by simply asking. -- Meni Rosenfeld (talk) 17:31, 26 June 2014 (UTC)

Yes, my friend, I do remember! If some possibility (say K=7) should be considered less credible than another possibility (say K=8), then it reflects some information about the bag which is not present a priori. So the uniform prior on the number of white balls,(K|0)=1/(1+N) for 0≤K≤N, is indeed the obvious choice. In practice the only information available is obtained by sampling. Bo Jacoby (talk) 19:38, 26 June 2014 (UTC).