Wikipedia:Reference desk/Archives/Mathematics/2008 August 1

= August 1 =

Statistics
Hello. I'm a grad student who's been asked to look at some data and come up with some statistics for possible publication, but the data set is small and limited to "present/not present", so I'm not sure how to approach it. To give a little more detail, I have been provided two collections of items, each of which has been attributed with specific features, as follows: My experience with statistics is very limited, and I'm having trouble working with the various statistics articles, so the questions I have are as follows: Honestly, I'm not even sure where to start, so any help you can provide would be extremely helpful. Many thanks. - But I Played One On TV (talk) 15:18, 1 August 2008 (UTC)
 * 1) How would I go about determining the degree of significance of individual features with this data, considering that it's limited to present/absent data as opposed to a range of values?
 * 2) How could I figure out the likelihood and error that, given the value of three features, an item belongs in a particular set?
 * 3) What kind of other useful statistics can I even generate regarding this data?


 * For Q3, one possibility might be odds ratio. For Q1 and Q2, you would need to apply a reasonable statistical model.  The top of the odds ratio article might offer a hint.  Baccyak4H (Yak!) 16:07, 1 August 2008 (UTC)
 * Thanks for the odds ratio pointer! I've been reading all day and definitely learning, albeit slowly. Now then, would I be completely out to lunch to try to calculate the p-values for each feature by doing the following?:
 * Define the null hypothesis as OR=1.0
 * Run the natural log of the odds ratio through the standard logistic function to get a range 0<=f(OR)<=1
 * Use the probability mass function where n = sum of collection members (80), k = n * f(OR), and p = 0.5 (which is f(1.0))
 * Thanks again! - But I Played One On TV (talk) 22:11, 1 August 2008 (UTC)
 * Further, if you find those articles too dense, call someone in your school's stats (or in lieu of that, math) department and ask them. Since you mention a publication is possible, you will probably get someone's attention. :-)  Baccyak4H (Yak!) 16:21, 1 August 2008 (UTC)

Hello. Assuming that the collections, A and B, are samples from a big population, you want to know about the population based on knowing the sample. The probability of each feature can be estimated with some uncertainty from the sample data. A collection contains n items of which i have some feature. If you knew the probability, x, that a random item of the population has the feature, then you could compute the probability pi(x) that the collection contains i = 0, 1, 2, ..., n items having that feature. This probability is given by the binomial distribution
 * $$p_i(x) = {n\choose i}\cdot x^i \cdot(1-x)^{n-i}.$$

This distribution is summarized by it's mean value ± standard deviation:
 * $$i\approx n\cdot x \pm\sqrt{n\cdot x\cdot(1-x)}.$$

This means that if you know the population frequency, x, of some feature, then you can estimate the sample frequency of the feature, which is
 * $$\frac i n\approx x \pm\sqrt{\frac{x\cdot(1-x)}n}.$$

This formula for estimating i knowing x and n is however not what you want. You want a formula for estimating x knowing i and n. The distribution function for x, knowing i and n is still
 * $$p_i(x) = {n\choose i}\cdot x^i \cdot(1-x)^{n-i}$$

apart from an unimportant normalization factor. This distribution function is known as the beta distribution. The mean value of the beta distribution is not $$\frac i n$$, but rather $$\frac {i+1}{n+2}$$, and the standard deviation of the beta distribution is not $$\sqrt{\frac{{\frac i n}\cdot(1-{\frac i n})}n}$$ but rather $$\sqrt{\frac{{\frac {i+1} {n+2}}\cdot(1-{\frac {i+1} {n+2}})}{n+3}}.$$ So the formula you want is
 * $$x\approx \frac {i+1}{n+2} \pm\sqrt{\frac{\frac {i+1}{n+2}\cdot(1-\frac {i+1}{n+2})}{n+3}}.$$

Substituting your data into this formula gives the following estimates for the population frequencies: Now you want to know if the two collections can be believed to come from the same population. Is a number from a distribution 0.03±0.03 likely to be equal to a number from a distribution 0.96±0.03? Compute the difference
 * $$(0.96\pm 0.03)-(0.03\pm 0.03)=(0.96-0.03)\pm\sqrt{0.03^2+0.03^2}=0.93\pm 0.05$$

Is this difference likely to be zero? Zero is 0.93/0.05=18.6 standard deviations away from the mean value of the distribution. This difference is highly significant.

Be warned that different statisticians use different approximations and thus may reach different results. The above approach may not be considered standard by your teacher. Have fun! Bo Jacoby (talk) 23:16, 1 August 2008 (UTC).


 * Hi again! I just wanted to ask you a quick followup question to my original question. How would I go about determining the probability that an element is part of "Collection B" if it has the three most significant features, in this case Feature 1, Feature 2, and an absence of Feature 5 (with individual probabilities of 0.96±0.03, 0.88±0.04, and 0.66±0.08 respectively)? Many thanks for your help! User: But I Played One On TV (talk) 15:47, 11 September 2008 (UTC) (The question is moved to here from my talk page. Bo Jacoby (talk) 17:56, 11 September 2008 (UTC).)

This question has nothing to do with the populations but only with the samples. So the probabilities above are not relevant. You need to go back to the original data material and count how many elements in each of the collections A and B that has the new Feature 6, which is "Feature 1 and Feature 2 and not Feature 5". Then you get four numbers:
 * A11 = number of elements in collection A having feature 6 (= 0)
 * A12 = number of elements in collection A not having feature 6
 * A21 = number of elements in collection B having feature 6
 * A22 = number of elements in collection B not having feature 6

and the sums
 * A10 = A11+A12 = number of elements in collection A (= 30)
 * A20 = A21+A22 = number of elements in collection B (= 50)
 * A01 = A11+A21 = number of elements having feature 6
 * A02 = A12+A22 = number of elements not having feature 6
 * A00 = A10+A20 = A01+A02 = total number of elements

The probability you ask for is A21/A01. Bo Jacoby (talk) 05:01, 12 September 2008 (UTC).


 * Fantastic, thank you! If I wanted to include error in this calculation, could I use the values we got form the beta function and propagate those deviations? - But I Played One On TV (talk) 19:57, 15 September 2008 (UTC)

Thanks for the nice words. No, there is no error of calculation. In this case the probability is known. What does it mean? Out of the elements having feature 6 you pick one element at random. The probability that this element is in collection B is exactly A21/A01. Bo Jacoby (talk) 18:48, 16 September 2008 (UTC)
 * Thank you Bo. You're my statistical hero. - But I Played One On TV (talk) 19:34, 16 September 2008 (UTC)

Perhaps you want to consider a random element from the population? Then rename the variables. You must distinguish between the number of elements in the population, A000, and inside the sample, A100, and outside the sample, A200, and the corresponding known sample counts, A111, A112, A121, A122, and the unknown counts outside the sample, A211, A212, A221, A222. The probability, that a random element having feature 6 is in collection B, is A021/A001=(A021/A000)/(A001/A000). (A zero in a digit position in the index indicates a summation). Now A021/A000 is the probability that a random element of the population is in collection B and has feature 6, and A001/A000 is the probability that a random element of the population has feature 6. These probabilities are not known but can, (assuming A000>>1), be estimated by the beta distribution
 * $$ \frac{A_{021}}{A_{000}}\approx \frac {A_{121}+1}{A_{100}+2} \pm\sqrt{\frac{\frac{A_{121}+1}{A_{100}+2} \cdot(1-\frac{A_{121}+1}{A_{100}+2})}{A_{100}+3}}$$

and
 * $$ \frac{A_{001}}{A_{000}}\approx \frac {A_{101}+1}{A_{100}+2} \pm\sqrt{\frac{\frac{A_{101}+1}{A_{100}+2} \cdot(1-\frac{A_{101}+1}{A_{100}+2})}{A_{100}+3}}$$

Division gives
 * $$ \frac{A_{021}}{A_{001}}\approx \frac{\frac {A_{121}+1}{A_{100}+2} \pm\sqrt{\frac{\frac{A_{121}+1}{A_{100}+2} \cdot(1-\frac{A_{121}+1}{A_{100}+2})}{A_{100}+3}}}{\frac {A_{101}+1}{A_{100}+2} \pm\sqrt{\frac{\frac{A_{101}+1}{A_{100}+2} \cdot(1-\frac{A_{101}+1}{A_{100}+2})}{A_{100}+3}}}

=\frac{ (A_{121}+1) \cdot \sqrt {A_{100}+3} \pm \sqrt } { (A_{101}+1) \cdot \sqrt {A_{100}+3} \pm \sqrt } $$

I don't know any exact formula for the mean and standard deviation of the quotient between two random variables. Bo Jacoby (talk) 21:23, 16 September 2008 (UTC).

Geometry :convex quadrilateral
let ABCD be a convex quadrilateral .Consider the points E and F such that C is the mid point of the line segment AE and D is the mid point of line segment BF. Evaluat with proof ,the ratio of [ABEF] and [ABCD]

ACtually I am not sure what is meant by [ABCD]. —Preceding unsigned comment added by Khubab (talk • contribs) 22:33, 1 August 2008 (UTC)


 * The brackets mean "the area of". I'm thinking about what the best solution to this problem may be (I already have one solution, but it's almost surely not what you're looking for). But remember, do your own homework -- that's the policy of the Reference Desk. You should also be a little more specific when you ask for help. 76.238.91.100 (talk) 04:11, 2 August 2008 (UTC)


 * Au contraire — I thought the brackets meant the cross-ratio, as used for example here. —Keenan Pepper 04:15, 2 August 2008 (UTC)


 * Hmmm, I didn't know about that. Anyway, if the brackets mean area of, then the answer would be very simple; just use the formula
 * $$ \mbox{Area(ABCD)} = \frac{1}{2} \overline{AC} \; \overline{BD} \sin \theta$$
 * where $$\theta$$ is the angle between the diagonals $$\overline{AC}$$ and $$\overline{BD}$$. This formula works for all quadrilaterals.
 * EDIT: Concerning the cross-ratio, I'm reading into it.
 * EDIT2: I looked at this page, and I believe there are typos in their formulas. Cut-the-knot has the correct ones. Also, does anyone know how Wolfram defined the cross ratio of a square to be 2 here (or the values for any of the other shapes for that matter)? 76.238.91.100 (talk) 04:25, 2 August 2008 (UTC)
 * (Wolfram) Well AC and BD are Lsqrt(2) whereas AD and BC are just L ? (the points on the square go ABCDA etc) is that it?87.102.86.73 (talk) 22:32, 2 August 2008 (UTC)
 * That was for coplanar points - it made sense to me? (except the ratio 1/2 for a square when they already have 2 for a square)87.102.86.73 (talk) 22:36, 2 August 2008 (UTC)
 * There is an EDIT to this below. Please explain. The Wolfram article just offhandedly states that it is possible to define a cross ratio for any 4 coplanar points, but it doesn't really elaborate how to perform the calculation. Elsewhere I found out that one could create an arbitrary point P, draw lines from P to the 4 points in question (in this case, the corners of the square), and then find the cross ratio of the 4 intersecting lines thus formed (for which there is an unambiguous definition for the value of the cross ratio). However, the value depends on the placement of P, meaning that with this method a shape cannot be assigned a cross ratio value without specification of the reference point.
 * EDIT: Alright, I think I've got it figured out. On the Wolfram page, they are treating the points as complex numbers to calculate the cross ratios for the shapes like the square. Multiple values are possible for each shape (explaining why a square has a cross ratio of 2 and 1/2), although these values are all related by the equations given earlier in the page. However, the cross ratio of a quadrilateral in the complex plane is a real number only if the points are concyclic. The OP just says a "convex quadrilateral," so I assume now that what I said earlier about using a reference point P is the only reasonable method to use here. I haven't determined whether the OP is properly solvable (i.e., yields an answer that doesn't depend on the placement of P) this way, though. But then again, there is always the small chance that the answer is supposed to be a complex number. 75.4.141.243 (talk) 00:42, 3 August 2008 (UTC)