Wikipedia:Reference desk/Archives/Mathematics/2007 June 19

= June 19 =

What is the probability that a vector came from a set?
Firstly, this isn't homework, this is for my job. I'm a programmer but most of the time I don't have to deal with this much math. I've been through my college stats book, and it has some hints but not the answer to this question:

There are N vectors (x1, x2, ... xM) where N is usually around 10 and M is around 40, and another new vector with the same number of elements. All of the elements xm are normally distributed and completely independent of each other. I need to know the probability that the new vector came from the same set as the N known vectors did. Note that I'm not trying to reject or accept the proposition at a given confidence level; I need to know the probability that the new vector is from the same population. Please help. 75.35.79.57 02:14, 19 June 2007 (UTC)


 * What do you mean, "came from": is a member of? is a linear combination of?


 * Is a member of: So, e.g., (1,2,3) is very likely to come from {(0.8,2.2,2.9),(1.1,2.1,3.2)} but (7,8,9) isn't.  75.35.79.57 03:07, 19 June 2007 (UTC)


 * That's not membership. Furthermore, you need to define some sort of tolerance value to get anywhere meaningful: (1,2,3) - (1.1,2.1,3.2) = (0.1, 0.1, 0.2) may be okay, but is (1,2,3) - (1.5, 1.9, 2.4) = (0.5, -0.1, 0.4)?


 * (edit conflict) Be nice to the guy, anon. So he doesn't know the right terminology.
 * I wasn't explicitly rude to the guy. One needs to ask a question reasonably precisely to expect a reasonably precise answer.


 * I think what the guy wants is something related to the probability that a certain vector is generated by some random random variable comming from some specific probability distribution of vectors (BTW, I'm not even sure I'm saying is right). For this purpose, Mahalanobis distance is often used.  To calculate the Mahalonobis distance, you need the estimated means, variances and covariances.  In your case, if all elements are supposed to be completely independent, your case is much simpler.


 * The Mahalonobis distance is given by:


 * $$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x-\mu)}.\, $$


 * But in your case, for 3 dimensions $$\Sigma = \left( \begin{smallmatrix}

v_1&0&0 \\ 0&v_2&0 \\ 0&0&v_3 \end{smallmatrix} \right)$$ where $$v_i$$ is the variance of each element i.


 * which gives $$ \Sigma^{-1} = \left( \begin{smallmatrix}

1/v_1&0&0 \\ 0&1/v_2&0 \\ 0&0&1/v_3 \end{smallmatrix} \right)$$


 * and if I'm not mistaken, in this instance $$D_M(x) = \sqrt{ \sum^n_{i=1} \frac {(x_i - \mu_i)^2}{v_i} }$$ where $$n=3$$


 * However, you still need some sort of threshold, and this part is a bit tricky. What distance is too far?  A normal distribution can produce any value, but its nicety is that most values are within some standard deviation measure away from the mean.  For one dimensional statistics 95% of the distribution is within 3 standard deviations of the mean (Is this correct? I cannot remember.  You can probably figure it out by reading confidence interval or something else if I'm wrong.)  You could use this kind of logic here (however, I must stop because I don't know how this translates to higher dimensions).  That is, you set the distance to be related to the percentage of the distribution you want to capture.


 * But there's problems with this. If multiple distributions must be tested, the above logic does not work (two distributions could be very close to each other and the likelihood that the vector came from either could be very high).  However, you can use the distance as calculated above to test which distribution is "closest" to the vector, and that would give you the distribution with the maximum likelihood.  Of course, if your distributions are practically on top of each other, your error rate will soar through the roof and you may want to consider finding better data for each entity these vectors represent.


 * You also may want to take a look at Expectation maximization and data clustering articles. Root4(one) 04:57, 19 June 2007 (UTC)


 * A word of caution though, ... I'm not too certain how many data clustering algorithms aren't more than "solutions" looking for problems. I think a healthy skepticism should be involved when reading about some people's methods of use.   Its use in some cases may not be anything more than 21st century numerology. Root4(one)


 * I mean in the statistical sense, just like a scalar has a certain probability of belonging to (having been taken from?) a normal distribution. In that case you would use a test to reject or accept membership at a certain confidence level.  I need the analogous test for (1) a vector, and (2) providing the confidence level on the cusp of rejection. 75.35.79.57 04:08, 19 June 2007 (UTC)


 * So you're actually asking for the probability that vector x' comes from the same distribution that generated the other vectors? Unless you happen to know what the original distribution was (i.e. that it was N(0,1)^M or N(mu,sigma^2)^M or whatever), in which case you don't need the set of vectors, then you've got to assume that the set of vectors you have is a reasonable approximation of the distribution. In which case, then I guess you'd produce a statistic of the form $$((x_i-\mu_i)/\sigma^2_i)$$, where $$\mu_i$$ and $$\sigma_i$$ are the mean and sd of the ith components of the set of vectors. I suspect each component of the statistic vector would then be t-distributed, but I'm not 100% sure. Confusing Manifestation 04:14, 19 June 2007 (UTC)


 * Yes, that's exactly what I want and your stated assumptions are true. Those are standard scores for each element, right?  So, any idea how to combine them?  Pythagorean means? 75.35.79.57 04:38, 19 June 2007 (UTC)


 * The (null) hypothesis to be tested for is that the new vector is from the same population as the N known vectors, where that population is assumed to have a distribution that is the joint distribution of M independent normally distributed random variables. The population parameters can be estimated independently for each component. For each of the M components, estimate its mean and variance from the data of the N known vectors, giving you estimates m1, ..., mM and v1, ..., vM. For each component (x1, ..., xM) of the new vector, define qi = (xi−m)2/vi, and compute the sum Σ qi of all M quantities. Under the null hypothesis this sum has a chi-square distribution with M degrees of freedom. You can then use this as a test statistic for the null hypothesis in the usual way.
 * A word about the meaning of this. You wrote that you need to know "the probability that the new vector is from the same population". The only thing you can estimate, is the probability that a random vector from the same population deviates not more from the norm than the new one does. That is not the same statement. Using the notation of conditional probability, it is similar to the difference between P(A|B) and P(B|A). --Lambiam Talk  08:52, 19 June 2007 (UTC)

Rational numbers
from what i underderstand, any rational number can be expressed as a decimal with an infinately recurring digit or sequence of digits in the end, and the opposite is true (any decimal with an infinately recurring digit or sequence of digits is rational). going from a/b form to recurring decimal is pretty easy: sit down and do long division till u notice a pattern or better yet, pull out a calculator (or computer calculator) that will show you enough digits to see the recurring ones. here's my question? how do you go the other way around? lets say i give u the arbitrary number 13.4588888... or 3.626362636263... is there a strait forward (or not so strait forward) way that always works to finding the equivalent a/b fraction? 209.53.181.75 17:44, 19 June 2007 (UTC)
 * This gives a good explanation. I'm sure there's plenty of tools on the internet that will just spit it out for you if you're not interested in the method or need to do this frequently... just google around  Adam2288  T  C  17:50, 19 June 2007 (UTC)


 * If you want to find the method yourself, here's a hint. Break up the number into three parts: (i) front part with none of the repeating digits (this part may be 0), (ii) the first occurrence of the set of repeating digits, (iii) all the rest of the repeating digits.  For your first example, this gives (i) 13.45, (ii) 0.008, (iii) 0.00088888...  Now stare at the last two for a while, and keep in mind that the full repeating part is their sum.  Baccyak4H (Yak!) 17:53, 19 June 2007 (UTC)


 * Say you have number x some repeating digits, 0.ABCDABCDABCD... now consider 10000x - x.  That is, you have ABCD.ABCDABCD... minus 0.ABCDABCD...  the A's, B's, C's, and D's match up on on those two numbers infinitely, thus leaving you ABCD.00000... or ABCD ! but this is 10000x - x = 9999x, or your number is equal to ABCD/9999.  They use the same method in the link Adam2288 gives above.  Of course, don't assume just because some number 0.68686868686868... is always a rational number.  It always depends on the process that created it.  For all you know the trillionth trillionth after zero digit starts out as ...31415926... (the digits of pi) It would turn out that 68/99 would be a very good approximate representation of this number, but it turns out this is in fact $$ \frac{68/99*(10^{10^{24}} - 1) + \frac{\pi}{10}}{10^{10^{24}}} $$, $$\pi$$ making it completely irrational. Root4(one) 20:00, 19 June 2007 (UTC)


 * See our article on Recurring decimals. --Lambiam Talk  22:30, 19 June 2007 (UTC)

Chances of seeing someone you know in porn
There was a song years ago about someone being shocked to see a former girlfriend in a "girlie magazine". Given the huge amount of porn on the internet and elsewhere nowerdays, I've been wondering what are the chances of a model being recognised by someone she knows, or if the hugh volume means that it is effectively done anonymously, unless your a "star". If TNOM=total number of models, TNOW=total number of women, WK=number of women known by the viewer, MWK=model women known, MS=number of models viewed, then (TNOM/TNOW)xWK=MWK and MWKxMS/TNOM=the chance of seeing someone you know. Then this cancels out to equal WKxMS/TNOW=the chance of seeing someone you know. This seems odd to me as TNOM cancels out and does not appear in the final equation. Have I got it right? 80.0.132.197 20:26, 19 June 2007 (UTC)


 * I'd say it's reliant on your definition of porn. Are we talking cheesy topless myspace photos, or Backdoor Sluts 9 here? -- Phoeba WrightOBJECTION! 20:47, 19 June 2007 (UTC)


 * The numerical result does depend heavily on the definition, but the calculations involved do not. Now, you have asked two different questions here: What is the probability for a model to be recognized, and what is the probability for a viewer to recognize a model. Your calculations seem correct (under highly simplified assumptions, of course) for the latter, and the result, indeed, should not depend on the total number of models (think of it this way: The viewer knows a certain percentage of the women in the world, and the probability for every model he views to be one he knows is roughly equal to that percentage). As for the former, there are additional variables involved, but the result will be inversely proportional to the number of models. -- Meni Rosenfeld (talk) 21:08, 19 June 2007 (UTC)
 * What about the model's risk of being spotted by someone she knows? If WKxMS/TNOW=the chance of seeing someone you know=the chance of seeing someone you know, then wouldnt this risk be simply PKxWKxMS/TNOW, where PK=number of people known by the model? This still does not involve the total number of models. 80.2.198.47 12:20, 20 June 2007 (UTC)
 * No. PKxWKxMS/TNOW is the chance that someone the model knows will recognize some model, not necessarily her! In case this happens, this can be any model, so the chances of it being her are 1/TNOM. So the probability to be recognized is PKxWKxMS/(TNOWxTNOM). Of course, this approximation is only valid as long as the probability that all people the model knows, together, will recognize more then one model, is negligible. Otherwise, a better approximation will be given by an exponential similar to the one in Lambiam's reply. -- Meni Rosenfeld (talk) 13:57, 20 June 2007 (UTC)
 * Estimating numbers for PKxWKxMS/(TNOWxTNOM) does suggest that the chances of a model new to the scene being spotted by people back home are very low: say 50x50x50/(300000000x300000)=one chance in 1.3888r x 10^-9, or virtually zero. I would have imagined the chances would be higher, at an intuitive guess say 5% 80.2.213.219 09:00, 21 June 2007 (UTC)
 * I don't think the formula is right. The people known by the model will see about PK×MS models (not taking account of repeats because several are viewing the same material). The probability that a given model from a pool of TNOM equiprobable models is in there, is about PK×MS/TNOM. Using the values from above, that is 50×50/300000 = 1/120. I don't know about you, but I know many more than 50 people; I think 1000 is a better estimate. I don't have any reliable material, but MS = 50 may also be a low average, depending on the time period this ranges over. Of course, much depends on the outlet. Some have a much wider dissemination than others. --Lambiam Talk  14:47, 21 June 2007 (UTC)
 * Interesting comments, although I have to point out that p=PKxMS/TNOM is not as far as I am aware dimensionally balanced according to the technique of dimensional analysis, although I could very easily be wrong. 80.0.101.211 22:54, 21 June 2007 (UTC)
 * Let V and M be the dimensions of quantities counting viewers and models, respectively. Use [PK] = V, [MS] = MV−1 and [TNOM] = M. Then [PK×MS/TNOM] = V×MV−1/M = 1, so here we have a dimensionless quantity. --Lambiam Talk  09:48, 22 June 2007 (UTC)
 * Lambiam is correct, of course. In my calculation above, I should have divided by MWK instead of TNOM (a model recognized by a viewer must be someone he knows, not any model), which under the assumption of independence, indeed gives PKxMS/TNOM. -- Meni Rosenfeld (talk) 10:56, 22 June 2007 (UTC)
 * I'm not so sure - in the theorectical case where there where the number of models was less than the the number of people known, then you would get a probability greater than 1. For example 50x1/10=5. And, if you consider the dimension H for people the equation reduces to 1=HxH/H=H. 1=H is not dimensionally balanced. 80.0.107.145 15:08, 22 June 2007 (UTC)
 * Again, the actual probability is more like 1 − exp (−PKxMS/TNOM), which is close to PKxMS/TNOM when it is small (not when it is 5!) -- Meni Rosenfeld (talk) 15:45, 22 June 2007 (UTC)
 * There are enough problems in the world as it is - no need to invent problems that don't exist. If you insist on analyzing this from a dimensional point of view, you have to do it right. Lambiam has already hinted at your mistake - MS is the number of models seen by every viewer, so its dimension is models per viewer, or persons per person, that is, dimensionless. -- Meni Rosenfeld (talk) 10:35, 24 June 2007 (UTC)
 * If MWK = 0 (you don't know a single model), then the chances that you will recognize a model are slim, obviously. This is true, however large WK and MS are. This is not reflected in the formula you gave. Assuming that all models are equally likely to appear in a photo viewed by a given viewer, the probability that any single randomly selected photo looked at shows a known model equals MWK/TNOM. The expected number among MS different models viewed, if MS << TNOM, is then MS×MWK/TNOM. To turn this into a probability, use 1 − exp(−MS×MWK/TNOM) – but if MS×MWK/TNOM is fairly small, it is a good approximation of the probability. --Lambiam Talk  22:25, 19 June 2007 (UTC)


 * But if we assume that acquaintance with the viewer is independent of being a model, then MWK/TNOM = WK/TNOW. -- Meni Rosenfeld (talk) 22:40, 19 June 2007 (UTC)

Incidentally, Centerfold (song), by the J. Geils Band, was released in 1981. StuRat 02:53, 20 June 2007 (UTC)


 * There's an earlier song in the same vein by Shel Silverstein ("Polly in a Porny") and a later one by King Missile ("Muffy", 1987). &mdash;Tamfang 05:27, 20 June 2007 (UTC)


 * One wrinkle to consider is that people are not islands: if someone in Polly McPornstar's social circle happens to see her in flagrante delicto, there's a good chance s/he will forward the picture to others in that circle (suggested subject line: "Omigod!"), so that widens the net. --TotoBaggins 18:55, 20 June 2007 (UTC)