Wikipedia:Reference desk/Archives/Mathematics/2018 May 24

= May 24 =

Looking for a statistical test
Let us suppose (wrongly, I'm sure) that whether or not it will rain on a given day is determined by flipping a biased coin. The bias varies from day to day, but the flips are independent. And let us suppose that when the weather channel reports the chance of rain, they are purporting to report the bias of that day's flip. How would I test their correctness?

In other words, I have a sequence of independent 0-1-valued trials for which I know the results, and I have a sequence of values, and I want to test that the values are the probabilities of success for each trial. I could extract just those days when the weather channel reported a particular chance, say 30%, and do a test of proportion on those. But that's throwing away a lot of information. It seems it would be better to design a test that considered the entire sample. But I've no idea how to design a test statistic for that. — Preceding unsigned comment added by 108.36.85.137 (talk) 14:53, 24 May 2018 (UTC)
 * Butterfly effect — Preceding unsigned comment added by 94.101.250.53 (talk) 17:08, 24 May 2018 (UTC)
 * I don't see anything on that page that's relevant to the question I asked?108.36.85.137 (talk) 21:20, 24 May 2018 (UTC)
 * New technologies are coming within a few decade not very late! — Preceding unsigned comment added by 188.210.69.196 (talk) 10:36, 25 May 2018 (UTC)
 * Here's a natural thing to do. (And if I knew statistics, I would know what it was called, but I don't, sorry.) There is a number that you can associate to the data: for each day, take (truth - prediction)^2 and then add these numbers up.  In your case, the truth is 1 if it did rain, and 0 if it didn't.  A perfect forecaster tells you the truth every day, and gets total error 0.  A zero-knowledge forecaster tells you 50% every day, so gets an error of 0.5 every day, and so has a total error after n days of n/2 (or maybe n/4).  If the world operates the way you describe, then the best possible thing a forecaster can do is to report the true probability p of rain on each day.  The actual computed error can be compared with this theoretical error to give a sense of how close they are.  (And one could probably do some statistics with it, as well.)  --JBL (talk) 18:29, 24 May 2018 (UTC)
 * That would be the Least squares method, a common way of finding a Simple linear regression, from which you calculate the R^2 value to see how good the calculated regression is (an R^2 of 1 would be a perfect forecaster, for example). Iffy★Chat -- 19:07, 24 May 2018 (UTC)
 * Hmm. Ideally I'd like a test statistic that reduces to the standard test of proportion statistic when the forecaster gives a constant prediction, and this doesn't seem to do that (unless I'm seriously botching my algebra).108.36.85.137 (talk) 21:20, 24 May 2018 (UTC)


 * (edit conflict) You could compute the sum of squared errors (SSE) (the sum of squares of (the forecast probability minus the value 0 or 1)) under two scenarios: (1) Assume that all the weather channel’s probabilities p_i are correct, and that the 0-1 variables take on not the observed outcomes but rather, separately for each p_i value, zeroes and ones in the ratio implied by the p_i’s. Subtracting these, squaring, and adding them together gives the theoretical SSE. (2) Using the observed errors (the weather channel probabilities minus the observed 0-1 values), squaring them, and adding them together gives the observed SSE. You could look at Chi-squared test F-test to see if the distributions in your problem allow the use of the chi-squared test F-test. Loraof (talk) 22:32, 24 May 2018 (UTC)


 * Or, you could separately run the test you mentioned – comparing p_i with the fraction of observed values that are 1 – for each i. Under a null hypothesis of the weather forecaster’s probabilities equaling the underlying values, and using for example a critical value of (1–alpha) = .99 for each test_i, under the null we expect all the tests to be below the critical value – i.e., we accept all the null hypotheses – with a probability 0.99N where N is the number of values that the forecast can take on. (N different probabilities are multiplied together because the tests are mutually independent.) So the Type I error rate is 1 – 0.99N. Loraof (talk) 02:25, 25 May 2018 (UTC)


 * Our article about Kullback–Leibler divergence is as best as you can get for an entry point to entropy-loss methods. Which, depending on your needs, either completely solve your problem with much more detail and math than needed, or barely scratch the surface.
 * The problem with the aforementioned least-squares method is that it does not punish big mistakes. Imagine an event occurred, and the forecasters' given probabilities were 0 for A, 0.2 for B, 0.47 for C. The square method will penalize A relative to B about as much as B relative to C, but intuitively A was much more committed to the outcome and should be punished much more for the incorrect forecast. The more math-y way to say that is that the information that a given predictor purports to give depends non-linearly of the predicted probability value. Tigraan Click here to contact me 09:09, 25 May 2018 (UTC)


 * The surprisal of a wrong prediction is related to the above and might be worth looking at. 173.228.123.166 (talk) 04:38, 26 May 2018 (UTC)
 * This does seem to work better for predicting more than two possible outcomes. Suppose, for example, there are three possible outcomes A, B, C (e.g. sunny, overcast no precipitation, precipitation) Let the forecaster assign probabilities pA, pB, pC to each outcome, and let qA, qB, qC be the 'true' probabilities, where pA+pB+pC = qA+qB+qC=1. The forecaster is 'scored' on each day according to the actual event and the probability that the forecaster assigned to that event. With the surprisal metric the score is log(p) if event was predicted to have probability p. But other metrics might work as well, for example -(1-p)2 for a least squares metric. In any case you'd want the score to be an increasing function of p and you'd also want the maximum expected score to have a maximum at pA=qA, pB=qB, pC=qC. For a score function of log(p) the expected score is qAlog pA + qBlog pB + qClog pC. Using Lagrange multipliers it's fairly easy to find that the critical point of this function with the constraint qA+qB+qC=1 does indeed occur at pA=qA, pB=qB, pC=qC. For the least squares metric the problem got more complicated but it looks like there is no critical point where expected. This metric satisfies Tigraan's objections as well since if the forecaster predicted that the actual event had probability 0 then the score would be -∞, in other words you should never trust forecaster again until the end of time. This is a very interesting problem to me since it could be applied to any situation where a prediction is made -- weather, elections, economic forecasts, etc. Such predictions appear in the media all the time but no one seems to bother with applying some kind of objective quality control to them. --RDBury (talk) 09:24, 26 May 2018 (UTC)

Let ''a1. . . aN be the observed zeroes and ones according to rain or sun, and p1. . . pN'' be the predictions. Then $$\sum a_k$$ is the actual number of sunny days, and $$\sum (1-a_k)$$ is the actual number of rainy days, while $$\sum p_k$$ is the predicted mean number of sunny days, and $$\sum (1-p_k)$$ is the predicted mean number of rainy days. The inner product $$\sum p_k a_k $$ is a kind of correlation between predicted and observed sunshine, while $$\sum (1-p_k)(1- a_k) $$ is a kind of correlation between predicted and observed rain. These elementary sums summarize how well the prediction fits the observations. Bo Jacoby (talk) 06:55, 28 May 2018 (UTC).