User:Evercat/Stats

One sample t-test
This is to test whether the (unknown) true population mean is the same as a reference figure...

Estimated standard deviation:



s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} $$ i.e. find corrected sums of squares, divide by df, and take square root.

Alternative method for estimated standard deviation:



s = \sqrt{\frac{\sum x_i^2 - (\sum x_i)^2 / n}{n - 1}} $$

Estimated standard error of the mean:



\frac{s}{\sqrt{n}} $$ i.e. stddev divided by square root of the sample size.

One sample t-test:



t_0 = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} $$ i.e. the difference in the values divided by the standard error of the sample.

Confidence interval for true population mean:



\bar{x} \pm t.s / \sqrt{n} $$ i.e. the sample mean &plusmn; t standard errors.

If this interval contains the reference value, the null hypothesis has not been rejected.

Subtracting the reference mean from the values gives a confidence interval for the difference between reference and population.

Paired t-test
This is to test whether the (unknown) true population means of two populations are the same, but our samples are linked...

It is equivalent to a one sample t-test where the data are the differences between pairs, and the reference value is zero.

Note that Xbar is the mean difference between pairs, s is the stddev for the differences, and n is the number of pairs:



t_0 = \frac{\bar{X}}{s / \sqrt{n}} $$ i.e. the mean difference divided by the standard error of the differences.

Two sample t-test
This is to test whether the (unknown) true population means of two populations are the same...

Pooled estimate of variance:



s^2 = \frac{\mbox{df}_1. (s_1)^2 + \mbox{df}_2. (s_2)^2}{\mbox{df}_1 + \mbox{df}_2} $$ This is just the weighted average of the two sample variances, with weights being degrees of freedom.

Estimated standard error of the difference between the means:



s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} $$

Note that s is derived from the pooled variance s2. This is not used for the paired t-test, only the two sample t-test.

Two sample t-test:



t_0 = \frac{\bar{x}_1 - \bar{x}_2}{s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$

i.e. the difference divided by the standard error of the difference. Note that s here is calculated from the pooled variance s2.

Confidence interval for the difference between the two means:



\bar{x}_1 - \bar{x}_2 \pm t.s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} $$ i.e. the estimated difference &plusmn; t standard errors, where t is taken straight from the table. The df to use is just n1 + n2 - 2. Note that s here is calculated from the pooled variance s2.

Binomial distribution
There are n trials with a probability p of success each trial. Then the binomial probability of x successes is:



\mbox{combinations that give x successes. chance of getting any such combination} $$

The right side of this is trivial to calculate; it's just the chance of getting precisely x successes in a specific order e.g. S-S-S-F.

The whole thing can be formalised as:



\binom{n}{x} p^x (1-p)^{n-x} $$

Where:



\binom{n}{x} = \frac{n!}{x!(n - x)!} $$

The binomial distribution has the following characteristics:


 * Mean successes = n * p
 * Variance is n * p * (1 - p)
 * Stddev is sqrt(variance)

As n increases, the binomial distribution starts to resemble a Normal distribution with those same parameters.

95% confidence intervals

Given n and an estimate of p, we can calculate a 95% confidence interval for the true value of p:



\hat{p} \pm 1.96 \sqrt{\hat{p}. (1 - \hat{p}) / n} $$

Or for the true expected number of successes:



n. \hat{p} \pm 1.96 \sqrt{\hat{p}. (1 - \hat{p}). n} $$

One can convert between the two confidence intervals: e.g. if p has a lower bound of 0.12, then in 100 trials, the lower bound for the expected successes is 12.

If one knows the true value of p, one can create a 95% confidence interval for what one will actually observe, in the same way.

Poisson distribution
There is an average of &mu; successes per unit of space/time. Then the binomial probability of x successes is:



\frac{\mu^x e^{-\mu}}{x!} $$

&mu; is both the mean and the variance of the distribution. As &mu; increases, the distribution also comes to resemble a Normal distribution with mean = &mu; and variance = &mu;.

95% confidence intervals

If we have x of successes in a certain unit of space/time, a 95% confidence interval for the true value of &mu; is x +/- stddev:



x \pm 1.96 \sqrt{x} $$

Contagion
For both binomial and Poisson models, contagion is indicated when the observed variance is significantly larger than the expected variance.

Chi-squared tables
Note: expected count for a cell in a table of data is:



\frac{\mbox{(row total)(column total)}}{\mbox{overall total}} $$

Degrees of freedom for a table is (rows - 1)(columns - 1).

Correlation coefficient


r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} $$

A value near 1 implies a straight line with positive gradient. Near -1 implies a straight line with negative gradient. Near 0 implies weak correlation.

Meanings of Sxy, Sxx, Syy:



S_{xy} = \sum (x_i - \bar{x})(y_i - \bar{y}) = \sum x_iy_i - \sum (x_i) \sum (y_i) / n $$



S_{xx} = \sum (x_i - \bar{x})^2 = \sum x_i^2 - (\sum x_i)^2 / n $$



S_{yy} = \sum (y_i - \bar{y})^2 = \sum y_i^2 - (\sum y_i)^2 / n $$

Regression
We are trying to determine a and b for this equation:



y = a + bx + \mbox{random variable} $$

a is the intercept, b is the slope. The random variable (residual) is assumed to be normally distributed with variance independent of x.



\hat{b} = S_{xy} / S_{xx} $$

\hat{a} = \bar{y} - \hat{b}\bar{x}