Talk:G-test

Feasibility of Fisher Exact test
Before writing the words below, I ran several such calculations using this web-based application: http://home.clara.net/sisa/twoby2.htm with the Firefox browser: for examples in which all cells had values between 10,000 and 20,000 it took about 30 seconds to finish the calculations.

For example, a laptop with a 1.7 Ghz Pentium and 1 GB of RAM, specifications not considered particularly high end in 2006, can readily handle cases of the Fisher exact test in which each cell's value is around 10,000 with commonly available statistical software.


 * Reverted as off topic. not really about G-test. Pete.Hurd 17:31, 31 July 2006 (UTC)

similarity to Kullback-Leibler divergence
Does the G-test and the Kullback-Leibler divergence mean the same but from another point of view?
 * It seems that they are the same thing. --Memming (talk) 18:58, 8 April 2009 (UTC)

G^2
Note that the "G-test" is referred to as the G^2 (g-squared) test (at least in psychology-related statistics).
 * Humph. I've never seen that - please give a reference.  seglea 23:29, 22 July 2005 (UTC)

To name few references to G^2 in psychological stats (this is common in multinomial modeling work in the memory literature and is becoming more common in fitting other types of models as well):

Dodson, Holland, & Shimamura, 1998. Using Excel to estimate parameters from observed data: An example from source memory data. Behavior Research Methods, Instruments, & Computers 1998, 30 (3), 517-526.

Batchelder & Reifer, 1999. Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6(1), 57-86.

Bayen, Murane, & Erdfelder. (1996). Source Discrimination, Item Detection, and Multinomial Models of Source Monitoring. Journal of Experimental Psychology: Learning, Memory, and Cognition 1996, Vol. 22, No. 1, 197-215.

Erdfelder & Buchner. (1998). Process-Dissociation Measurement Models: Threshold Theory or Detection Theory? Journal of Experimental Psychology: General, 127(1), 83-96.

fisher.g.test in GeneTS not G-test as described?
fisher.g.test implemented in GeneTS is an exact test for whether a time series is different from Gaussian white noise, not the alternative to the chi-square test as described.

Where does the 2 come from?
I've been trying to work out how Pearson's formula is an approximation for this test.

$$\ln(1 + x) \approx x $$ (for small x)

$$G = 2 \sum_i O_i \ln (O_i / E_i)$$

$$= 2 \sum_i O_i \ln (1 + \frac{O_i-E_i}{E_i})$$

$$\approx 2 \sum_i O_i \frac{O_i-E_i}{E_i}$$

$$= 2 \sum_i \frac{O_i}{E_i} (O_i - E_i) - 2 \sum_i (O_i - E_i)$$ (since $$\sum_i (O_i - E_i) = 0$$)

$$= 2 \sum_i \frac{O_i}{E_i} (O_i - E_i) - (O_i - E_i)$$

$$= 2 \sum_i \frac{O_i - E_i}{E_i} (O_i - E_i)$$

$$= 2 \sum_i \frac{(O_i - E_i)^2}{E_i}$$

This is the formula for $$\chi^2$$, except that the factor of 2 is still there. What was my error? Thanks! &mdash; ciphergoth 14:11, 3 June 2006 (UTC)


 * Your approximation for ln(1+x) at $$\approx$$ wasn't good enough; it roughly works for each term, but its error for positive and negative numbers reinforces is enough for the factor of 2. Taking the -x^2/2 term  and another approximation should get you there. --Henrygb 15:01, 9 March 2007 (UTC)

Then please tell me where is my error:

$$\ln(1 + x) \approx x - \frac{x^2}{2}$$ (for small x)

$$G = 2 \sum_i O_i \ln (O_i / E_i)$$

$$= 2 \sum_i O_i \ln (1 + \frac{O_i-E_i}{E_i} )$$

$$\approx 2 \sum_i O_i ( \frac{O_i-E_i}{E_i} - \frac{(O_i-E_i)^2}{2 E_i^2})$$

$$= 2 \sum_i \frac{O_i}{2} ( 2 \frac{O_i-E_i}{E_i} - \frac{(O_i-E_i)^2}{E_i^2})$$

$$= \sum_i O_i ( 1 -1 + 2 \frac{O_i-E_i}{E_i} - \frac{(O_i-E_i)^2}{E_i^2})$$

$$= \sum_i O_i ( 1 - (1 - \frac{O_i-E_i}{E_i})^2 )$$

$$= \sum_i O_i - \sum_i O_i ( \frac{O_i}{E_i})^2 )$$

$$= n - \sum_i \frac{O_i^3}{E_i^2} $$


 * You made a mistake in the one-before-last equality:
 * $$1 - \frac{O_i-E_i}{E_i} \neq \frac{O_i}{E_i}$$  —Preceding unsigned comment added by 87.69.46.105 (talk) 07:38, 23 February 2008 (UTC)

Then please tell me where is my error:

$$G = \sum_i O_i ( 1 - (1 - \frac{O_i-E_i}{E_i})^2 )$$

$$= \sum_i O_i ( 1 + 1 - \frac{O_i-E_i}{E_i}) ( 1 - 1 + \frac{O_i-E_i}{E_i})$$

$$= \sum_i O_i \frac{O_i-E_i}{E_i} ( 2 - \frac{O_i-E_i}{E_i})$$

$$= 2\sum_i O_i \frac{O_i-E_i}{E_i} - \sum_i \frac{O_i}{E_i} \frac{(O_i-E_i)^2}{E_i}$$

$$= 2\sum_i \frac{(O_i-E_i)^2}{E_i} - \sum_i \frac{O_i}{E_i} \frac{(O_i-E_i)^2}{E_i}$$

$$= \sum_i \frac{(O_i-E_i)^2}{E_i} ( 2 - \frac{O_i}{E_i})$$

More precise stating of distribution of G under null hypothesis
This sentence should be made more precise
 * Given the null hypothesis that the observed frequencies result from random sampling from a distribution with the given expected frequencies, the distribution of G is approximately that of chi-squared, with the same number of degrees of freedom as in the corresponding chi-squared test.

Does it converge in distribution? So does the $$ \chi^2 $$ statistic, right? Is the asymptotic rate of convergence quicker for $$ G $$ than for $$ \chi^2 $$? I don't have any references on $$ G $$ so I'm afraid I won't be of any help answering these questions.

Andyrew609 19:39, 27 November 2006 (UTC)

splitting of the G statistics
I am currently going through agrasti's: Categorical Data Analysis (2002) and at page 82 he gies a clean explanation on how to partition the G statistic (p.s: be aware that on the 2007 edition on the book, most of this section was cut - so don't bother looking for it there)

This partitioning is useful - so it might be worth noting in the article... Talgalili —Preceding unsigned comment added by Talgalili (talk • contribs) 17:38, 5 September 2007 (UTC)

Maybe a squeamish comment about notation
It should be correct in the G formulae to write the bigger brackets outside the summation operator and containing the whole expression of the terms using indexes.

How to handle zero frequencies in observations?
Since in the formula

$$ G = 2\sum_{i} {O_i \cdot \ln(O_i/E_i) }$$

the logarithm is used, how terms are handled where $$ O_i = 0$$?


 * I guess you should use Pearson's in that case, as implicitly recommended for small $$ | O_i - E_i | $$: “[T]he approximation to the theoretical chi-square distribution for the G-test is better than for the Pearson chi-squared tests in cases where for any cell |Oi − Ei | > Ei, and in any such case the G-test should always be used.”
 * l0b0 (talk) 15:45, 16 April 2009 (UTC)

If $$ O_i = 0$$ the term $${O_i \cdot \ln(O_i/E_i) }$$ in the sum should be counted as zero. Entropeter (talk) 20:25, 4 February 2012 (UTC)

Better introduction
The opening paragraph never mentions that the "cells" it refers to are contingency table cells. Holopoj (talk) 21:01, 15 July 2013 (UTC)

Rolf and Sokal's incredibly late mention of the G test remotely notable?
The G-test (though not under that name) has been recommended (by comparison with the chi-square test) since its development in the 1930s. How on earth does a mention of it by Rohlf and Sokal -- nearly half a century late to the party -- count as notable? Their books are widely known in some application areas perhaps and so that mention might perhaps be the first major push for the G-test in those areas -- but again, so what? Why would it be important to flag how late those areas understood what had been going on in statistics nearly 5 decades before? How does that belong anywhere in an article on the G-test, let alone "above the fold"? Is there any evidence this recommendation by Rohlf and Sokal substantially influenced practice among *statisticians* (rather than people working in some application area)?

If this part is to stay it needs some support for notability.Glenbarnett (talk) 07:07, 28 January 2016 (UTC)

Why G-test?
The page says the test is "increasingly being used in situations where chi-squared tests were previously recommended." Why? dfrankow (talk) 16:04, 9 May 2017 (UTC)

Historical comment
I removed the sentence "This approximation was developed by Karl Pearson because at the time it was unduly laborious to calculate log-likelihood ratios." from the Relation to the chi-squared tests section. The reason is that there is no citation to this claim. According to Johnson and Kotz, Pearson's developed the chi-square statistic by using geometrical consideration. Moreover, Stigler notes the first apearences of the term "likelihood" in a mathematical sence was in two papers by Fisher in 1921 and 1922. The concept of likelihood ratio first apeared in papers by E. Pearson and J. Neyman, later in the 1920's.

יוסי לוי (talk) 10:55, 10 December 2018 (UTC)

Woolf (1957)
The article currently fails to mention the original paper that coined the term "G-test": Woolf, B. (1957). The log likelihood ratio test (the G-test). Annals of Human Genetics 21(4), 397-409. GKSmyth (talk) 09:31, 28 June 2020 (UTC)