Talk:Multiple comparisons problem

Wiki Education Foundation-supported course assignment
This article is or was the subject of a Wiki Education Foundation-supported course assignment. Further details are available on the course page. Student editor(s): Jbrowning17. Peer reviewers: Jbrowning17.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 04:36, 17 January 2022 (UTC)

Coin flip calculation
"the likelihood that a fair coin would come up heads at least 9 out of 10 times is 11 * (½)10 = 0.0107."

Can someone please explain where the 11 came from? My own understanding is that it should be a 10 —Preceding unsigned comment added by 67.169.127.132 (talk) 05:07, 31 December 2008 (UTC)


 * It says "at least nine", not "exactly nine".
 * And "likelihood" is the wrong word here; I've changed it to "probability". Michael Hardy (talk) 05:57, 31 December 2008 (UTC)
 * P(Heads at least 9 out of 10 times)= P(Heads 10 out of 10 times) + P(Heads 9 out of 10 times)
 * = 1/2^{10} + choose(10,1) 1/2^{10}
 * = 11 * 1/2^{10} 128.231.234.23 (talk) 19:14, 26 March 2024 (UTC)
 * = 11 * 1/2^{10} 128.231.234.23 (talk) 19:14, 26 March 2024 (UTC)

Can you add a reference of some kind ? Maybe looks obvious to a statistician, but not so much for me. And I'm a scientist ! Would be very helpful for the general reader. —Preceding unsigned comment added by 146.50.10.49 (talk) 09:48, 31 March 2009 (UTC)

Any outcome of tossing a fair coin 10 times has probibility of (1/2) ^ 10. There is only one way to get 10 heads and then 10 ways of getting 9 heads (once for each of the 10 throws being a tail e.g. first is a tail or third is a tail) hence (1 + 10) * (1/2). 213.173.165.162 (talk) 14:40, 24 March 2010 (UTC)

Multiple comparisons or multiple testing ?
I have always heard of the problems explained in the article under the name multiple testing (cf also the Benjamini & Hochberg paper cited at end of the page), so I would be tempted to suggest to move the article to this name, but it may simply be a bias on my side. Any opinion ? Schutz 21:00, 21 September 2005 (UTC)


 * Multiple comparisons is what I've always heard; multiple testing to me sound like simultaneous testing of multiple null hypotheses, and that is not the same topic. So I oppose such a move. Michael Hardy 22:09, 15 October 2005 (UTC)
 * multiple testing is exactly what you call multiple comparisons, but I'd be interested to know of any reference that documents the meaning that you indicated above. As I wrote above, I have read mainly papers that use the terminology multiple testing, but this may be a bias among researchers specialised in a specific area. For example, almost all the literature on the statistical analysis of DNA microarray use this term (I just added a sentence on this on the microarray page). Given that noone else has answered my suggestion (thanks for jumping in !), I will not move the page, but I will indicate clearly that multiple testing is (also) what we are talking about here. Schutz 00:03, 16 October 2005 (UTC)
 * Ok, after rereading the article again, it seems to me that it is very confusing in its present state ! If one reads only the first sentence, multiple comparisons is indeed not the same as multiple testing: if, by multiple comparison, this article means basically "what you do after obtaining a significant ANOVA F-test", then indeed we are not talking about multiple testing in general (which covers more generally the problems of "using statistical tests repeatedly" as indicated in the intro). Do you agree ? In this case, most of the discussion could go in a more general article on multiple testing, that I will be happy to start. Unfortunately, all the definitions I have been able to find so far for multiple comparisons blur the distinction with multiple testing. Schutz 00:53, 16 October 2005 (UTC)
 * I have to agree with Michael Hardy that the term often used is multiple comparisons. I have used it myself, and had reviewers of my own research papers claim I need to do "multiple comparisons corrections." Also, google returns 177,000 hits for the search bonferroni+"multiple comparisons", and 48,000 for bonferroni+"multiple testing". Debivort 08:09, 3 January 2006 (UTC)
 * Sorry, I am lost here; please see the last comment I have written above. It seemed clear to me that this article was about the ANOVA F-test multiple comparison problem; indeed, most of the procedures linked from this page are specifically about "comparing sets of means", as was the lead section. Based on this, the article has been split between multiple comparisons (this page), and multiple testing (the general problem). The (good) changes you have made are about the general problem. If the consensus is that multiple comparisons is the general problem, then the two pages should be merged &mdash; and this article should be cleaned up. But I must say that I like the split approach, and it seems logical: with the ANOVA, you are really comparing a set of means, while testing really refers to the application of multiple statistical tests, whatever they are. The google searches do not tell us if the two terms have the exact same meaning (some of the links for multiple comparisons point towards the ANOVA question only; some talk about the more general problem). For the record, even though it is not relevant to this particular discussion, I have mostly seen the term multiple testing for the general problem, including in reviews of research papers. Hey, the only paper in the bibliography of the article that mention anything says multiple testing, and it is about the general problem ;-). Schutz 15:30, 3 January 2006 (UTC)
 * Mathworld says that Bonferroni corrections address the multiple comparisons problem. They alas do not have an entry on "multiple testing". It seems like the article text (parts not by me) and all of the statistical tests linked below that I am familiar with address "multiple comparisons" as the problem is conceived by me and Mathworld. Is your conception of multiple comparisons (i.e. the ANOVA f-test) a specific example of multiple testing/my conception of multiple comparisons? I wonder if we aren't just running into a linguistic rather than content-based hurdle here. Debivort 16:54, 3 January 2006 (UTC)
 * Basically, I first thought it was only a linguistic question when I started this discussion a few months ago. It is only based on the comments above (it was mentioned in particular that multiple testing and multiple comparisons were not the same thing), and the content of the article that I assumed that multiple comparisons (i.e. the ANOVA f-test) was a specific example of multiple testing &mdash; while it was not my conception, I was ok with the distinction and spinned-off the multiple testing article, which no one objected about. This is why I am a bit puzzled about the going back. I wonder if there may be a systematic difference in vocabulary between statisticians working in different fields; the statistical papers I have seen so far were all about multiple testing (starting with Benjamini-Hochberg, as mentioned above). This is probably why I easily believed that multiple comparisons was the special case, but it may be only a bias. In any case, if the consensus is that multiple testing==multiple comparison (hopefully other people will say something), then the first priority would be to merge the other article, instead of rewriting it (although it may be too late). As a side note, at least some of the linked articles are indeed specially related to ANOVA. Schutz 17:35, 3 January 2006 (UTC)
 * Yeah, it does seem like we need to rope in some other comments. I'll ask around if anyone has the time to comment on it. Maybe you can do the same? Debivort 05:20, 4 January 2006 (UTC)


 * Try asking at Wikipedia talk:WikiProject Mathematics. linas 15:08, 5 January 2006 (UTC)

Disclaimer: I don't know if I am biased by may concrete problem, as I am not statistician, neither I am english native speaker, but I'll try to help. According to dictionary:

Testing n. 1. The act of testing or proving; trial; proof. [1913 Webster]

Comparison n. 1. The act of comparing; an examination of two or more objects with the view of discovering the resemblances or differences; relative estimate. [1913 Webster]

With these definitions, I think that making _multiple test_ is repeating a test some times. An example, if we want to test if A is better than B (or equal, or whatever). After that we got C and we want to test A vs C, and B vs C. Then comes D and I want A-D, B-D, C-D... If we do that, with a t-test or wilcoxon, it is more likely having false positives (the coin example in the article). In this way, we would be accepting a false hipotesis for example saying that A has the same mean that D. For this reason, we have tests designed to avoid this: ANOVA (parametric), Friedman (nonparametric), others??...

After performing ANOVA or Friedman, we only know that for example H0: A = B = C = D is not true. Then we would probably want to know which one is different from the others. For this purpose, we can apply one of the techniques that allow us to _compare_ every pair: Tukey test, Nemenyi, Bonferroni...

The previous could clearly split article in two, but probably I have left other ideas, like those about techniques to repeat a test in order to increase power that I do not know of. I think we should clearify which contents do we want here before deciding about one or two articles. Arauzo 18:59, 20 April 2006 (UTC)


 * Revising some bibliografy, in (Zar, 1999) these are chapters 10 an 11:


 * Multiple Hypotheses: the analysis of variance. This chapter introduces the problem of repeating the same test to over different samples to confirm various hypothesis over them (coin example). Then explains ANOVA and their non-paramentric extensions like Kruskal Walls and points to chapter 14 for other techniques with more than one factor ex. Friedman.


 * Multiple comparisons. This chapter explains how the comparisons among pairs of the samples tested in an ANOVA test should be done and different test for comparisons like Tukey.


 * In the start of chapter 11: 'The term "multpliple comparisons" was introduced by D. B. Duncan in 1951', according to (David 1995).


 * H. A. David First (?) occurrence of common terms in mathematical statistics. Amer. Statist. 49: 121-133, 1995.
 * Jerrol H. Zar, Biostatistica Analysis, 4th ed. Prentice-Hall 1999, ISBN 013081542X
 * Arauzo 11:35, 23 April 2006 (UTC)

Strong Support. I'm late to this discussion, but I've never heard of multiple testing, until just now. I use multiple comparisons as a term all the time (especially Bonferroni and friends). Could it be a UK/US thing, or a case where SPSS has dictated the vocabulary to the world? Otherwise, I think the time for merger may be here. I'll plan to do it in a couple of days, if I don't here from anyone else. -Scott Alberts 03:59, 6 September 2006 (UTC)


 * Multiple testing and multiple comparisons should remain different pages. The difference is not merely UK/US or terminology, but one of essential difference. Multiple testing, or multiple hypotheses testing, is the general problem of testing several null hypotheses while controlling the overall chance of false positive; Bonferroni correction of P-values is the most common method for doing this, but not the only one. Multiple comparisons are the special cases where the tests are comparisons between groups, typically several pairwise comparisons on a set of groups (e.g. all-against-all or one-against-all); although Bonferroni correction is still valid, it is typically not very powerful as it does not make use of the dependencies between different pairwise tests performed on the same groups. I'll see if I can make an update to the multiple testing page at some point which will make this difference more clear. --Septagon 23:43, 7 January 2007 (UTC)

I'm late to the party here, but "multiple comparisons" and "multiple testing" are typically considered as fairly interchangeable terms these days. One of the most respected experts on the topic is J.P. Shaffer, who said: "The term 'multiple comparisons' has come to be used synonymously with 'simultaneous inference,' even when the inferences do not deal with comparisons" (from "Multiple hypothesis testing: A review," Annual Review of Psychology, Shaffer, 1995). In various instances over the decades, people have sometimes used the term "multiple comparisons" more narrowly (referring to ANOVA-type contexts), but I disagree with Septagon's statement that "multiple comparisons" can only be used in the context of group comparisons. For example, in a within-subjects (repeated measures) design, there aren't separate groups, but there could still be "multiple comparisons" if there are more than one factor or more than two factor levels or more than one outcome variable.

"Multiple comparisons" is an older term than "multiple testing," so it is not surprising that it shows up more in searches. But "multiple testing" is now very much a standard, well accepted term in the field. For instance, there is a well known text called Multiple Testing Problems in Pharmaceutical Statistics, which is authored by and edited by some of the most respected statisticians in the area. In the manual Multiple Comparisons and Multiple Testing Using SAS, by Westfall, Tobias, and Wolfinger (all respected names in the field), the authors say "the term 'multiple testing' is more common than 'multiple comparisons' when analyzing modern high-dimensional data." They make a fairly loose distinction between "multiple comparisons" and "multiple testing," saying that "multiple comparisons" is "often" used for comparisons of several treatment means, whereas "multiple testing" is used in "a broader class of applications." But certainly if one substituted nonparametric tests (in which case the comparisons wouldn't be of means), there would still be "multiple comparisons" in the literal sense. In fact, even multiple correlation tests can be considered multiple "comparisons," as each test is effectively a comparison of the correlation to zero.

To avoid confusion, maybe the best approach for this article is treat "multiple comparisons" as indeed equivalent to "multiple testing," but mention that the term has sometimes been defined more narrowly.23.242.195.76 (talk) 17:37, 27 June 2021 (UTC)

Lead section
I think that the lead section should be a little more accessable. The big picture in plain language. There is plenty of room for the subtulties of the concept further down. ike9898 01:59, 8 October 2005 (UTC)


 * I agree. I don't know what the sentence "The experimentwise α level increases exponentially as the number of comparisons increases." means. What is an α level, or where do I go to look it up? Not really a field I know that much about, so I look forward to a more clear article. -- Jake 07:13, 15 October 2005 (UTC)


 * I agree as well, and will take a crack at an edit with a more accessible intro, taking into account the current trent in multiple comparisons v testing (above). Debivort 08:10, 3 January 2006 (UTC)

Tukey's Studentized Range Test/Distribution
There is a nice summary of this by NIST at which I believe is in the public domain, as NIST is a US government agency. In fact I made a template for this: NIST-PD. Btyner 18:57, 15 May 2006 (UTC)

Redundancy
There is a section in the middle of the article that is repeated word-for-word later in the article. Please fix. Thanx. --Cromwellt 5PM, 16 Feb 2007 (having login problems) —The preceding unsigned comment was added by 67.142.130.42 (talk) 23:03, 16 February 2007 (UTC).

RVs
Recent reversions may be problematic as they restore claims about Bonferroni that are uncited and have already been called into dispute. Can some of the other editors weigh in on this? de Bivort 15:38, 13 November 2007 (UTC)

Multiple comparisons for confidence intervals and hypothesis tests
The paragraph says:

"If the inferences are hypothesis tests rather than confidence intervals, the same issue arises. With just one test performed at the 5% level, there is only a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, for 100 tests where all null hypotheses are false, the expected number of incorrect rejections is 5.  If the tests are independent, the probability of at least one incorrect rejection is 99.4%.  These errors are called false positives."

shouldn't be "However, for 100 tests where all null hypotheses are true, the expected number of incorrect rejections is 5."?

that is, we reject it when we think is false (based on the alpha level). In this case the problem arise because we reject it even if it is true. Sorry if I misunderstood- non-statistician here.Diego Diez 13:22, 23 September 2010 (UTC) —Preceding unsigned comment added by Kurai yousei (talk • contribs)

Unhelpful section: "Classification of m hypothesis tests"
I suggest a re-thinking of the purpose and goal of this section. The main problem is that nothing in this section appears elsewhere in the article, so it should be deleted if no more work is done to make it useful.

The section comes at an important point in the article and risks throwing the reader off-track. The section throws a lot of variables at the reader, forcing him or her to ponder a complicated table just to come up with some pretty intuitive and obvious ideas, such as that the number of false positives and true positives add up to the number of discoveries. Do we need all this quasi-math to know that?

The table is confusing, mostly because there is no clear relation between rows and columns. I believe "Declared significant" means "Researchers believe alternative hypothesis to be true", and "Declared non-significant" means "Researchers believe null hypothesis to be true". This relabeling might make the table clearer.

So the question comes down to: what do you want to teach the reader? Currently, the section teaches nothing worthwhile. — Preceding unsigned comment added by AndrewOram (talk • contribs) 11:36, 22 April 2014 (UTC)

Dr. Wolf's comment on this article
Dr. Wolf has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

"The article is not structured well, has many holes, and even contains some wrong (or at least misleading) statements. It might be better to scrap it altogether and refer to a well-crafted review paper instead. Sorry about this negative rating but I feel the need to say it as it is.

http://www.dictionaryofeconomics.com/article?id=pde2010_M000425&edition=current&q=romano%20wolf&topicid=&result_number=1"

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Wolf has expertise on the topic of this article, since he has published relevant scholarly research:


 * Reference : Joseph P. Romano & Azeem M. Shaikh & Michael Wolf, 2009. "Hypothesis testing in econometrics," IEW - Working Papers 444, Institute for Empirical Research in Economics - University of Zurich.

ExpertIdeasBot (talk) 22:42, 24 September 2016 (UTC)

Criticism
My issues with this article are: 1) the controlling procedures section is a repetitive since its merge with multiple testing correction so that needs to be condensed and cleaned up 2) the large-scale multiple testing section is largely missing citations for its claims Jbrowning17 (talk) 03:18, 18 October 2016 (UTC)

sources for edit
I must say it is really ridiculous that something that can be found so easily using a search engine (as I mentioned in the original edit summary), would still need to have a source brought for it and deleted otherwise. Well here are the sources, you could just google "MCP conference XXX" (when XXX is the year of the confernce) and get these as first results, but I'll list those results anyway: Orielno (talk) 17:43, 19 October 2016 (UTC)
 * 2011 conference
 * 2013 conference
 * 2015 conference
 * 2017 conference

Multiple comparisons correction section is highly problematic
For example:

Multiple testing correction refers to re-calculating probabilities obtained from a statistical test which was repeated multiple times.

Not clear what "recalculating probabilities" means. And "repeated multiple times" doesn't make sense in this context. Multiple testing doesn't mean the same test is repeated; it means multiple different tests are conducted.

In order to retain a prescribed family-wise error rate α in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than α.

Not true. For instance, you could use sequential testing or a closed testing procedure. Also, the familywise error rate is only one "total error rate" that has been defined. It would be better to mention the familywise error rate and false discovery rate, and link to the wiki articles on those topics.

Boole's inequality implies that if each of m tests is performed to have type I error rate α/m, the total error rate will not exceed α.

That sentence is grammatically indecipherable.

'''In some situations, the Bonferroni correction is substantially conservative, i.e., the actual family-wise error rate is much less than the prescribed level α. This occurs when the test statistics are highly dependent (in the extreme case where the tests are perfectly dependent, the family-wise error rate with no multiple comparisons adjustment and the per-test error rates are identical).'''

Too much detail about one specific procedure (the Bonferroni procedure) in one implausible situation (perfect dependence). Why not just briefly mention the Bonferroni procedure and link to the wiki article on that topic?

'''For example, in fMRI analysis,[8][9] tests are done on over 100,000 voxels in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance.'''

fMRI analysis is often done on over 100,000 voxels, but not always. Also, this is not the best example, since using p-values for thresholding in fMRI (as described in the cited sources) isn't quite the same as using p-values in standard null hypothesis tests.

Since adjacent voxels tend to be highly correlated, this threshold is generally too stringent.

It's POV to say the Bonferroni threshold is "generally too stringent."

Because simple techniques such as the Bonferroni method can be conservative, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without excessively inflating the rate of false negatives.

It's POV to refer to other techniques being "better" than Bonferroni.

'''Such methods can be divided into general categories:

Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including a partially correct null hypothesis. Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions. Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA, MANOVA, or Tukey's range test. These methods generally provide only "weak" control of Type I error, except for certain numbers of hypotheses. Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.'''

That entire list is unsourced, and there is no particular reason to divide multiple testing procedures into those particular categories. For one thing, these categories are not mutually exclusive (e.g., methods that provide weak control are a subset of methods that provide control "except under certain defined conditions"). Also, the first category (methods that work "under any conditions") is so narrowly defined that it excludes basically any standard method, as there are pretty much always statistical assumptions that are required. And the second category (methods that work "except under certain defined conditions") is so vague as to be essentially meaningless. Also, some of the info here is false. For example, Tukey's range test is NOT an omnibus test; and a method can require a significant omnibus test and still provide strong control (e.g., Hayter's procedure). Also, the list of categories ignores procedures that are designed to control the false discovery rate, rather than the familywise error rate. It would be better to link to the familywise error rate article than attempt to cram a bunch of detail in here.

'''The advent of computerized resampling methods, such as bootstrapping and Monte Carlo simulations, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.'''

That may be true (though no source is provided). But again, there is an entire wiki article on the familywise error rate, which covers methods of control. Why not just link to that article, instead of going into detail in this article which is supposed to be about the general problem of multiple comparisons?

Problems with the "controlling procedures" section
'''If m independent comparisons are performed, the family-wise error rate (FWER), is given by α¯ = 1 − ( 1 − α)^m. Hence, unless the tests are perfectly positively dependent (i.e., identical), α¯ increases as the number of comparisons increases.'''

The second statement, though true, doesn't follow from the first, so why "Hence?" Also, FWER is only one error rate that's been defined; what about the per-family error rate and the false discovery rate?

If we do not assume that the comparisons are independent, then we can still say: α¯ ≤ m ⋅ αpercomparison

"We can still say?" I think what was meant was that, regardless of the dependence of the tests, α¯ ≤ mαpercomparison., and the more positively dependent the tests, the more α¯ shrinks toward α.

Example: 0.2649 = 1 − ( 1 − .05 ) 6 ≤ .05 × 6 = 0.3

That "example" has no explanatory value. What's the point of simply plugging arbitrary values into the formula without providing explanation that connects those values to some useful context?

The most conservative method, which is free of dependence and distributional assumptions, is the Bonferroni correction

Not necessarily. There are procedures that can be more conservative than Bonferroni in some cases (e.g., the Benjamini–Yekutieli procedure and the Scheffé procedure). Also, note that we don't need to mention the Bonferroni procedure both here and in the next section. Also, "free of dependence and distributional assumptions" is vague and likely unclear to many readers. I suggest either being more explicit (e.g., "The Bonferroni procedure does not require any assumptions about the dependence (correlation) of the tests, and does not impose any added assumptions about the distributions of the test statistics") or simply linking to the Bonferroni correction article without going into the issue of assumptions here.

A marginally less conservative correction can be obtained by solving the equation for the family-wise error rate of m independent comparisons for αpercomparison.

That method (i.e., the Šidák procedure) is indeed marginally less conservative than Bonferroni, but it's actually marginally liberal when the tests are negatively dependent. Also, focusing on the Šidák and Holm procedures is abritrary. Numerous proceudres have been developed. Each procedures has its pros and cons, and some procedures are more generalized or specialized than others. Mentioning Bonferroni makes sense (though we don't need to do so multiple times in the article), since that's the simplest method. Why don't we just mention Bonferroni and note that there are other procedures, rather than going into detail about two particular alternatives?

Continuous generalizations of the Bonferroni and Šidák correction are presented in.[7]

That's not even a complete sentence. — Preceding unsigned comment added by 23.242.195.76 (talk) 16:49, 27 June 2021 (UTC)