Talk:F-test of equality of variances

This page enables the quick reference of a statistical procedure that used to be widely taught and which is still sometimes taught, although professional statisticians have long deprecated its use.

Otherwise, the warning about this test is only a short part of the F-test article, which makes it hard to reference and which invites confusion. Some months ago, I corrected the F-test article when a (non-statistician?) editor stated that the F-test (for the null hypothesis of the analysis of variance is not robust. ~

Having a separate article will help clarify the difference. Thanks Kiefer.Wolfowitz (talk) 00:19, 5 March 2010 (UTC)

Another argument for keeping this article is that only a separate article (with a name like this) can be listed in the Category of Obsolete statistical procedures. Kiefer.Wolfowitz (talk) 00:46, 5 March 2010 (UTC)

Title
The title of this article is somewhat long and cumbersome. Would F-test of equality of variances be reasonable? Michael Hardy (talk) 16:44, 6 March 2010 (UTC)


 * That is reasonable (and agreeable)! Please move the article to the better name. Thanks for the improvement! Kiefer.Wolfowitz (talk) 17:58, 6 March 2010 (UTC)
 * I moved it.Kiefer.Wolfowitz (talk) 18:14, 6 March 2010 (UTC)

Promulgating myths?
The value of this category is diminished when it also serves to promulgate myths, such as the F test for means is robust to departures from population normality, when a host of Monte Carlo studies since 1980 have shown it isn't for smaller alpha levels (e.g., .01) and/or unbalanced layouts, particularly when the group with the largest standard deviation has the smallest n.Edstat (talk) 04:55, 9 May 2010 (UTC)


 * There is no statement that the robustness covers everything. However, this F-test is relatively robust according to the usual reliable sources, e.g. Moore and McCabe. In particular, Hinkelmann and Kempthorne discuss simulation results for unbalanced data, noting increased variance of the simulated p-values under the null hypothesis when imbalance increases, but concluding that the F-test is rather good. There is a more detailed discussio of imbalance in the book of Andre Khuri, Thomas Matthew, Bimal Sinha. The phrase "promulgate myths" seems to depart from your previous respect for Kempthorne, et alia. Thanks, Kiefer.Wolfowitz (talk) 19:42, 9 May 2010 (UTC)


 * This is the danger in restricting one's knowledge to textbook authors, instead of primary sources.* The Type I error and power characteristics are famously documented - first on the side of the F test was Lindquist in the 1940s (who cited results from his student in his book), followed by the Bonoeau studies of 1960 and 1961, culminating with the Glass, Peckham, and Sanders study in 1972. J V Bradley, in 1968, raised the flag with the F test, and through the mid 1970s fought with GP&S until Bradley was finally blacklisted by obnoxious so-called purists who couldn't see beyond their biases and prejudices. Blair's study in 1980 came to the current understanding with the F, and Sawilowsky in 1992 with the t, that for larger alpha levels and balanced layouts the parametric tests are generally (and sometimes remarkably) robust from departures from population normality when testing for a shift in location parameter, but for smaller alpha levels and unbalanced layouts, some Type I inflations/deflations do occur. (E.g., if my memory serves correctly, Type I errors for the t, with nominal alpha at .05, fall to .044 for the exponential, 0.042 for t (df=3), .038 for the chi-square (df=2), and .018 for the Cauchy.)


 * The more important point, however, was first made by Scheffe in the early 1950s but it fell on deaf ears, again due to those who couldn't see beyond their biases and prejudices, that the cost of that general (and sometimes remarkable) robustness was the potential to impact comparative power. The Monte Carlo studies of Blair, Sawilowsky, among others throughout the 1980s reframed the issue by demonstrating that even if the t/F are robust, they are no longer UMPU, nonparametric counterparts are often 3 - 4 times more powerful, and by defitinon those npar tests are completely robust to this violation. In short, in the rare case of normality the parametric procedures maintain a power advantage of .01 - .03, whereas for nonnormality, the nonparametric procedures easily have power, for example under extreme skew, of .99 while the parametric test's power is .051. When I mentioned that I have had to fight clever workers, that was code for textbook authors, who still think Siegel's 1956 or Tate & Clelland's 1957 npar book is up to date, and hence every few years try and take us back several decades.


 * Kempthorne cannot be faulted for MC results that came after his time. There are some textbook authors who have actually read the primary sources and get a lot of this more or less correct. Rand Wilcox's later textbooks come to mind (but then again, there are a few dozen points where even he is still behind the times, but he is still LIGHT YEARS ahead of the those intro textbook authors you've mentioned above).


 * *This also underscores why Wikipedia must advertise itself as a non-professional encylopedia, because by definition it calls on non-academics to base what they write on secondary and tertiery sources!Edstat (talk) 23:46, 9 May 2010 (UTC)


 * Edstat, thanks for your explanation. Could you cite some other authors in this article, with a view towards the weight given in surveys in the most reliable sources. I do think that Akritas, Hettmansperger, etc. are serious and have shown great respect for Shlomo Sawilowsky before, so maybe you could find something by such an author? I know that Matthews is serious, etc.  Thanks, Kiefer.Wolfowitz (talk) 01:56, 10 May 2010 (UTC)


 * Well, be specific about which points you are interested in and I will see what I can do. Edstat (talk) 13:23, 10 May 2010 (UTC)

Removing from obselete category
I have again removed this from Category:Obsolete statistical procedures. The list of reasons for not including it, which I find are clearly stated in Category talk:Obsolete statistical procedures, clearly apply. The are:
 * Has anyone ever referred to these topics as obsolete? -No.
 * Are the underlying statistical ideas wrong? -No.
 * Do they still appear in current statistical text books? -Yes.
 * Are they something that a well-grounded statistician should know about? -Yes.

So the test is clearly not obsolete. Melcombe (talk) 12:21, 11 May 2010 (UTC)


 * I disagree with your assessment. There are numerous studies that have shown that what was previously considered to be a legitimate test is now known, via Monte Carlo studies, to be a nonrobust test, and hence it is obsolete. There are many secondary textbook authors who have noted this. It is appropriate, if you believe insufficient documentation is presented, to ask for more citations, but not to delete it. Similarly, the underlying statistical idea, initially based on asymptotic theory, has subsequently been shown to be invalid for small samples. I will wait a few days, and if there is no response, I will reinstate this category, and you can feel free to put a citations needed tag if you believe it requires more documentation. Someone finally does something of professional value in Wikipedia and it should be supported, not censored.Edstat (talk) 15:19, 12 May 2010 (UTC)


 * "Similarly, the underlying statistical idea, initially based on asymptotic theory, has subsequently been shown to be invalid for small samples." Are you saying that the theory used to derive the F-distribution for any sample size is wrong, given the assumption of normality? I am sure that will be news to the statistical community. And it seems you need to be reminded of WP:NOTAMANUAL ... that wikipedia is not some sort of guide book to what should be done. Melcombe (talk) 17:01, 12 May 2010 (UTC)


 * Again, the name "obsolete" was chosen to be consistent with the super-category. (I would prefer "deprecated" or "non-robust or underpowered or inefficient or inadmissible" etc. Will Melcombe also remove the point-of-view super-category? Kiefer.Wolfowitz (talk) 16:50, 12 May 2010 (UTC)


 * Clearly some theories have shown to be wrong and hence the theory might well be called "obsolete". The procedure here is perfectly correct if the assumptions are valid. Melcombe (talk) 17:01, 12 May 2010 (UTC)


 * Melcombe, that is your opinion and a big "if". The WP-relevant point is that the most respected textbooks (in the USA, at least) warn against the F-test for equality of variances, precisely because it's useless in practice. The Wikipedia article on alchemy doesn't include the Banach-Tarski paradox for doubling the mass of gold, because Martin Gardner's procedure is worthless in practice. Don't you acknowledge that deprecation of this F-test is standard among the most reliable sources (although perhaps not updates of Kendall's works from the 1940s)? Kiefer.Wolfowitz (talk) 17:19, 12 May 2010 (UTC)


 * Below you have now agreed that the F-test is valid if the assumptions behind it are valid. I agree that the test has been shown to give poor results when the assumption of normality does not hold. But there is a distinction between "warned against" and "obsolete". You have had the opportunity to add in citations for "obsolete" but have not done so. I encourage you to add in citations backing up the statement that the test is warned-against as it is evidently true that the F-test here is both warned-against and should be warned-against. But it remains true that the F-test is valid and should be used when the assumptions required are true (and known to be true)... there may be limited circumstances where this applies but they exist. It is essentially a matter of finding a phrasing that adequately gets over just when/why the test is warned-against. Melcombe (talk) 13:07, 13 May 2010 (UTC)


 * Melcome is correct that normality suffices for the F-test, and that the wording about asymptotics needs to be removed. Kiefer.Wolfowitz (talk) 19:12, 12 May 2010 (UTC)


 * No, what I said was a continuation of the previous sentence. Not only is the F test for variances not robust for departures of population normality, its degree of nonrobustness INCREASES as the sample size (drawn from a nonnormal distribution) DECREASES, despite what is predicted from asymptotic theory. No, that is not news to anyone who has conducted a Monte Carlo study on this.Edstat (talk) 19:52, 12 May 2010 (UTC)


 * I don't see any mention of asymptotics in the article. Clearly if the simulations and supposed asymptotic theory disagree (and disagree more as the sample size increases, which is what you seem to claim), then one of these must apply: the simulations are being undertaken incorrectly, or the asymptotic theory is being applied incorrectly, or they are being applied under contradictory assumptions. Melcombe (talk) 12:50, 13 May 2010 (UTC)


 * Furthermore, the statement - "Clearly some theories have shown to be wrong and hence the theory might well be called 'obsolete'. The procedure here is perfectly correct if the assumptions are valid" - is internally inconsistent. The F test on variances is used to test homoscedasticity, and it turns out to unexpectedly be based on the assumption of homoscedasticity! In other words, the only time this test may be "perfectly correct" is when it is "perfectly non needed". No one in applied statistics would consider this argument to be of any value.Edstat (talk) 19:55, 12 May 2010 (UTC)


 * I stand corrected. I should have read the sentences again. Kiefer.Wolfowitz (talk) 20:02, 12 May 2010 (UTC)


 * Let me also clarify: Suppose sample x is drawn from a normal population with mean zero and sd 1. Suppose further sample y is drawn from an exponential distribution. There are an infinite number of exponentinal distributions, but only one has mean 1, and hence sd 1. Therefore, in violating normality, there is almost a guarantee there will be a violation of homogeneous variances, and hence my comment above that the test unexpectedly turns out to be sensitive to nonnormality because concomitant with it is heteroscedasticity. I hope that clarifies my point.Edstat (talk) 23:54, 12 May 2010 (UTC)
 * No it does not. Just what is it that you think is being tested? It is perfectly possible to test for equality of variances without making an asssumption of normality, which is presumably exactly what the tests being suggested as replacements do. Your argument would imply that these tests are wrong too. You are wrong to say that there is only one such exponential distribution, as the context here is one in which there is an unknown location parameter for each population, so the case being treated would need to be considering shifted exponential distributions (involving two free parameters) not just ordinary exponenentials. Of course it would be possible to construct tests appropriate to the assumptions you seem to want, but they are straying too far from what the F-test is testing. You would have to specify exactly what assumptions you are prepared to assume: what clases of distributions are involved and what types of differences between the distributions under the alternative hypothesis. It seems that the F-test and at least some of the "replacements" involve assuming that the possible differences are of location-scale form, while only the F-test assumes normaility.
 * You seem to be confusing the assumptions behind the test with what is being tested. The only point in the overall test procedure in which an assumption of "homogeneity" is being made is in determining the distribution of the test statistic. The test is a valid test against "homogeneity", essentially because the test statistic is selected under the assumption that "homgeneity" does NOT hold. These are just standard steps, essentially basic to all of hypothesis testing. The assumption of normality is used in two ways: (i) in determining the null distribution of the test statistic, which is why the the test is invalid/ has the wrong size if the assumption of normality is wrong,; (ii) in determining the form of the test statistic as some form of "best test statistic" (either through power comparisons or derivation via maximum likelihood), and here normality under both the null and alternative hypothesis is required. To consider other tests, the assumption of "homogeneity" must always be being made under the null hypothesis, essentially because that is the hypothesis being tested.
 * Melcombe (talk) 12:41, 13 May 2010 (UTC)

So, what do we see here? (Yes I like that phrase.) We have a test that does not have the required properties (size) in situations in which the assumptions made for its validity are not valid. That clearly would apply to any test. Non-normality is not particularly special. The majority of things labelled as "statistical tests" rely on the assumption of statistical independence somwhere within their set of assumptions ...are we to label all tests as "obsolete" because complete statistical independence never holds? This non-robustness to validity of assumptions does not separate the "F-test of equality of variances" from any other test. JA(000)Davidson (talk) 09:53, 13 May 2010 (UTC)

Name change to resolve this problem?
Here is a suggestion. There are numerous statistical tests that are (1) obsolete (i.e., superior tests have since been developed for the null and alternative hypothesis for which it was designed, but there are also (2) tests that subsequently have been shown by Monte Carlo methods to be nonrobust to departures from independence, departures from homogeneity of variance, departures from normality, and/or combinations thereof, or are (3) lacking in comparative statistical power. Perhaps the argument of Melcombe is just semantics due to the category title. I suggest moving this to a title such as: Obsolete, non-Robust, and non-Powerful Statistical Tests, or Poor Statistical Tests, or some such title that is more encompassing than just "obsolete." Edstat (talk) 00:05, 13 May 2010 (UTC)


 * Well, it is not only semantics, it is also a question of providing adequate back-up for any statements being made. I agree with the contribution in the section above, that we need to be careful of basing the idea of labelling a test as "not to be used" just on the fact that it has poor behaviour if it is used in cases where the basic assumptions behind the test are not valid. "Poor power" might be a consideration, but clearly does not apply here because the test is optimal provided the assumptions are valid: but a test with poor power need not always be regarded as "not to be used" if it has other virtues, such as being readily calculable by hand. I suggest that, if the separation-out of some procedures really is wanted, the term "superceded" (as in "superceded statistical procedures") might do for a category name, but that this needs to be backed-up in every such article by a section labelled "superceded procedures" (or similar) which would be specific about:
 * exactly what test or procedure was being said to be superceded (to cover instances where an article might describe several procedures only some of which are "superceded", and protection against merging of articles).
 * the context in which the thing is superceded (eg. for practical statistical analyses, or ranges of sample sizes) to avoid giving the impression that the theory is somehow wrong. If there are other instances of things being "superceded" they may not be as supposedly clear-cut as for this article, and there must be more room for providing helpful information that just sticking an article in a category does not provide.
 * replacement procedures
 * sources justifying the label as superceded and the supposed replacements... and no primary sources are not good enough on their own, there needs to some more general backing in the statistics community. For this particular test, we had requests for citations backing the label as "obsolete" that were never met, but "superceded" may fare better.
 * My reason for saying that "primary sources are not good enough on their own" is partly that this a Wikipedia requirement, but also to provide protection from cases such as where, in KW's example (on other talk page) of "flat-earth", we could very well find a "primary source" setting out this theory and extolling its virtues but would be unable to find adequate secondary sources to justify its worth, notatability or usefulness. If a category is set up, then I think it needs to be requirement for inclusion that a section such as outlined above should exist in the article, because there must be a way in which a value-judgement such as that implied here can be given appropriate citations and discussed as usual, and because there should be any easy way of finding the justification for being in the category within the article ...and perhaps more importantly the exact circumstances of when a procedure is "superceded" can be fine-tuned.
 * The question would then remain of whether the procedure in the present article actually is "superceded". There are arguments both ways, given that the test can (and should) be used in contexts where adequate checking of the normality assumption has already been made, perhaps across multiple sets of data of the same type from the same source. Thus the test procedure cannot be regarded as "superceded" in all circumstances. But that is why discussion of exactly when a procedure is inadvisable needs to be included in the article. And it exemplifies why I would prefer "superceded" to "obsolete", as the latter term does seem to have an indivisible, no-exceptions character.
 * Melcombe (talk) 11:59, 13 May 2010 (UTC)

Issues
I think the discussion above has raised an important question regarding the "replacement tests" being indicated ... just what is it that these tests provide tests of, and the assumptions/structure required? One can imagine replacing the normality-location-scale framework leading to the F-test, by either a "same-but-unknown family of distributions differing by location-scale" or a "possibly different distribution families, unknown locations but differing variances" framework, or even having "variances" here replaced by some other type of measure of spread. If you want to try to indicate "good practice", then this should start with choosing an appropriate thing to test, within an appropriate set of assumptions and structure of change-to-be-detected ? Melcombe (talk) 13:24, 13 May 2010 (UTC)

Once again around the percentage bend (yet another obsolete procedure)
"We have a test that does not have the required properties (size) in situations in which the assumptions made for its validity are not valid. That clearly would apply to any test." This, I suppose, is the difference between a mathematical statistician's view of the field as opposed to an applied statistician's view of the field. The utility of a statistical test, for situations when its underlying assumptions are not known to be true or cannot reasonably be assumed to be true, is based solely on its robustness first, and comparative power second. The F test on variances is notoriously non-robust to non-normality in all its forms, and applied statisticians are well trained to realize that population normality is not the norm (see, e.g., surveys of Pearson and Please; Tan; Miccerri; etc.) In other words, the only time this test can be reasonably used is in a fictional world and hence its appeal to the naïve mathematical statistician who doesn't have to testify in court, be responsible for a clinical trial, make a business decision, etc.

This is the reason that there are countless primary studies (Monte Carlo or otherwise) that have demonstrated how dangerous this test is, and if one wanted to do so, one could find countless textbook authors (i.e., secondary sources) mimicking this fact.

The argument being set forth in defense of this test is the reason why it pops up periodically. If the F test was a drug, those pushing it would be sued for product liability.

I lost interest in this argument 30 years ago – there has been due diligence in the field by ethical and responsible applied statisticians, and if every few years someone wants to make the claim that the F test on variances is useful in any applied situation, I would challenge them to specify in what real life circumstances they know this to be true, and furthermore, to support it.

Finally, it takes about 5 minutes to write and execute a Monte Carlo study showing that conducting the F test on variances conditional on a non-significant test of normality inflates the Experiment-wise Type I error rate to nearly double, and presents a marked loss in power. So, if you can live with that, I’m done editing here.Edstat (talk) 00:21, 15 May 2010 (UTC)


 * It remains the case that no-one has added anything that expresses what an applied statistician's view would be (and provided citations). There is not even a statement of what someone might think they are testing if they they are not assuming normality. I have not read anything here that could be interpreted as being "The argument being set forth in defense of this test". It remains a fact the F-test is perfectly valid is the assumption of normality holds ... no one has yet added anything that even begins to explain why someone would contemplate applying this test if they think that normality might not hold. Thus the article currently has no context for saying anything about the poor performance if there is non-normality. I am certainly not against having valid facts for practical statisticians, but we need to be clear about what those facts are and when they apply, without being in conflict with WP:NOTAMANUAL. Melcombe (talk) 08:54, 18 May 2010 (UTC)


 * Well then, if you insist on repeated the same thing, I give up. After all, wikipedia is a nonprofessional encyclopedia, so why not promote a test that has no practical use, and in fact is dangerous in all real data (and before you ask, as has others, what is meant by real data, see Micerri) situations? I've discharged my obligation to the profession; have you?Edstat (talk) 03:14, 21 May 2010 (UTC)


 * I shall provide citations from Moore McCabe about the imprudence of this test. Edstat's view is established among competent statisticians, and it's a waste of time to discuss the never-never land of what if we had a normal distribution. It just suffices to report verifiable information from the most reliable sources, giving due weight to differences of opinion among authorities. It's a waste of EdStat's time to argue further, since convincing Melcombe has never been a Wikipedia criterion. Kiefer.Wolfowitz (talk) 13:30, 21 May 2010 (UTC)
 * The introduction's bipartition of the statistical theory of the F-statistic into mathematical statistics and applied statistics is nonstandard. Who is responsible this faulty, original research?  Kiefer.Wolfowitz (talk) 13:34, 21 May 2010 (UTC)
 * One might ask who is responsible for Kiefer.Wolfowitz (talk)'s lack of understanding of WP:NOR, let alone WP:NPOV. The introduction is clearly not a "bipartition of the statistical theory": firstly it doesn't claim to separate "statistical theory" into two disjoint parts and secondly "applied statistics" is not part of "statistical theory" ... so a distinct lack of logic here. An article needs to be about the subject specified by the title: its content should cetainly not be constrained deliberately to make it conflict with WP:NPOV, whereas we see from the discussion above that Kiefer.Wolfowitz (talk) created this page for that very purpose. Melcombe (talk) 13:57, 24 May 2010 (UTC)


 * Just stumbled across this - my question is what is being implied by WP:NOTAMANUAL? Melcombe wants, apparently, citations of when to or when not to use this test, which makes good sense to me, but then seems to be against such documention, calling it WP:NOTAMANUAL, so I'm confused. I guess the bigger question is why is there so much talk about this page if it has been deleted? Might this mean the category should be called "Controversial Hypothesis Tests"? My belief is that there are lot of controversial hypothesis tests that might be discussed in such a category, but of course that would require the specific circumstances when they are controversial. Finally, the idea that the test is valid when all goes well should be obvious - otherwise it wouldn't be a legitimate hypothesis test to begin with.76.112.241.229 (talk) 20:28, 21 May 2010 (UTC)
 * I am not against "such documentation", just against making statements without documentation and without being clear about what is actually being said. I think that a bald statement "don't use this this test, use that one" reeks of "just a manual". There is just slightly more than this, but not enough for a reader to know what these supposed simulation studies actually found, and not enough to contain reliable sources for recommendations about actual statistical practice, as opposed to the primary sources who would naturally talk-up the importance of what they think they have found. Melcombe (talk) 13:57, 24 May 2010 (UTC)


 * REMARKABLE. Have you really convinced yourself that Wikipedia, which advertises itself as a "nonprofessional" encyclopedia, written by same, is somehow more of a scientific tone than "primary" sources that are peer-reviewed, double blind, etc., and in which the latter would in any way permit "talking-up the importance" of what gets published in the professional literature? Have you ever published anything in the professional literature? REALITY CHECK TIME! If I wasn't convinced wikipedia is a (sometimes pleasant but nevertheless) waste of time before, I certain am now! Professional encyclopedias rely on primary sources first, and supplement with secondary sources (tertiary: out of the question!). The reason wikipedia relies on secondary and tertiary sources (read in here: find it on google because going to a university grade library or subscribing to online journals is out of the question) has nothing to do with them being reliable, independent, or informed: it is because (1) it is hoped that what has become common knowledge is accurate, because same can only know common knowledge, and (2) the project is based on the modern journalism concept that if one can find two sources of heresay it meets the legal test of disproving intent to defame or libel. Well, no sense is giving a snappy comeback, because to this page I won't.Edstat (talk) 23:09, 25 May 2010 (UTC)


 * Please avoid the phrase "Wiki warriors". It may be helpful to appeal to an administrator or third opinion, if things cannot be resolved among you two. Thanks, Kiefer.Wolfowitz (talk) 10:58, 26 May 2010 (UTC)