Talk:Mann–Whitney U test

Ties
"All the formulae here are made more complicated in the presence of tied ranks, but if the number of these is small (and especially if there are no large tie bands) these can be ignored when doing calculations by hand. The computer statistical packages will use them as a matter of routine."

First, what does one do with ties? (A link is sufficient if it is described elsewhere.)

Second, what do computer statistical packages use routinely? The ignoring procedure, or the proper (undescribed) way to handle ties?

dfrankow (talk) 19:45, 29 December 2008 (UTC)


 * Yes, that was not well expressed. I have had a go at rephrasing it - does it make better sense now?  The actual formula in the case of ties would be a bit of a pig to enter in Wiki code and I will leave that job for someone more fluent in the coding than I am, though it's certainly true that for completeness we ought to have it here.  seglea (talk) 01:40, 30 December 2008 (UTC)

One-tailed versus two-tailed distributions
"Note that since U1 + U2 = n1 n2, the mean n1 n2/2 used in the normal approximation is the mean of the two values of U. Therefore, you can use U and get the same result, the only difference being between a left-tailed test and a right-tailed test.

Huh? Perhaps it would be clearer to say which value goes with a left-tailed and which with a right-tailed test. dfrankow (talk) 19:45, 29 December 2008 (UTC)

This test can be reduced to one formula with one reference table
Really, someone should look into that... I don't have the time. COYW (talk) 22:26, 11 May 2015 (UTC)

Assumptions
In the general formulation, the null hypothesis of the Mann-Whitney U-test is not about the equality of distributions. It is about the symmetry between two populations with respect to the probability of obtaining a larger observation. Of course, two identical distributions possess the property of symmetry but two different distributions (for example, 2 normals with the same mean but different variances) can also be perfectly symmetric with respect to the probability of obtaining a larger observation.

The whole issue of correctly formulating a null hypothesis is very important for consideration of the power of the test. Consider again 2 normal distributions with the same mean and different variances. If the null hypothesis is defined as the equality of 2 distributions we are likely to fail to reject the null hypothesis even though we know a priori that it is not true. Only for 2 distributions with a similar variance but different means (more specifically, 2 distributions with a location shift) we will have a fair chance (i.e. good power) of rejecting the null hypothesis. So such a formulation of the null hypothesis severely restricts the applicability of the test.

However, if we define the null hypothesis as a hypothesis of symmetry with respect to obtaining a larger observation then everything works perfectly and the power of the test does not depend on diverging variances. Indeed, the inability to reject the null hypothesis for 2 normals with the same means but different variances is not a failure of the test because we know a priori that the null hypothesis is satisfied in this case.

-- —Preceding unsigned comment added by Marenty (talk • contribs) 02:19, 29 July 2010 (UTC)

The article gives a piece of misleading information on the assumptions of the Mann-Whitney U test. It says:

"In a less general formulation, the Wilcoxon-Mann-Whitney two-sample test may be thought of as testing the null hypothesis that the probability of an observation from one population exceeding an observation from the second population is 0.5. This formulation requires the additional assumption that the distributions of the two populations are identical except for possibly a shift (i.e. f1(x) = f2(x + δ) )"

Testing the alternative hypothesis P(A>B) > 0.5 (where A is from population 1 and B is from population 2) does not require the restricting assumption that both distributions are equal except for a shift in location! How come? What is the basis for this statement? The test statistic in U-test is just the proportion of pairs such that the first observation is from population 1 and the second from population 2. The distribution of this test statistic can perhaps most easily be theoretically calculated for the special case shifted distributions but it does not restrict the use of the test and has nothing to do with test assumptions! —Preceding unsigned comment added by Marenty (talk • contribs) 22:06, 24 May 2008 (UTC)

Assumptions seem necessary, in the unequal variance case under the null hypothesis p-values are not uniformly distributed (I used two normals, same mean different variance). 66.218.169.47 (talk) 23:55, 13 February 2009 (UTC)

can anyone typeset the formulae better? I am not familiar with Tex. seglea 05:39, 17 Jan 2004 (UTC)

Assumptions
The hypothesis stated in this article refers both to the testing of equality of central tendency, and equality of distribution. The central tendancy hypothesis requires the additional assumption that the distribution of the two samples are the same except for a shift (i.e. f1(X) = f2(X+delta)). The test can also be described as a general test of equality of distribution (H0: f1=f2). In this case the shift alternative is not required, however, the test is used most often as a test of central tendency, so the original formulation (with the addition of the shift assumption) is most appropriate. I have added this assumption to the main page. —Preceding unsigned comment added by 132.239.102.171 (talk) 00:20, 30 January 2008 (UTC)

"The hypothesis stated in this article refers both to the testing of equality of central tendency, and equality of distribution"

Really, this test is precisely only for testing stochastic dominance of two variables A and B, that is, of Prob(A>B) > Prob(B>A). In other words, it tests whether a randomly chosen sample from A is expected to be greater than a sample from B. Look at the test statistic: it is a function of the proportion of pairs A>B where A is from the 1st distribution and B is from the 2nd distribution. For testing the stochastic dominance, ao additional assumptions are needed (beside the assumption that the underlying distribution is ordinal.)

It is incorrect to use MU U-test for general testing of "equality of two distributions" as asserted. Two normal distributions A and B with the same mean and different variances are different distributions, the test will(incorrectly) never reject the null hypothesis if we are testing for equality of the distributions. But if we are testing testing for stochastic dominance instead, then the test (correctly) does not reject the null hypothesis.

On the other hand, "central tendency" is a nebulous concept, but in reality testing "equality of central tendency" with U-test will be nothing more than testing of stochastic dominance. If we want to use the test for detection of a "shift", then we do need to add an assumption about distribution74.0.49.2 (talk) 01:51, 8 June 2009 (UTC)s A and B having the same shapes. But this additional (and unnecessary) assumption follows from the very definition of the "shift" rather than from intrinsic requirements of U test.

In summary, the test should be used in general for testing that Prob(A>B)>0.5, and as such has only one assumption that the samples are comparable (i.e. ordinal).

74.0.49.2 (talk) 01:51, 8 June 2009 (UTC)

P value
I beleive that one cannot interpret results from this test with out understanding the P-value. As I understand it the smaller the P value the more different the two populations are. What I would like to know if there is a critical value like there is with a T-test? Thanks ADS

Inexact explanation of what this test should be used for
A significant MW tests does not necessarily imply that the distributions have different medians. This is a common misconception. It is most powerful for detecting a difference in medians, which is why this is commonly misstated. The MW tests that the samples were taken from different distributions.
 * This issue seems to have been elaborated in the section "Illustration of object of test". Let me first analyze what is remarked above. Probably meant is 'a significant outcome' of a MW test. And then: a significant outcome never implies anything about any distribution. It merely points to some at forehand assumed property. The MW test is sensitive to shifted support of the distributions. In the mentioned section a (strange) example is given, but an ever stranger remark is made about the "medians". The medians in the sample are merely estimations of the population medians, and the test does not rely on the sample medians. So I do not understand what is meant by this section. Madyno (talk) 09:54, 3 November 2020 (UTC)

calculation of mu in the normal approx
Is this right? I would have thought it should be symmetrical in n1 and n2...

161.130.68.84 21:11, 28 December 2006 (UTC)JWD

I'm confused about "Therefore U = 32 − (6×7)/2 = 32 − 21 = 11 (same as method one)." What is 6x7? Needs clarification. — Preceding unsigned comment added by 12.199.98.26 (talk) 18:07, 10 January 2019 (UTC)

Introduction requires clarity
These two statements are not equivelent
 * It requires the two samples to be independent, and the observations to be ordinal or continuous measurements
 * i.e. one can at least say, of any two observations, which is the greater.

Link to rank article needs to be more specific
The 'rank' link currently points to a disambiguation page which doesn't include an article explaining what the rank of a sample is.Tim (talk) 02:31, 3 June 2008 (UTC)

Distribution of U-statistic and Table of Values
The table of values link (pdf) is broken, I am changing the address to a different document. I have been unable to find on the web some kind of explanation of the distribution of the U-statistic. This article could use at least some explanation of how the statistic is distributed, and optimally a formula or plot, if possible. I'll keep working, but if someone's got it on hand, that would be great. Lovewarcoffee (talk) 19:57, 4 August 2008 (UTC)


 * Agreed. The earlier formula for S(X,Y) indicates that U should always be positive, but later it states that U is 'normally distributed' and can take on negative values (yet still somehow retaining its equivalence to the area under the ROC????).
 * Part of the confusion may stem from the fact that the Wilcoxon statistic is not the same as the Wilcoxon statistic. I'll try to fix this.
 * The contradictions in the current article mean it is not helpful to anyone who wants to use this test. 24.127.190.251 (talk) 13:47, 24 July 2023 (UTC)

Just what is it that we are talking about?
The article starts:


 * In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test) is. . ..

Thereafter it talks of "MWW". "MWW" strikes me as an odd abbreviation for "Mann-Whitney U test." If this article is correctly titled, I suggest that the test should be abbreviated as "MW."

David J. Sheskin devotes pp 513–75 of Handbook of Parametric and Nonparametric Statistical Procedures, 4th ed. (Boca Raton: Chapman & Hall, 2007) to this one test, which he calls the "Mann–Whitney U" test. (If you're a purist, note the dash: it's not one statistician with a double-barreled name, but two separate people, Mann and Whitney.) He writes at the start:


 * Two versions of the test to be described under the label of the Mann–Whitney U test were independently developed by Mann and Whitney (1947) and Wilcoxon (1949). The version to be described here is commonly identified as the Mann–Whitney U test while the version developed by Wilcoxon (1949) is usually referred to as the Wilcoxon–Mann–Whitney test. Although they employ different equations and different tables, the two versions of the test yield comparable results. (513)

(Unfortunately even Sheskin's 1700+ pages don't include any further coverage of [what he calls] the Wilcoxon–Mann–Whitney test.)

And Sheskin adds in an endnote:


 * The test to be described in this chapter is also referred to as the Wilcoxon rank-sum test and the Mann–Whitney–Wilcoxon test. . . . (569)

This of course doesn't agree with what's written in this Wikipedia article. To follow Sheskin, it would instead say something like:


 * In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW) or Wilcoxon rank-sum test) is. . . . (Although very similar, the "Wilcoxon-Mann-Whitney" test is different.) . ..

What's the authority for what the article now says? Tama1988 (talk) 09:49, 17 November 2008 (UTC)


 * The Mann-Whitney U test and Wilcoxon two sample test were developed independently, but provide an identical test statistic. Both Snedecor and Cochran and Sokal and Rolf retain the distinction and neither concatenate the names. Regards&mdash; G716  &lt;T·C&gt; 20:58, 18 November 2008 (UTC)


 * So how about:
 * In statistics, the Mann-Whitney U test &mdash; also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon two sample test, or Wilcoxon rank-sum test &mdash; is. . . . (Although very similar, the "Wilcoxon-Mann-Whitney" test is different.) . ..
 * ? Or are you saying that when Sheskin talks of "comparable results" he means "the same results" and that Wilcoxon's 1949 test (the "Wilcoxon-Mann-Whitney test") is the same as Mann-Whitney? Tama1988 (talk) 09:14, 19 November 2008 (UTC)

Wilcoxon rank test is more widely known, so it should be the one described. Or it merits its own entry. — Preceding unsigned comment added by 2601:4C1:4180:12D0:35E8:A608:9ED6:3D81 (talk) 03:02, 3 December 2019 (UTC)

Receiver operating characteristic
I have removed the following from the section on Herrnstein's rho. As it stands, it does not make sense. It may well be true, but if so it needs a lot more explanation.
 * "ρ is also known as the area under the receiver operating characteristic (ROC) curve."

seglea (talk) 23:14, 12 January 2009 (UTC)

This is well known so I'm unclear on why this was removed. See

author = {Hanley, J. A. and McNeil, B. J.}, year = 1982, title = {The meaning and use of the area under a receiver operating characteristic ({ROC}) curve}, journal = {Radiology}, volume = 143, pages = {29-36}, annote = {diagnosis;testing;ROC;c index} Harrelfe (talk) 14:32, 14 February 2009 (UTC)

I just want to add a little comment here on the rho statistic discussed above. It is a bit unclear since in one section it says that the AUC is directly related to the U statistic, giving the equation AUC=U/n1*n2, which is clear and good, and then below in the section on rho, which is exactly the same formula except it is given as rho= U/n1*n2, the reader is informed about "this commonly used test statistic..." would it not be better to integrate these two sections? Also, rho is such a common symbol in math/stats, it would be nice to call it Herrnstein's rho, to make it clear that he was the first to call this statistic by this name. I don't mind that in this case the AUC is called by something else, since usually people are interested in AUC as a tool in classification of binary outcomes at different probability levels, where having AUC values close to one indicates a "good" score for a classifier, and scores approaching 0.5 are considered valueless as classifiers, since they are no better than random assignment. In the case of rho, however, where the test is coming from a comparison of two populations, there is not this judgement about values close to 1 being "better". ~Frieda

Assumptions and Formalization of Hypotheses
I will shortly change the "Formal statement of object of test" and "Assumptions" sections and put them into one section. There were several errors.

1. Previously the article stated that the MWW test does not test for differences in medians. But if you make the location shift assumptions, then it does in fact strictly test for the differences in medians.

2. Previously the article stated that one proper formulation for the MWW test is to have the null hypothesis be that P(X>Y)=0.5. In fact, this is not true. If you have two normal distributions with the same mean but different variances then the MWW test is no longer valid under that null even though P(X>Y)=.5 (see Pratt, 1964, Journal of the American Statistical Association, 665-680).

3. Previously the article stated "Without making such a strong assumption [about the location shift] (and verifying its validity) it is incorrect to use the MWW test as a test for shift in location." In fact, although we can invalidate the location shift assumption, we cannot verify that assumption with a finite amount of data. (Testing and finding no significant violation of an assumption is not the same a verifying that assumption. You could have not been able to find significance because of a small sample size). In statistics we make assumptions all the time, so saying it is incorrect to use an assumption without verifying it seems contrary to the practice of statistics. Although it may be a good idea to check the assumption if you can.

4. The following statement was made: "the Mann-Whitney U test is valid for testing of stochastic dominance under very broad conditions, without making any additional assumptions, including any additional assumptions about variances of the two samples". This is not correct, see Pratt, 1964 referenced above. The paragraph following was mostly redundant, so I deleted it. Mpf3205 (talk) 05:18, 7 March 2010 (UTC)

Mann-Whitney vs. Wilcoxon
I spent a great deal of time puzzling over the table of critical values in the second external link (www.stat.auckland.ac.nz), wondering why it didn't match up with the first link, and more importantly, why some of the values appeared to be theoretically impossible (e.g. greater than 100 for a 10*10 test). After careful reading, I realized that the test statistic in the link was calculated differently than the one in the article. (i.e. a straight R1 sum of ranks versus the R1 - n1(n1+1)/2).

I have no external experience with this, but the best I can tell from the external link, while the Mann-Whitney and Wilcoxon tests are equivalent, they are not identical, in that the numeric form of the statistic differs. Whereas the Mann-Whitney includes the n1(n1+1)/2 adjustment, the Wilcoxon is a straight sum of ranks. While not changing the application or conclusions of the test, this is crucial to know when looking at critical value tables, as what works for one won't work for the other.

I altered the text for the external link so hopefully others will not be as confused, but could someone who has a better understanding of the history and situation add a clarification about the different functional forms to the article? (If you would add info about why they're equivalent, and why one form might be preferred over the other, so much the better.)

P.S. While you're at it, a discussion on how to treat identical valued items in calculating the test statistic would be also be appreciated. The Auckland link discusses it for the straight Wilcoxon sum of ranks, but I'm still not sure how they are accounted for in the Mann-Whitney statistic. -- 140.142.20.229 (talk) 22:30, 8 March 2010 (UTC)


 * Just thought I'd point to a reference which touches on the difference: Journal of the American Statistical Association, Vol. 59, No. 307 (Sep., 1964), pp. 925-934  -- 140.142.20.229 (talk) 22:39, 8 March 2010 (UTC)


 * Different implementations are using differently defined test statistics! There definitely needs to be a section that makes this clear.  It would save a lot of wasted time.  I have been using some R implementations, the standard one in 'stats' called via wilcox.text and the advanced version in the 'coin' package invoked via wilcox_test.  The test statistic reported from wilcox_test (coin package) is equal to the sum of ranks R1 in the wikipedia example [in the R idiom, this is accessed via 'statistic(out, "linear")' where "out" is the output of wilcox_test-- without the "linear" option, you'll get a Z score].  The test statistic reported from wilcox.test (stats package) is equal to the sum of ranks R1 minus a factor that corresponds to n1(n1+1)/2 in the wikipedia notation [in the R idiom, this is invoked via out$statistic, where "out" is the output of the wilcox.test].  Neither of these implementations computes the "U" statistic as defined in the wikipedia article.  Dabs (talk) 19:34, 4 December 2014 (UTC)

What do you need to assume under the null hypothesis?
Under the section that describes the assumptions, I had previously added that you need to have both distributions be equal under the null. That was deleted. But I assert that you need that assumption. If you state the null hypothesis as only needing that Pr[X>Y]+ .5 Pr[X=Y] = .5, that does not give sufficient conditions for validity. Here is a counter example (see Pratt, 1964, JASA, cited in my previous notes above): if you have two normal distributions with the same mean but different variances then Pr[X>Y]+ .5 Pr[X=Y] = .5, but your type I error can be inflated (i.e., the test can reject the null hypothesis more often than the nominal significance level).

The paragraph that I deleted also had that same mistaken idea. —Preceding unsigned comment added by Mpf3205 (talk • contribs) 05:31, 27 August 2010 (UTC)

Un-clarity in terms : ranks or observations ? lower rank or smaller ?
"Taking each observation in sample 1, count the number of observations in sample 2 that are smaller than it (count a half for any that are equal to it)."

If I understand correctly the test does not require to compare observed values but just the ranks, so this sentence should be changed using ranks.

"Choose the sample for which the ranks seem to be smaller" "count the number of hares it is beaten by (lower rank)"

is 1st the lowest rank ? For me it is the opposite. The word smaller seems less ambiguous to me.

"Arrange all the observations into a single ranked series. That is, rank all the observations without regard to which sample they are in."

This has to be done for small samples or big samples, so this sentence should precede.

I will do the modifications I propose, it is probably better so you see what I mean.

Arnaud —Preceding unsigned comment added by 132.183.93.37 (talk) 16:10, 16 November 2010 (UTC)


 * This does not seem to have gotten fixed. I just fixed it.  There are two issues here.  First, the explanation as given was incorrect, as implied above.  We want to count the *wins*, and counting the wins for sample 1 means counting the observations in sample 2 that are *larger* (and counting 0.5 for ties).  The second issue is about whether or not ranks are needed.  The answer is no, they are not needed with the simple method introduced in the "calculations" section.  It is possible to calculate U correctly without assigning numeric ranks.  Just do each pairwise comparison, and count the wins.  Here is a numeric example: A = (1, 3) and B = (2, 3, 7).  The pairwise wins for A are (3, 1.5), i.e., 1 beats 2, 3 and 7, and 3 beats 7 and ties with 3.  This means that U equals 4.5.  For B, the number of wins is only 1.5, because 2 beats 3, and 3 ties with 3.  The sum of these two U values is 6, which is the produce of 2 and 3, as expected.  If you go through the second method of calculation,  the sums of ranks are 1 + 3.5=4.5 for A, and 2 + 3.5 + 5 = 10.5 for B (this particular numeric example isn't ideal because in this case r1 just happens to equal 4.5, same as u1).  Given that n1=2 and n2=3, the formulas for U will give you exactly the same result as for the simple method.  So, to reiterate, we do not need to assign numeric ranks to do the simple method-- we only need to make binary comparisons. Dabs (talk) 21:53, 8 December 2014 (UTC)

Where is the non-technical summary?
I'm sorry, but WikiPedia is used by non-experts to gain an understanding of something they may have run across in a technical or semi-technical setting. As such, one of the joys of reading many WP articles is that someone has taken the time to explain, in layman's terms, just what exactly is covered in the topic. That is NOT the case here. I think it's great that so many of you can chime in here as "experts", able to contribute because you have the required background.

But when the first sentence of the article uses a phrase "have equally large values" -- it just doesn't make sense. On the face of it, why should it be difficult to determine whether two "samples" have "equally large values"? Doesn't that simply mean looking at the largest value in each sample and seeing if they are the same? Clearly not, which is precisely why a layman's version, at least in the first paragraph, should be offered.

I hope someone is willing to stoop to the level of the non-cognoscente and explain what the heck this is all about. I find that in statistics, almost more than in any other discipline, practitioners are unwilling to translate their statements into simple real-world examples and plain speaking. I often wonder if that's because they're afraid, in some way, that someone will claim the emperor has no clothes? -roricka 1/1/11 — Preceding unsigned comment added by Roricka (talk • contribs) 22:29, 1 January 2011 (UTC)


 * Hi Roricka. Reading your comment, I truly would like to help out but am not sure how. You wrote how the first sentence doesn't make sense.  Well, indeed it doesn't make sense to anyone not familiar with the most basic notions of probability and statistics.  However, you can not put them into the first sentence since it would require explaining what a random variable is and how that relates to statistical tests and hypothesis testing (notice how proper wikilinks are present in the first sentence).
 * Also, IMHO, I don't think all wikipedia articles can be formulated as independent modules of knowledge, easily understood without context. This particular article is a good example of that.
 * If, after reading more, you'd gain an insight as to how to make this article clearer - I'll be most interested to see how.
 * Yours,
 * Talgalili (talk) 10:32, 2 January 2011 (UTC)

I read the Spearman ciefficient page and understood it immediately. This page I just found incomprehensible in comparison. I have to echo what the OP said. I suggest using the Spearman page as an example of "how to do it right" maybe? Especially the images which were great! — Preceding unsigned comment added by 84.92.230.173 (talk) 14:27, 6 June 2011 (UTC)

Would someone please clarify how the name of the test is pronounced, for those who aren't used to seeing the symbol?radcen (talk) 20:36, 11 December 2015 (UTC)

Separate pages for Mann-Whitney U test and Wilcoxon rank-sum test?
Despite Mann-Whitney U test and Wilcoxon rank-sum test are equivalent, they are two different tests, as pointed out by 140.142.20.229. In the current version, Mann–Whitney U is described, while Wilcoxon rank-sum test is not. Since the are two different tests, shouldn't we create a new page for Wilcoxon rank-sum test, containing the description of this method, and then say that Mann-Whitney U test and Wilcoxon rank-sum test are equivalent?--Gorif (talk) 23:42, 12 February 2012 (UTC)

Do Ranks start at 1 or at 0
The Wikipedia article says that ranks start at 0, because you are considering how many hares the tortoise beats, which could be 0. But everything else I've seen about Wilcoxon test says the ranks start at 1. If one has tables for looking up the implication of Wilcoxon rank, it is rather important whether ranks are from 0 or 1. Further the observation that U1+U2=n1*n2 only holds if ranks start at 0, not 1.

This issue needs clarification!!

Ian Davis

textserver.com@gmail.com — Preceding unsigned comment added by 209.183.141.235 (talk) 01:52, 2 December 2014 (UTC)


 * Ranks begin at 1, otherwise the formulas will be off. However, this is *not* contradicted by the tortoises and hares example, because it does not use ranks.  This is explained in the beginning of the "calculations" section-- "method 1" is simply to count the pairwise wins.  This calculation does not require assigning numeric ranks, and it gives the correct value. Dabs (talk) 21:52, 8 December 2014 (UTC)


 * To follow up: It is not true that U_1 + U_2 = n_1 * n_2 only holds if ranks start at 0. Since ranks start at 1, the case of n_1 = n_2 = 1 gives U_1 and U_2 being 1-1=0 and 2-1=1, which sum to n_1 * n_2 = 1. LachlanA (talk) 01:29, 5 August 2016 (UTC)

Is the t-Test Valid for Non-Normal Distributions?
The introduction says the Wilcoxon is more efficient than the t-test when the distribution is non-normal. That's misleding. The t-test is not just inefficient (is that term meaningful for a test as opposed to an estimator?), but rather fails to be valid if the distribution is not normal. It's a parametric test. See https://en.wikipedia.org/wiki/Student's_t-test Erasmuse~enwiki (talk) 12:22, 22 June 2015 (UTC)

Can a two-tailed test show the sign of the difference?
The example of the hare and the tortoise says
 * significant evidence that hares tend to have lower completion times than tortoises (p < 0.05, two-tailed)

My understanding of a two-tailed test is that the alternative hypothesis is simply that the distributions are unequal, rather than one dominates the other. To test whether hares have lower completion times, don't we need a one-sided test? In particular, being significant on a two-sided test doesn't AFAICT show that the sample with the lower mean was drawn from a distribution with a lower mean with the same significance. (If it did, why would anyone use one-sided tests?) LachlanA (talk) 01:20, 5 August 2016 (UTC)

Paragraph about Wilcoxon signed-rank test
There's been some edit activity re: the paragraph where this test is compared to a Wilcoxon signed-rank test (1, 2, 3, and mine). I think I have clarified it but I don't want to start an edit war, so  What do you think of the current phrasing? —Cousteau (talk) 13:34, 21 July 2017 (UTC)
 * Thank you for this. I did intervene initially because I was worried that it looked as if had got it wrong, identifying something as repetition (and removing it) when it was not. Then, however, I thought that I had myself perhaps been somewhere along the hasty<-->ignorant spectrum in making this edit (and in my edit on the user's Talk page) so I undid it all. Because I have no understanding of this topic area I think it was unwise editing from me, and I am withdrawing from further involvement, leaving it to people who actually know what they are doing. So: apologies, thanks and best wishes DBaK (talk) 20:03, 23 July 2017 (UTC)

Clarifying needed: a simple shift interpretation
In the 'Relation to other tests' under alternative it says: "If one desires a simple shift interpretation..." What is meant by simple shift interpretation? This seems extremely vague. shift of what? — Preceding unsigned comment added by 70.187.129.94 (talk) 20:45, 14 July 2018 (UTC)

Comparing MW to t-test
Hi User:So_many_suspicious_toenails, I am happy to see all the changes you're making to this article. Several things to talk about:

1. You keep referencing how the t-test vs MW deal with different H0. What is worth mentioning though is that both of their H0 will hold if the two groups are of the same distribution and that being the normal distribution. Hence, you can use both tests to check if H0: group A has the same distribution as group B (and all are i.i.d normal). Now, the interesting point would be how powerful each of the tests would be for which type of deviation. So I do think they can be compared for specific Nulls.

2. The t-test does assume the underlying observations come from a normal distribution, this is essential in order to make sure the estimation of S^2 comes from a chi-square distribution. It is true that t-test is robust to various types of deviations. And that it could still be valid (in the sense that the type I error would be less than or equal to alpha), under these deviations. But in order to be an exact valid test, as far as I know, the data should be normal.

What do you think of what I wrote, and of how they impact the text?

Tal Galili (talk) 09:45, 29 July 2019 (UTC)

Hi Tal Galili,

1. The null tested by the t-test is in general equality of means. As you indicate (at least if I'm interpreting your comments correctly), if we further assume that observations in both groups are normally distributed with the same variance, then the null is indeed of equality in distribution. However, it is exceedingly rare to know a priori that data are normally distributed (unless the data are synthetic), and under more realistic assumptions (e.g., observations in group 1 ~ F and observations in group 2 ~ G, with F and G unknown a priori but assumed to have some number of finite moments), M-W and the t-test do not test the same nulls against the same alternatives. The Lumley et al. (2002) article includes a very nice discussion of choosing between these (and other) tests/methods.

2. I really can't recommend the Lumley et al. (2002) article enough, to be honest. It is true that the test statistic must be exactly t-distributed in order for the t-test to be exact (your conditions are sufficient, but I suspect not necessary). However, the t-test is extremely/primarily useful as an asymptotic test since we almost never know that data are truly normally distributed and standard theory for inference in this case does not apply if we condition on a test for normality.

In short, the t-test is asymptotically valid, and this constitutes its primary usefulness. I think putting a lot of emphasis on exactness is misleading, as normality is in fact not an important assumption, practically speaking, for the validity of this test. It would be too bad if readers came away from this page thinking it was only appropriate to use the t-test when data were known to be normally distributed (or worse, when they passed a test for normality).

Is that clear? I agree with you about conditions for exactness, but I suspect we may have different thoughts about how important exactness is practically speaking.

So many suspicious toenails (talk) 18:11, 29 July 2019 (UTC)


 * Hey again,
 * Having the t-test being valid (i.e.: keeping type I error), depends on dividing by a chi square distribution, which is derived if the distribution of the observation is normal (you could also think of cases where that would happen when the data is not normal, but I did not came across such uses in practice). While it is true that for large samples it can quickly not matter, since the t-test becomes an asymptotic z-test (specifically Wald test). However, the reason we care about the t-test is often for relatively small sample sizes (say, n<30). In these cases, if the distribution of the data is not normal (and, say, very right skewed), this leads to a very wrong test (i.e.: type I error can inflate. Especially if the test is one sided). From my personal experience, this is a common use of the t-test (for small samples sizes). And for such cases, knowing that a strong assumption for validity is the normality is important (in contrast, using MW with exact tables is more lenient with its assumptions). Hence, I think we generally have very similar opinions, but I think you understate the situation in which researchers will use t-test for small samples while needing to assume normality (even though they may know it is a wrong assumption), and getting invalid results (where using MW would have worked). And I think this should be better reflected in the article. I'll get to doing such an edit at some point in the future, but before doing so, wanted to make sure we are aligned about what I wrote. What do you think?
 * (p.s.: One small point - t-test, when referring to the exact version, assumes equality of distributions for both groups. It is best powered for a difference in means. It could have also been used for difference in medians or the 75% quantile, or whatever. It just would have less power to detect such a case)
 * Tal Galili (talk) 13:35, 31 July 2019 (UTC)


 * Hi again, again:


 * I agree that under assumptions of homoscedasticity and normality, testing a difference of means is equivalent to testing a difference of an arbitrary location measure (because the normal distribution is completely determined by its first two moments). However, this seems, as you suggest, to be a minor point, since data are rarely if ever truly normally distributed (to say nothing of normally distributed with common variance).


 * I'm totally pro-Mann-Whitney if the target of scientific interest is P(X > Y) + 0.5 P(X = Y) for X and Y random draws from two distributions of interest – this is what MW tests under the loosest assumptions. To interpret MW as a test of location, however, we need more assumptions, which may be violated. Violations of these assumptions has been shown to lead to inflated type 1 error relative to the t-test in some (imho realistic) situations (i.e., non-normal data and heteroscedasticity).


 * I agree that using a t-test in situations where the t-test is likely to have poor performance in terms of frequentist criteria is a bad idea. If what you care about is a difference of means, however, MW does not provide a robust alternative, since MW may actually have worse performance if interpreted as a test of location.


 * In short, I'm totally with you on cautionary notes about using t-tests when they are unlikely to perform well. In these situations, MW is completely reasonable if the probability that a randomly drawn observation from one population is larger than a randomly drawn observation from the other is of interest. I don't, however, think the evidence supports MW as an adequate substitute for the t-test as a test of location.


 * So many suspicious toenails (talk) 20:49, 6 August 2019 (UTC)

Effect size section should be tightened
Specifically, the following section: https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Proportion_of_concordance_out_of_all_pairs has three sub section of different names for the same measure. Tal Galili (talk) 19:39, 31 July 2019 (UTC)

Rank-biserial_correlation should be merged with text from other articles
This section: Mann–Whitney_U_test can take material from:


 * Effect_size
 * Rank_correlation
 * Wilcoxon_signed-rank_test

In both of these cases I linked to the section on this page, but I think they might contain material that could be added to this section (and the ones in these articles should probably be shortened).

Tal Galili (talk) 17:07, 3 August 2019 (UTC)

Formula for Common Language Effect Size?
The formula for the common language effect size of U/(n1*n2) seems to me only works in case there are no ties. In general: "for continuous data, it is the probability that a score sampled at random from one distribution will be greater than a score sampled from some other distribution" (McGraw & Wong, 1992, p. 361). — Preceding unsigned comment added by Stikpet (talk • contribs) 11:08, 2 September 2020 (UTC)

Illustration of object of test
I have hidden the section 'Illustration of object of test' as it is not correct. The author doesn't properly distinguish between population and sample.Madyno (talk) 09:20, 10 November 2020 (UTC)

New link added to another online calculator
Hi everyone I have added a link under "External Links" to my online app, that can perform the Mann-Whitney U test. Question: Was it all right for me to do that? Linking to my own website? Or would that be considered conflict of interest or spam? The site is completely free of charge and no registration required. — Preceding unsigned comment added by IwanHBon (talk • contribs) 20:48, 5 March 2021 (UTC)


 * No, you should not add links to your website. See WP:ELNO - MrOllie (talk) 22:09, 5 March 2021 (UTC)

Help with table
Can someone help me format the table in the efficiency section so it is on the right? DMH43 (talk) 17:40, 6 December 2023 (UTC)

U-Statistic
Link to U-Statistic should be discussed in greater depth. Why is the Whitney U statistic unbiased etc... Biggerj1 (talk) 19:39, 19 February 2024 (UTC)

Does it test for a difference in mean as well as a difference in median?
"if the responses are assumed to be continuous and the alternative is restricted to a shift in location, i.e., F1(x) = F2(x + δ), we can interpret a significant Mann–Whitney U test as showing a difference in medians."

If the shape of the distribution is the same then the mean should move too right? Why is it always described as being about the median? Transient Being (talk) 18:01, 24 May 2024 (UTC)