Talk:Wilcoxon signed-rank test

Difference between Wilcoxon signed-rank and ranked-sum test
Is the difference between these two tests that signed-rank test is paired and the ranked-sum test is unpaired? If so, perhaps this could be made more explicit? Such as: "The Wilcoxon signed-rank test is not the same as the Wilcoxon rank-sum test. While both are nonparametric and involve summation of ranks, the Wilcoxon signed-rank test requires that the data is paired while the Wilcoxon rank-sum test is used for unpaired data." --JonathanWilliford (talk) 20:38, 3 March 2015 (UTC)

Copyright violation?
It appears that some portion of this have been copied from a textbook: "The recommended cutoff varies from textbook to textbook — here we use 20 although some put it lower (10) or higher (25)." Is there a copyright violation happening here? 96.49.201.125 (talk) 04:39, 1 December 2009 (UTC)

This is not on the article any more. Removing this comment...? Arauzo (talk) 11:57, 24 October 2016 (UTC)

Ordinal data
Im no expert, but pretty sure that this test can also deal with ORDINAL data

The test is in parts not well described, even I, as someone who teaches statistics, have difficulties to understand the test fully from the text. sigbert Mi Apr 11 08:09:57 CEST 2007


 * NO! Since you are subtracting two values (e.g. pre vs. post) it has to be interval data, not just ordinal. --Statprof (talk) 17:18, 15 April 2008 (UTC)

I'm pretty sure it could work with ordinal (Ratings of women after 5 pints of beer?) —Preceding unsigned comment added by 128.243.253.114 (talk) 21:06, 21 April 2008 (UTC)


 * Wrong: As Statprof mentioned, subtracting values presupposes interval data: In your example it would presuppose that someone rated 10 is equally far away from someone rated 9 as someone rated, say 3 is from someone rated 2. 128.232.231.16 (talk) —Preceding undated comment added 11:07, 2 March 2009 (UTC).

I would say that the test generally doesnt work for ordinal data. But if you have ordinal data and it is possible to assume that a change in two scale steps always is a greater change than a change of one scale step, a change of three steps always is a greater change than a change of two steps and so forth (a kind of ordinal data which is closed to interval, i.e. semi-interval, even though it is not a perfect equidistance) then the test will work. It demands carefull validation of the scale and rather heavy assumptions, to be on the safe side a sign test could be prefarable. //MG Stat. —Preceding unsigned comment added by 90.231.200.20 (talk) 06:16, 22 April 2010 (UTC)

An already referenced article says,  'the measures of XA and XB have the properties of at least an ordinal scale of measurement, so that it is meaningful to speak of "greater than," "less than," and "equal to."' , and lots of other pages (just one example) agree. To people claiming that 'subtraction requires interval data': the test uses the sign and the rank (not the value!) of each difference, which make sense for ordinal data. I'm changing the article. --asqueella (talk) 11:49, 24 March 2011 (UTC)
 * The above comment by 90.231.200.20 makes a lot of sense, but I couldn't find a source for that. --asqueella (talk) 12:30, 24 March 2011 (UTC)
 * All known WP:Reliable sources say it works for ordinal data, so the article should reflect that, not ill-informed speculation and WP:original research. Subtracting values does not require interval data. Defining a measure of effect size might do so, but this is a hypothesis test, not a measure of effect size. Qwfp (talk) 19:21, 13 April 2011 (UTC)

I agree with the notion that the test works on ordinal variables (essentially, if you have ordinal data, you skip one step). Reference for instance here: http://www.sussex.ac.uk/Users/grahamh/RM1web/WilcoxonHandoout2011.pdf. Emil, 14:02, 29.4.2013. — Preceding unsigned comment added by 91.214.156.1 (talk) 12:04, 29 April 2013 (UTC)

The test is based on ranks, so is designed to deal with ordinal data. Potential confusion arises because the data that need to be ordinal are the differences between the matched observations, not the individual scores for each member of the pair. So if you have a sample of people and can rank them from least different to most different then you have ordinal data and no problem. If, on the other hand, you have observations for these people (say) at pre and post on a five point scale, you have to calculate the difference scores and it is here that you encounter a problem if the measurements are only made on an ordinal scale. As Statprof points out you have to calculate a difference between the pre and post measurements, and if the data are only ordinal you don't know if a change for person A from the 2nd to 4th categories is bigger/smaller/the same as a change from the 3rd to 5th category for person B. Or even, for that matter, a change from the 4th to 5th category for person C.SpeakSince (talk) 03:20, 6 June 2014 (UTC)

Assumptions for Wilcoxon-signed rank test?
What are the assumptions for the wilcoxon signed rank test? unfortunately information and appliances i find, mainly "contradict". some statistical books state that one assumption of the test is that the distribution of the differences should be symmetric!!!!??? but this assumption would only be true under the null??? thanks. —Preceding unsigned comment added by Fanny151984 (talk • contribs) 10:11, 23 August 2008 (UTC)
 * For what it is worth, the Official Matlab (version 2013b) software used by several engineers at large companies makes the following claim about the Wilcoxon signed rank test: "The data are assumed to come from a continuous distribution, symmetric about its median." Their official documentation references [1] Gibbons, J.D., and S. Chakraborti. Nonparametric Statistical Inference, 5th Ed., Boca Raton, FL: Chapman & Hall/CRC Press, Taylor & Francis Group, 2011. [2] Hollander, M., and D. A. Wolfe. Nonparametric Statistical Methods. Hoboken, NJ: John Wiley & Sons, Inc., 1999. This does not mean they are correct, but it is evidence to support the claim that this test assumes the data is symmetric about the median, and if it exists the, mean.150.135.223.6 (talk) 23:07, 11 December 2014 (UTC)

30-06-2015 There is a indeed a clear problem on H0 which cannot be "median difference between the pairs is zero". Here is an example on H0 which is rejected by the Wilcoxon test :

N=101 ; X_1,...,50 = (0,1) ; X_51 = (2, 2) ; X_51,...,101 = (5,3). The median difference is 0 (50 at 1; 1 at 0; 50 at -2) but Wilcoxon test reject Null hypothesis (sum of signed rank not 0).

"median difference between the pairs is zero" is the null hypothesis of sign-test. I have no source but I propose H0 : "difference between the pairs follows a symmetric ditribution arround 0". Under this hypothesis $$|x_{2,i} - x_{1,i}|$$ and $$\sgn(x_{2,i} - x_{1,i})$$ are independant so it should imply E(W)=0 (I do not have the courage to find a proof but i guess it may be simple). I do not know if there is a simple formulation for a reducted H1 hypothesis particularly for the one sided "Positive differences are more often greater than negative differences" but I think it is incorrect — Preceding unsigned comment added by Samuelboudet (talk • contribs) 12:46, 30 June 2015 (UTC)

The Signed-Rank test absolutely needs the assumption of the distribution being symmetric. Without it, it will tend to reject the null hypothesis more. This paper here shows in Table 3.2 that in some skewed distributions (Gamma distributions with reasonable parameters), the samples from the distribution will reject its own median ~87% of the time, meaning there is a huge Type-1 error rate without the symmetric assumption! Rackdude (talk)


 * I think I got it :).
 * For coupled data : H0 is if (X1,Y1) and (X2,Y2) are two independant random couples of variable following the tested distribution then median(X1-Y1+X2-Y2)=0  and H1 is median(X1-Y1+X2-Y2) is either greater or lower than 0.
 * For single data ; H0 : if X1 and X2 are two independant random variables following the tested dsitribution median(X1+X2)=0 H1 : median(X1+X2) is either greater or lower than 0.
 * I have a doubt that it may be median(X1+X2 | X1+X2<>0) and thus would correspond to P(X1+X2>0) = P(X1+X2<0).
 * With that definition it is not required to assume symmetric distribution. Strangely is is possible that median(X1)=median(X2)<0 and median(X1+X2)>0. The only test for median of differences is the sign test. (I guess $$median( \sum_{i=1}^{n}(X_i) /n )$$ converge to E(X) (E is expected value) since $$\sum_i X_i /n$$ converge to a gaussian law of average E(X). On the idea, this means that Wilcoxon test has a kind of semi assumption of gaussianity to be meaningfull since median(X1+X2) is closer to the mean (multiply by 2)).
 * I just send a bug report to mathworks for them to correct their help documentation. (e.g >> A=[ 100:200 230:300];B=[101:201 0:70]; median(B-A) %-> 1 signtest(A,B,'tail','left')  %->0.0135  ranksum(A,B,'tail','right') %->1.2510e-25 )
 * I have a kind of proof but I may have done mistake and anyway I have no sources. Please if someone can validate before I change on the main article. I think, I will add also a paragraph to discuss on this counter intuitive fact and common topic of mistake. --Samuelboudet (talk) 10:51, 5 July 2017 (UTC)


 * After years, I apologize for my previous mistake, and it seems someone has done the same mistake as me and put it in the main page in the "Null and alternative hypotheses" section.
 * The current H0 in the page: $$\Pr(\tfrac12(X_1 + X_2) > 0)=1/2$$ with $$X_1$$ and $$X_2$$ are IID $$F$$-distributed random variables.
 * First it contradicts the previous sentences: "The one-sample Wilcoxon signed-rank test can be used to test whether data comes from a symmetric population with a specified median. If the population median is known, then it can be used to test whether data is symmetric about its center."
 * Secondly here is a counter example by assuming that $$F$$ has pdf
 * $$ p(x) = \begin{cases} 1 & \text{if } x \in [-\frac{\sqrt{2}}{2}, 0] \cup [\frac{\sqrt{2}}{2}, 1] \\ 0 & \text{otherwise} \end{cases} $$. If we take $$n$$ samples $$X_1, \ldots, X_n$$ and if we compute the p-value for a statistic $$U_n=-\frac{n(n-1)}{2}$$ which corresponds to all $$X_i$$ negatives. $$P(U_n=-\frac{n(n-1)}{2})=P(\bigcup_i X_i<0) = (\frac{\sqrt{2}}{2})^n$$. This is very different than the Wilcoxon p-value i.e. under the right null hypothesis (symmetry arround 0), we assume that every $$P(X_i<0)=\frac{1}{2}$$ and then $$P(\bigcup_i X_i<0)=0.5^n$$. For example, with n=5 we will have a probability of 0.1768 to reject the null hypothesis at a threshold of 0.0312 (i.e. An event which is supposed to happen with a probability inferior to 0.0312 will actually occur with a probability of 0.1768.) and this is a contradiction.
 * Since I have done the same error, I think having understood where it came from. Let's $$U_n = \frac{1}{n} \sum_{i=1}^n \text{sign}(X_i) R_i$$ the sum signed rank statistic.
 * On average $$E(U_n) = n(n-1) \left( P(X_1 + X_2 > 0) - \frac{1}{2} \right) + 2n \left( P(X_1 > 0) - \frac{1}{2} \right)$$ where $$X_1$$ and $$X_2$$ are iid following a same continuous distribution.
 * When $$n \to \infty$$, $$E(U_n)=0 \Leftrightarrow P(X_1 + X_2 > 0) = \frac{1}{2}$$ and $$P(X_1 + X_2 > 0) = \frac{1}{2}$$ is the $$H_0$$ written in the wikipage. This assumption is however not enough to assume the entire distribution of $$U_n$$ as described in the "Computing the null distribution" section. As a counter example, if we consider the distribution $$H$$ with pdf
 * $$ p(x) = \begin{cases} 1 & \text{if } x \in [-s, 0] \cup [s, 1] \\ 0 & \text{otherwise} \end{cases} $$ with $$ s = \frac{\sqrt{2(n^2+1)}-2}{2(n+1)} $$, we have $$E(U_n)=0$$ and it is conform with the "false" $$H_0$$ assumption but the probability to reject H0 does not equal to actual p-values of Wilcoxon tests.
 * Samuelboudet (talk) 19:26, 17 July 2024 (UTC)

A common error is came back in introduction: "The Wilcoxon test can be a good alternative to the t-test when population means are not of interest; for example, when one wishes to test whether a population's median is nonzero, or whether there is a better than 50% chance that a sample from one population is greater than a sample from another population." : test if median is zero is the object of the sign test not Wilcoxon. The counter example : N=101 ; X_1,...,50 = (0,1) ; X_51 = (2, 2) ; X_51,...,101 = (5,3). The median of differences is 0 (50 at 1; 1 at 0; 50 at -2) but Wilcoxon test reject Null hypothesis (sum of signed rank not 0). Moreover, the use of t-test is not due to "population means are of interrest" but is due to a gaussianity assumption which can be done or not. I propose a correction. Samuelboudet (talk) 15:07, 17 July 2024 (UTC)

One sample testing against hypothesis
Some statistics software (e.g. GraphPad Prism) claim to use the Wilcoxon Signed Rank test for non-parametric one sample testing. i.e. compares the median of a single group to a hypothetical median. Prism distinguishes from the more common two sample test by calling that the Wilcoxon matched pairs test.

N.B. The one sample test on the difference between matched pairs in two groups seems to be equivalent to a Wilcoxon signed rank test on those two groups and comparing to the null hypothesis that the median difference is equal to zero. Although in the one sample test you can compare to any hypothetical median, not just one equal to zero.

If this is legitimate, a section on this should probably be included. I'm not completely sure so haven't just added it in. schroding79 (talk) 00:59, 2 September 2008 (UTC)

Example wrong?
The example describes the W+ statistic as a sum of the signed ranks. This contradicts the "Test Procedure" section (and my understanding of the W+ statistic) that says W+ is the minimum of the sum of ranks for positive differences and the sum of ranks for negative differences.


 * Yeah, I was trying to figure out how the test could be that the value is less than the critical value if it is this type of sum (i.e. if all were positive, then would not-reject always, but then it almost has to be a reject). Also, the previous section makes reference to the statistic converging (presumably in distribution) to the normal but then doesn't say what the mean and SD are. 018 (talk) 18:19, 3 February 2010 (UTC)

In the example should we be looking the critical values for n=10 as the page says or should we use n=9 because we disregard the data point where the values are equal? —Preceding unsigned comment added by 82.25.68.121 (talk) 20:37, 14 October 2010 (UTC)

I've corrected the example using the sum of the signed ranks and the appropriate critical value, and I corrected the test procedure to match this. — Preceding unsigned comment added by Kastchei (talk • contribs) 02:06, 20 April 2012 (UTC)

I don't understand why the critical value is supposed to be 33. Looking at the table B.3 in [|reference table], the critical value with alpha=0.05, two-tailed, and n=9 is 5, not 33. — Preceding unsigned comment added by Jdfekete (talk • contribs) 15:01, 30 October 2019 (UTC)

Conflict with Siegel & Castellan?
65.78.66.242 commented on the article page: "The diecision rule state here is in error. According to Siegel & Castellan 1988 pp. 88-89, the stated decision rule is if the calculated value is less than or equal to the critical value then the null hypothesis is rejected (not retained as stated in the Wikipeda text)." MichaK (talk) 16:13, 25 October 2010 (UTC)

The W statistic
According to, which is currently one of the external links at the end of this article, the test statistic W is computed as the sum of W+ and W-. This differs from this article, which uses the minimum of W+ and W-. Apparently this is an unresolved issue as some descriptions use only W+, others use the minimum of W+ and W-, and yet others use the sum of W+ and W-. I emailed the author of that page (Dr. Lowry) about this. He said that using the sum of W+ and W- will converge to approximate the Normal distribution with fewer comparisons, and that the method of using the minimum of W+ and W- date to the time when the properties of the relevant sampling distributions had to be worked out laboriously by hand, and are only useful for small-sample cases (where n is less than about 10). He referred me to Mosteller & Rourke, Sturdy Statistics: Nonparametrics and Order Statistics, Addison-Wesley, 1973. I believe his argument is sound, and I think that the method of using the minimum of W+ and W- will tend to lead to more false-positives simply because it relies on fewer observations. I propose that we change this article to describe the simpler and more robust method of computing W as the sum of W+ and W-.--Headlessplatter (talk) 18:25, 21 June 2011 (UTC)

I've changed the test procedure to clarify all the issues you've mentioned. This section and the example now match the procedure recommended by Dr. Lowry. — Preceding unsigned comment added by Kastchei (talk • contribs) 02:10, 20 April 2012 (UTC)

Assumptions for Wilcoxon-signed rank test - needs excpention
I have just added proper citation for the assumption of the wilcoxon signed rank test:

http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test#Assumptions

I am not sure this section is complete, since under the framework of randomization tests, the wilcox test has a different H0 (where it works on the effect of the treatment in changing the "distribution", and not necessarily a shift of location parameter).

Tal Galili (talk) 10:17, 8 December 2011 (UTC)

Confidence Interval section is not good
The section on confidence intervals is complete garbage. This article covers the paired-sample test, but it looks like someone copied some text that was intended for a single-sample non-paired test, and even did a poor job at that. The notation doesn't make sense (e.g., D_i has a single subscript, but seems to define a 2-D matrix of differences), isn't consistent with the notation of the previous section, and the wording itself contains numerous typos, poor grammar, and awkward descriptions.

How to compute the confidence interval for the paired test is not at all obvious, and thus deserves coverage here. As is, the section that is there is doing a disservice and would be better removed. I would, however, like to have a reference for how it is done in the paired case. With the $$\min(W_-,W_+)$$ in the test, I don't think it is as easy as using the Walsh averages of Zi.

-- Lonnie Chrisman 19:33, 8 December 2011 (UTC)

I removed this section until someone can add the correct information.Kastchei (talk) 02:08, 20 April 2012 (UTC)

A new section discussing the theory behind it and why it works?
Hello, I'm not an expert in this area and after a little reading from (http://www.cis.uoguelph.ca/~wineberg/publications/ECStat2004.pdf) I discovered how this test works. I think the average user looking this up would like to know how the tables for small sample sizes are generated.

(1) Essentially the ranks generated are uniformly distributed provided the source distributions are continuously distributed. Being continuously distributed means the probability of tie is zero (not impossible, just zero probability). Discrete distributions introduce the possibility of ties; each tie drives the distribution of ranks to be less uniform.

(2) A function of the sums (or differences) of uniform random variables (RV) is performed via convolution. This follows from the basics of the study of RVs.

(3) So in the limit the distribution approaches a normal distribution. Note this is not required to perform the test, but describes why once the sample size is sufficiently large (e.g., 30) then you can use the normal distribution. Otherwise the statistic is distributed according to the convolution of n uniform RVs.

Anyways, these are just my two cents.Mouse7mouse9 00:41, 9 August 2013 (UTC) — Preceding unsigned comment added by Mouse7mouse9 (talk • contribs)

"History" section
In the history section it says that this test is also referred to as the "t-test for matched pairs" or the "t-test for dependent samples". Is this really true? Because those phrases should refer to the parametric paired-samples t-test, which of course is completely different than this, uses the mean, variance, correlation, assumes Gaussian distribution, etc. So if people do indeed refer to this test with those phrases, that would be either wrong or misleading, correct? Is this something that should be addressed in the article? Eflatmajor7th (talk) 22:32, 19 October 2013 (UTC)
 * Hearing nothing, I'm going to delete the last sentence of the History section, to avoid confusion. Eflatmajor7th (talk) 08:41, 3 December 2013 (UTC)

"come from the same population"
What does it mean? "come from the same population"

It doesn't mean that $$$x_{i,1}$$$ has the same distribution as $$$x_{i,2}$$$, that would be strange. Lockywolf (talk) 10:34, 28 May 2015 (UTC)

As N r increases, the sampling distribution of W converges to a normal distribution.
This is impossible, since W is always positive. — Preceding unsigned comment added by Lockywolf (talk • contribs) 11:07, 28 May 2015 (UTC)

Critical value correct?
I highly doubt the correctness of the critical value W = 35 (two-paired, alpha=0.05, n=9). Have looked up multiple tables and it seems to be 5/5,4/6, depending on rounding in the table itself. The value of 35 indeed pops up in external link 1, but don't know why it says so. — Preceding unsigned comment added by 89.146.22.128 (talk) 12:46, 9 December 2016 (UTC)

Question moved from article to talk page
Kevinwm0 asked question inside article. Moved here:


 * Question: Can someone explain how you get $$ W_{\alpha = 0.05,\ 9 \text{, two-sided}} = 6$$?
 * If we were to use $$z = \frac{W}{\sigma_W}, \sigma_W = \sqrt{\frac{9(9 + 1)(2*9 + 1)}{6}} = \sqrt{285}$$
 * and Z is far less than 1.96.

Variance off by factor 4
I just ran some simulations on the distribution of the statistic as described here and the Variance formula seems to be off by a factor 4.

This might be a leftover from the earlier discussions on using only the positive or only the negative values. I'll fix this in the article, but I am not entirely sure about the references to older materials, which might report values for the test statistic using only the positiv or negative values. Xenonoxid (talk) 22:56, 17 June 2019 (UTC)

Link to the Dutch wikipedia
Hello

When going to the Dutch wiki, I arrive at https://nl.wikipedia.org/wiki/Rangtekentoets. But it should go to https://nl.wikipedia.org/wiki/Wilcoxontoets The last one is about 2 samples. --Rcsmit (talk) 11:59, 2 November 2020 (UTC)

Clarity to a layperson
I don't think a layperson would have a chance of understanding what Wilcoxon signed rank is from this article. Perhaps adding a "Basic concept"/"Informal concept" section could solve that? EditorPerson53 (talk) 00:01, 31 December 2021 (UTC)


 * I hear that you would like additional explanation, but I'm not sure what you would like additional explanation of. Could you be more specific about what you see as lacking?  Ozob (talk) 06:07, 31 December 2021 (UTC)

Calculating the p-values
In this article when the p-values were calculated keeping 0 and removing 0, where do 55 and 109 come from? https://en.wikipedia.org/w/index.php?title=Wilcoxon_signed-rank_test

The W+ was 66. How was the p-value calculated. If the hypothesis test is an equivalency test such that H0 is µ≠0 and H1 is µ=0 or within a given boundary say -0.5 and 0.5 instead of testing for difference, how should one approach calculating the p-value ? Daannoo (talk) 23:10, 11 May 2023 (UTC)