Talk:Anscombe's quartet

Variances...
The variances of the x and y variables had been miscalculated. Whoever did the first calculation seems to have summed all (x-mean_x)^2, and then divided by 10, instead of N=11. This error seems to have been repeated elsewhere on the internet, but there are webpages (like this one ) which give the correct standard deviation/variance.

Could people "refix"ing the values on the page leave a comment here explaining their calculation? Erkcan (talk) 07:10, 18 April 2008 (UTC)


 * The x-variances for the 4 datasets are 10,10,10,10. The y-variances are 3.75206280991736, 3.75239008264463, 3.74783636363636, 3.74840826446281. Erkcan (talk) 07:25, 18 April 2008 (UTC)


 * There is some confusion here between population and sample variance, in the former case the denominator is n (11), in the latter case n-1 (10). Which one is correct depends on whether x and y are the population or a sample. But it doesn't really matter, what is more important is that the variance and mean are the same (however calculated) for each data set. It is incorrect to refer to mean and variance of each x or y. Mean of x would be better. Also the lines in the graphs don't intercept the y-axis at 3, I presume the origin is not zero which is a bit confusing. I would also ask that the other statistics from the original paper are added, this seems to be in hand from the page source. Jmgibbons (talk) 13:57, 2 September 2009 (UTC)
 * I have rephrased the table to avoid the "each x" usage. Melcombe (talk) 17:24, 12 November 2009 (UTC)


 * Whoops, made an anonymous edit refixing those values before I read this page. I actually ran accross this as I was writing a minor report on the quartet and the page threw me off for a while, thinking I was calculating the variance wrongly, somehow. So yes, I assure you that correct statistics matter and I would kind of like to know why people calculated them with n instead of n-1. Also, the image was generated with the n-1 variances, so there's another reason to keep them as such. 81.57.247.167 (talk) 08:09, 12 November 2009 (UTC)

Delete part of a sentence
Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

replaced with

Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient.

REASON: The relation between x and E[Y|x] in this "made-up population" may or may not be linear. There is no basis to test lack of linear fit, with the given "design" of x values. (There are degrees of freeedom for pure error only, but NONE for lack of fit when there are only 2 distinct x values.) I think that going into these matters is beyond the scope of the page, so my proposal is just a deletion.

129.1.23.19 (talk) 20:42, 30 September 2011 (UTC)

What the heck is "d.p." in the first table?
Can anyone substitute in the longer statistical terminology? — Preceding unsigned comment added by 18.111.93.217 (talk) 14:36, 15 October 2011 (UTC)


 * It means decimal places --Rumping (talk) 00:27, 17 November 2011 (UTC)

File:Anscombe's quartet 3.svg to appear as POTD soon
Hello! This is a note to let the editors of this article know that File:Anscombe's quartet 3.svg will be appearing as picture of the day on December 11, 2011. You can view and edit the POTD blurb at Template:POTD/2011-12-11. If this article needs any attention or maintenance, it would be preferable if that could be done before its appearance on the Main Page so Wikipedia doesn't look bad. :) Thanks!  howcheng  {chat} 18:37, 9 December 2011 (UTC)

Certainly a interesting picture - but shouldn't discrepancies cited here in this discussion page be resolved first?--173.69.135.105 (talk) 03:16, 14 December 2011 (UTC)

Regression lines
The regression lines shown (and mentioned in the body of the article) are least squares regression lines. As other forms of regression calculations can be carried out giving different results, I'm going to insert "least squares" in the first mention of the regression lines. An L1 regression, for example, minimizes the sum of the absolute values of the residuals, and has the property that the outlier in the third dataset will be effectively ignored.

Addendum: after inserting the phrase "least squares" and reviewing the article before saving it, I came to the conclusion that the wording had become overly convoluted. I'm not going to save the edited version of the article, but I am still of the opinion that in some way it needs to make clear that the regression lines shown are least squares.Floozybackloves (talk) 04:08, 11 December 2011 (UTC)

I think that the variance values are wrong
I was doing an homework about means and variances and I tried one of dataset from Anscombe's quartet. Then the variance results I calculated was different from the ones written in wikipedia. Firt I thought I made a mistake and searched for it. After search everything seemed correct then I started searching on the internet. The previous versions of wikipedia page had these numbers:

variance of x = 10

variance of y = 3.75

these numbers are same with the results I found. Can someone check it? — Preceding unsigned comment added by 193.140.194.64 (talk) 18:57, 9 October 2012 (UTC)

The issue is N = 11 versus N-1 = 10 in the denominator in the computation of variance, where N is the number of observations. If the data are a sample from a population and the mean is estimated from the sample, then using N-1 in the denominator has the desirable property that E(sample variance) = population variance. The intuition is that estimating the sample mean "uses up" one of the observations so that dividing by N (instead of N less one used up observation) would understate the spread in the data. The Wikipedia page on Bessel's correction has a very clear discussion.

The sample variance computed with N-1 in the denominator are:

sample variance of X = 11

sample variance of Y ≈ 4.12

Michaelaoash (talk) 20:19, 8 August 2013 (UTC)

4th dataset
We say:
 * the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear

Well I think this misses the point - it appears there is no relationship, linear or otherwise! I'd like to say something like the following instead:
 * the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the two variables otherwise are independent/unrelated/uncorrelated

But I haven't checked all the sources, so I won't be bold and change rigth now.

By the way, I think one can today quite easily do a little better than Anscombe did, using e.g. Excel Goal Seek -- for instance, based on Anscombe's quartet, I've created a sextet where I also include an uncorrelated scatterplot (x and y are independently normally distributed) + one outlier, as well as an exponential function. But of course this article is and should be only about Anscombe's original dataset.--Nø (talk) 08:20, 31 January 2017 (UTC)


 * I agree with your point, and the proposed change looks good. Smyth (talk) 18:09, 31 January 2017 (UTC)
 * Done - I chose unrelated as last word.--Nø (talk) 16:13, 2 February 2017 (UTC)
 * "Unrelated" is incorrect in this case. There clearly is a relation between the two variables: that's what the graph illustrates. It's just not entirely a straight-line relation. Obviously the two variables are not statistically independent either, not are they uncorrelated. MartinPoulter (talk) 17:52, 3 February 2017 (UTC)
 * The implication of the graph seems to be that the true value of X is constant, and the outlier is a measurement error. In which case there is no real-world relationship whatsoever between X and Y, even though mathematically there is a correlation. Smyth (talk) 22:42, 3 February 2017 (UTC)
 * I agree with Smyth and disagree with MartinPoulter. I'll not revert Poulter's revert, but I think someone should, or if possible find a better wording.--Nø (talk) 10:29, 4 February 2017 (UTC)
 * Having thought about it more, there are other possible interpretations of the graph. For example, the underlying model may be some sort of step function. So how about ... one outlier is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables? Smyth (talk) 13:24, 5 February 2017 (UTC)
 * I support that solution.--Nø (talk) 14:01, 5 February 2017 (UTC)
 * That's an improvement. I think we should be very wary that the statements should be about the data, not about subjective personal interpretations of the data, so I'm glad to see the comments go in that direction. MartinPoulter (talk) 17:18, 6 February 2017 (UTC)