Wikipedia:Reference desk/Archives/Mathematics/2023 June 1

= June 1 =

Central limit theorem
I'm sorry, but to tell the truth, by now, I have become utterly confused about the terminology or rather the concepts behind them applied in the central limit theorem. The problem is that there seems to be some confounding of the terms sample and variable, as can be seen from the definition given for the Classical CLT: This implies an equation of sample and variable, but, as far as I can follow, samples as such do not constitute variables, but instead consist of the latter!

With this in mind, I finally get into serious trouble when considering the example given here on the right left (comparison of probability density functions $p(k)$ for the sum of $n$ fair 6-sided dice). As the example doesn't deal with the distribution of the averages but of the sums of the possible sums of spots, I asked myself how the CLT as defined at the very beginning of the article's lead ("the sampling distribution of the [standardized] sample mean tends towards a [/ the standard] normal distribution") could be applied here. Now here's the deal: For each die the average of spot sums equals 3,5, so with each die my average of possible spot sums grows by that factor 3,5. This, however, to me constitutes merely a linear and not a normal relation to the growth of n ... Now where's my fallacy? Please somebody help me get out of this quagmire, as I'm literally beginning to grow mad about this. Thanks a lot in advance for any assistance. Hildeoc (talk) 20:49, 1 June 2023 (UTC)

PS: As to the terminology problem, doesn't each sample consist of $$X_1, X_2, X_3, \ldots, X_n,$$ as n denotes e.g. the number of dice within a single sample in the given example? (Hence, wouldn't consequently $$\mu_1, \mu_2, \mu_3, \ldots, \mu_n$$ have to denote the expected values of those IID variables within that single sample instead of the expected values of multiple samples? But if so, what exactly would these expected values of the IID variables within the single sample constitute numerically, e.g. for a sample with n = 4, and how would they form a normal distribution for growing n?--Hildeoc (talk) 21:04, 1 June 2023 (UTC)


 * A quick response in haste. The terminology in the article is occasionally non-standard, both in the lead ("If $X_1, X_2, \dots, X_n, \dots$ are random samples drawn from a population ...") and, as you noted, further on ("Let $\{X_1, \ldots, X_n}\$  be a sequence of random samples ..."). The data set $X_1, X_2, \dots, X_n$  is the sample. As to the example in the section Applications and examples, the text to the left and the image with histograms to the right describe different cases. The image shows the distribution of sample averages for increasingly larger sample sizes, denoted there by a capital $$N$$. The population from which these samples are drawn has a uniform distribution, just like for a fair die, but the possible values are the numbers from 0 to 100, instead of 1 to 6, so the average should be 50.  --Lambiam 23:26, 1 June 2023 (UTC)


 * Thank you very much. However, I'm very sorry to say that, in my "frenzy", I made a very stupid mistake: I actually meant to refer merely to the example on the left (i.e., this one), not the right! Let's take, for instance, the sample with n = 3. Then I get three IID variables $$X_1, X_2, X_3$$ and one single average $$\mu = \mu_1 + \mu_2 + \mu_3 = 3*3.5 = 10.5$$, right? For a larger sample, I would accordingly get another single average of $$n*3,5$$, right? Now how exactly do I get a normal distribution for averages (plural!)? By plotting the probabilities (= f(x)) of the averages of multiple samples (= x), right? (cf. here, for example) If that's the way to go, why exactly do you deem the terminology in the article "non-standard" in this respect then? Or did I get anything wrong here?--Hildeoc (talk) 00:44, 2 June 2023 (UTC)
 * Fix a sample size $$n$$. Take a sample of that size and note its sum. Repeat until you have many such sums, enough to get a good idea of their distribution. Let's take the case where you are throwing fair dice. If $$n=0$$ you'll soon notice you get a one-point distribution with $$\mu=0,\sigma^2=0.$$ If $$n=1,$$ after taking a lot of samples you'll notice not only that $$\mu\approx \tfrac 72,\sigma^2\approx \tfrac{35}{12},$$ but also that the distribution is nearly uniform. If $$n=1000,$$ you'll observe that now $$\mu\approx 3500,\sigma^2\approx 2917,$$ but also that the distribution has a bell shape and is closely approximated by the normal distribution with these parameters.
 * What is confusing is that there are two levels of sampling. You take one sample $$X_1,...,X_n$$ and get a sum $$S_1.$$ You take an independent second sample and get a sum $$S_2.$$ The sequence of sums obtained, $$S_1,...,S_k,$$ is itself a sample drawn from the population of "$$n$$-sums". To get a good idea of the distribution of that population, $$k$$ needs to be fairly large. That is true in general for sampling and has nothing in particular to do with the CLT. The CLT is about what happens to the distribution as $$n$$ tends to infinity. --Lambiam 08:41, 2 June 2023 (UTC)
 * @Lambiam: I'm sorry but now I'm confused.
 * If n is 0, then I don't have any distribution at all, i.e. not even a one-point distribution, do I?
 * If my n is 1, and I take several samples with 1 die, the average for each sample equals simply the number of dots for each rolled die, as I only have one single value for each sample.
 * With $$n = 1000$$, my $$\mu$$ becomes $$3.5 * 1000 = 3500$$ (not $$350$$), right?
 * Apart from that, when mapping the various sum averages (= x-values) against their probabilities (= y-values), it doesn't matter – as to the CLT – that, even for very large n, the consecutive x-values (i.e. average for a sample with n variables, average for a sample with n + 1 variables etc.) of my resulting normal distribution can always only be values discretized by the factor 3.5, meaning the resulting distribution can actually never become continuous, whatever the n?
 * Did I get it right: When dealing with the CLT in terms of sums of IID variables, we can get close to the normal distribution with one single large sample with a large n, i.e. many IID variables within that single sample (e.g., many dice with their numbers added as in one of the charts here)? Whereas when dealing with averages, on the other hand, we need to map the sum averages of several samples with increasing different large n (cf. here: "The central limit theorem for sample means says that if you keep drawing larger and larger samples (such as rolling one, two, five, and finally, ten dice) and calculating their means, the sample means form their own normal distribution (the sampling distribution).") the same number of variables? But if so, this will only make a difference in terms of empirical, not theoretical values (as in theory, the average sum for a fixed number of dice, for instance, will always stay $$n * 3.5$$).
 * (I'm honestly sorry if, which seems actually very likely to me, these questions may appear quite lowbrow to professionals, but I'm really just trying to fully grasp the idea behind the CLT!) @David Eppstein, @Michael Hardy, what do you think? Hildeoc (talk) 17:24, 3 June 2023 (UTC)


 * Point by point:
 * If $$n=0,$$ there are no dice and therefore no dots, so the total number of dots is always equal to $$0.$$
 * If $$n=1,$$ the average value of the one-element sample is indeed the number of dots. Assuming the throws are independent, the die is fair if (and only if) the distribution is uniform.
 * With $$n=1000,$$ $$\mu$$ should indeed become $$3.5 \times 1000 = 3500$$. I have corrected the error.
 * Whatever the value of $$n,$$ the random variable that is the sum of the values in a sample of die throws will have a discrete probability distribution, since it can only assume integral values in the range $$n,n+1,...,6n.$$
 * Assume we have some real-valued random variable $$X$$ that has a positive but finite variance. We define two families of derived random variables. One family has members $$S_1(X),S_2(X),S_3(X),...,$$ where $$S_n(X)$$ is the value obtained by taking the sum of an IID sample of $$X$$ of size $$n$$. The family $$A_1(X),A_2(X),A_3(X),...$$ is defined similarly, but now $$A_n(X)$$ is the value obtained by taking the average of an IID sample of $$X$$ of size $$n$$. Each member of these two families is a random variable with a probability distribution. The three random variables $$X,$$ $$S_1(X)$$ and $$A_1(X)$$ have the same distribution. What the CLT essentially says is that as $$n$$ gets larger and larger, the distribution of $$S_n(X)$$ will start to look more and more like a normal distribution. For the case that $$X$$ represents a fair die, we know (by definition) the distribution of $$X$$ precisely. We can use that to calculate the distribution of $$\underline{S_n(X)}$$ exactly. For example, we know that $$\mu=\tfrac 72n$$ without actually throwing any dice. If we do not know the distribution of $$X,$$ we need to take a number of samples of $$S_n(X)$$ and look at the distribution experimentally obtained. If $$n$$ is fairly large, $$S_n(X)$$ should itself be approximately normally distributed, but this can only be verified experimentally by taking a large number of samples of $$S_n(X).$$ Everything said about $$S_n(X)$$ applies equally to $$A_n(X).$$ But for the x-scale when plotting their distributions, these have the same distribution.
 * I hope this clarifies the issue. --Lambiam 21:06, 3 June 2023 (UTC)
 * Thank you very much indeed for thoroughly clarifying that. Your argumentation seems plausible to me. So, to resume my – accordingly modified – summary question: As to plotting the distribution for the various empirical averages of samples with the same large number of IID variables, I only get different x-values (i.e. averages) due to the variation that occurs in empirical data (as for the given example, strictly speaking in terms of theoretical probability, the average for various samples of the same size would always have to amount to $$n * 3.5$$, correct?)
 * Follow-up question: How exactly can I know ex ante whether the population variance is finite, in fact?
 * Also, shouldn't this confusing cumulative reference to $$X_n,$$ $$S_n(X)$$ and $$A_n(X)$$ as IID variables invoked by you rather be correspondingly expounded in the article in question to avoid further ambiguity and misunderstandings (like mine)?
 * Hildeoc (talk) 00:19, 4 June 2023 (UTC)
 * Again point by point:
 * Yes. It is not different from experimentally determining the distribution of any random variable. Suppose you and a colleague are both tasked with finding out if a given physical die is fair. You decide to work independently and both cast the die 3000 times. Upon comparison, your histograms will not be identical.
 * There is no general way of knowing this if the population is infinite and there is no limit on the absolute value of the property of interest. (Otherwise the variance is easily seen to be finite.) You may hope to create a plausible parametrized mathematical model and prove that for all reasonably possible settings of the parameters the model gives you a finite variance. However, you will never have a guarantee that the model is in this respect an adequate description of reality.
 * I've briefly looked into improving the article but am wary of introducing my own approaches and notations, and the reliable sources I looked at (only a few, but they were supposed to be the best) seemed as needlessly confusing as the article's text, which appeared to be following their approach. However, my examination was only cursory. I expect that we have many editors who are experts in this field, which I am not, but I have also noticed that the experts tend to be less interested in getting the more basic maths articles in good shape.
 * --Lambiam 01:28, 4 June 2023 (UTC)
 * I highly appreciate your time and effort once more. As to your last observation, this is really a shame in view of the importance of those basic articles for a true understanding of the fundamental concepts and principles, not least for non-professionals like me. Hildeoc (talk) 18:51, 4 June 2023 (UTC)

I am somewhat unsure what question is being asked here, but let's see if an example sheds some light. Suppose a four-sided die is thrown three times. The following are the possible samples:

\begin{array}{|l|c|c|l|lllllllllll} \hline & \text{sample} & \text{sample} \\ \text{sample} & \text{sum} & \text{mean} & \\ \hline 1,1,1 & 3 & 1\phantom{.0000\ldots} \\ 1,1,2 & 4 & 1.3333\ldots \\ 1,1,3 & 5 & 1.6666\ldots \\ 1,1,4 & 6 & 2\phantom{.0000\ldots} \\ 1,2,1 & 4 & 1.3333\ldots \\ 1,2,2 & 5 & 1.6666\ldots \\ 1,2,3 & 6 & 2\phantom{.0000\ldots} \\ 1,2,4 & 7 & 2.3333\ldots \\ 1,3,1 & 5 & 1.6666\ldots \\ 1,3,2 & 6 & 2\phantom{.0000\ldots} \\ 1,3,3 & 7 & 2.3333\ldots \\ 1,3,4 & 8 & 2.6666\ldots \\ 1,4,1 & 6 & 2\phantom{.0000\ldots} \\ 1,4,2 & 7 & 2.3333\ldots \\ 1,4,3 & 8 & 2.6666\ldots \\ 1,4,4 & 9 & 3\phantom{.0000\ldots} \\ 2,1,1 & 4 & 1.3333\ldots \\ 2,1,2 & 5 & 1.6666\ldots \\ 2,1,3 & 6 & 2\phantom{.0000\ldots} \\ 2,1,4 & 7 & 2.3333\ldots \\ 2,2,1 & 5 & 1.6666\ldots \\ 2,2,2 & 6 & 2\phantom{.0000\ldots} \\ 2,2,3 & 7 & 2.3333\ldots \\ 2,2,4 & 8 & 2.6666\ldots \\ 2,3,1 & 6 & 2\phantom{.0000\ldots} \\ 2,3,2 & 7 & 2.3333\ldots \\ 2,3,3 & 8 & 2.6666\ldots \\ 2,3,4 & 9 & 3\phantom{.0000\ldots} \\ 2,4,1 & 7 & 2.3333\ldots \\ 2,4,2 & 8 & 2.6666\ldots \\ 2,4,3 & 9 & 3\phantom{.0000\ldots} \\ 2,4,4 & 10 & 3.3333\ldots \\ 3,1,1 & 5 & 1.6666\ldots \\ 3,1,2 & 6 & 2\phantom{.0000\ldots} \\ 3,1,3 & 7 & 2.3333\ldots \\ 3,1,4 & 8 & 2.6666\ldots \\ 3,2,1 & 6 & 2\phantom{.0000\ldots} \\ 3,2,2 & 7 & 2.3333\ldots \\ 3,2,3 & 8 & 2.6666\ldots \\ 3,2,4 & 9 & 3\phantom{.0000\ldots} \\ 3,3,1 & 7 & 2.3333\ldots \\ 3,3,2 & 8 & 2.6666\ldots \\ 3,3,3 & 9 & 3\phantom{.0000\ldots} \\ 3,3,4 & 10 & 3.3333\ldots \\ 3,4,1 & 8 & 2.6666\ldots \\ 3,4,2 & 9 & 3\phantom{.0000\ldots} \\ 3,4,3 & 10 & 3.3333\ldots \\ 3,4,4 & 11 & 3.6666\ldots \\ 4,1,1 & 6 & 2\phantom{.0000\ldots} \\ 4,1,2 & 7 & 2.3333\ldots \\ 4,1,3 & 8 & 2.6666\ldots \\ 4,1,4 & 9 & 3\phantom{.0000\ldots} \\ 4,2,1 & 7 & 2.3333\ldots \\ 4,2,2 & 8 & 2.6666\ldots \\ 4,2,3 & 9 & 3\phantom{.0000\ldots} \\ 4,2,4 & 10 & 3.3333\ldots \\ 4,3,1 & 8 & 2.6666\ldots \\ 4,3,2 & 9 & 3\phantom{.0000\ldots} \\ 4,3,3 & 10 & 3.3333\ldots \\ 4,3,4 & 11 & 3.6666\ldots \\ 4,4,1 & 9 & 3\phantom{.0000\ldots} \\ 4,4,2 & 10 & 3.3333\ldots \\ 4,4,3 & 11 & 3.6666\ldots \\ 4,4,4 & 12 & 4\phantom{.0000\ldots} \\ \hline \end{array} $$ So look at the distribution of the sample sum:

\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrr} 3 \\ 4 & 4 & 4 \\ 5 & 5 & 5 & 5 & 5 & 5 \\ 6 & 6 & 6 & 6 & 6 & 6 & 6 & 6 & 6 & 6 \\ 7 & 7 & 7 & 7 & 7 & 7 & 7 & 7 & 7 & 7 & 7 & 7 \\ 8 & 8 & 8 & 8 & 8 & 8 & 8 & 8 & 8 & 8 & 8 & 8 \\ 9 & 9 & 9 & 9 & 9 & 9 & 9 & 9 & 9 & 9 \\ 10 & 10 & 10 & 10 & 10 & 10 \\ 11 & 11 & 11 \\ 12 \end{array} $$

You can see the "bell-shaped" curve.

The same thing applies to the sample means. Michael Hardy (talk) 23:07, 5 June 2023 (UTC)

Geographic center of the United States
Geographic center of the United States says the following:

This is distinct from the contiguous geographic center, which has not changed since the 1912 admissions of New Mexico and Arizona to the 48 contiguous United States.

Aside from the fact that NM and AZ were part of the US before they became states, why would the contiguous geographic centre have changed because of their admission? Imagine that they were foreign countries, for example — why would they have affected the location of the central spot? Neither state includes the contiguous US' easternmost, westernmost, southernmost, or northernmost points. Nyttend (talk) 20:55, 1 June 2023 (UTC)


 * The 'geographic center' discussed there is the centroid - "the the arithmetic mean position of all the points in the surface of the figure" - which will certainly change when new territory is added. Note how the U.S. center article describes it's original determination: "In 1918, the Coast and Geodetic Survey found this location by balancing on a point a cardboard cutout shaped like the U.S." AndyTheGrump (talk) 21:04, 1 June 2023 (UTC)