Talk:Stein's example

In several places this article had things like this:


 * $$[\frac{1}{2}]$$

where the brackets were too small. One fixes this by using \left and \right with this result:


 * $$\left[\frac{1}{2}\right].$$

We also saw


 * $$e^{\frac{-1}{2}}$$

and the like, i.e., "stacked" fractions in exponents. These are hideous, and I changed them to


 * $$e^{-(1/2)}\,$$

etc. In several cases, TeX's matrix environment was used to align "=" on successive lines. This has several bad effects, including making fractions look smaller. In some cases, the matrix environment was used when there was only one line! I got rid of the matrix environment. I'd use it for actual matrices, but not for this. Michael Hardy 00:18, 23 October 2005 (UTC)

the second instance of R(theta, d) would be clearer as R(theta, d'), since it is computing for the new estimator d'

also, it seems strange that there is no reference to the scientific american article that clearly explains the notion - it is where many of us first saw this concept

b. efron, c. morris, stein's paradox in statistics, sci.am. 238(5) pp, 119-127 (1977)

c. landauer cal@aero.org

71.118.91.16 00:42, 29 January 2006 (UTC)

(Was) too technical
This article needs a rewrite: Any opinions before I go and rewrite this? --Zvika 12:54, 12 October 2006 (UTC)
 * The lead does not give any information about what Stein's example is, only about the context.
 * You need to read through almost a screenful, including a few equations, before you begin to understand what the concept means.
 * The proof sketch is long and technical, and not very insightful; I am not sure if it qualifies as encyclopedic, and even if it does, it should probably be placed at the end of the article, or in a subpage.


 * Well guys, I've put in a lot of work into the new article, I hope you will like it! --Zvika 07:45, 1 December 2006 (UTC)

I am disappointed by the deletion of the hierarchical Bayes example (you removed the Interpreting Stein's Example section, moved part of this elsewhere, but deleted this section). To be sure, I wrote that piece, but it was following good authority (Berger's book, and personal conversations with him), and I think that this example helps to understand how Stein's example can arise (since the Bayesian example automatically produces an admissible decision rule). Did you have a specific reason to delete this section? Bill Jefferys 00:35, 14 March 2007 (UTC)


 * Hi, I'm glad to see some response to my edits, it means I am not the only one in the world interested in this topic! When I rewrote the article, I tried to keep as much of the original content as possible. In this case I thought the explanation was a bit unclear and could not figure out what exactly what meant. In the end, seeing that nobody seemed to have anything to say about it, I replaced it with the current explanation. I confess that this is also a result of the fact that I find the hierarchical Bayes explanation somewhat problematic: Who is to say what prior we are to choose? Why does one prior dominate the ordinary estimator while another does not? And, most importantly, how come the resulting estimator has lower risk, even for values of $$\theta$$ which are highly unlikely under the chosen prior? IMO, the frequentist explanation offered in the current version is much more convincing.
 * All the same, you are correct in that the empirical Bayes story is commonly used to explain the phenomenon. I will be very happy to cooperate with you in placing both explanations in the article. For example, we could make two subsections within Stein's example, one with the empirical Bayes explanation and one with the frequentist explanation. However, we will need to address the problems I mentioned above. If you are interested, why don't you try writing something like this, maybe based on the version you had before I came along? I will try to edit it constructively.
 * --Zvika 09:41, 14 March 2007 (UTC)

Sounds like a plan. Be patient, my plate is rather full.

The point is that given a loss (e.g., square error) and proper priors, the generalized Bayes rule will virtually always be admissible. On the other hand, if one used improper flat priors, one would get the inadmissible rule that arises, e.g., from the MLE. This provides a close connection between the Bayesian approach and Stein's example. The James-Stein estimator, and related ones such as Efron-Morris, are not admissible in general; What they do is to give a specific example that dominates the usual rule, thus demonstrating the inadmissiblity of the latter. Recall that Stein's original proof was only an existence proof and did not display a dominating rule. On the other hand, by minimizing the expected loss in a Bayesian context, we constructively obtain a decision rule that is admissible. Bill Jefferys 13:48, 14 March 2007 (UTC)


 * This is exactly the point: Stein's example is not about admissibility but rather about inadmissibility. The surprising result is not that Bayesian estimators are admissible, but that the standard estimator that everyone uses is inadmissible. What is the relevance of showing other, admissible estimators? The real question is why doesn't the MLE work as well as we'd expect. The best response that I have seen yet is the one about the expected norm of the MLE which is too large, as written in the current version. --Zvika 19:28, 14 March 2007 (UTC)

I guess I see your point and reasoning for not including the Bayesian example here. But surely it deserves mentioning somewhere, if not in this article then somewhere else (and crosslinked to this one). I mean, if you go to the Efron-Morris article, the example they give for baseball statistics is easily implemented as a hierarchical Bayes model with either binomial or (approximately as) normal model, and the results are essentially the same numerically as Efron and Morris got using their estimator (I presume that the article used E-M rather than J-S, but I don't have it in front of me). Jim Berger has been quite insistent that when doing hierarchical Bayes, it's really important to check for admissibility, since the usual priors on scale variables can give badly inadmissible results. Unfortunately, there doesn't seem to be a page on hierarchical Bayes, at least I can't find one. The page on admissible decision rule says something, but no example is given. Any ideas? Bill Jefferys 20:32, 14 March 2007 (UTC)


 * You're right, the only mention I could find of hierarchical Bayes is a redlink to hierarchial Bayesian model in empirical Bayes method. I see hierarchical Bayes as a special case of empirical Bayes, so unless you want to write a particularly detailed account, I think an example paragraph describing the idea in hierarchical Bayesian model will suffice for starters. We could then add a short paragraph to Stein's example (and maybe also to James-Stein estimator) saying that there is an empirical Bayes derivation of the JS estimator, and linking to empirical Bayes method (and also citing Efron and Morris' empirical Bayes derivation). --Zvika 07:08, 16 March 2007 (UTC)

Hierarchical Bayes is Bayesian; Empirical Bayes is not. Jim Berger (private communcation) has stated this very clearly to me. Thus, I cannot agree that HB is a "special case" of EB.

There needs to be a real article on HB. I have some free time coming up next semester, and I hope I can get to it. Bill Jefferys (talk) 02:24, 21 November 2007 (UTC)

Technical error regarding admissibility/inadmissibility
The definition of admissibility requires strict inequality for some $$\theta$$. Otherwise no admissible rule would exist if at least two satify "less than or equal" criterion for all theta but none has "less than" for some theta (under the definition of the article before I revised it). Bill Jefferys (talk) 02:19, 21 November 2007 (UTC)

Variances
Does this depend on the variances being known, or unknown but equal, or at least known to be in a given ratio? If so, then that is quite an assumption and needs to be stated explicitly and early in the article. Is it true if the underlying distributions are completely unknown? --Rumping (talk) 14:24, 22 December 2008 (UTC)
 * The answers to many of these questions can be found in James-Stein estimator. --Zvika (talk) 18:21, 22 December 2008 (UTC)
 * I had read that. Hence my question.  It looks to me that the answer to the first question mark is "yes" and the answer to the second might be "no". But I am not sure.--Rumping (talk) 01:45, 23 December 2008 (UTC)

Implications Section Wrong?
The implications section provides the example of using a shrinkage estimator to estimate a set of unrelated statistics (US wheat yield, spectators at Wimbledon and the weight of a candy bar). It says that on average, the new estimator is better. If I understand the shrinkage estimator correctly, this only holds as long as the three statistics are drawn from a distribution whose mean exists. For example, it works if the true US wheat yield, the true number of spectators at Wimbledon, and the true weight of a candy bar are random variables stemming from a normal distribution. However, it would not work if they are random variables stemming from a Cauchy distribution (note that the optimal shrinkage factor alpha goes to zero as the variance of these three statistics goes to infinity). Note that in both cases, they are "unrelated" if unrelated means that their correlation is 0. However, for the shrinkage estimator to work, they must still have "something to do with each other", namely being drawn from a distribution whose mean exists (and probably also must have a finite variance). Unfortunately, that is all based on my own calculations, so these thoughts are not wikipedia-conform. Would be good to have a source for this. — Preceding unsigned comment added by 2A02:1205:5076:CE60:9495:5865:8674:7A67 (talk) 09:33, 17 June 2015 (UTC)

Actually, the startling and revolutionary fact about the Stein estimator is that the θ's do NOT have to be drawn from any distribution. The method of the proof uses hierarchical thinking to motivate, but then strips away the marginalization over θ vector, so that the final result is frequentist, therefore applies conditional on any and every θ vector. The famous Efron-Morris paper makes that clear. Notsofeo (talk) 18:26, 7 September 2015 (UTC)

Intuitive Explanation Section
The "intuitive explanation" section doesn't sound right to me. Can anyone note a source for it? A textbook maybe? Wbrinda (talk) 10:09, 10 March 2012 (UTC)

- I didn't really understand this phenomenon from the "intuitive explanation". The sentence that helped me to understand comes much later in the article:

"This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component."

"I would suggest that this explanation or something like it should appear near the top of the article, for the benefit of simpler minded folk like myself. 103.1.70.178 (talk) 18:40, 16 June 2013 (UTC)"

Intuitive Explanation
Given n≥3 population means, Stein's method will have a lower mean squared error than just choosing sample values as estimators. But, given n sample values, choosing the sample values is the "optimal" decision. This follows from it being the optimal solution for each mean, and linearity of expectation.

It seems like Stein's method takes advantage of the population means being finite by slightly underestimating their magnitude. Say the sample is 7, then 6 and 8 are not treated as equally likely population means. 6 is "more likely" (and in general the population mean is likely to be slightly lower than the sample).

By multiplying the sample value by a <1 but close enough to 1 factor before using it, it gives us the advantage of reducing large sample values more than it reduces small ones, resulting overall in a lower MSE. 2600:1008:B007:530D:8A6:BA6A:FA37:C28F (talk) 12:11, 4 September 2022 (UTC)