Talk:Empirical Bayes method

Data sets are conditionally independent given the prior parameters
As far as I understand from the modern Bayesian perspective empirical Bayes is about hierarchical Bayesian models and learning the parameters of a prior distribution by sharing that distribution across different data sets. In other words it is an assumption that these data sets are conditionally independent given the prior parameters. I don't detect any of that in this article. Am I missing the point or something?


 * I don't see how you're failing to detect it. Look at the quantity called &Theta; in the article. Michael Hardy 22:44, 30 July 2006 (UTC)

So the point is that there are several approachs to Empirical Bayes, and what is presented here is the simply the "Robbins" Method (See Carlin and Louis).

In a broader context, Empirical Bayes is about using empirical data to estimate and/or evaluate the Marginal distribution that arises in the Posterior, from Bayes Theorem. For simple models (with simple conjugate priors such as Beta-Bionomial, Gaussian-Gaussian, Poission-Gamma, Multinomial-Dirichlet, Uniform-Pareto, etc), there are several simple and elegant results that basically estimate the Marginal using Maximum Likelihood Estimate (MLE), and then a simple point estimate for the posterior (i.e. a point estimate for the prior). The basic results are quite easy to interpret (i.e as a linear regression) and implement. It would be nice to have discussion on these topics here, and some summary of the basic models.

A good example would be to work out the Beta-Bionomial Model, since this model is somewhat complicated and is a good starting point for modeling small, discrete data sets.

Once the basic ideas are laid out, it would be good to then add sections on computational methods for complex models, such as Expectation Maximization, Markov Chain Monte Carlo, etc.

Some applications would be nice to, such as modeling consumer Marketing data (which is what I do with it) —Preceding unsigned comment added by Charlesmartin14 (talk • contribs)

deletion of a misguided paragraph
This section was recently added to the article (this version is after some cleanups including misspellings, fixing links, some TeX corrections, etc.):


 * Bayes' theorem


 * In the Bayesian approach to statistics, we consider the problem of estimating some future outcome or event, based on measurements of our data, a model for these measurements, and some model for our prior beliefs about the system. Let us consider a standard two-stage model, where we write our data measurements as a vector $$y = {y_1, y_2, \dots, y_n} $$, and our prior beliefs as some vector of random unknowns $$\theta$$.  We assume we can model our measurements with a conditional probability distribution (the  likelihood ) $$Pr(y|\theta)$$, and also the prior as $$\Pr(\theta|\eta)$$, where $$\eta$$ is some hyper-parameter.  For example, we might choose $$\Pr(y|\theta)$$ to be a  binomial distribution, and $$\Pr(\theta|\eta)$$ as a Beta distribution (the conjugate prior).    Empirical Bayes then employs empirical data to make inferences about the prior $$\theta$$, and then plugs this into the likelihood $$Pr(y|\theta)$$ to make estimates  for future outcomes.

This could leave the impression that empirical Bayes methods are an instance of the Bayesian approach to statistics. But that is incorrect: the Bayesian approach is about the degree-of-belief interpretation of probability.

This could also leave the impression that the Bayesian approach is about estimating FUTURE outcomes or events. It's not. (In some cases it may be about the future, but that's certainly nowhere near essential.)

This characterization of "likelihood" fails to make clear that the likelihood is a function of &theta; and not of y. It also works only for discrete distributions, whereas likelihood is more general than that.

It uses the word "hyperparameter" without explanation.

After the words "for example", the examples are far to terse to be comprehensible. The examples the article already gave are comprehensible. More could be useful if they were presented in the same way. Michael Hardy 21:23, 27 September 2006 (UTC)

This is an attempt to provide some additional information about how Empirical Bayes is done in practice and the basic formulation from first principles, as opposed to just saying "use Baye's theomr and the result pops out!" It is a first start...these things do take time

Some comments;

(1) I am not sure what you mean by saying the Likeliehood is not a function of y? Perhaps you mean it is not a function of the data? The point is that Empirical Bayes will eventually plug this in. (2) Add a section on the hyperparameter to define it...why did you delete it? (3) If you don't like the word future, then change it? Why did you delete everything?

(4) You leave the impression that Empirical Bayes is nothing more than Robbin's method?! Under your logic, I would delete the entire page and just start form scratcvh.

(5) Empirical Bayes (EB) is an approach to Bayesian statsitics which combines the Bayesian formalism and empirical data? Again, what's the problem? Here the issue is more between the Robbin's style Non-Parametric EB and the Carlin and Louis style Parametric EB.

(6) The examples ARE NOT comprehensible...you did not explain anything except give Robbin's formula and explain how to plug in the results...you need to explain WHY Robbin's is doing what he is doing in the more general context, based on the rgior and presentation in other areas of probability theory on the wikipedia. For example, you could explain that Robbin's method is actually the Bayes Estimate of the prior under Squaered Error Loss.

The point of the article should be to explain the primary models involved with some rigor and derivation, such as the Beta-Binomial, the Poisson-Gamma, etc, since these are commonly used and not explained elsewhere.


 * The likelihood function is L(&theta;) = f&theta;(x) where f&theta; is the probability density function given that &theta; is the value of the parameter. The argument to the function L(&theta;) is of course &theta;.


 * No, empirical Bayes is more than Robbins' examples. I have no problem with additional examples.


 * There's nothing inherently Bayesian about empirical Bayes methods. Empirical Bayes methods that are most often used are not Bayesian.  The mere fact that Bayes' theorem is used does not make it any more Bayesian than it would otherwise be.  Bayesianism is the degree-of-belief interpretation, as opposed to the frequency interpretation or some others, of probability.


 * The example I referred to is one that you very tersely mentioned but did not explain. The example that was already there was explained.  In the examples section below the paragraph I criticized, you could add some fully explained examples. Michael Hardy 22:48, 27 September 2006 (UTC)

This issue about the functional dependence is merely notation...I am simply following the convention in the wikipedia entry on Bayes Theorem and other well known treasties on conditional distributions, such as Carlin and Louis to papers. It would be good to have a consistant notation accross the wikipedia page after some other issues are cleaned up. I fail to see the need to use the term "bound variable" because that is really confusing to anyone who is not a programmer, and especially here, since the point of Empirical Baeys is to "unbind the varibales" and approximate them with their empirical counterparts in the marginal


 * I am not a programmer. The term "bound variable" is older than computers and older than computer software. Michael Hardy 17:33, 29 September 2006 (UTC)

The formulae presented are fine for discrete and continuous distributions? What specifically do you mean

As for being inherently Bayesian, the point is that Empirical Bayes methods use empirical data to approximate the Bayesian marginal and/or posterior distribution under certain approximations (such as squared error loss, Stein estimation, maximum likelihood estimate (MLE), etc), or they may use computational methods to approximate the marginal (Gaussian Quadrature, Metropolis Monte Carlo, Markov Chain Monte Carlo, etc.). This is true with Robbin's method, with the Beta-Binomial model, with Bayesian Regression, etc. Each "example" uses some combination of these approximation (i.e Robbins is a point estimate assuming a non-informative, unspecified prior and squared error loss).


 * I strongly object to the inclusion of numerical and simulation methods in this page as a type of empirical Bayes analysis. These computational methods do not modify the nature of the Bayesian inference, as they only induce some numerical error in the exploitation of the posterior. Empirical Bayes methods use the data to construct (part of) the "prior" and are thus conducting another form of inference that is not entirely Bayesian. Empirical has a specific meaning in the field, it does not mean approximate or ad hoc. Ethel Mannin (talk) 09:12, 24 July 2021 (UTC)

The current explanation of Robbin's method for Empirical Bayes does not clearly explain how the marginal is being approximated...indeed, you just refer to it as the "normalizing constant", and hwile there is a wikipedia entry on this, it is just not transparent and not the terminology used in the some of the popular literature on Empirical Bayes (Carlin and Louis, Rossi, etc). It is is also confusing since it is not actually constant (i.e. it will be a function of any hyperparameters as well as the data that appears in the Likelihood )

You should take out the comments like "That is why the word Bayes appears" and "that is why the word empirical appears" and, instead, explain concisely but from general first principles what is going on. I have tried to add some of this in the introduction. One good formulation, at least for Robbin's method, is to show that Bayes rule arises as a point estimate when you minimize the posterior squared error loss (i.e risk minimization), and it takes about 3-4 lines of basic calculus

I have included some of this and will need to fix up the notation to complete it...IMHO it would be important to present the basic derivations and their motivations

Your example with the normal distribution is incomplete because you are describing, again, a very specific case of Bayesian Regression, whereas a more complete discussion would at least include the Parametric Point Estimates for Gaussian and its conjugate priors (either for unknown mean and known variance, or for unknown mean and unknown variance).

Most importantly, the article should explain the difference between Non-Parametric and Parametric EB and also discuss the basic result of the Parametric EB and Point Estimation, which include the notions of information "borrowing,"  "shrinkage," and the trade-off between bias and variance

I have cleaned up the proof considerably by explaining why we can use a point estimate for the prior and clarified that we are essentially estimating the marginals. This will also make it easier to add a section on parametric Empirical Bayes. Clearly much more work is needed!

The next planned step is to add a section on Parametric Empirical Bayes, to derive the Beta-Binomial model, and to provide a numerical example.


 * Writing Pr(y|&theta;) is at best valid only for discrete distributions. "Pr" means probability, and must be between 0 and 1 (inclusive).  For continuous distributions, you need probability density, which, of course, need not be less than 1, and writing "Pr" for probability density seems rather strange to me. Michael Hardy 17:35, 29 September 2006 (UTC)


 * The current version of the page already uses a more common notation

I have now schetched out mathematical details for the so-called "example for a normal distribution" There are some subtlies here I have avoided (such as a proper derivation of the conjugate priors), among other things

It would be good to create an example which uses the results of the mathematical derivation It would also be good to add specific sections for the Beta-Binomial model, which is non-trivial, and, perhaps, the multi-variate linear regression model (which is commented on in Estimation of covariance matrices)

Some confusions
In the introduction part, I understands most of them until the sentence "Empirical Bayes then employs the complete set of empirical data to make inferences about the prior θ, and then plugs this into the likelihood ρ(y | θ) to make estimates for future outcomes of individual measurements." First of all, what kind of estimation won't use the "complete set of empirical data"? I am not an expert in Statistics, so forgive me if I am ignorant. As far as I knew, for estimation, it is always better to use the entire dataset, except that you are talking about cross-validation. Secondly, what do you mean of "plugs this estimation into the likelihood"? It would be better to write it in a mathematical equation about the statement. I am guessing you are saying that one predicts new data y_new by p(ynew|yold)=p(ynew|$$\theta$$)p($$\theta$$|yold). Is it correct? —Preceding unsigned comment added by 76.121.137.101 (talk) 10:23, 18 February 2008 (UTC)

The entire section about using an iterative approximate approach to computing the hierarchical posterior should be removed as (a) it is not an empirical Bayes approach (b) requires an hyperprior (c) corresponds to a specific form of Gibbs sampling.

Clarity
This article should be re-written. I submit that it is way over the head of most Wikipedia readers. The Gunning Fog Index of the Introduction paragraph is over 19, indicating a post-graduate education is needed to understand it. The introductions should explain the basic concept to the average reader, even if the particulars are too technical.

I came to this article because Empirical Bayes is becoming common in highway safety research, but despite 3 years of math in college, and 16 years experience as a traffic engineer, I learned nothing from this article.--Triskele Jim (talk) 18:08, 8 January 2009 (UTC)


 * Agreed. I tried to write a short, to-the-point intro. —Preceding unsigned comment added by 84.238.115.164 (talk) 17:35, 20 February 2010 (UTC)

Shrinkage/James-Stein estimators
The famous James-Stein estimator (the classic example of a shrinkage estimator) for the mean of a Gaussian has its theoretical foundation in empirical Bayes. It results from applying empirical Bayes to the estimation of the covariance matrix of a Gaussian prior with fixed mean zero. The JS estimate is simply the Bayes estimator for the resulting posterior. Would be interesting to add a section about this. —Preceding unsigned comment added by 84.238.115.164 (talk) 17:17, 20 February 2010 (UTC)

Original meaning of the term
[This is my first post. Please be forgiving] The term "Empirical Bayes" is used in many contexts with many different meanings. A short review is suggested in I. J. Good, “Introduction to Robbins (1955) An empirical Bayes approach to statistics,” Breakthroughs in Statistics: Foundations and basic theory (1992): 379. In a nutshell, it was originally used in an estimation problem with a physical (frequentist, non epistemic), non-parametric prior distribution. Soon enough it started being used for "pseudo-Bayes" problems with a parametric prior, and even for pure Bayes (epistemic/De Finetti prior) with a composite parametric prior, despite the original EB was purely frequentist. I believe emphasis should be put on the frequentist nature of the "original" empirical Bayes, versus it's epistemic spin offs. 


 * What does epistemic mean in this context? As for what the article should focus on, I'd say that the essence of Empirical Bayes is that it is a frequentist's approach for selecting the hyperparameters that determine a Bayesian's prior distribution for the parameters.  — Q uantling (talk &#124; contribs) 14:20, 10 February 2011 (UTC)


 * By "epistemic" I mean subjective prior. I.e.- the prior describes my beliefs and not the relative frequencies in some hyper-population. John.ros (talk) 15:50, 12 February 2011 (UTC)

Error in Poisson-Gamma model
There seems to be an error (or simply poor notation) in the final example. In particular, the example is using y as the full data, but the posterior is apparently only using 1 data point. Then the final bayes estimator includes ybar, but this is only correct for a sample of size one. For a sample of size n, the estimator should be $$\frac{n \bar y + \alpha}{n+1/ \beta}$$.

I'd make the change myself, but it would almost surely be reverted before anyone even looks at the change. — Preceding unsigned comment added by 130.113.126.119 (talk) 18:45, 31 July 2012 (UTC)