Talk:Conjugate prior

Article seems to mix up 'Posterior' and 'Likelihood'
At the top of the article it says that a conjugate prior is the case when the prior and posterior are of the same family. However the rest of the article implies that it is the likelihood and prior that need to be of the same family for it to be a conjugate prior. e.g. The 'Table of conjugate distributions' only lists pairs of likelihoods and priors while not mentioning the posterior.

Untitled
You may want to see some of the pages on Empirical Bayes, the Beta-Binomial Model, Bayesian Linear RegressionCharlesmartin14 23:43, 19 October 2006 (UTC).

This article and http://en.wikipedia.org/wiki/Poisson_distribution#Bayesian_inference disagree about the hyper parameters to the posterior. —Preceding unsigned comment added by 98.202.187.2 (talk) 00:57, 12 May 2009 (UTC)
 * This was my comment last night, sorry I didn't sign it, I've just corrected this on the page. Bazugb07 (talk) 14:32, 12 May 2009 (UTC)

Could someone fill in the table for multivariate normals and and pareto? 128.114.60.100 06:21, 21 February 2007 (UTC)

It would be nice to actually state which parameters mean what, since the naming in the table does not correspond to the naming on the pages for the corresponding distributions (atm I have a problem figuring out which of the hyperparameters for the prior for the normal (variance and mean) belong to the inverse gamma function and which to the normal) —Preceding unsigned comment added by 129.26.160.2 (talk) 10:44, 14 September 2007 (UTC)

For the Gamma likelihood with prior over the rate parameter, the posterior parameters are $$\alpha_0+n\alpha,\beta_0+\sum_n x_n$$ for any $$ n $$. This is in the Fink reference.Paulpeeling (talk) 11:53, 24 May 2008 (UTC)

May want to consider splitting the tables into scalar and multivariate conjugate distributions.

Changed "assuming dependence" (under normal with no known parameters) to "assuming exchangability". "Dependence" is wrong; "independence" is better, but since technically that should be "independence, conditional on parameters", I replaced it with the usual "exchangability" for brevity. 128.59.111.72 (talk) 00:48, 18 October 2008 (UTC)


 * Wasn't the "dependence" referring to dependence among the parameters, not the data? --128.187.80.2 (talk) 23:00, 30 March 2009 (UTC)

Family of distributions
How does one tell whether two distributions are conjugate priors? What distinguishes "families"?

Incorrect posterior parameters
Has anyone else noticed the posterior parameters are wrong? At least according (Degroot, 1970), the multivariate normal distribution posterior in terms of precision is listed incorrectly: it should be what the multivariate normal distribution in terms of the covariance matrix is listed as on the table. I don't really have the time to make these changes right now or check any of the other posterior parameters for accuracy, but someone needs to double check these tables. Maybe I'll do it when I'm not so busy. Also the, Fink (1995) article disagrees with DeGroot on a number of points, so I question it's legitimacy, given that the latter is published work and former is an ongoing report. Maybe it should be removed as a source? DeverLite (talk) 23:22, 8 January 2010 (UTC)

I just implemented the Multivariate Gaussian with Normal - Wishart conjugate distribution according to the article and found that it does not integrate to one. I corrected the posterior distribution in that case, but the others probably also need to be corrected. — Preceding unsigned comment added by 169.229.222.176 (talk) 01:55, 20 August 2012 (UTC)

To prevent confusion, it should be made clear that the Student's t distribution specified as the posterior for the multivariate normal cases is a multivariate student's t distribution parametrized by precision matrix, not by covariance as in the wikipedia article on the multivariate student's t distribution. — Preceding unsigned comment added by 169.229.222.176 (talk) 02:03, 20 August 2012 (UTC)

I was just looking at it and it looked wrong to me. The accuracy-based posterior parameters should have no inversions (as can be seen in the univariate case for example). I can fix that according to DeGroot's formulation. --Olethros (talk) 15:35, 14 September 2010 (UTC)

Marginal distributions
I think it would be useful to augment the tables with the "marginal distribution" as well. The drawback here is the tables will widen, and they are already pretty dense. Thoughts? --128.187.80.2 (talk) 23:00, 30 March 2009 (UTC)
 * I am not clear what you mean by marginal distribution here ... if it is what I first thought (marginal dist of the observations) then these marginal distributions might find a better and useful place under an article named like compound distributions. Or is it the marginal distribution of new observations conditional on the existing observations marginalised over the parameters (ie predictive distributions)? Melcombe (talk) 08:59, 31 March 2009 (UTC)
 * I was referring to the marginal distribution of the observations (not the predictive distribution). I often use this page as a reference guide (much simpler than pulling out my copy of Gelman et al.) and at times I have wanted to know the marginal distribution of the data. Granted, many books don't include this information, but it would be useful. As an example, in the Poisson-Gamma model $$\mathbf{x} \sim NegBin \left( \alpha, \frac{\beta}{1+\beta} \right)$$ (when the gamma is parameterized by rate). This information is largely contained in Negative binomial but that article does not specifically mention that it is the marginal distribution of the data in the Bayesian setting. Plus, it would be more convenient to have the information in one place. Your proposal to put it on a dedicated page may be a reasonable compromise since the tables are already large and this information is used much less frequently. --128.187.80.2 (talk) 17:27, 1 April 2009 (UTC)


 * I thought that giving such marginal distributions would be unusual in a Bayesian context, but I see that Bernardo & Smith do include them in the table in their book .. but they do this by a having a separate list of results for each distribution/model, which would be a drastic rearrangement of what is here. An article on compound distributions does seem to be needed for its own sake. Melcombe (talk) 13:21, 2 April 2009 (UTC)

Poisson-Gamma
It keeps getting changed back and forth, but I have the hyperparameters as: \alpha + n,\ \beta + \sum_{i=1}^n x_i\!

-- There is certainly a problem as it currently stands. The wikipedia page on the gamma explicitly gives both the two forms. Particularly k=alpha, beta=1/theta. Hence the update rules must be consistent with this notation. I have corrected this for now. —Preceding unsigned comment added by 129.215.197.80 (talk) 15:23, 20 January 2010 (UTC)

Please add discussion if this is incorrect before changing it! —Preceding unsigned comment added by Occawen (talk • contribs) 05:06, 6 December 2009 (UTC)

Most unintelligible article on Wikipedia
Just a cheeky comment to say that this is the hardest article to understand of all those I've read so far. It assumes a lot of background knowledge of statistics. Maybe a real-world analogy or example would help clarify what a conjugate prior is. Abstractions are valuable but people need concrete examples if they want to jump in half-way through the course. I'm really keen to understand the relationship between the beta distribution and the binomial distribution, but this article (and the ones it links to) just leave me befuddled. 111.69.251.147 (talk) 00:39, 21 June 2010 (UTC)


 * You haven't read enough of Wikipedia if you think this is it's most unintelligible ! However, I agree that it's baffling.  I came here with no prior knowledge of what a conjugate prior is, following a link from a page that mentioned the beta distribution being the conjugate of various other distributions.  I find myself reading a page that tells me a conjugate prior is (in effect) a likelihood function that changes metaparameters but not form when given new data; this does not tell me how this prior's form is conjugate *to* any other distribution, which was what I was trying to glean.  Lurking in the back is the fact that the variate being modelled has a distribution, let's call it X; when the prior for its parameter has distribution Y, then data about the primary refines our knowledge of the parameter to a posterior likelihood of the same form as the prior Y; in such a case, I'm guessing "the form of Y" is what's being described as "conjugate to" (possibly the form of) X; but I don't actually see the text **saying that** so I'm left wondering whether I've guessed wrong.  An early remark about the gaussian seemed to be saying that, but it was hard to be sure because it was being described as self-conjugate and similar phrasing was used to describe the prior and posterior as conjugate, so I was left in doubt as to whether X=gauss has Y=gauss work.  I lost hope of finding any confirmation or correction for my guess as the subsequent page descended into unintelligible gibberish. (It might not seem like that to its author, but that's the problem of knowing what you're talking about and only being used to talking about it to others who already understand it: as you talk about it, you say the things you think when you think about it and can't see that, although it all fits nicely together within any mind that understands it already, it *conveys nothing* to anyone who doesn't already understand it.  Such writing will satisfy examiners or your peers that you understand the subject matter, but won't teach a student anything.)  -- Eddy 84.215.30.244 (talk) 06:14, 30 July 2015 (UTC)

Another less cheeky comment.
Paragraphs 1 through to contents - Great. The rest - incomprehensible. I have no doubt that if you already know the content, it is probably superb, but I saw a long trail of introduced jargon going seemingly in no particular direction. I was looking for a what & some "WHY do this", but I did not find it here. Many thanks for the opening paragraphs. Yes, I may be asking for you to be the first ever to actually explain Bayesian (conjugate) priors in an intuitive way. [not logged in] — Preceding unsigned comment added by 131.203.13.81 (talk) 20:13, 1 August 2011 (UTC)

So, working through the example, thanks for one, being my only hope to work out what it all means: If we sample this random ... f - Arh, not the "f" of a few lines above. x - Arh, "s,f" that's x & "x", well that's the value for q = x, that's theta, from a few lines above I'm rewriting it on my page to just get the example clear. — Preceding unsigned comment added by 131.203.13.81 (talk) 03:12, 10 August 2011 (UTC)

This article
wat

Simple English sans maths in the intro would be great. —Preceding unsigned comment added by 78.101.145.17 (talk) 14:48, 24 March 2011 (UTC)

Broken link
The external link is broken. Should I remove it? — Preceding unsigned comment added by 163.1.211.163 (talk) 17:38, 12 December 2011 (UTC)

Wrong posterior
Some of the posterior are wrong. I just discovered one:

Normal with known precision τ	μ (mean)

The posterior variance is (τ0+nτ)^-1. — Preceding unsigned comment added by 173.19.34.157 (talk) 04:09, 15 May 2012 (UTC)

I just discovered another for Normal with unknown mean and variance. Second hyper parameter should be $$\frac{1}{\nu'} = \frac{1}{\nu} + n$$. — Preceding unsigned comment added by 193.48.2.5 (talk) 10:56, 17 January 2019 (UTC)

That Table
Yeah... That table, while informative, is not formatted very well. It wasn't clear at first what the Posterior Hyperparameters column represented, or what any of the variables meant in the Posterior Predictive column. — Preceding unsigned comment added by 129.93.5.131 (talk) 05:00, 10 December 2013 (UTC)

Is there some reference for the log-normal to normal conversion? It seems strange that the estimates would be optimal after exponentiation. — Preceding unsigned comment added by Amrozack (talk • contribs) 21:50, 17 June 2020 (UTC)

Assessment comment
Substituted at 19:53, 1 May 2016 (UTC)

Any appetite for a "practical application" section?
I've recently used bayesian conjugate priors for computing the probability that there will be at least 1 rental car available in my area at any given day. Would there be any appetite for a section showing how one can use the table to compute something like this? If so i would write that up in the next few days. I figure it might help with making the page a little more understandable. — Preceding unsigned comment added by Rasmusbergpalm (talk • contribs) 08:59, 11 February 2020 (UTC)


 * Rasmusbergpalm I've found the practical example very helpful, thank you. However, I was wondering about the correctness of the gamma distribution hyperparameters picked; perhaps I'm missing something. Because in the gamma distribution mean=alfa/beta and variance=alfa/beta^2, we can calculate beta=mean/variance and alfa=beta*mean. From the example we have mean=8/3 and sample variance=7/3, from which we can calculate beta=8/7 and alfa=64/21. Does this make sense?--Gciriani (talk) 16:18, 2 December 2020 (UTC)


 * Gciriani Thanks! Glad you liked it :) Remember, the gamma distribution is a distribution over the rate of the poisson distribution (from which the samples are drawn), not over the samples directly. The information from the data enters through the likelihood term, not the prior term. Also, the prior hyperparameters are inherently subjective. There's no "right" answer. They represents your prior belief. Ideally you should set them before you observe any data. One way to check if your prior belief is reasonable is what is called prior predictive samples; You sample parameters from your prior and then sample data from your likelihood model given those parameters and see if they're reasonable. I made a small notebook where I do this that you can check out. link to notebook. I hope this clears things up. If you wish to learn more here some excellent resources: http://www.stat.columbia.edu/~gelman/book/ and https://arxiv.org/pdf/2011.01808.pdf. Rasmusbergpalm (talk) — Preceding undated comment added 20:03, 3 December 2020 (UTC)

Beta note
I am not persuaded that saying the interpretation of a Beta prior/posterior distribution with hyperparameters α and β is $$\alpha - 1$$ successes and $$\beta - 1$$ failures, with the note

"The exact interpretation of the parameters of a beta distribution in terms of number of successes and failures depends on what function is used to extract a point estimate from the distribution. The mode of a beta distribution is $\frac{\alpha - 1}{\alpha + \beta - 2},$ which corresponds to $\alpha - 1$ successes and $\beta - 1$ failures; but the mean is $\frac{\alpha}{\alpha + \beta},$ which corresponds to $\alpha$ successes and $\beta$ failures. The use of $\alpha - 1$ and $\beta - 1$ has the advantage that a uniform ${\rm Beta}(1,1)$ prior corresponds to 0 successes and 0 failures, but the use of $\alpha$ and $\beta$ is somewhat more convenient mathematically and also corresponds well with the fact that Bayesians generally prefer to use the posterior mean rather than the posterior mode as a point estimate.  The same issues apply to the Dirichlet distribution."

My problem is that this becomes a nonsense when used with the Jeffreys prior $$\alpha=\beta=\tfrac12$$: the fractional values might be explainable away, but the negative values really cannot. I would much prefer saying the interpretation of α and β is $$\alpha$$ successes and $$\beta$$ failures with a note like the following - 11:03, 23 June 2020 (UTC)

"The exact interpretation of the parameters of a beta distribution in terms of number of successes and failures depends on what function is used to extract a point estimate from the distribution. The mean of a beta distribution is $\frac{\alpha}{\alpha + \beta},$ which corresponds to $\alpha$ successes and $\beta$ failures, while the mode is $\frac{\alpha - 1}{\alpha + \beta - 2},$ which corresponds to $\alpha - 1$ successes and $\beta - 1$ failures. Bayesians generally prefer to use the posterior mean rather than the posterior mode as a point estimate, justified by a quadratic loss function, and the use of $\alpha$ and $\beta$ is more convenient mathematically, while the use of $\alpha - 1$ and $\beta - 1$ has the advantage that a uniform ${\rm Beta}(1,1)$ prior corresponds to 0 successes and 0 failures. The same issues apply to the Dirichlet distribution."

The "minus 1" that appears in the Dirichlet, Gamma, etc pdfs comes from the associated Haar measure, rather than the parameters, which is the main source of confusion here I think. The Dirichlet (Beta as special case) distribution can be constructed as a random vector of gamma random variables divided by their sum. The proportion of the sum associated with a particular gamma variable is independent of the actual sum, which hints at how the sum is marginalized out and the Dirichlet pdf can be described by an integral over a subset of the positive reals, and that's where the Haar measure comes in. In the language of exponential family distributions, 1/(x*(1-x)) may be seen as the carrier measure, and this cleanly separates the sufficient statistics log(x) and log(1-x):

1/(x(1-x)) exp(a log(x) + b log(1-x) - log(B(a,b)))

This gives the canonical exponential family form. Cswitch (talk) 00:08, 19 February 2023 (UTC)

looks like an error in predictive priors
For the a poisson likelihood and gamma prior, the predictive prior is given as $$\operatorname{NB}\left(\tilde{x}\mid k', {\theta'}\right)$$ when specifying the Gamma Distribution using scale ($$\theta$$); and $$\operatorname{NB}\left(\tilde{x}\mid\alpha', \frac{1}{1 + \beta'}\right)$$ when specifying the Gamma Distribution using rate $$\beta$$. Since $$\theta'=1/\beta'$$, I do not see how both equations can be correct.

Furthermore, there are a number of conventions for the Negative binomial (NB) and it does not specify which is used.

The root of all these problems is that the equations are unsourced. Every single equation should be sourced, or deleted. Adpete (talk) 01:38, 26 August 2020 (UTC)

mean + variance gaussian chain
These https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf notes present a chained prior for the Gaussian when neither mean or variance are fixed - is it a good idea to put this in the table too? Thank you, — Preceding unsigned comment added by Reim (talk • contribs) 08:16, 6 November 2020 (UTC)