Talk:Metropolis–Hastings algorithm

x prime or x_t?
Here are two conflicting comments on the case a>1, which I have cut and pasted for readability. --Rinconsoleao 15:12, 16 July 2007 (UTC)


 * There seems to be a typo in the last display. If a>1, we would accept the proposal, x', not x_t. --Youyifong 06:38, 28 July 2006 (UTC)


 * Hi, is there a mistake here? In the step-by-step instructions,

x^{t+1} = \left \{ \begin{matrix} x'(x(t)?)                           & \mbox{if }a > 1 \\ x'\mbox{ with probability }a, & \mbox{if }a < 1 \end{matrix} \right. $$
 * (Postdoc 02:30, 16 July 2007 (UTC))

The algorithm always accepts if a>1. That is, $$x^{t+1}=x'$$ when a>1. Notice that this is consistent with the statements about the uniform random variable $$u$$ in the previous paragraph. I have rewritten the "step-by-step instructions" to make this clearer. --Rinconsoleao 15:12, 16 July 2007 (UTC)
 * Please check the page again. The pages mis-typed x' as x_t, as Youyifong mentioned again. Also, should it be "a >= 1" ? (Postdoc 05:38, 30 July 2007 (UTC))
 * Thanks! --Rinconsoleao 15:44, 31 July 2007 (UTC)

Gibbs sampling
The article says that the Gibbs-sampler is a special case of the MH-algorithm. As far as I know it is a complementary MCMC-method which works quite differently. --Smeyen 01:07, 24 February 2007 (UTC)


 * In Gibbs sampler, its proposal distribution is the full conditional distribution of some part of parameter space conditional on the rest, which always resultes in a acceptance probability of 1. See [http:://www.stat.purdue.edu/~lebanon/notes/metropolis.pdf] (Postdoc 05:46, 30 July 2007 (UTC))

Gibbs sampling is a special case of the Metropolis-Hastings sampler, with the a proposal distribution that at each iteration, updates one element of the parameter vector from the distribution of that element conditional on the current value of all of the others: $$g(x'|x) = P(x_i | x_{-i})$$, resulting in an acceptance probability that is always 1, and hence is always accepted. Proof left as an exercise for the reader, &#9786;. Gregory R. Warnes (talk) 23:27, 27 October 2017 (UTC)

Complexity

 * The Markov chain is started from a random initial value \displaystyle x^0 and the algorithm is run for many iterations until this initial state is "forgotten"

no bounds are known on the number of needed iterations, e.g. the mixing time of the Markov chain? Can we really claim in the introduction that this algorithm allow to compute something if we don't known how long it should be iterate? Levochik (talk) 07:43, 26 August 2009 (UTC)

Example
First of all, thank you for this article. It's the best I could find in the web about Monte Carlo simualtion. But unfortunately, I still don't get it. How can I compute a=a_1a_2 if I don't know P(x)? Perhaps additionally, you could include an example. Something where one can see the step-by-step instruction applied to a simple example. Just like a record of a view steps from a real simulation of some simple case. —Preceding unsigned comment added by Askamaster (talk • contribs) 10:30, 25 March 2010 (UTC)
 * You don't need P(x), you only need a function proportional to P(x), as the normalizations cancel in the ratio. It is quite often the case that you can easily determine a function proportional to P(x), but the integral to normalize it is very difficult to perform. This comes up quite often in Bayesian computations of the posterior distribution. Rjljr2 (talk) 15:49, 11 April 2010 (UTC)


 * article is NOT good, is NOT clear, you want improvement?

best part is who created it then explain why, what, when, are no answered I will give you it tells how does it give algorithm in detail so one can code it? nope, does it provide example? noJuror1 (talk) 10:28, 30 November 2017 (UTC)

The article suffers a bit from the Phd-on-wikipedia-syndrome. It descibes a fairly simple idea, but it tries to be very general and that makes it hard to read for people that don't know the subject beforehand. Compare with the article on Simulated_annealing for an simpler article on the same subject. nielsle — Preceding unsigned comment added by 89.23.239.219 (talk) 08:17, 13 January 2018 (UTC)

Proposal density
Clearly the Markov chain doesn't converge for all proposal densities Q. Certainly Q > 0 is sufficient, but I don't believe it is necessary. What condition is required for the proposal density? — Preceding unsigned comment added by WuTheFWasThat (talk • contribs) 19:24, 16 May 2012 (UTC)

Overview
The section Overview tries to give a formal derivation of the Algorithm. The foregoing sections are so well written that it was very easy to understand for me, even though I am not an expert in the field. Bringing the section Overview to the same didadactic level would be an asset. The questions I would like to have answered are: How can I see that the equation is still redundant and what does the term redundant mean in this context? How does one come to the idea of defining $$\displaystyle A(x\rightarrow x')$$ as the specified minimum function?

Some other proposals for improvements: The meaning of the variable $$\displaystyle x' $$ seems to be different from the one in the section Intuition. Would it be an option to use another notation? Furthermore, the section Overview repeats the definition of the algorithm in the section Intuition. However, at the same time it lists the steps of skipping samples to mitigate correlations. I suggest to put the complete list of steps into the previous section. Then, a proper name for the section Overview would e.g. be Formal Derivation.

Sorry, I am new to wikipedia and do not want to just poke your text. Please let me know if you wish me to do some of the mentioned improvements. —Contribo (talk) 20:20, 3 December 2012 (UTC)


 * I agree with you, "overview" is not the proper name; maybe "formal derivation" as you suggested sounds better. Nevertheless, the intuition section does not present the modern view of metropolis-hastings algorithm in the sense that it does not uses a general proposal (it is always assumed symmetrical, which is not the general case. However, I think the intuition part is to understand the idea of the method; that is why it skips the fact that the proposal also has to be considered in the acceptance in the general case.


 * Redundant in the sense that there can be another choices for A which also fulfills the equation you mention (maybe under-determined). I don't recall my statistical physics lectures, but I know there is at least another choice of A which also fulfills it and yet, it does not use the "min".Jorgecarleitao (talk) 20:53, 3 December 2012 (UTC)


 * Thank you for your explanations, I did not expect to get an Answer so quickly. I bring these issues into discussion not just because I would like to understand the matter but also to motivate you to work towards improving this article together with me. In my role as the learner I offer to ask you questions and you as the author could modify the article, such as to answer these questions. Of course I offer to do modifications as well — as far as I am qualified. Does this sound attractive?


 * In this sense let me point out some further aspects. You write the proposal also has to be considered in the acceptance in the general case. I am not sure if I get you right. Isn't this considered in the section Step-by-step Instructions? This brings me to another point: the Step-by-step Instructions should contain the skipping of T states to make the resulting sequence uncorrelated.


 * I noticed some more points regarding the section Overview: What is the role of ergodicity in this explanation. The text says that it can be shown that ergodicity is a requirement for the process to converge asymptotically. A reference to the proof would be great. How is it ensured that the process is ergodic? Another Point: The markov matrices in the wiki-reference have finite dimensions. However, the markov matrix for the process under discussion is infinte dimensional, this difference should at least be mentioned. --Contribo (talk) 21:24, 3 December 2012 (UTC)

Easter egg link
The link which is labeled "in practice" and leads to Bayesian statistics should be rephrased so as to not surprise the reader, as per WP:EASTEREGG. There should probably be some brief remark explaining what Bayesian statistics are and how they relate, but I don't know how to write such a remark. — Preceding unsigned comment added by 132.65.153.93 (talk) 18:03, 12 November 2013 (UTC)

Interacting Metropolis-Hastings methodology
(this was in the main article, and even though has a lot of information, it does not belong to Metropolis-Hastings. It is about sequential Monte Carlo and Markov Chain Monte Carlo as the text below itself claims. I'm storing it here so it is not lost in history.)

The central idea is to consider an interpolating sequence of target distributions $$P_k(x)~, k=0,\ldots,n$$ with an increasing complexity of sampling and such that $$P_n(x)=P(x)$$. We let $$h_k(x)$$ be a function that is proportional to $$P_k(x)/P_{k-1}(x)$$, for each $$k\geq 1.$$. In this notation, for any k we clearly have the product formula$$P_k(x)=\frac{1}{\mathcal Z_k}~\left\{\prod_{1\leq l\leq k}h_l(x)\right\}~P_0(x)$$for some normalizing constants $$\mathcal Z_k.$$ When the state space $S$ is finite, for any $$k\geq 1$$  we have

$$P_k(x)=\displaystyle\frac{h_k(x)~P_{k-1}(x)}{\displaystyle\sum_{y\in S}h_k(y)~P_{k-1}(y)}\quad\mbox{and}\quad \frac{\mathcal Z_k}{\mathcal Z_{k-1}}=\sum_{y\in S}h_k(x)~P_{k-1}(x)$$ For probability densities w.r.t. the Lebesgue measure dx on $S=\mathbb R^d$, for some dimension $$d\geq 1$$, we have

$$P_k(x)=\displaystyle\frac{h_k(x)~P_{k-1}(x)}{\displaystyle\int_{y\mathbb R^d}h_k(y)~P_{k-1}(y)~dy}\quad\mbox{and}\quad \frac{\mathcal Z_k}{\mathcal Z_{k-1}}=\int_{y\in \mathbb R^d}h_k(x)~P_{k-1}(x)~dx$$We assume that it is easy to sample independent random variables with common probability$$P_0(x)$$, and the functions $$h_k(x)$$ take values in $$[0,1]$$ and they can be computed easily for any state x.

We illustrate these rather abstract models with a couple of examples: ~\left\{\prod_{1\leq l\leq k}h_l(x)\right\}~P_0(x)\quad\mbox{with}\quad h_l(x)=1_{A_l}(x)$$ For any $$k\geq 0$$ denote by $$P_k(x\mapsto x')$$ the transition probability of the Metropolis–Hastings algorithm with target invariant measure $P_k(x)$. To simplify the presentation, we assume that the state space s finite.
 * Boltzmann-Gibbs distributions associated with an energy function $$V~:~x\in S\mapsto V(x)\in[0,\infty]$$ on some finite set $$S$$ and some inverse temperature parameter $$\beta>0$$ are defined by$$P(x)=\frac{1}{\mathcal Z}~e^{-\beta V(x)}\quad\mbox{with}\quad \mathcal Z=\sum_{y\in S}~e^{-\beta V(y)}$$In this situation, we can choose the interpolating sequence $$P_k(x)=\frac{1}{\mathcal Z_k}~e^{-\beta_k V(x)}\quad\mbox{with}\quad \mathcal Z_k=\sum_{y\in S}~e^{-\beta_k V(y)}\quad\mbox{and some parameters}\quad \beta_0=0<\beta_1<\ldots<\beta_n=\beta$$By construction, we have $$P_k(x)=\frac{1}{\mathcal Z_k}~\left\{\prod_{1\leq l\leq k}h_l(x)\right\}~P_0(x)\quad\mbox{with}\quad h_l(x)=e^{-(\beta_{l}-\beta_{l-1})V(x)}$$
 * Let $P_0(x)$ be the probability density of some real valued random variable $$X$$. Let $A\subset \mathbb R$  be some subset such that $$\mathbb P(X\in A)>0$$ and set $$P(x)=\frac{1}{\mathcal Z}~1_A(x)~P_0(x)\quad\mbox{with}\quad \mathcal Z=\int_{y\in \mathbb R}~1_A(y)~P_0(y)~dy$$with the indicator function $$1_A(.)$$ of a set $A$ . In this situation, we can choose the interpolating sequence $$P_k(x)=\frac{1}{\mathcal Z_k}~1_{A_k}(x)~P_0(x)\quad\mbox{with}\quad \mathcal Z_k=\int_{y\in \mathbb R}~1_{A_k}(y)~P_0(y)~dy>0\quad\mbox{and some decreasing subsets}\quad A_0=\mathbb R\supset A_1\supset \ldots\supset A_n=A$$By construction, we have $$P_k(x)=\frac{1}{\mathcal Z_k}
 * Initially, we sample a large number $$N$$ of independent random variables $\xi_0=\left(\xi^i_0\right)_{1\leq i\leq N}$ with common probability distribution $$P_0(x)$$. By the law of large numbers, we have $$P_0^N(x):=\frac{1}{N}\sum_{1\leq i\leq N}1_{\xi^i_0}(x)~\approx_{N\uparrow\infty}~P_0(x)\qquad\mbox{and}\qquad \sum_{y\in S}h_1(x)~P^N_{0}(x)=\frac{1}{N}\sum_{1\leq i\leq N}h_1(\xi^i_0)\approx_{N\uparrow\infty}~\frac{\mathcal Z_1}{\mathcal Z_{0}}$$
 * Acceptance-rejection with recycling:   The objective is to design a new population of samples $\xi_0~\longrightarrow~ \widehat{\xi}_0=\left(\widehat{\xi}^i_0\right)_{1\leq i\leq N}$  with distribution $P_1(x)$ . With a probability $$h_1(\xi^i_0)$$  we accept the state $$\xi^i_0$$ and we set $$\widehat{\xi}^i_0=\xi^i_0$$.  Otherwise, we sample a new indidual $$\widetilde{\xi}^i_0$$ in the current population with the probability measure

$$Proba\left(\widetilde{\xi}^i_0=\xi^j_0~|~\xi_0\right)=\frac{h_1(\xi^j_0)}{\sum_{1\leq k\leq N}h_1(\xi^k_0)},~\mbox{for each}~j\in\{1,\ldots,N\}$$and we set $\widehat{\xi}^i_0=\widetilde{\xi}^i_0.$ Using the fact that$$\displaystyle\frac{h_1(x)~P^N_{0}(x)}{\displaystyle\sum_{y\in S}h_1(y)~P^N_{0}(y)}=\sum_{1\leq i\leq N}\frac{h_1(\xi^i_0)}{\sum_{1\leq j\leq N}h_1(\xi^j_0)}~1_{\xi^i_0}(x)~\approx_{N\uparrow\infty}P_1(x)$$  we have the estimates$$\widehat{P}_0^N(x):=\frac{1}{N}\sum_{1\leq i\leq N}1_{\widehat{\xi}^i_0}(x)~\approx_{N\uparrow\infty}~P_1(x)$$ The interacting Metropolis-Hastings defined above belong to the class of mean field interacting particle methods and Feynman-Kac particle models (a.k.a. Sequential Monte Carlo methodologies ). In contrast to traditional Markov Chain Monte Carlo methods relying on the stability of Markov chain, the precision interacting Metropolis-Hastings only depends on the size $$N$$ of the system, in the sense that for any time $$k$$ we have
 * Proposal-exploration: The objective is to design a new population $$\widehat{\xi}_0~\longrightarrow~\xi_1=\left(\xi^i_1\right)_{1\leq i\leq N}$$ with distribution $P_1(x)$  using the Metropolis-Hastings transitions $$P_1(x\mapsto x')$$. From each selected states $$\widehat{\xi}^i_0$$ we sample independently a transition $$\widehat{\xi}^i_0~\longrightarrow~\xi^i_1~\sim~P_1(\widehat{\xi}^i_0\mapsto x')$$ By the law of large numbers, we have$$P^N_1(x^{\prime})=\frac{1}{N}\sum_{1\leq i\leq N}1_{\xi^i_1}(x^{\prime})~\approx_{N\uparrow\infty}~\sum_{x\in S}P_1(x)~P_1(x\mapsto x^{\prime})=P_1(x^{\prime})\qquad\mbox{and}\qquad \sum_{y\in S}h_2(x)~P^N_{1}(x)=\frac{1}{N}\sum_{1\leq i\leq N}h_2(\xi^i_1)\approx_{N\uparrow\infty}~\frac{\mathcal Z_2}{\mathcal Z_{1}}$$The next acceptance-rejection transition with recycling $\xi_1~\longrightarrow~ \widehat{\xi}_1=\left(\widehat{\xi}^i_1\right)_{1\leq i\leq N}$  is defined as above by replacing $h_0(.)$  by $h_1(.)$ ; and the second proposal-exploration $$\widehat{\xi}_1~\longrightarrow~\xi_2=\left(\xi^i_2\right)_{1\leq i\leq N}$$ is defined by moving the selected states with the Metropolis-Hastings transition $$P_2(x\mapsto x')$$, and so on.

$$P^N_k(x)=\frac{1}{N}\sum_{1\leq i\leq N}1_{\xi^i_k}(x)~\approx_{N\uparrow\infty}~P_k(x)$$In addition, we have the unbiased estimates of the normalizing constants$$\forall 1\leq k\leq n\qquad \frac{\mathcal Z_k}{\mathcal Z_{0}}=\prod_{1\leq l\leq k}\frac{\mathcal Z_l}{\mathcal Z_{l-1}}~\approx_{N\uparrow\infty}~\prod_{1\leq l\leq k}\frac{1}{N}\sum_{1\leq i\leq N}h_l(\xi^i_{l-1})$$In addition, for any function f(.) on S we have the Feynman-Kac formulae

$$\sum_{x\in S}~f(x)~P_k(x)=\frac{E\left(f(X_k)~\prod_{1\leq l\leq k}h_l(X_{l-1})\right)}{E\left(\prod_{1\leq l\leq k}h_l(X_{l-1})\right)}~\qquad\mbox{and}\qquad~{\mathcal Z_k}/{\mathcal Z_0}=E\left(\prod_{1\leq l\leq k}h_l(X_{l-1})\right)$$where $$X_k$$ stands for a non homogenous Markov chain with initial distribution $$P_0(x)$$ and Markov transitions $$\mathbb P\left(X_k=x'~|~X_{k-1}=x\right):=P_k(x\mapsto x')$$.

The analysis of the convergence of Feynman-Kac particle methods has been started in 1996 and in 2000 in the book and the series of articles. More recent developments can be found in the books. Applications of these Feynman-Kac methodologies to interacting Metropolis–Hastings algorithms are discussed in the series of articles. The unbiased properties of the particle estimate of the ratio $$\mathcal Z_k/\mathcal Z_0$$ are proved in in the context of filtering problems. Online adaptive choices of the temperature parameters and the decreasing subsets in the above examples can also be found in. In Operation Research, the interacting MCMC algorithm associated with target distributions defined in terms of indicator functions is sometimes referred as subset simulation.

Thinning
The paragraph on thinning is very poor. In particular the statement "this means that if we want a set of independent samples, we have to throw away the majority of samples and only take every nth sample" is plain wrong.

There is no such guarantee of independence, regardless of the interval chosen. In addition, most authors advise against the practice, since while thinning reduces correlation between subsequent steps, it also greatly reduces the number of samples finally obtained. There are some exceptions, as Art Owen discusses here http://statweb.stanford.edu/~owen/reports/bestthinning.pdf.

Patrick1894 (talk) 14:42, 17 April 2019 (UTC)

Rename? (In Wikipedia parlance suggested Move)
In my experience, supported for example by the frequencies of the terms "Metropolis", "Metropolis-Hastings" and "Monte Carlo" recorded between 1953 and 2012 in Google's Ngram viewer, this algorithm is far more often called "Metropolis" than "Metropolis-Hastings." Therefore I suggest the article be renamed (moved to) "Metropolis algorithm" with Metropolis-Hastings redirecting to it, and with a subsection, now lacking, that specifically explains and compares Hastings' 1970 improvement and generalization to the original 1953 algorithm.CharlesHBennett (talk) 22:06, 22 January 2020 (UTC)
 * Neutral. In my experience, "Metropolis" is more common than "Metropolis-Hastings" as well. However, I don't think we get a lot by moving this. Metropolis algorithm already redirects to this article. BernardoSulzbach (talk) 10:59, 23 January 2020 (UTC)

Formal derivation of transition probability incomplete
It is stated that the transition probability $$P(x'\mid x)$$ can be separated into a proposal step and accept/reject step. Mathematically,


 * $$P(x'\mid x) = g(x' \mid x) A(x', x).$$

where $$g(x' \mid x)$$ is the proposal distribution and $$A(x', x)$$ the acceptance distribution.

This transition distribution does not take into account rejection however. The transition probability is complete with an additional rejection term


 * $$P(x'\mid x) = g(x' \mid x) A(x', x) + \delta(x' - x) (1 - A(x', x))$$

where $$\delta(x)$$ is understood as the Dirac delta function. It is only later in the derivation with the choice for $$A(x', x)$$ that the last term vanishes, since a transition where $$x' = x$$ will always be accepted ($$A(x, x) = 1$$). — Preceding unsigned comment added by 213.125.242.82 (talk) 19:46, 8 April 2020 (UTC)

Oughtn't the core procedure be made a tad clearer? Suggest revision.
Calculate the acceptance ratio $$\alpha = f(x')/f(x_t)$$, which will be used to decide whether to accept or reject the candidate. Because f is proportional to the density of P, we have that $$\alpha = f(x')/f(x_t) = P(x')/P(x_t)$$.


 * Accept or reject:
 * Generate a uniform random number $$u \in [0, 1]$$.
 * If $$u \le \alpha$$, then accept the candidate by setting $$x_{t+1} = x'$$,
 * If $$u > \alpha$$, then reject the candidate and set $$x_{t+1} = x_t$$ instead.

May I suggest a rewrite as follows which shows the Hastings step more clearly? The above is correct, but:
 * It forces a generation of $$u$$ unnecessarily when $$\alpha > 1$$, and
 * It requires thinking to see that it works, specifically in the case $$\alpha = 1$$.

Suggested revision:


 * Accept or reject:
 * If $$\alpha > 1$$, then accept the candidate by setting $$x_{t+1} = x'$$.
 * If $$\alpha \le 1$$, then
 * Generate a uniform random number $$u \in [0, 1]$$.
 * If $$u \le \alpha$$, then accept the candidate with probability $$\alpha$$ by setting $$x_{t+1} = x'$$.
 * If $$u > \alpha$$, then reject the candidate and set set $$x_{t+1} = x_t$$ instead.

bayesianlogic.1@gmail.com 19:28, 14 December 2020 (UTC)

The plot with the three chains needs clarification
The rosenbrock function is typically used as an example for minimization, not maximization. The Metropolis Hastings can be seen as a maximization algorithm though, and indeed the caption is written as if we were maximizing something. Perhaps the function being used is 1/rosenbrock, in such a way that the minimum becomes a maximum?

Lnz.Rossi (talk) 16:59, 22 May 2021 (UTC)


 * Maximization of some function f is the same as minimization of -f. Here, instead of minimizing Rosenbrock, the Metropolis-Hastings algorithm is maximizing -1×Rosenbrock. RFZYN SPY  talk 02:47, 21 March 2024 (UTC)

Acceptance ratio formula failing to parse
Under 'Intuition', I see the following parsing error instead of the acceptance ratio formula: "Calculate the acceptance ratio Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "/mathoid/local/v1/":): {\displaystyle \alpha = f(x')/f(x_t)}, which will be used to decide whether to accept or reject the candidate. Because f is proportional to the density of P, we have that..."

When I try to edit, this error doesn't show up in the preview, so not sure how to fix this. 128.12.123.150 (talk) 03:22, 17 February 2023 (UTC)


 * It looks fine from my end. Perhaps some sort of browser incompatibility? Qflib, aka KeeYou Flib (talk) 04:43, 18 February 2023 (UTC)