Chinese restaurant process

In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a restaurant. Imagine a restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 sits at the first table. The next customer either sits at the same table as customer 1, or the next table. This continues, with each customer choosing to either sit at an occupied table with a probability proportional to the number of customers already there (i.e., they are more likely to sit at a table with many customers than few), or an unoccupied table. At time n, the n customers have been partitioned among m &le; n tables (or blocks of the partition). The results of this process are exchangeable, meaning the order in which the customers sit does not affect the probability of the final distribution. This property greatly simplifies a number of problems in population genetics, linguistic analysis, and image recognition.

The restaurant analogy first appeared in a 1985 write-up by David Aldous, where it was attributed to Jim Pitman (who additionally credits Lester Dubins).

An equivalent partition process was published a year earlier by Fred Hoppe, using an "urn scheme" akin to Pólya's urn. In comparison with Hoppe's urn model, the Chinese restaurant process has the advantage that it naturally lends itself to describing random permutations via their cycle structure, in addition to describing random partitions.

Formal definition
For any positive integer $$n$$, let $$\mathcal{P}_{n}$$ denote the set of all partitions of the set $$\{ 1, 2, 3,..., n \} \triangleq [n]$$. The Chinese restaurant process takes values in the infinite Cartesian product $$\prod_{n \geq 1} \mathcal{P}_{n}$$.

The value of the process at time $$n$$ is a partition $$B_n$$ of the set $$[n]$$, whose probability distribution is determined as follows. At time $$n=1$$, the trivial partition $$B_1 = \{ \{ 1 \} \}$$ is obtained (with probability one). At time $$n+1$$ the element "$$n+1$$" is either: The random partition so generated has some special properties. It is exchangeable in the sense that relabeling $$\{ 1,..., n \}$$ does not change the distribution of the partition, and it is consistent in the sense that the law of the partition of $$[n-1]$$ obtained by removing the element $$n$$ from the random partition $$B_n$$ is the same as the law of the random partition $$B_{n-1}$$.
 * 1) added to one of the blocks of the partition $$B_n$$, where each block is chosen with probability $$|b|/(n+1)$$ where $$|b|$$ is the size of the block (i.e. number of elements), or
 * 2) added to the partition $$B_n$$ as a new singleton block, with probability $$1/(n+1)$$.

The probability assigned to any particular partition (ignoring the order in which customers sit around any particular table) is



\Pr(B_n = B) = \frac{\prod_{b\in B} (|b| -1)!}{n!}, \qquad B \in \mathcal{P}_{n} $$

where $$b$$ is a block in the partition $$B$$ and $$|b|$$ is the size of $$b$$.

The definition can be generalized by introducing a parameter $$\theta>0$$ which modifies the probability of the new customer sitting at a new table to $$\frac{\theta}{n+\theta}$$ and correspondingly modifies the probability of them sitting at a table of size $$|b|$$ to $$\frac{|b|}{n+\theta}$$. The vanilla process introduced above can be recovered by setting $$\theta=1$$. Intuitively, $$\theta$$ can be interpreted as the effective number of customers sitting at the first empty table.

Alternative definition
An equivalent, but subtly different way to define the Chinese restaurant process, is to let new customers choose companions rather than tables. Customer $$n+1$$ chooses to sit at the same table as any one of the $$n$$ seated customers with probability $$\frac{1}{n+\theta}$$, or chooses to sit at a new, unoccupied table with probability $$\frac{\theta}{n+\theta}$$. Notice that in this formulation, the customer chooses a table without having to count table occupancies---we don't need $$|b|$$.

Distribution of the number of tables
The Chinese restaurant table distribution (CRT) is the probability distribution on the number of tables in the Chinese restaurant process. It can be understood as the sum of $$n$$ independent Bernoulli random variables, each with a different parameter:



\begin{align} K & = \sum_{i=1}^n b_i \\[4pt] b_i & \sim \operatorname{Bernoulli} \left( \frac \theta {i-1+\theta}\right) \end{align} $$

The probability mass function of $$K$$ is given by



f(k) = \frac{\Gamma(\theta)}{\Gamma(n+\theta)} |s(n,k)| \theta^k, \quad k=1,\dots,n, $$

where $$s$$ denotes Stirling numbers of the first kind.

Two-parameter generalization
This construction can be generalized to a model with two parameters, $$\theta$$ & $$\alpha$$, commonly called the strength (or concentration) and discount parameters respectively. At time $$n+1$$, the next customer to arrive finds $$|B|$$ occupied tables and decides to sit at an empty table with probability



\frac{\theta + |B| \alpha}{n + \theta}, $$

or at an occupied table $$b$$ of size $$|b|$$ with probability



\frac{|b| - \alpha}{n + \theta}. $$

In order for the construction to define a valid probability measure it is necessary to suppose that either $$\alpha<0$$ and $$\theta = -L\alpha$$ for some $$L \in \{1,2,,...\}$$; or that $$0\leq\alpha<1$$ and $$\theta>-\alpha$$.

Under this model the probability assigned to any particular partition $$B$$ of $$[n]$$, can be expressed in the general case (for any values of $$\theta,\alpha$$ that satisfy the above-mentioned constraints) in terms of the Pochhammer k-symbol, as



\Pr(B_n = B \mid \theta,\alpha) = \frac{(\theta + \alpha)_{|B|-1, \alpha}}{(\theta+1)_{n-1, 1}} \prod_{b\in B}(1-\alpha)_{|b|-1, 1} $$

where, the Pochhammer k-symbol is defined as follows: by convention, $$(a)_{0,k} = 1$$, and for $$m > 0$$



(a)_{m,k} = \prod_{i=0}^{m-1}(a+ik) = \begin{cases} a^m & \text{if }k = 0, \\ \\ k^m\,(\frac{a}{k})^{\overline m} & \text{if }k>0, \\ \\ \left|k\right|^m\,(\frac{a}{\left|k\right|})^{\underline m} & \text{if }k<0 \end{cases} $$

where $$x^{\overline m}=\prod_{i=0}^{m-1}(x+i)$$ is the rising factorial and $$x^{\underline m}=\prod_{i=0}^{m-1}(x-i)$$ is the falling factorial. It is worth noting that for the parameter setting where $$\alpha<0$$ and $$\theta = -L\alpha$$, then $$(\theta + \alpha)_{|B|-1, \alpha}=(|\alpha|(L-1))_{|B|-1, \alpha}$$, which evaluates to zero whenever $$|B|>L$$, so that $$L$$ is an upper bound on the number of blocks in the partition; see the subsection on the Dirichlet-categorical model below for more details.

For the case when $$\theta > 0$$ and $$0<\alpha<1$$, the partition probability can be rewritten in terms of the Gamma function as



\Pr(B_n = B\mid \theta,\alpha) =\frac{\Gamma(\theta)}{\Gamma(\theta+n)}\dfrac{\alpha^{|B|}\,\Gamma(\theta/\alpha + |B|) }{\Gamma(\theta/\alpha)}\prod_{b\in B}\dfrac{\Gamma(|b|-\alpha)}{\Gamma(1-\alpha)}. $$

In the one-parameter case, where $$\alpha$$ is zero, and $$\theta>0$$ this simplifies to



\Pr(B_n = B\mid\theta) = \frac{\Gamma(\theta)\,\theta^{|B|}}{\Gamma(\theta+n)}\prod_{b\in B} \Gamma(|b|). $$

Or, when $$\theta$$ is zero, and $$0<\alpha<1$$



\Pr(B_n = B\mid\alpha) =\frac{\alpha^{|B|-1}\,\Gamma(|B|) }\prod_{b\in B} \frac{\Gamma(|b|-\alpha)}{\Gamma(1-\alpha)}. $$

As before, the probability assigned to any particular partition depends only on the block sizes, so as before the random partition is exchangeable in the sense described above. The consistency property still holds, as before, by construction.

If $$\alpha=0$$, the probability distribution of the random partition of the integer $$n$$ thus generated is the Ewens distribution with parameter $$\theta$$, used in population genetics and the unified neutral theory of biodiversity.



Derivation
Here is one way to derive this partition probability. Let $$C_i$$ be the random block into which the number $$i$$ is added, for $$i =1,2,3,...$$. Then



\Pr(C_i = c\mid C_1,\ldots,C_{i-1}) = \begin{cases} \dfrac{\theta + |B| \alpha }{\theta + i -1} & \text{if }c \in \text{new block}, \\ \\ \dfrac{|b| - \alpha }{\theta + i - 1} & \text{if }c\in b; \end{cases} $$

The probability that $$B_n$$ is any particular partition of the set $$\{ 1,...,n \}$$ is the product of these probabilities as $$i$$ runs from $$1$$ to $$n$$. Now consider the size of block $$b$$: it increases by one each time we add one element into it. When the last element in block $$b$$ is to be added in, the block size is $$|b|-1$$. For example, consider this sequence of choices: (generate a new block $$b$$)(join $$b$$)(join $$b$$)(join $$b$$). In the end, block $$b$$ has 4 elements and the product of the numerators in the above equation gets $$\theta\cdot 1\cdot 2\cdot 3$$. Following this logic, we obtain $$\Pr(B_n = B)$$ as above.

Expected number of tables
For the one parameter case, with $$\alpha=0$$ and $$0<\theta<\infty$$, the number of tables is distributed according to the chinese restaurant table distribution. The expected value of this random variable, given that there are $$n$$ seated customers, is



\begin{align} \sum_{k=1}^n \frac \theta {\theta+k-1} = \theta \cdot (\Psi(\theta+n) - \Psi(\theta)) \end{align} $$

where $$\Psi(\theta)$$ is the digamma function. In the general case ($$\alpha>0$$) the expected number of occupied tables is



\begin{align} \frac{\Gamma(\theta+n+\alpha)\Gamma(\theta+1)}{\alpha \Gamma(\theta+n) \Gamma(\theta+\alpha)} - \frac \theta \alpha, \end{align} $$

however, note that the $$\Gamma(\cdot)$$ function here is not the standard gamma function.

The Dirichlet-categorical model
For the parameter choice $$\alpha<0$$ and $$\theta=-L\alpha$$, where $$L\in\{1,2,3,\ldots\}$$, the two-parameter Chinese restaurant process is equivalent to the Dirichlet-categorical model, which is a hierarchical model that can be defined as follows. Notice that for this parameter setting, the probability of occupying a new table, when there are already $$L$$ occupied tables, is zero; so that the number of occupied tables is upper bounded by $$L$$. If we choose to identify tables with labels that take values in $$\{1,2,\ldots,L\}$$, then to generate a random partition of the set $$[n]=\{1,2,\ldots,n\}$$, the hierarchical model first draws a categorical label distribution, $$\mathbf p=(p_1,p_2,\ldots,p_L)$$ from the symmetric Dirichlet distribution, with concentration parameter $$\gamma=-\alpha>0$$. Then, independently for each of the $$n$$ customers, the table label is drawn from the categorical $$\mathbf p$$. Since the Dirichlet distribution is conjugate to the categorical, the hidden variable $$\mathbf p$$ can be marginalized out to obtain the posterior predictive distribution for the next label state, $$\ell_{n+1}$$, given $$n$$ previous labels

P(\ell_{n+1}=i\mid \ell_1,\ldots,\ell_n) = \frac{\gamma+ \left| {b_i} \right| }{L\gamma+n} $$ where $$\left|{b_i}\right|\ge0$$ is the number of customers that are already seated at table $$i$$. With $$\alpha=-\gamma$$ and $$\theta=L\gamma$$, this agrees with the above general formula, $$\frac{|b_i| - \alpha}{n + \theta}$$, for the probability of sitting at an occupied table when $$|b_i|\ge1$$. The probability for sitting at any of the $$L-|B|$$ unoccupied tables, also agrees with the general formula and is given by



\sum_{i: |b_i| = 0} P(\ell_{n+1}=i\mid \ell_1,\ldots,\ell_n) = \frac{(L-|B|)\gamma}{n + L\gamma} = \frac{\theta + |B| \alpha}{n + \theta} $$

The marginal probability for the labels is given by



P(\ell_1,\ldots,\ell_n) = P(\ell_1)\prod_{t=1}^{n-1} P(\ell_{t+1}\mid\ell_1,\ldots,\ell_t) = \frac{\prod_{i=1}^L\gamma^{\overline{\left|{b_i}\right|} }}{(L\gamma)^{\overline n}} $$ where $$P(\ell_1)=\frac1L$$ and $$x^{\overline m}=\prod_{i=0}^{m-1}(x+i)$$ is the rising factorial. In general, there are however multiple label states that all correspond to the same partition. For a given partition, $$B$$, which has $$\left|B\right|\le L$$ blocks, the number of label states that all correspond to this partition is given by the falling factorial, $$L^{\underline{\left|B\right|} }=\prod_{i=0}^{\left|B\right|-1}(L-i)$$. Taking this into account, the probability for the partition is



\text{Pr}(B_n=B\mid\gamma,L) = L^{\underline{\left|B\right|}}\,\frac{\prod_{i=1}^L\gamma^{\overline{\left|{b_i}\right|} }}{(L\gamma)^{\overline n}} $$

which can be verified to agree with the general version of the partition probability that is given above in terms of the Pochhammer k-symbol. Notice again, that if $$B$$ is outside of the support, i.e. $$|B|>L$$, the falling factorial, $$L^{\underline{|B|}}$$ evaluates to zero as it should. (Practical implementations that evaluate the log probability for partitions via $$\log L^{\underline{|B|}}=\log\left|\Gamma(L+1)\right|-\log\left|\Gamma(L+1-|B|)\right|$$ will return $$-\infty$$, whenever $$|B|>L$$, as required.)

Relationship between Dirichlet-categorical and one-parameter CRP
Consider on the one hand, the one-parameter Chinese restaurant process, with $$\alpha=0$$ and $$\theta>0$$, which we denote $$\text{CRP}(\alpha=0,\theta)$$; and on the other hand the Dirichlet-categorical model with $$L$$ a positive integer and where we choose $$\gamma=\frac{\theta}{L}$$, which as shown above, is equivalent to $$\text{CRP}(\alpha=-\frac{\theta}{L},\theta)$$. This shows that the Dirichlet-categorical model can be made arbitrarily close to $$\text{CRP}(0,\theta)$$, by making $$L$$ large.

Stick-breaking process
The two-parameter Chinese restaurant process can equivalently be defined in terms of a stick-breaking process. For the case where $$0\le\alpha<1$$ and $$\theta>-\alpha$$, the stick breaking process can be described as a hierarchical model, much like the above Dirichlet-categorical model, expcept that there is an infinite number of label states. The table labels are drawn independently from the infinite categorical distribution $$\mathbf p=(p_1,p_2,\ldots)$$, the components of which are sampled using stick breaking: start with a stick of length 1 and randomly break it in two, the length of the left half is $$p_1$$ and the right half is broken again recursively to give $$p_2,p_3,\ldots$$. More precisely, the left fraction, $$f_k$$, of the $$k$$-th break is sampled from the beta distribution:

f_k\sim B(1-\alpha,\theta+k\alpha),\; \text{for }k\ge1\text{ and }0\le\alpha<1 $$

The categorical probabilities are:



p_k=f_k\prod_{i=1}^{k-1}(1-f_k),\;\text{where the empty product evaluates to one.} $$

For the parameter settings $$\alpha<0$$ and $$\theta=-\alpha L$$, where $$L$$ is a positive integer, and where the categorical is finite: $$\mathbf p=(p_1,\ldots, p_L)$$, we can sample $$\mathbf p$$ from an ordinary Dirchlet distribution as explained above, but it can also be sampled with a truncated stick-breaking recipe, where the formula for sampling the fractions is modified to:



f_k \sim B(-\alpha, \theta+k\alpha),\;\text{for }1\le k\le L-1\text{ and }\alpha<0 $$ and $$f_L=1$$.

The Indian buffet process
It is possible to adapt the model such that each data point is no longer uniquely associated with a class (i.e., we are no longer constructing a partition), but may be associated with any combination of the classes. This strains the restaurant-tables analogy and so is instead likened to a process in which a series of diners samples from some subset of an infinite selection of dishes on offer at a buffet. The probability that a particular diner samples a particular dish is proportional to the popularity of the dish among diners so far, and in addition the diner may sample from the untested dishes. This has been named the Indian buffet process and can be used to infer latent features in data.

Applications
The Chinese restaurant process is closely connected to Dirichlet processes and Pólya's urn scheme, and therefore useful in applications of Bayesian statistics including nonparametric Bayesian methods. The Generalized Chinese Restaurant Process is closely related to Pitman–Yor process. These processes have been used in many applications, including modeling text, clustering biological microarray data, biodiversity modelling, and image reconstruction