User:Therustyone

= Data Clustering using the Information Bottleneck =

This application of the bottleneck method to non-Gaussian sampled data is described in [1]. The concept, as treated there, is not without complication as there are two independent phases in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

Density Estimation
Since the bottleneck method is framed in probabilistic rather than statistical terms, we first need to estimate the underlying probability density at the sample points $$X$$. This is a well known problem with a number of solutions [2]. In the present method, probability densities at the sample points are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.

Define an arbitrarily increasing distance metric $$f \,$$ between all sample pairs and define distance matrix $$d_{i,j}=f \Big ( \Big| x_i - x_j \Big | \Big )$$. Then compute transition probabilities between sample pairs $$P_{i,j}=exp (- \lambda d_{i,j} ) \,$$ for some $$\lambda > 0 \,$$. Treating samples as states, and $$P \,$$ as a Markov state transition probability matrix, the vector of probabilities of the ‘states’ after $$t$$ steps, conditioned on the initial state $$p(0) \,$$, is $$p(t)=P^t p(0) \,$$. We are here interested only in the equilibrium probability vector $$p(\infty ) \,$$ given, in the usual way, by the dominant left eigenvector of matrix $$P \,$$ and is independent of the initialising vector $$p(0) \,$$. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities densities here.

Clusters
In the following, the reference vector $$Y \,$$ contains sample categories and the joint probability $$p(X,Y) \,$$ is assumed known. A cluster $$\tilde x_k \,$$ is defined by its probability distribution over the data samples $$x: \,\,\, p( \tilde x_k |x)$$. In [1] Tishby et al present the following iterative set of equations to determine the clusters

$$\begin{cases} p(\tilde x|x)=Kp(\tilde x) exp \Big( -\beta\,D_{KL} \Big[ p(y|x) \,|| \, p(y| \tilde x)\Big ] \Big)\\ p(y| \tilde x)=\textstyle \sum_x p(y|x)p( \tilde x | x) p(x) \big / p(\tilde x) \\ p(\tilde x) = \textstyle \sum_x p(\tilde x | x) p(x) \\ \end{cases} $$

The function of each line of the iteration is expanded as follows.

Line 1:  This is a matrix valued set of conditional probabilities

$$A_{i,j} = p(\tilde x_i | x_j )=Kp(\tilde x_i) exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| \tilde x_i)\Big ] \Big)$$

The Kullback Leibler distance $$D^{KL} \,$$ between the $$Y \,$$ vectors generated by the sample data $$x \,$$ and those generated by its reduced information proxy $$\tilde x \,$$ is applied to assess the fidelity of the compressed vector with respect to the categorical data Y in accordance with the fundamental bottleneck equation. $$D^{KL}(a||b)\,$$ is the Kullback Leibler distance between distributions $$a, b \,$$

$$D^{KL}= \int p(a) log \Big ( \frac{p(a)}{p(b)} \Big ) da $$

and $$K \,$$ is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback Liebler distance is large, thus successful clusters grow in probability while unsuccessful ones decay.

Line 2:	This is a second matrix valued set of conditional probabilities

$$B_{k,l}=p(y_i | x_k ) = \sum_k p(y_i | x_k ) p(\tilde x_j | x_k )\big / p(\tilde x_j )$$

The steps in deriving this are as follows. We have, by definition

$$\begin{align} p(y|\tilde x) & = \int_x p(y|x)p(x|\tilde x) \\ & =\int_x p(y|x)p(x, \tilde x ) \big / p(\tilde x) \\ & =\int_x p(y|x)p(\tilde x | x) p(x) \big / p(\tilde x) \\ \end{align}$$

where the Bayes identities $$p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,$$ are used. Finally the integral is rewritten as the summation over the sample points $$k$$ as in the first equation above.

Line 3: this line finds the marginal distribution of $$\tilde x \,$$

$$\begin{align} p(\tilde x_i)& =\sum_j p(\tilde x_i | x_j) p(x_j) & = \sum_j p(\tilde x_i, x_j) \end{align}$$

This is also derived from standard results.

Further inputs to the algorithm are the marginal sample distribution $$p(x) \,$$ which has already been determined by the dominant eigenvector of $$P \,$$ and the matrix valued Kullback Leibler distance function

$$D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| \tilde x_i)\Big ] \Big)$$ derived from the sample spacings and transition probabilities.

The matrices $$p(y_i | \tilde x_j), p(\tilde x_i | x_j)$$ can be initialised randomly.

Defining Decision Contours
To categorize a new sample $$ x' \,$$ external to the training set $$X$$, first calculate the probabilities that it belongs to each of the various clusters which is the conditional probability $$p(\tilde x | x' ) \,$$. In order to find this, apply the previous distance metric to find the transition probabilities between $$ x' \,$$ and all samples in $$X \,$$, $$P(x_i)= f \Big ( \Big| x_i - x_j \Big | \Big )$$. Secondly apply the last two lines of the 3-line algorithm to get cluster, and conditional category probabilities.

$$\begin{align} p(\tilde x_i ) & = \sum_j p(\tilde x_i | x_j)p(x_j) \\ p(y_i | \tilde x_j) & = \sum_k p(y_i | x_k) p(\tilde x_j | x_k)p(x_k) / p(\tilde x_j ) \\ \end{align}$$

Finally we have

$$p(y_i)= \sum_j p(y_i | \tilde x_j) p(\tilde x_j)$$

Generally the algorithm converges rapidly, often in tens of iterations. However parameter $$\beta \,$$ must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, click into focus at certain critical values.

There is some analogy between this algorithm and a neural network with a single hidden layer. The nodes are represented by the clusters $$\tilde x_j \,$$. The first and second layers of network weights are the conditional probabilities $$p(\tilde x_i | y_j$$ and $$p(y_i | \tilde x_j) $$ respectively. However, unlike a standard neural network, the present algorithm always uses probabilities of samples as inputs rather than the sample values themselves and non linear function are encapsulated in the Kullback Leibler distances and the transition probabilities rather than sigmoid functions.  Compared to a neural network this algorithm seems to converge much more quickly and by varying $$\beta \,$$ and $$\lambda \,$$ various levels of focus on features can be achieved.  There are also similarities to some varieties of Fuzzy Logic algorithms.

For blind classification and clustering, the transient behaviour of $$p(t) \,$$ is analysed and this is discussed in more detail in [2] but this extra complication is not necessary for the supervised training described here.

An Example
In the following simple case we investigate clustering in a four quadrant multiplier with random inputs $$u, v \,$$ and two categories of output, $$\pm 1 \,$$, generated by $$y=sign(uv) \,$$. This function has the property that there are two spatially separated clusters for each category and so it demonstrates that the method can handle such distributions.

20 samples are taken, uniformly distributed on the square $$[-1,1]^2 \,$$. The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters \lambda = 3,\, \beta = 2.5 adn the distance function $$d_{i,j} = f \Big ( \Big| X_i - X_j \Big | \Big ) = \Big| X_i - X_j \Big |^2$$ where $$X_i = (u_i,v_i)^T \,$$. The figure shows the locations of the twenty samples with '0' representing Y = 1 and 'x' representing Y = -1. The contour at the unity likelihood ratio level is shown, $$L= Pr(1) \big/ Pr(-1) = 1$$ as a new sample $$x' \,$$is scanned over the square. Theoretically the contour should align with the $$X=0 \,$$ and $$Y=0 \,$$ coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.