User:Mctinker/Affinity Propagation

The Affinity Propagation algorithm is an algorithm designed for Cluster analysis. In Cluster analysis, we hope to identify a subset of representative examples based on a measure of similarity, then we can achieve the patterns. In tranditional clustering algorithms,  such as k-means clustering, the initial "exemplars"(the representative sample in each pattern) are randomly selected; however, it is usually rerun many times with different initializations in an attempt to find a good solution and only work well when initial choice is close to a good solution. The Affinity Propagation algorithm introduce a new way to select these initial "exemplars" to avoid these drawbacks. This algorithm is firstly presented by Brendan J. Frey and Delbert Dueck in 2007.

Algorithm
The Affinity Propagation algorithm takes as input a collection of real-valued similarities between data points(matrix S), denoted by $$s(i,j)$$, which indicates how close(similar) the two nodes are. One way to choose the similarity value is to use the negative Euclidean distance, if applicable. When $$i=j$$, $$s(i,i)$$ will store the probability that this i point will be selected as "exemplar", called "preferences". The point k with higher $$s(k,k)$$ value is more likely to be choosed as "exemplar".

Real-valued messages will be exchanged between data points until a good set of exemplars and patterns emerges, based on the inputs.There are two kinds of messages, "responsibility" $$r(i,j)$$ and "availability" $$a(i,j)$$. The "responsibility", $$r(i,j)$$, which will be sent from data point i to candidate exemplar point j(j is still a data point there), reflects how suitable it is that point j serves as the exemplar for point i, taking into account other potential exemplar for i. The "availability", $$a(i,j)$$, sent from candidate j to point i, which will reflect that how proper it is that point i choose point j as its exemplar, considering support from other point to choose j as exemplar. In each iteration, the algorithm will update these messages until the convergence conditions get satisfied. In other words, we will have another two matrices $$A$$ and $$R$$, and we will update the matrices in every iteration.

At first, all responsibilities will be set to zero, $$a(i,k)=0$$, for any i,k. Then, begin iteration, update matrix $$R$$, the formula is
 * $$\textrm{r}(i,k)= s(i,k)-\max(a(i,k')+s(i,k')) $$, where k'≠k.

The next step is updating matrix $$A$$, there are two cases:
 * when i≠k, $$\textrm{a}(i,k)= \min(0,\,r(k,k)+f(k)) $$, where f(k)= ∑max(0,r(i',k)),and i'∉{i,k}
 * when i=k, $$ a(k,k)= f(k)$$, $$f(k)$$ is the same as above.

From the matrix $$R$$ and $$A$$, for every point i, find the point k that maximizes $$a(i,k)+r(i,k)$$. Here k will be the exemplar for point i, and if $$i=k$$, point i itself will be a exemplar.

Do the iterations until some conditions satisfied.

Pseudocode
The Pseudocode for the basic version of Affinity Propagation algorithm follows:

Set any a[i][j]=0 while the convergence conditions are not satisfied for each each i,k find the k'≠k that maximizes a[i][k']+s[i][k'] r[i][k] &larr; s[i][k]-(a[i][k']+s[i][k']) sum &larr; 0; for each i' ∉{i,k} then  if r[i'][k]>0  do    sum &larr; sum+r[i'][k]  end if end for if i==k then a[i][k] &larr; sum else if r[k][k]+sum<0 then a[i][k] &larr; r[k][k]+sum else a[i][k] &larr; 0 end if end if end for for each each i         find the k that maximizes a[i][k]+r[i][k] set k as the exemplar of i         put point i into the pattern of k    end for end while

Damped factor
Here, we will update the matrix $$R$$ and $$A$$ in each iteration,but it is important that they be damped to avoid numerical oscillations that arise in some circumstances. In order to avoid it, the algorithm introduce the Damped factor λ, then we update the messages, we will do as following:
 * $$\text{newvalue} = \lambda*\text{oldvalue}+ (1-\lambda)*\text{updatevalue} $$

The updatevalue will be calculated by the formula above. And the factor λ is between 0 and 1.

Stop Conditions
There are many conditions for the termination of the iterations, such as :
 * After certain times of iterations;
 * The sum of changes of all messages in one iteration is smaller than a thresholds;
 * The decisions of exemplars did not change for N iterations.

Example
Here is an easy example to show how it works:

There are 4 points:point 1(-2.3,3.7),point 2(-1.5,1.8),point 3(2.5,1.8),point 4(4.0,1.6).

Then use the negative Euclidean distance for the similarities, and the preferences will be set to a common value--the median of the input similarities. And the terminate the algorithm when the exemplars decisions did not change for 10 iterations. The Damped factor λ will be set to 0.5. The algorithm will work like this:

At first, the matrix S is: -4.00 -2.06 -5.16 -6.64  -2.06 -4.00 -4.00 -5.50   -5.16 -4.00 -4.00 -1.51        -6.64 -5.50 -1.51 -4.00 Matrix A and R are 0.

Then follow the algorithm, and calculate $$r(i,j)$$, taking into account the damped factor λ. The matrix R becomes: -0.9700   0.9700   -1.5500   -2.2900    0.9700   -0.9700   -0.9700   -1.7200   -1.8250   -1.2450   -1.2450    1.2450   -2.5650   -1.9950    1.2450   -1.2450  Next, calculate $$a(i,j)$$,taking into account the damped factor λ. The matrix A becomes: 0.4850  -0.4850         0         0   -0.4850    0.4850         0         0         0         0    0.6225   -0.6225         0         0   -0.6225    0.6225 From matrix A and R, we can monitor the exemplar decisions:

Point 1 selects point 2 as exemplar, while point 2 selects point 1; point 3 selects point 4,and point 4 selects point 3.

Then do iterations until the exemplars decisions do not change for 10 iterations.

The final result will be : point 1,2 belongs to cluster 1, the exemplar is point 1;point 3,4 belongs to cluster 2, the exemplar is point 3

Number of clusters
Unlike other traditional clustering algorithms, we do not need to input the expected number of clusters when using the Affinity Propagation algorithm.

Here, The number of clusters(number of identified exemplars) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If all data points are equally suitable as exemplars, the preferences should be set to a common value: this value can be varied to produce different numbers of clusters. Say if the shared value is the median of the input similarities, it will return a moderate number of clusters; but when it is set to the minimum similarity, it will result in a small number of clusters.

Complexity
For the Affinity Propagation algorithm, the running time depends on the number of iterations. If there are n samples, when updating $$r(i,j)$$, it will cost $$O(n-1)$$ time, and there are $$n^2$$ values in matrix $$R$$. Then update $$a(i,j)$$, it will take $$O(n-2)$$ time for each value. As the result, each iteration requires $$O(2*n^3-3*n^2)=O(n^3)$$ time.

For the space, we have to store three n by n matrices.

Correctness
As introduced above, $$r(i,j)$$ reflects how suitable it is that point j serves as the exemplar for point i, taking into account other potential exemplar for i. $$a(i,j)$$ reflects that how proper it is that point i choose point j as its exemplar, considering support from other point to choose j as exemplar. If two point are more similar, they are more likely to be put into the same cluster. In this algorithm, they have higher similarity between themselves. From the formula updating $$r(i,j)$$, high similarity will return higher $$r(i,j)$$, which means the high probability for j to serve as the exemplar for point i. At the same time, it will make other $$r(i,j')$$ lower, making other j' less probable to serve as the exemplar for point i, which corresponding to "taking into account other potential exemplar for i".

Then, when updating $$a(i,j)$$, only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities).If the self-responsibility $$r(k,k)$$ is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar. Higher $$r(k,k)$$ means it is better for k to do the role of exemplar,and high summation means higher support for k to severs as exemplars received from other points, then it will return high $$a(i,k)$$, which means i is more likely to choose k. $$a(k,k)$$ means accumulated evidence that point k is an exemplar, based on the support k received.

We choose the k as exemplar for point i when k maximizes $$a(i,k)+r(i,k)$$. It reflects that high probability for k to serve as the exemplar for point i and higher probability for i to select k as its exemplar.

So, the algorithm will return a good clustering result. Actually, after many experiments, it is clear that this algorithm will not only return good results, but also better than many traditional algorithms.

Applications
There are many applications based on this algorithm, such as:
 * Constructing Treatment Portfolios Using Affinity Propagation;
 * Solving the Uncapacitated Facility Location Problem Using Message Passing Algorithms;
 * Hierarchical Affinity Propagation;
 * Non-metric affinity propagation for unsupervised image categorization;
 * FLoSS: Facility Location for Subspace Segmentation.

Taking the clustering for segments of DNA(corresponding to putative exons) as example. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the tissues. Most similarities are set to -∞, corresponding to distant DNA segments that could not possibly be part of the same gene. Then modify the algorithm that if the similarity is negative infinity, then there is no messages between these two samples. Finally, the algorithm works very well for this problem(very efficient and good clustering results).

Related algorithms
There are many clustering algorithms(Cluster analysis), such as k-means algorithm, fuzzy clustering, and Hierarchical clustering. Compared with traditional algorithms, the Affinity Propagation algorithm is more efficient and always leads to better result. What's more, it can solve some problems that traditional algorithms can not; for example, there are three point $$A,B,C$$,but $$d(A,B)+d(B,C)$$ may be smaller than $$d(A,C)$$, when consider some realistic problems. Then traditional algorithms can not solve it but the Affinity Propagation algorithm can! But it is important to choose proper preferences values. Another drawback is that the Affinity algorithm outperform when n is large enough(for some examples, n>1000).