User:Thewayiseeit/k-means++

=k-means++ Algorithm= The k-means++ clustering algorithm was proposed in 2007 by David Arthur and Sergei Vassilvitskii as an approximation algorithm for the NP-hard k-means problem.

Background
The k-means problem is to find cluster centers that minimize the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

However, the k-means algorithm has at least two major theoretic shortcomings:
 * First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
 * Secondly, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.

In a nutshell, k-means++ addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

Example Bad Case
To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum squared distance of points to assigned clusters, consider the example of four points in $$\mathbb{R}^2$$ that form an axis aligned rectangle with the width of the rectangle they form being somewhat larger than its height.

If $$k=2$$ and the two initial cluster centers lie at the mid-points of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm will converges without moving these cluster centers and the cluster assignments are consequently suboptimal.

Now, consider stretching the rectangle horizontally to an arbitrarily high width. The standard k-means algorithm will still cluster the points suboptimally and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm do arbitrarily poorly with respect to the k-means objective function.

Initialization Algorithm
With the intuition of spreading the k initial cluster centers away from each other, the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.