Automatic clustering algorithms

Automatic clustering algorithms are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other cluster analysis techniques, automatic clustering algorithms can determine the optimal number of clusters even in the presence of noise and outlier points.

Centroid-based
Given a set of n objects, centroid-based algorithms create k partitions based on a dissimilarity function, such that k≤n. A major problem in applying this type of algorithm is determining the appropriate number of clusters for unlabeled data. Therefore, most research in clustering analysis has been focused on the automation of the process.

Automated selection of k in a K-means clustering algorithm, one of the most used centroid-based clustering algorithms, is still a major problem in machine learning. The most accepted solution to this problem is the elbow method. It consists of running k-means clustering to the data set with a range of values, calculating the sum of squared errors for each, and plotting them in a line chart. If the chart looks like an arm, the best value of k will be on the "elbow".

Another method that modifies the k-means algorithm for automatically choosing the optimal number of clusters is the G-means algorithm. It was developed from the hypothesis that a subset of the data follows a Gaussian distribution. Thus, k is increased until each k-means center's data is Gaussian. This algorithm only requires the standard statistical significance level as a parameter and does not set limits for the covariance of the data.

Connectivity-based (hierarchical clustering)
Connectivity-based clustering or hierarchical clustering is based on the idea that objects have more similarities to other nearby objects than to those further away. Therefore, the generated clusters from this type of algorithm will be the result of the distance between the analyzed objects.

Hierarchical models can either be divisive, where partitions are built from the entire data set available, or agglomerating, where each partition begins with a single object and additional objects are added to the set. Although hierarchical clustering has the advantage of allowing any valid metric to be used as the defined distance, it is sensitive to noise and fluctuations in the data set and is more difficult to automate.

Methods have been developed to improve and automate existing hierarchical clustering algorithms such as an automated version of single linkage hierarchical cluster analysis (HCA). This computerized method bases its success on a self-consistent outlier reduction approach followed by the building of a descriptive function which permits defining natural clusters. Discarded objects can also be assigned to these clusters. Essentially, one needs not to resort to external parameters to identify natural clusters. Information gathered from HCA, automated and reliable, can be resumed in a dendrogram with the number of natural clusters and the corresponding separation, an option not found in classical HCA. This method includes the two following steps: outliers being removed (this is applied in many filtering applications) and an optional classification allowing expanding clusters with the whole set of objects.

BIRCH (balanced iterative reducing and clustering using hierarchies) is an algorithm used to perform connectivity-based clustering for large data-sets. It is regarded as one of the fastest clustering algorithms, but it is limited because it requires the number of clusters as an input. Therefore, new algorithms based on BIRCH have been developed in which there is no need to provide the cluster count from the beginning, but that preserves the quality and speed of the clusters. The main modification is to remove the final step of BIRCH, where the user had to input the cluster count, and to improve the rest of the algorithm, referred to as tree-BIRCH, by optimizing a threshold parameter from the data. In this resulting algorithm, the threshold parameter is calculated from the maximum cluster radius and the minimum distance between clusters, which are often known. This method proved to be efficient for data sets of tens of thousands of clusters. If going beyond that amount, a supercluster splitting problem is introduced. For this, other algorithms have been developed, like MDB-BIRCH, which reduces super cluster splitting with relatively high speed.

Density-based
Unlike partitioning and hierarchical methods, density-based clustering algorithms are able to find clusters of any arbitrary shape, not only spheres.

The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required. This type of algorithm provides different methods to find clusters in the data. The fastest method is DBSCAN, which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method OPTICS creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density.

These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density.  

In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the directed acyclic graph (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.