Compression Digest
compression/_posts/2015-07-29-clustering.md
Summary of Clustering Algorithms
[Literal] Clustering aims to divide a set of objects into meaningful groups where objects within a cluster are similar, and objects in different clusters are dissimilar. [Literal] Centroid-based partitioning methods include k-center and k-means. [Literal] Hierarchical methods allow exploration of various clustering results from a dendrogram. [Literal] Density-based methods are also mentioned but not detailed.
Key points
- [Literal] Clustering groups similar objects together and dissimilar objects apart.
- [Literal] K-center aims to find k centers that minimize the radius of the largest cluster, but it is NP-hard; a 2-approximation exists.
- [Literal] K-means initializes k random centroids, assigns points to the nearest centroid, and updates centroids as the average of assigned points until convergence.
- [Literal] K-means always terminates because the cost (distance) strictly decreases each round, and there's a finite number of possible centroid sets.
- [Literal] K-means++ seeding improves the initial centroid selection by choosing subsequent centroids with probabilities proportional to the squared distance to existing centroids, offering better approximation ratios than arbitrary seeding.
- [Literal] K-means struggles with clusters of differing sizes, densities, or non-globular shapes.
- [Literal] Hierarchical methods, particularly agglomerative ones, merge the most similar clusters iteratively.
- [Literal] A dendrogram from hierarchical clustering can yield different numbers of clusters (k) by cutting the tree at various levels.
- [Literal] An agglomerative approach using a binary search tree to store cluster distances takes O(n^2 log n) time.
- [Literal] The choice of distance function is crucial in hierarchical clustering.
- [Literal] Density-based clustering is mentioned as a category but not elaborated upon.
Patterns / reminders
- [AI Synthesis] The trade-off between computational complexity (NP-hard k-center) and practical approximation algorithms (k-means, k-means++) is a common theme in optimization problems.
- [AI Synthesis] Hierarchical clustering offers flexibility by allowing users to derive different clusterings from a single computed hierarchy, catering to varied analytical needs.
- [AI Synthesis] The "careful seeding" in k-means++ highlights the significant impact of initialization on the performance and convergence of iterative algorithms.
Sources
- (Source: raw/_posts/2015-07-29-clustering.md)