
To make sense of a gene expression matrix, biologists often rearrange its genes so that similar expression vectors correspond to consecutive rows in the gene expression matrix. Then, they often assign colors to each element of the gene expression matrix according to their expression levels, resulting in a heatmap (shown below).
Figure: The heatmap of a gene expression matrix. The rows and columns of the gene expression matrix are ordered according to the order of leaves constructed using hierarchical clustering.
Given a set of k centers in multi-dimensional space, its Voronoi diagram is a partitioning of the space into k regions (called Voronoi cells), each containing exactly one center. A center's Voronoi cell consists of all points closer to that center than to any other. The figure below (source: https://en.wikipedia.org/wiki/Voronoi_diagram) shows the Voronoi diagram of 20 centers.
For a given center, the Centers to Clusters step of the Lloyd algorithm forms a cluster technically containing all data points in its Voronoi cell. Note that since all points in a cluster are located within a single Voronoi cell, its center of gravity is also located within the same Voronoi cell. Thus, since all Voronoi cells are different, all centers of gravity constructed by the Lloyd algorithm are different.
k-means++Initializer requires O(n*k) distance computations, where n is the number of data points and k is the number of centers. It therefore results in roughly the same running time as a single iteration of the Lloyd algorithm.
Since FarthestFirstTraversal is a deterministic algorithm, its results are defined by selecting the first center (up to tie resolution). Thus, the number of possible initializations is limited by the number of data points. This may present a limitation if we want to run the Lloyd algorithm many times and select the best solution.
Let {n, k} denote the number of partitions of n points into k clusters. We will establish a recurrence relation to compute {n, k}. To do so, select an arbitrary point x, and note that there are two possibilities. First, x could belong to a cluster all by itself. In this case, the remaining points form k - 1 clusters, and we know that there are {n-1, k-1} ways to partition them. Second, x could belong to a cluster with other points. In this case, there are {n-1, k} ways to partition the other points into k clusters, and then there are k different choices for the cluster to which x belongs.
Combining these two cases results in the following recurrence relation for all n > 1 and all k satisfying 1 < k < n:
{n, k} = {n-1, k-1} + k · {n-1, k}
As for a base case, note that {n, 1} = 1 for all n ≥ 1 because there is only one way to partition n points into a single cluster. By the same reasoning, {n, n} = 1 for all n ≥ 1.
Exercise Break: Find a formula for {n, 2} in terms of n.
The numbers {n, k} are called Stirling numbers of the second kind.
Given two vectors a = (a1, …, an) and b = (b1, …, bn), their dot product is defined as the sum of pairwise products of their elements:
Note that the two occurrences of the "·" symbol in this equation mean different things. On the left side, it represents the dot product of vectors, which we have just defined. On the right side, it indicates a normal product of two numbers ai and bi.
Before clustering even starts, biologists have to decide how to represent expression vectors in multi-dimensional space. Some representations may result in better clustering outcomes than others (see the FAQ on how scaling the input data affects clustering). It turns out that taking logarithms results in better clustering, but there is no good theoretical justification for this observation.
The silhouette value is a popular approach for evaluating the quality of clustering. For each point DataPoint, define ClusterDistance(DataPoint) as the average distance between this point and all other points within the same cluster. We also define Distance(DataPoint, Cluster) as the average of the distances from DataPoint to all points in Cluster. The cluster with the minimum distance from a data point (among all clusters that do not contain DataPoint) is called the neighboring cluster for DataPoint. We refer to this distance as NeighborDistance(DataPoint). We define
Separation(DataPoint) = NeighborDistance(DataPoint) - ClusterDistance(DataPoint)
The silhouette value Silhouette(DataPoint) is defined as follows:
Silhouette(DataPoint) = Separation(DataPoint)/max(NeighborDistance(DataPoint), ClusterDistance(DataPoint))
The silhouette value ranges from −1 to +1 and measures how similar a data point is to its own cluster compared to other clusters. A high silhouette value indicates that the data point is well matched to its own cluster and poorly matched to neighboring clusters. If most data points have a high silhouette value, then the clustering is adequate. If many data points have low silhouette values, then the clustering configuration may have too many or too few clusters. The average silhouette value over all data points is a measure of how well the data have been clustered.
To remove outliers, we must first define what we mean by an outlier. It turns out that outlier detection is related to clustering, and so it is not easy to define what an outlier is without clustering! In fact, a common approach to defining outliers is to apply the Lloyd algorithm and list as outliers the top m points that are the farthest away from their nearest cluster centers. However, the Lloyd algorithm itself is sensitive to outliers, and such outliers may have an impact on the clusters that this algorithm constructs. :) Moreover, some outliers may form small tight clusters (e.g., clusters consisting of two data points) that will not be removed by the described procedure. An attempt to remove all small clusters as outliers is also risky, since some small clusters may represent a feature of a dataset rather than an artifact caused by outliers.
Biologists sometimes use the silhouette analysis to select the value of k (see the previous FAQ on alternate metrics of cluster quality). For example, compute the average silhouette values for k ranging from 1 to some max value and find the maximum silhouette value in this range, and select the associated value of k.
The 1-center clustering problem is also known as the Minimum Covering Sphere Problem: given a set of n data points in m-dimensional space, find an m-dimensional sphere of minimum area containing all these points. In the case of two-dimensional space, the corresponding Minimum Covering "Circle" Problem can be solved in O(n^4) time by considering all triplets of points and capitalizing on the observation that the minimum covering circle for a set of data points is determined by either two points (which represent a diameter of the circle) or three points lying on the boundary of the circle. Thus, we can explore all triples of points. Since checking whether a circle formed by this triple contains all data points takes O(n) time, and we need to check O(n^3) triples, the running time of this algorithm is O(n^4).
There exists a faster linear time algorithm for solving the Minimum Covering Sphere Problem, proposed in N. Megiddo. Linear-time algorithms for linear programming in R3 and related problems. SIAM J. on Computing, 12: 759–776 (1983).