Hierarchical Clustering Tips:
There are several approaches we might use to stopping the hierarchical clustering process:
1) We could be told, or have a belief, about how many clusters there are in the data. For example, if we are told that the data about sports news is taken from Cricket, Hockey, and Football, then we know to stop when there are three clusters left.
2) We could stop combining when at some point the best combination of existing clusters produces a cluster that is inadequate. There can be various tests for the adequacy of a cluster. But for an example, we could insist that any cluster have an average distance between the centroid and its points no greater than some limit. This approach is only sensible if we have a reason to believe that no cluster extends over too much of the space.
3) We could continue clustering until there is only one cluster. However, it is meaningless to return a single cluster consisting of all the points. Rather, we return the tree representing the way in which all the points were combined. This form of answer makes good sense in some applications, such as one in which the points are genomes of different species, and the distance measure reflects the difference in the genome.Then, the tree represents the evolution of these species, that is, the likely order in which two species branched from a common ancestor.
4) Stop if the diameter of the cluster that results from the best merger exceeds a threshold. We can also base this rule on the radius, or on any of the variants of the radius.
5) Stop if the density of the cluster that results from the best merger is below some threshold. The density can be defined in many different ways. Roughly, it should be the number of cluster points per unit volume of the cluster. That ratio can be estimated by the number of points divided by some power of the diameter or radius of the cluster. The correct power could be the number of dimensions of the space. Sometimes, 1 or 2 is chosen as the power, regardless of the number of dimensions.
6) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster. For example, we could track the average diameter of all the current clusters. As long as we are combining points that belong in a cluster, this average will rise gradually. However, if we combine two clusters that really don’t deserve to be combined, then the average diameter will take a sudden jump.
Efficiency of Hierarchical Clustering :
The basic algorithm for hierarchical clustering is not very efficient. At each step, we must compute the distances between each pair of clusters, in order to find the best merger. The initial step takes O(n2) time, but subsequent steps take time proportional to (n-1)2, (n-2)2, ....... The sum of squares up to n is O(n3), so this algorithm is cubic. Thus, it cannot be run except for fairly small numbers of points.
More efficient implementation steps are :
Step 1 : We start, as we must, by computing the distances between all pairs of points, and this step is O(n2).
Step 2 : Form the pairs and their distances into a priority queue, so we can always find the smallest distance in one step. This operation is also O(n2).
Step 3 : When we decide to merge two clusters C and D, we remove all entries in the priority queue involving one of these two clusters; that requires work O(n log n) since there are at most 2n deletions to be performed, and priority-queue deletion can be performed in O(log n) time.
Step 4 : We then compute all the distances between the new cluster and the remaining clusters. This work is also O(n log n), as there are at most n entries to be inserted into the priority queue, and insertion into a priority queue can also be done in O(log n) time.
Since the last two steps are executed at most n times, and the first two steps are executed only once, the overall running time of this algorithm is O(n2 log n).
That is better than O(n3), but it still puts a strong limit on how large n can be before it becomes infeasible to use this clustering approach.
K-Means Clustering Tips:
Picking the Right Value of k :
We may not know the correct value of k to use in a k-means clustering. However, if we can measure the quality of the clustering for various values of k, we can usually guess what the right value of k as :
If we take a measure of appropriateness for clusters, such as average radius or diameter, that value will grow slowly, as long as the number of clusters we assume remains at or above the true number of clusters.However, as soon as we try to form fewer clusters than there really are, the measure will rise precipitously. The idea is expressed by the below diagram as :
Above figure shows that average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls below the true number present in the data.
  
Picking the Right Centroids :
Problem : If centroid chosen initially wrong then it can be end up with wrong clusters like as it can be possible one cluster is very large and other clusters are very small. Thats all depend on centroid chosen initially.
Solution : To avoid this, we should do the k-means clustering process with different random initialization and calculate the cost function
with each random initialize centroids. We should pick that random initialization which has highest minimized cost function.
There are several approaches we might use to stopping the hierarchical clustering process:
1) We could be told, or have a belief, about how many clusters there are in the data. For example, if we are told that the data about sports news is taken from Cricket, Hockey, and Football, then we know to stop when there are three clusters left.
2) We could stop combining when at some point the best combination of existing clusters produces a cluster that is inadequate. There can be various tests for the adequacy of a cluster. But for an example, we could insist that any cluster have an average distance between the centroid and its points no greater than some limit. This approach is only sensible if we have a reason to believe that no cluster extends over too much of the space.
3) We could continue clustering until there is only one cluster. However, it is meaningless to return a single cluster consisting of all the points. Rather, we return the tree representing the way in which all the points were combined. This form of answer makes good sense in some applications, such as one in which the points are genomes of different species, and the distance measure reflects the difference in the genome.Then, the tree represents the evolution of these species, that is, the likely order in which two species branched from a common ancestor.
4) Stop if the diameter of the cluster that results from the best merger exceeds a threshold. We can also base this rule on the radius, or on any of the variants of the radius.
5) Stop if the density of the cluster that results from the best merger is below some threshold. The density can be defined in many different ways. Roughly, it should be the number of cluster points per unit volume of the cluster. That ratio can be estimated by the number of points divided by some power of the diameter or radius of the cluster. The correct power could be the number of dimensions of the space. Sometimes, 1 or 2 is chosen as the power, regardless of the number of dimensions.
6) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster. For example, we could track the average diameter of all the current clusters. As long as we are combining points that belong in a cluster, this average will rise gradually. However, if we combine two clusters that really don’t deserve to be combined, then the average diameter will take a sudden jump.
Efficiency of Hierarchical Clustering :
The basic algorithm for hierarchical clustering is not very efficient. At each step, we must compute the distances between each pair of clusters, in order to find the best merger. The initial step takes O(n2) time, but subsequent steps take time proportional to (n-1)2, (n-2)2, ....... The sum of squares up to n is O(n3), so this algorithm is cubic. Thus, it cannot be run except for fairly small numbers of points.
More efficient implementation steps are :
Step 1 : We start, as we must, by computing the distances between all pairs of points, and this step is O(n2).
Step 2 : Form the pairs and their distances into a priority queue, so we can always find the smallest distance in one step. This operation is also O(n2).
Step 3 : When we decide to merge two clusters C and D, we remove all entries in the priority queue involving one of these two clusters; that requires work O(n log n) since there are at most 2n deletions to be performed, and priority-queue deletion can be performed in O(log n) time.
Step 4 : We then compute all the distances between the new cluster and the remaining clusters. This work is also O(n log n), as there are at most n entries to be inserted into the priority queue, and insertion into a priority queue can also be done in O(log n) time.
Since the last two steps are executed at most n times, and the first two steps are executed only once, the overall running time of this algorithm is O(n2 log n).
That is better than O(n3), but it still puts a strong limit on how large n can be before it becomes infeasible to use this clustering approach.
K-Means Clustering Tips:
Picking the Right Value of k :
We may not know the correct value of k to use in a k-means clustering. However, if we can measure the quality of the clustering for various values of k, we can usually guess what the right value of k as :
If we take a measure of appropriateness for clusters, such as average radius or diameter, that value will grow slowly, as long as the number of clusters we assume remains at or above the true number of clusters.However, as soon as we try to form fewer clusters than there really are, the measure will rise precipitously. The idea is expressed by the below diagram as :
Above figure shows that average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls below the true number present in the data.
Picking the Right Centroids :
Problem : If centroid chosen initially wrong then it can be end up with wrong clusters like as it can be possible one cluster is very large and other clusters are very small. Thats all depend on centroid chosen initially.
Solution : To avoid this, we should do the k-means clustering process with different random initialization and calculate the cost function
with each random initialize centroids. We should pick that random initialization which has highest minimized cost function.
