Sunday, May 26, 2013

Clustering Tips

Hierarchical Clustering Tips:

There are several approaches we might use to stopping the hierarchical clustering process:
1) We could be told, or have a belief, about how many clusters there are in the data. For example, if we are told that the data about sports news is taken from Cricket, Hockey, and Football, then we know to stop when there are three clusters left.
2) We could stop combining when at some point the best combination of existing clusters produces a cluster that is inadequate. There can be various tests for the adequacy of a cluster. But for an example, we could insist that any cluster have an average distance between the centroid and its points no greater than some limit. This approach is only sensible if we have a reason to believe that no cluster extends over too much of the space.
3) We could continue clustering until there is only one cluster. However, it is meaningless to return a single cluster consisting of all the points. Rather, we return the tree representing the way in which all the points were combined. This form of answer makes good sense in some applications, such as one in which the points are genomes of different species, and the distance measure reflects the difference in the genome.Then, the tree represents the evolution of these species, that is, the likely order in which two species branched from a common ancestor.
4) Stop if the diameter of the cluster that results from the best merger exceeds a threshold. We can also base this rule on the radius, or on any of the variants of the radius.
5) Stop if the density of the cluster that results from the best merger is below some threshold. The density can be defined in many different ways. Roughly, it should be the number of cluster points per unit volume of the cluster. That ratio can be estimated by the number of points divided by some power of the diameter or radius of the cluster. The correct power could be the number of dimensions of the space. Sometimes, 1 or 2 is chosen as the power, regardless of the number of dimensions.
6) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster. For example, we could track the average diameter of all the current clusters. As long as we are combining points that belong in a cluster, this average will rise gradually. However, if we combine two clusters that really don’t deserve to be combined, then the average diameter will take a sudden jump.


Efficiency of Hierarchical Clustering :

The basic algorithm for hierarchical clustering is not very efficient. At each step, we must compute the distances between each pair of clusters, in order to find the best merger. The initial step takes O(n2) time, but subsequent steps take time proportional to (n-1)2, (n-2)2, ....... The sum of squares up to n is O(n3), so this algorithm is cubic. Thus, it cannot be run except for fairly small numbers of points.

More efficient implementation steps are :

Step 1 : We start, as we must, by computing the distances between all pairs of points, and this step is O(n2).
Step 2 : Form the pairs and their distances into a priority queue, so we can always find the smallest distance in one step. This operation is also O(n2).
Step 3 : When we decide to merge two clusters C and D, we remove all entries in the priority queue involving one of these two clusters; that requires work O(n log n) since there are at most 2n deletions to be performed, and priority-queue deletion can be performed in O(log n) time.
Step 4 : We then compute all the distances between the new cluster and the remaining clusters. This work is also O(n log n), as there are at most n entries to be inserted into the priority queue, and insertion into a priority queue can also be done in O(log n) time.

Since the last two steps are executed at most n times, and the first two steps are executed only once, the overall running time of this algorithm is O(n2 log n).
That is better than O(n3), but it still puts a strong limit on how large n can be before it becomes infeasible to use this clustering approach.


K-Means Clustering Tips:

Picking the Right Value of k :
We may not know the correct value of k to use in a k-means clustering. However, if we can measure the quality of the clustering for various values of k, we can usually guess what the right value of k as :

If we take a measure of appropriateness for clusters, such as average radius or diameter, that value will grow slowly, as long as the number of clusters we assume remains at or above the true number of clusters.However, as soon as we try to form fewer clusters than there really are, the measure will rise precipitously. The idea is expressed by the below diagram as :











Above figure shows that average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls below the true number present in the data.
  
If we have no idea what the correct value of k is, we can find a good value in a number of clustering operations that grows only logarithmically with the true number. Begin by running the k-means algorithm for k = 1, 2, 4, 8, . . . .Eventually, you will find two values v and 2v between which there is very little decrease in the average diameter, or whatever measure of cluster cohesion you are using. We may conclude that the value of k that is justified by the data lies between v/2 and v. If we use a binary search in that range, we can find the best value for k in another log2clustering operations, for a total of 2 log2v clusterings. Since the true value of k is at least v/2, we have used a number of clusterings that is logarithmic in k.

Picking the Right Centroids :

Problem : If centroid chosen initially wrong then it can be end up with wrong clusters like as it can be possible one cluster is very large and other clusters are very small. Thats all depend on centroid chosen initially. 

Solution : To avoid this, we should do the k-means clustering process with different random initialization and calculate the cost function 




with each random  initialize centroids. We should pick that random initialization which has highest minimized cost function.

Classification

Classification is an algorithm that distinguishes between a fixed set of classes.
Such as "spam" vs. "non-spam", based on labeled training is example of classification
So many algorithms are use for classification problem. Some algorithms are :
  a) Naive Bayes
  b) Balanced Winnow
  c) Winnow
  d) Maximum Entropy
  e) Decision Tree
  f) Support Vector M/C(SVM) 

Balanced Winnow Algorithm :


Balanced  Winnow keeps two weights for each feature,  wand w-.


Overall weight of feature : (w  - w- ).

For a given document with feature vector x = (Sp1, Sp2,…..,Spd) the algorithm deems the document relevant iff : 









 : Weights corresponding to active feature Pj




                  : Strength of Pfeature.




d                                    :  Number of documents

τ                                     :  Threshold

In Balanced Winnow algorithm initialize below weights with 2τ /m  and τ /m respectively.





Here m is the active number of feature in a document in the collection.
Weight of active features are updated only when a mistake  is made and these weight updated with promotion step or demotion step.

Positive Example : In the positive example step, following a mistake on a positive example(If the algorithm predicts 0 and the label is 1). Positive part promoted and Negative part of weight is demoted as :







Negative  Example : In the negative example step, following a mistake on a negative example(If the algorithm predicts 1 and the label is 0). Positive part demoted and Negative part of weight is promoted as :







α : Promotion Parameter. The value of α  is greater than 1(α > 1)
β : Demotion Parameter. The value of β lies between 0 and 1(1>β>0).

Three ways of adjusting the value of  “s(i, d)” (Strength of feature I in document d) according to 
frequency document:
1) If the feature is present(active) in document then “s(i, d) = 1” otherwise “s(i, d) = 0”.
2) Strength of feature depend on number of occurrence “n(i, d)” in document. In this “s(i, d) = n(i, d)”.
3) In this reducing the strength by normalizing the value of n(i ,d) as : s(i, d) = sqrt(n(i,d))

Example :
Let the document d  contains {information, retrieval, retrieval, retrieval, retrieval, information, 
retrieval, retrieval} terms.
Let the weight vector of these two features ”information, retrieval” is Wd  = (0.5,0.1) and  
threshold τ = 1
Strength function of above each approach calculated  as :
1.) X1   = (1,1)(boolean feature present or not), Strength function(1)  is  ∑ = 0.5*1 + 0.1*1 = 0.6.
2) X2   = (2,6) (Number of times feature occur), Strength function(2)  is  ∑ = 0.5*2 + 0.1*6 = 1.1.
3) X3   = (1.41,2.45) (Normalize the number of times feature occur), Strength function(3)  is  ∑ = 0.5*1.41 + 0.1*2.45 = 0.95. 


In above example Balanced Winnow algorithm will predict 0 when using the first and third strength function and 1 when using the second strength function


A computational advantage of Balanced Winnow is given in a dispersed high dimensional space because
only weights of active features are updated.
This makes the algorithm more sensitive to the relationships among the features. These relationships may go unnoticed by an algorithm that is based on counts of features.


Positive Winnow : Positive Winnow keeps only one weight Wpj  of feature Pinstead of  two weights wand w-  like as Balanced Winnow.
For a given document with feature vector x = (Sp1, Sp2,…..,Spd) the algorithm deems the document relevant iff : 





When a mistake is made, the weight vector is updated. If the algorithm predicts 0 and the label is 1 (positive example) then the weights of all the active features is promoted - the weight Wpj is multiplied by α
Otherwise, if the algorithm predicts 1 and the received label is 0 (negative example) then the weights of all the features are demoted - the weight Wpj is multiplied by β. 
No matter which mistake is made, the weights of the inactive features remain unchanged.
.