Its all about Data: Clustering

Clustering is a supervise learning process. It is the process of examining a collection of "points" and grouping the points into "clusters" according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

Applications :
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
WWW: Document classification; clustering weblog data to discover groups of similar access patterns.
Market Segmentation.

Clustering algorithms may be classified as listed below:-

Hierarchical Clustering : Hierarchical Clustering
Exclusive Clustering : K-means
Overlapping Clustering : Fuzzy C-means
Probabilistic Clustering : Mixture of Gaussian

Distance Measure in Clustering :
Distance measure is the important component of clustering. Often if the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. But sometime Euclidean distance also misleading even data instance vectors are all in the same physical units.
Eq.

Above figure shows, different scaling can lead to different clustering.
For higher dimensional data, a popular measure is the Minkowski metric :

d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1.

Steps of Hierarchical Clustering :

Step 1 : Start by assigning each item to a cluster, so that if we have N items, we now have N clusters, each containing just one item.

Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

Step 2 : Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one cluster less.

Step 3: Compute distances (similarities) between the new cluster and each of the old clusters.

Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)

Step 3 can be done in 3 different ways like as :

a) Single-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

b) Complete-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

c) Average-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Lets take the example of Hierarchical Clustering using Single-Linkage :

Step 1 done as by assigning each item to cluster.

Step 2 done as computed the nearest pair of cities is MI and TO, at distance 138 from above figure.

These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138.

Distance matrix after that is :

Repeated step 2 and 3 until it merge to single cluster as :

Finally last two clusters at level 295 merge.

The process is summarized by the following hierarchical tree as :

K-Means Clustering : K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.
In the beginning of k-means clustering we determine number of clusters K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroid.
Steps are :
Iterate untill stable(= no no object move )
Step 1 : Determine the centroid of coordinate.
Step 2 : Determine the distance of each object to the centroids
Step 3 : Group the object based on minimum distance.

Lets go through numeric example of k-means :

Suppose we have several objects(4 types) and each object have 2 attributes/features as show below:

Our goal is to group objects into K=2 group base on two features(Weight/Density)

Step 1 : Initial Value of centroid : Suppose we consider Object A and Object B as the first centroids. Let c1 and c2 denote the coordinates of centroids, then c1=(1,1) and c2=(2,1)

Step 2 : Objects-Centroid Distance calculate the distance b/w cluster centroid to each object.

Lets us Euclidean diatance for this caculation. Distance matrix at iteration 0 is

Each column in the distance matrix symbolizes te object. The first row of the distance matrix cooresponds to the distance of each object to the first centroid and second row is the distance of each object to the second centroid.

Step 3 : In object clustering step, we assign each object base on min distance. Thus Object A is assign to group 1, Object B to group 2, Object C to group 2 and Object D to group 2.

In below group matrix element is 1 if and only object is assign to that group.

Step 4 : In iteration 1 determine centroid of both the group as
Centroid of group 1 as : c1 = (1,1). Because group 1 only has 1 member.
Centroid of group 2 as : c2 = ((2+4+5)/3,(1+3+4)/3) =(11/3,8/3). Because group 2 only has 3 members.

Step 5 : In this compute distance matrix with new centroid as done in Step 2 as

Ste 6 : Compute group matrix from distance matrix of Step 5 as :

Step 7 : Repeat Step 4 to calculate new centroid as :

Step 8 : In iteration 2 calculate distance matrix with new centroid of step 7 as :

Step 9 : Group matrix from Step 8 as :

We obtaining the result of G² = G¹. Comparing the grouping of last iteration and this iteration reveals that object doesn’t move group anymore. Thus the computation of the k-means clustering has reached its stable state and no more iteration is needed.

Final grouping results are :

Tips for clustering are present in that blog http://datachurn.blogspot.in/2013/05/clustering-tips.html

Its all about Data

Friday, May 24, 2013

Clustering

No comments:

Post a Comment