Its all about Data: 2013

Sunday, December 15, 2013

Map/Reduce Programming Paradigm Example

MapReduce is the heart of Hadoop. It is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

Lets start with Word Count Example :
Assume we have 3 web documents, and each documents contains web text. In that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data we need to analyze, the key principles which is covering here remains the same.

Doc 1 : Sachin is playing 200th Test.
Doc 2 : Dhoni is batting with Sachin in his 50th Test.Test is going good.
Doc 3 : Sachin and Dhoni playing Test Cricket Match in Mumbai.

In map phase the sentence would be split as words and form the initial key value pair as
Doc 1 :
<Sachin,1>
<is,1>
<playing,1>
<200th,1>
<Test,1>

Doc 2 :
<Dhoni,1>
<is,2>
<batting,1>
<with,1>
<Sachin,1>
<in,1>
<his,1>
<50th,1>
<Test,2>
<going,1>
<good,1>

Doc 3:
<Sachin,1>
<and,1>
<Dhoni,1>
<playing,1>
<Test,1>
<Cricket,1>
<Match,1>
<in,1>
<Mumbai,1>

In the reduce phase the keys are grouped together and the values for similar keys are added. So here there are only one pair of similar keys would be added so the out put key value pairs would be
<Sachin,2>
<is,3>
<playing,2>
<200th,1>
<Test,4>
<Dhoni,2>
<batting,1>
<with,1>
<in,2>
<his,1>
<50th,1>
<going,1>
<good,1>
<Cricket,1>
<Match,1>
<Mumbai,1>

This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for key

The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts.

This Map/Reduce paradigm divide into 3 parts :
Part 1: Mapper Class
Part 2 : Reducer Class
Part 3 : Driver Class which will trigger Map/Reduce job.

Part 1 : Code of Mapper Class :
The functionality of the mapper method is as follows
Step 1 : Create a IntWritable variable ‘one’ with value as 1
Step 2 : Convert the input line in Text type to a String
Step 3 : Use a tokenizer to split the line into words
Step 4 : Iterate through each word and a form key value pairs as
a) Assign each work from the tokenizer(of String type) to a Text ‘word’
b) Form key value pairs for each word as <word,one> and push it to the output collector

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
/*Hadoop supported data types. The data types provided here are Hadoop specific data types designed for operational efficiency suited for massive parallel and lightning fast read write operations*/

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

/*Map method that performs the tokenizer job and framing the initial key value pairs
Mapper<LongWritable, Text, Text, IntWritable> , it refers to the data type of input and output key value pairs specific to the mapper or rateher the map method, ie Mapper<Input Key Type, Input Value Type, Output Key Type, Output Value Type>. */

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
//taking one line at a time and tokenizing the same
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

//iterating through all the words available in that line and forming the key value pair
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
//sending to output collector which inturn passes the same to reducer
output.collect(word, one);
}
}
}

Part 2 : Code of Reducer Class :
The functionality of the reduce method is as follows
Step a) Initaize a variable ‘sum’ as 0
Step b) Iterate through all the values with respect to a key and sum up all of them
Step c) Push to the output collector the Key and the obtained sum as value

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
//reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
/*iterates through all the values available with a key and add them together and give the
final result as the key and sum of its values*/
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Part 3 : Code of Driver Class which will trigger Map/Reduce job.

The functionality of the Driver method is as follows

Step a) Creating a JobConf object

Step b) Setting configuration object with key and value class for Map/Reduce.

Step c) Provide the mapper and reducer class names

Step d) Provide the I/P and O/P HDFS directory path for Map/Reduce.

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCountDriver extends Configured implements Tool{

public int run(String[] args) throws Exception

{

//creating a JobConf object and assigning a job name for identification purposes

JobConf conf = new JobConf(getConf(), WordCountDriver.class);

conf.setJobName("WordCount");

//Setting configuration object with the Data Type of output Key and Value

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

//Providing the mapper and reducer class names

conf.setMapperClass(WordCountMapper.class);

conf.setReducerClass(WordCountReducer.class);

//the hdfs input and output directory to be fetched from the command line

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

return 0;

}

public static void main(String[] args) throws Exception

{

int res = ToolRunner.run(new Configuration(), new WordCountDriver(),args);

System.exit(res);

}

Friday, November 8, 2013

Steps to run Presto on linux m/c

Facebook announced on 6th November 2013 that it is committing its Presto low-latency, SQL-compliant query system for Hadoop to open source. Facebook has one of the largest data warehouses in the world, now surpassing 300 petabytes. That sheer scale has forced Facebook to invent its own tools for working with variably-structured data at high-scale.

Steps to run Presto on linux m/c :

Step 1 : First install Hadoop and Hive. Installation of hadoop can be follow from the blog "Setup Hadoop". This move brings yet another fast query option to Hadoop.
Setup Hive as :
Step a : Download the Release of Hive from : http://mirror.reverse.net/pub/apache/hive/ and extract Hive Release
Step b : export the path of hive folder as :
export HIVE_HOME=/home/hduser/hadoop/hive-0.11
Step c : Add $HIVE_HOME/bin to your PATH:
export PATH=$HIVE_HOME/bin:$PATH
Step d : Create directory on hdfs for hive as :
bin/hadoop fs -mkdir /home/hduser/hadoop-workDir/warehouse
At my m/c HDFS place at "/home/hduser/hadoop-workDir/". You can use according to HDFS place at your Hadoop Installation.
Step e : To use the hive command line interface (cli) from the shell:
$HIVE_HOME/bin/hive
Step f : Simple example of create table on hive as :
hive> CREATE TABLE hiveExample (eId INT, eName STRING);
Step g : hive> SHOW TABLES;

Step 2 : Download and Deployment Presto as per instruction shown at :
http://prestodb.io/docs/current/installation/deployment.html
Start the presto server as :
bin/launchuer run

Step 3 : Download presto-cli-0.52-executable.jar(download from http://prestodb.io/docs/current/installation/cli.html), rename it to presto, then run it:
./presto --server localhost:8080 --catalog hive --schema default

Step 4 : From another prompt start Hive metastore as :
Whatver port specify in Step 2 for "etc/catalog/hive.properties". Same port specify in below metastore run.
bin/hive --service metastore -p 9083

Step 5 : At presto command prompt shown in step 3, run the command "show tables", it will list the table(hiveExample) we had created in hive.
All the operation of hive we can do on presto, it is very fast compare to directly use hive.

Tuesday, October 29, 2013

Steps to run K-Means Clustering on Apache Mahout

For k-means example on Mahout, here i am downloading reuters dataset from : http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

I am assuming you have configured apache mahout and hadoop. Otherwise first setup hadoop and mahout( hadoop setup and apache mahout setup explained in previous blogs).

Steps to run K-Means algorithm on apache mahout using this reuters dataset are :

Step 1 : untar this file as : tar -xvzf reuters21578.tar.gz

Step 2: Run Lucene indexing on this dataset as :
mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters reuters-out

Step 3 : Copy this "reuters-out" lucene index file folder from local file system to hadoop file system as :
hadoop dfs -put reuters-out /home/hduser/hadoop-workDir/reuters-out

Step 4 : Mahout understand the sequence files/directory. So, convert this "reuters-out" file to sequence directory as :
mahout seqdirectory -i /home/hduser/hadoop-workDir/reuters-out -o /home/hduser/hadoop-workDir/reuters-out-seqdir -c UTF-8 -chunk 5

-c : The name of the character encoding of the input files. Default to UTF-8
-chunk : The chunkSize in MegaBytes. Defaults to 64

Step 5 : Convert these files of sequence directory to vectors as :

mahout seq2sparse -i /home/hduser/hadoop-workDir/reuters-out-seqdir -o /home/hduser/hadoop-workDir/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

--maxDFPercent : The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. Use for remove junk words/ stop words repeats so much time in doc.

--namedVector : (Optional) Whether output vectors should be NamedVectors. If set true else false. It helps to get the vector Name(means file name) in cluster o/p.

Step 6: Run k-means clustering on these vectors as :

mahout kmeans -i /home/hduser/hadoop-workDir/reuters-out-seqdir-sparse-kmeans/tfidf-vectors -c /home/hduser/hadoop-workDir/reuters-kmeans-clusters -o /home/hduser/hadoop-workDir/reuters-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

-dm : It tell which distance measure algorithm want to use. "org.apache.mahout.common.distance.CosineDistanceMeasure" is just one type of distance measure. In mahout so many implementation present, we can use anyone according to our requirement. Some are :

EuclideanDistanceMeasure,MahalanobisDistanceMeasure,ManhattanDistanceMeasure etc.

-x : Maximum number of iteration.

-n : It is value of k for k-means. Here in example 20 mentioned. Go through this : http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set or http://datachurn.blogspot.in/2013/05/clustering-tips.html

Step 7 : Dump clustered points to local machine for analyzing, documents belong to which clusters as :

mahout seqdumper -i /home/hduser/hadoop-workDir/reuters-kmeans/clusteredPoints/part-m-00000 -o resultFile.txt

Step 8 : Get clean result of this dump file from java code as :

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.DataInputStream;

import java.io.FileInputStream;

import java.io.FileWriter;

import java.io.InputStreamReader;

import java.util.LinkedHashMap;

import java.util.Map;

public class mahoutClusterOutput {

String inputFile = "/home/hduser/hadoop/mahout-0.8/reuters-kmeans-sgm/resultFile.txt";

String outputFile = "/home/tarun/resultFileCluster.txt";

LinkedHashMap<String,String> mpIdData = new LinkedHashMap();

public void fileRead()

{

try

{

FileInputStream fstream = new FileInputStream(inputFile);

DataInputStream in = new DataInputStream(fstream);

BufferedReader br = new BufferedReader(new InputStreamReader(in));

String strLine="";

int lineNumber=0;

while ((strLine = br.readLine()) != null)

{

lineNumber++;

if(lineNumber < 3 || strLine.length() < 20)

{

continue;

}

String[] tempStr = strLine.split(":");

if(mpIdData.containsKey(tempStr[1]))

{

String val = mpIdData.get(tempStr[1]) + "," + tempStr[4].split("=")[0].replace("/", "");

mpIdData.put(tempStr[1], val);

}

else

{

String val =tempStr[4].split("=")[0].replace("/", "");

mpIdData.put(tempStr[1], val);

}

in.close();

FileWriter f1 = new FileWriter(outputFile,true);

BufferedWriter out = new BufferedWriter(f1);

for(Map.Entry<String, String> entry : mpIdData.entrySet())

{

out.write("Cluster Id : " + entry.getKey());

out.write("-------> Documents : " + entry.getValue());

out.write("\n");

}

out.close();

}

catch(Exception e)

{

System.out.println("Exception message : " + e.getMessage());

}

public static void main(String[] args)

{

mahoutClusterOutput ob = new mahoutClusterOutput();

ob.fileRead();

}

Wednesday, October 9, 2013

Ensemble Learning

Ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models

Bagging, Boosting and Random Forest fall in category of Ensemble Learning.

Ensemble learning can be broken down into two tasks:

1) Developing a population of base learners from the training data.

2) Then combining them to form the composite predictor.

Boosting and Regularization Paths :

Boosting technology that goes a step further; it builds an ensemble model by conducting a regularized and supervised search in a high-dimensional space of weak learners.

Before going forward lets start with some terminologies :

Least Squares :
least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation.

Regularized versions :

Regularization in the fields of machine learning, refers to a process of introducing additional information in order to to prevent overfitting.

Ridge regression :

In this adds a constraint that β² the L²-norm of the parameter vector, is not greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with α β² added, where α is a constant.

Lasso method :

An alternative regularized version of least squares is Lasso (least absolute shrinkage and selection operator), which uses the constraint that β¹ , the L¹-norm of the parameter vector, is no greater than a given value. (As above, this is equivalent to an unconstrained minimization of the least-squares penalty with added.)

One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero.

Best of Sparsity Principle :

Boosting’s forward stagewise strategy with shrinkage approximately minimizes the same loss function with a Lasso-style L1 penalty.

However, the sometimes superior performance of boosting over procedures such as the support vector machine may be largely due to the implicit use of the L1 versus L2 penalty. The shrinkage resulting from the L1 penalty is better suited to sparse situations, where there are few basis functions with nonzero coefficients.

Bayesian sense the best predictor is ridge regression . That is, we should use an L2 rather than an L1 penalty when fitting the coefficients.

On the other hand, if there are only a small number (e.g., 1000) coefficients that are nonzero, the lasso (L1 penalty) will work better. We think of this as a sparse scenario, while the first case (Gaussian coefficients) is dense.

In other words, use of the L1 penalty follows what we call the “bet on sparsity” principle for high-dimensional problems:

Use a procedure that does well in sparse problems, since no procedure does well in dense

problems.

Monday, July 22, 2013

Regression Model Analysis

Regression model analysis is done by RMSE(Root Mean Square Error) value.
An RMSE of zero, meaning that the estimator $\hat{\theta}$ predicts observations of the parameter $\theta$ with perfect accuracy, is the ideal, but is practically never possible.
The regression line predicts the average y value associated with a given x value. Note that is also necessary to get a measure of the spread of the y values around that average. To do this, we use the root-mean-square error (r.m.s. error).
To construct the r.m.s. error, you first need to determine the residuals. Residuals are the difference between the actual values and the predicted values. It is denoted by

is the observed value for the i_th observation and

is the predicted value.

They can be positive or negative as the predicted value under or over estimates the actual value. Squaring the residuals, averaging the squares, and taking the square root gives us the r.m.s error. You then use the r.m.s. error as a measure of the spread of the y values about the predicted y value.

Tuesday, June 11, 2013

Data Preparation and Predictive Modeling Mistakes in M/C Learning and How to Avoid These Mistakes?

1) Mistake (Including ID Fields as Predictors) :
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building the model.

2) Mistake(Using Anachronistic Variables) :
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect.

3) Mistake(Selecting the wrong Y-variable) :
When building your dataset for a logistic regression model, we’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, we’ll actually want to model attrition, identifying those employee who are likely to leave (hopefully the smaller group) rather than those who are likely to stay.

4) Mistake(Allowing Duplicate Records) :
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included.

5) Mistake(Modeling on Too Small of a Population) :
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in our dataset. The larger our population size is, the most robust our model will be.

6) Mistake(Judging the quality of a model using one measure) :
It’s wrong to capture the quality of a model with a single Measure. We should use Cross Validation etc methods.

7) Mistake(Not Accounting for Outliers and/or Missing Values) :
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when we are combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.

8) Mistake(Fail to consider enough variables) :
When deciding which variables to audition for a model, we want to include anything we have on-hand that we think could possibly be predictive. Thats wrong way. We should separate out the extra variables is something that we modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.

Monday, June 10, 2013

Underfitting/Overfitting Problem in M/C learning

Underfitting : If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C². If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.

How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.

Overfitting : If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.

How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function

which penalizes the magnitudes of the parameters like as

Sunday, May 26, 2013

Clustering Tips

Hierarchical Clustering Tips:

There are several approaches we might use to stopping the hierarchical clustering process:
1) We could be told, or have a belief, about how many clusters there are in the data. For example, if we are told that the data about sports news is taken from Cricket, Hockey, and Football, then we know to stop when there are three clusters left.
2) We could stop combining when at some point the best combination of existing clusters produces a cluster that is inadequate. There can be various tests for the adequacy of a cluster. But for an example, we could insist that any cluster have an average distance between the centroid and its points no greater than some limit. This approach is only sensible if we have a reason to believe that no cluster extends over too much of the space.
3) We could continue clustering until there is only one cluster. However, it is meaningless to return a single cluster consisting of all the points. Rather, we return the tree representing the way in which all the points were combined. This form of answer makes good sense in some applications, such as one in which the points are genomes of different species, and the distance measure reflects the difference in the genome.Then, the tree represents the evolution of these species, that is, the likely order in which two species branched from a common ancestor.
4) Stop if the diameter of the cluster that results from the best merger exceeds a threshold. We can also base this rule on the radius, or on any of the variants of the radius.
5) Stop if the density of the cluster that results from the best merger is below some threshold. The density can be defined in many different ways. Roughly, it should be the number of cluster points per unit volume of the cluster. That ratio can be estimated by the number of points divided by some power of the diameter or radius of the cluster. The correct power could be the number of dimensions of the space. Sometimes, 1 or 2 is chosen as the power, regardless of the number of dimensions.
6) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster. For example, we could track the average diameter of all the current clusters. As long as we are combining points that belong in a cluster, this average will rise gradually. However, if we combine two clusters that really don’t deserve to be combined, then the average diameter will take a sudden jump.

Efficiency of Hierarchical Clustering :

The basic algorithm for hierarchical clustering is not very efficient. At each step, we must compute the distances between each pair of clusters, in order to find the best merger. The initial step takes O(n²) time, but subsequent steps take time proportional to (n-1)², (n-2)², ....... The sum of squares up to n is O(n³), so this algorithm is cubic. Thus, it cannot be run except for fairly small numbers of points.

More efficient implementation steps are :

Step 1 : We start, as we must, by computing the distances between all pairs of points, and this step is O(n²).
Step 2 : Form the pairs and their distances into a priority queue, so we can always find the smallest distance in one step. This operation is also O(n²).
Step 3 : When we decide to merge two clusters C and D, we remove all entries in the priority queue involving one of these two clusters; that requires work O(n log n) since there are at most 2n deletions to be performed, and priority-queue deletion can be performed in O(log n) time.
Step 4 : We then compute all the distances between the new cluster and the remaining clusters. This work is also O(n log n), as there are at most n entries to be inserted into the priority queue, and insertion into a priority queue can also be done in O(log n) time.

Since the last two steps are executed at most n times, and the first two steps are executed only once, the overall running time of this algorithm is O(n² log n).
That is better than O(n³), but it still puts a strong limit on how large n can be before it becomes infeasible to use this clustering approach.

K-Means Clustering Tips:

Picking the Right Value of k :
We may not know the correct value of k to use in a k-means clustering. However, if we can measure the quality of the clustering for various values of k, we can usually guess what the right value of k as :

If we take a measure of appropriateness for clusters, such as average radius or diameter, that value will grow slowly, as long as the number of clusters we assume remains at or above the true number of clusters.However, as soon as we try to form fewer clusters than there really are, the measure will rise precipitously. The idea is expressed by the below diagram as :

Above figure shows that average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls below the true number present in the data.

If we have no idea what the correct value of k is, we can find a good value in a number of clustering operations that grows only logarithmically with the true number. Begin by running the k-means algorithm for k = 1, 2, 4, 8, . . . .Eventually, you will find two values v and 2v between which there is very little decrease in the average diameter, or whatever measure of cluster cohesion you are using. We may conclude that the value of k that is justified by the data lies between v/2 and v. If we use a binary search in that range, we can find the best value for k in another log₂v clustering operations, for a total of 2 log₂v clusterings. Since the true value of k is at least v/2, we have used a number of clusterings that is logarithmic in k.

Picking the Right Centroids :

Problem : If centroid chosen initially wrong then it can be end up with wrong clusters like as it can be possible one cluster is very large and other clusters are very small. Thats all depend on centroid chosen initially.

Solution : To avoid this, we should do the k-means clustering process with different random initialization and calculate the cost function

with each random initialize centroids. We should pick that random initialization which has highest minimized cost function.

Classification

Classification is an algorithm that distinguishes between a fixed set of classes.
Such as "spam" vs. "non-spam", based on labeled training is example of classification
So many algorithms are use for classification problem. Some algorithms are :
a) Naive Bayes
b) Balanced Winnow
c) Winnow
d) Maximum Entropy
e) Decision Tree
f) Support Vector M/C(SVM)

Balanced Winnow Algorithm :

Balanced Winnow keeps two weights for each feature, w+ and w-.

Overall weight of feature : (w+ - w- ).

For a given document with feature vector x = (S_p1, S_p2,…..,S_pd) the algorithm deems the document relevant iff :

: Weights corresponding to active feature P_j

: Strength of P_jfeature.

d : Number of documents

τ : Threshold

In Balanced Winnow algorithm initialize below weights with 2τ /m and τ /m respectively.

Here m is the active number of feature in a document in the collection.
Weight of active features are updated only when a mistake is made and these weight updated with promotion step or demotion step.

Positive Example : In the positive example step, following a mistake on a positive example(If the algorithm predicts 0 and the label is 1). Positive part promoted and Negative part of weight is demoted as :

Negative Example : In the negative example step, following a mistake on a negative example(If the algorithm predicts 1 and the label is 0). Positive part demoted and Negative part of weight is promoted as :

α : Promotion Parameter. The value of α is greater than 1(α > 1)

β : Demotion Parameter. The value of β lies between 0 and 1(1>β>0).

Three ways of adjusting the value of “s(i, d)” (Strength of feature I in document d) according to

frequency document:

1) If the feature is present(active) in document then “s(i, d) = 1” otherwise “s(i, d) = 0”.

2) Strength of feature depend on number of occurrence “n(i, d)” in document. In this “s(i, d) = n(i, d)”.

3) In this reducing the strength by normalizing the value of n(i ,d) as : s(i, d) = sqrt(n(i,d))

Example :

Let the document d contains {information, retrieval, retrieval, retrieval, retrieval, information,

retrieval, retrieval} terms.

Let the weight vector of these two features ”information, retrieval” is Wd = (0.5,0.1) and

threshold τ = 1

Strength function of above each approach calculated as :

1.) X1 = (1,1)(boolean feature present or not), Strength function(1) is ∑ = 0.5*1 + 0.1*1 = 0.6.

2) X2 = (2,6) (Number of times feature occur), Strength function(2) is ∑ = 0.5*2 + 0.1*6 = 1.1.

3) X3 = (1.41,2.45) (Normalize the number of times feature occur), Strength function(3) is ∑ = 0.5*1.41 + 0.1*2.45 = 0.95.

In above example Balanced Winnow algorithm will predict 0 when using the first and third strength function and 1 when using the second strength function

A computational advantage of Balanced Winnow is given in a dispersed high dimensional space because
only weights of active features are updated.
This makes the algorithm more sensitive to the relationships among the features. These relationships may go unnoticed by an algorithm that is based on counts of features.

Positive Winnow : Positive Winnow keeps only one weight W_pjof feature P_jinstead of two weights w+ and w- like as Balanced Winnow.
For a given document with feature vector x = (S_p1, S_p2,…..,S_pd) the algorithm deems the document relevant iff :

When a mistake is made, the weight vector is updated. If the algorithm predicts 0 and the label is 1 (positive example) then the weights of all the active features is promoted - the weight W_pj is multiplied by α
Otherwise, if the algorithm predicts 1 and the received label is 0 (negative example) then the weights of all the features are demoted - the weight W_pj is multiplied by β.
No matter which mistake is made, the weights of the inactive features remain unchanged.

Friday, May 24, 2013

Clustering

Clustering is a supervise learning process. It is the process of examining a collection of "points" and grouping the points into "clusters" according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

Applications :
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
WWW: Document classification; clustering weblog data to discover groups of similar access patterns.
Market Segmentation.

Clustering algorithms may be classified as listed below:-

Hierarchical Clustering : Hierarchical Clustering
Exclusive Clustering : K-means
Overlapping Clustering : Fuzzy C-means
Probabilistic Clustering : Mixture of Gaussian

Distance Measure in Clustering :
Distance measure is the important component of clustering. Often if the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. But sometime Euclidean distance also misleading even data instance vectors are all in the same physical units.
Eq.

Above figure shows, different scaling can lead to different clustering.
For higher dimensional data, a popular measure is the Minkowski metric :

d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1.

Steps of Hierarchical Clustering :

Step 1 : Start by assigning each item to a cluster, so that if we have N items, we now have N clusters, each containing just one item.

Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

Step 2 : Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one cluster less.

Step 3: Compute distances (similarities) between the new cluster and each of the old clusters.

Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)

Step 3 can be done in 3 different ways like as :

a) Single-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

b) Complete-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

c) Average-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Lets take the example of Hierarchical Clustering using Single-Linkage :

Step 1 done as by assigning each item to cluster.

Step 2 done as computed the nearest pair of cities is MI and TO, at distance 138 from above figure.

These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138.

Distance matrix after that is :

Repeated step 2 and 3 until it merge to single cluster as :

Finally last two clusters at level 295 merge.

The process is summarized by the following hierarchical tree as :

K-Means Clustering : K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.
In the beginning of k-means clustering we determine number of clusters K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroid.
Steps are :
Iterate untill stable(= no no object move )
Step 1 : Determine the centroid of coordinate.
Step 2 : Determine the distance of each object to the centroids
Step 3 : Group the object based on minimum distance.

Lets go through numeric example of k-means :

Suppose we have several objects(4 types) and each object have 2 attributes/features as show below:

Our goal is to group objects into K=2 group base on two features(Weight/Density)

Step 1 : Initial Value of centroid : Suppose we consider Object A and Object B as the first centroids. Let c1 and c2 denote the coordinates of centroids, then c1=(1,1) and c2=(2,1)

Step 2 : Objects-Centroid Distance calculate the distance b/w cluster centroid to each object.

Lets us Euclidean diatance for this caculation. Distance matrix at iteration 0 is

Each column in the distance matrix symbolizes te object. The first row of the distance matrix cooresponds to the distance of each object to the first centroid and second row is the distance of each object to the second centroid.

Step 3 : In object clustering step, we assign each object base on min distance. Thus Object A is assign to group 1, Object B to group 2, Object C to group 2 and Object D to group 2.

In below group matrix element is 1 if and only object is assign to that group.

Step 4 : In iteration 1 determine centroid of both the group as
Centroid of group 1 as : c1 = (1,1). Because group 1 only has 1 member.
Centroid of group 2 as : c2 = ((2+4+5)/3,(1+3+4)/3) =(11/3,8/3). Because group 2 only has 3 members.

Step 5 : In this compute distance matrix with new centroid as done in Step 2 as

Ste 6 : Compute group matrix from distance matrix of Step 5 as :

Step 7 : Repeat Step 4 to calculate new centroid as :

Step 8 : In iteration 2 calculate distance matrix with new centroid of step 7 as :

Step 9 : Group matrix from Step 8 as :

We obtaining the result of G² = G¹. Comparing the grouping of last iteration and this iteration reveals that object doesn’t move group anymore. Thus the computation of the k-means clustering has reached its stable state and no more iteration is needed.

Final grouping results are :

Tips for clustering are present in that blog http://datachurn.blogspot.in/2013/05/clustering-tips.html