Tuesday, October 29, 2013

Steps to run K-Means Clustering on Apache Mahout

For k-means example on Mahout, here i am downloading  reuters dataset from : http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

I am assuming you have configured apache mahout and hadoop. Otherwise first setup hadoop and mahout( hadoop setup and apache mahout setup explained in previous blogs).

Steps to run K-Means algorithm on apache mahout using this reuters dataset are :

Step 1 : untar this file as : tar -xvzf reuters21578.tar.gz

Step 2:  Run Lucene indexing on this dataset as :
mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters reuters-out

Step 3 : Copy this "reuters-out" lucene index file folder from local file system to hadoop file system as :
hadoop dfs -put reuters-out /home/hduser/hadoop-workDir/reuters-out

Step 4 : Mahout understand the sequence files/directory. So, convert this "reuters-out" file to sequence directory as :
mahout seqdirectory -i /home/hduser/hadoop-workDir/reuters-out -o /home/hduser/hadoop-workDir/reuters-out-seqdir -c UTF-8 -chunk 5

-c         : The name of the character encoding of the input files.  Default to UTF-8      
-chunk :  The chunkSize in MegaBytes. Defaults to 64

Step 5 : Convert these files of sequence directory to vectors as :
mahout seq2sparse -i /home/hduser/hadoop-workDir/reuters-out-seqdir -o /home/hduser/hadoop-workDir/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

--maxDFPercent : The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. Use for remove junk words/ stop words repeats so much time in doc.

--namedVector : (Optional) Whether output vectors should  be NamedVectors. If set true else false. It helps to get the vector Name(means file name) in cluster o/p.

Step 6:  Run k-means clustering on these vectors as :
mahout kmeans -i /home/hduser/hadoop-workDir/reuters-out-seqdir-sparse-kmeans/tfidf-vectors -c /home/hduser/hadoop-workDir/reuters-kmeans-clusters -o /home/hduser/hadoop-workDir/reuters-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

-dm : It tell which distance measure algorithm want to use. "org.apache.mahout.common.distance.CosineDistanceMeasure" is just one type of distance measure. In mahout so many implementation present, we can use anyone according to our requirement. Some are :
EuclideanDistanceMeasure,MahalanobisDistanceMeasure,ManhattanDistanceMeasure etc.
-x : Maximum number of iteration.
-n  : It is value of k for k-means. Here in example 20 mentioned. Go through this : http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set or http://datachurn.blogspot.in/2013/05/clustering-tips.html

Step 7 : Dump clustered points to local machine for analyzing, documents belong to which clusters as :
mahout seqdumper -i /home/hduser/hadoop-workDir/reuters-kmeans/clusteredPoints/part-m-00000 -o resultFile.txt

Step 8 : Get clean result of this dump file from java code as :

 import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.InputStreamReader;
import java.util.LinkedHashMap;
import java.util.Map;


public class mahoutClusterOutput {

String inputFile = "/home/hduser/hadoop/mahout-0.8/reuters-kmeans-sgm/resultFile.txt";
String outputFile = "/home/tarun/resultFileCluster.txt";
LinkedHashMap<String,String> mpIdData = new LinkedHashMap();
public void fileRead()
{
try
{
FileInputStream fstream = new FileInputStream(inputFile);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine="";
int lineNumber=0;
while ((strLine = br.readLine()) != null)   
{
lineNumber++;
   
  if(lineNumber < 3 || strLine.length() < 20)
  {
  continue;
  }
  
String[] tempStr = strLine.split(":");
if(mpIdData.containsKey(tempStr[1]))
{
String val = mpIdData.get(tempStr[1]) + "," + tempStr[4].split("=")[0].replace("/", "");
mpIdData.put(tempStr[1], val);
}
else
{
String val =tempStr[4].split("=")[0].replace("/", "");
mpIdData.put(tempStr[1], val);
}
}
in.close();
FileWriter f1 = new FileWriter(outputFile,true);
BufferedWriter out = new BufferedWriter(f1);
for(Map.Entry<String, String> entry : mpIdData.entrySet())
{
out.write("Cluster Id : " + entry.getKey());
out.write("-------> Documents : " + entry.getValue());
out.write("\n");
}
out.close();
}
catch(Exception e)
{
System.out.println("Exception message : " + e.getMessage());
}
}
public static void main(String[] args) 
{
mahoutClusterOutput ob = new mahoutClusterOutput();
ob.fileRead();

}

}