Its all about Data: 2013-05-19

Friday, May 24, 2013

Clustering

Clustering is a supervise learning process. It is the process of examining a collection of "points" and grouping the points into "clusters" according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

Applications :
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
WWW: Document classification; clustering weblog data to discover groups of similar access patterns.
Market Segmentation.

Clustering algorithms may be classified as listed below:-

Hierarchical Clustering : Hierarchical Clustering
Exclusive Clustering : K-means
Overlapping Clustering : Fuzzy C-means
Probabilistic Clustering : Mixture of Gaussian

Distance Measure in Clustering :
Distance measure is the important component of clustering. Often if the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. But sometime Euclidean distance also misleading even data instance vectors are all in the same physical units.
Eq.

Above figure shows, different scaling can lead to different clustering.
For higher dimensional data, a popular measure is the Minkowski metric :

d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1.

Steps of Hierarchical Clustering :

Step 1 : Start by assigning each item to a cluster, so that if we have N items, we now have N clusters, each containing just one item.

Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

Step 2 : Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one cluster less.

Step 3: Compute distances (similarities) between the new cluster and each of the old clusters.

Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)

Step 3 can be done in 3 different ways like as :

a) Single-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

b) Complete-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

c) Average-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Lets take the example of Hierarchical Clustering using Single-Linkage :

Step 1 done as by assigning each item to cluster.

Step 2 done as computed the nearest pair of cities is MI and TO, at distance 138 from above figure.

These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138.

Distance matrix after that is :

Repeated step 2 and 3 until it merge to single cluster as :

Finally last two clusters at level 295 merge.

The process is summarized by the following hierarchical tree as :

K-Means Clustering : K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.
In the beginning of k-means clustering we determine number of clusters K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroid.
Steps are :
Iterate untill stable(= no no object move )
Step 1 : Determine the centroid of coordinate.
Step 2 : Determine the distance of each object to the centroids
Step 3 : Group the object based on minimum distance.

Lets go through numeric example of k-means :

Suppose we have several objects(4 types) and each object have 2 attributes/features as show below:

Our goal is to group objects into K=2 group base on two features(Weight/Density)

Step 1 : Initial Value of centroid : Suppose we consider Object A and Object B as the first centroids. Let c1 and c2 denote the coordinates of centroids, then c1=(1,1) and c2=(2,1)

Step 2 : Objects-Centroid Distance calculate the distance b/w cluster centroid to each object.

Lets us Euclidean diatance for this caculation. Distance matrix at iteration 0 is

Each column in the distance matrix symbolizes te object. The first row of the distance matrix cooresponds to the distance of each object to the first centroid and second row is the distance of each object to the second centroid.

Step 3 : In object clustering step, we assign each object base on min distance. Thus Object A is assign to group 1, Object B to group 2, Object C to group 2 and Object D to group 2.

In below group matrix element is 1 if and only object is assign to that group.

Step 4 : In iteration 1 determine centroid of both the group as
Centroid of group 1 as : c1 = (1,1). Because group 1 only has 1 member.
Centroid of group 2 as : c2 = ((2+4+5)/3,(1+3+4)/3) =(11/3,8/3). Because group 2 only has 3 members.

Step 5 : In this compute distance matrix with new centroid as done in Step 2 as

Ste 6 : Compute group matrix from distance matrix of Step 5 as :

Step 7 : Repeat Step 4 to calculate new centroid as :

Step 8 : In iteration 2 calculate distance matrix with new centroid of step 7 as :

Step 9 : Group matrix from Step 8 as :

We obtaining the result of G² = G¹. Comparing the grouping of last iteration and this iteration reveals that object doesn’t move group anymore. Thus the computation of the k-means clustering has reached its stable state and no more iteration is needed.

Final grouping results are :

Tips for clustering are present in that blog http://datachurn.blogspot.in/2013/05/clustering-tips.html

Dimensionality Reduction in M/C Learning

Dimensionality reduction in machine learning and statistics is the process of reducing the number of random variables under consideration.
It is divided into 2 categories :
1) Feature Selection
2) Feature Extraction

Feature Selection : Reduce dimensionality by selecting subset of original variables. It finds interesting features from the original feature set.

Why Use Feature Selection? :

a) Simplifying or speeding up computations with only little loss in classification quality.
b) Reduce dimensionality of feature space and improve the efficiency, performance gain, and precision of the classifier.
c) Improve classification effectiveness, computational efficiency, and accuracy.
d) Remove non-informative and noisy features and reduce the feature space to a manageable size.
e) Keep computational requirements and dataset size small, especially for those text categorization algorithms that do not scale with the feature set size.

Some of Feature Selection Methods are :
1) TF and TF-IDF :
TF : In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d).

IDF : The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

aking the idf(t,D) = log |D| / d Ɛ D : t Ɛ d

TF-IDF : tf–idf is the product of two statistics, term frequency and inverse document frequency. tfidf(t,d,D) = tf(t,d) * idf(t,D)

2) Mutual Information : Mutual information method assumes that the "term with higher category ratio is more effective for classification"
MI = log (A * N /((A+C)(A+B))

A : Number of document that contain term t and also belong to category c.
B : Number of documents that contain term t, but don't belong to category c.
C: Number of documents that don't contain term t, but belong to category c
D: Number of documents that don't contain term t and also don't belong to category c

3) Chi Square : Chi square measures the lack of independence between a term t and the category, c
χ2 = N(AD-CB)2 / ((A+C)(B+D)(A+B)(C+D))

A,B,C,D as mentioned above and N is number of documents.
Through chi square value measure the p(significance value) from table as shown below :

Feature Extraction : It reduce dimensionality by (linear or non- linear) projection of D-dimensional vector onto d-dimensional vector (d < D).
The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized.
principal component analysis(pca) Algorithm :

1) Create covariance matrix for features. covariance matrix (also known as dispersion matrix or variance–covariance matrix) is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector (that is, of a vector of random variables).
Covariance matrix shown beow :

It measure the relation b/w features.

Compute the "eigenvector" of covariance matrix ∑.

[u,s,v] = svd(covariance matrix ∑)

SVD : Singular Value Decomposition

SVD returns U matrix as shown below :

For calculating k dimensional z, we have to first take reduced dimension k column from U matrix.

Calculate Z from reduced U dimension as :

It can be write as :

Classifier/Model Analysis

After generating the classifier/model, we do the analysis on that. There is two terms precision and recall define the strength of classifier/model.
Precision (+ve Predicted Value) : It is fraction of retrieved instances that are relevant. Ex. Search engine have given 10 pages result to user according to query from the 20 pages those are exactly match query result. Only 6 pages match the query correctly out of 10 retrieved pages.Total 9 pages exactly match the query result out of 20.

Precision : (Total number of retrieved documents those match exactly the query result) / (Total number of retrieved documents) : 6/10
Precision : (True positive) / (True Positive + False Positive)
: ( 6 ) / (6 + 4 )

Precision can be seen as a measure of exactness or quality.

Recall (Sensitivity) : Fraction of relevant instances retrieved.

Recall : (Total number of retrieved documents those match exactly the query result) / (Total number of documents exactly match query result) : 6/9

Recall : (True positive) / (True Positive + False Negative)
: ( 6 ) / ( 6 + 3 )

Recall is a measure of completeness or quantity.

In statistics, if the null hypothesis is that all and only the relevant items are retrieved, absence of type I and type II errors corresponds respectively to maximum precision (no false positives) and maximum recall (no false negatives).

type I error : False Positive : 10 -6 = 4 for above example.
type II error : False Negative : 9 -6 = 3 for above example.

Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.

Usually, precision and recall scores are not measured in isolation. Instead, either values for one measure are compared for a fixed level at the other measure and both are combined into a single measure, such as their harmonic mean the F-measure(Balanced F-score), which is the weighted harmonic mean of precision and recall.

There is another analysis which is decile analysis to check the efficiency of model.
Decile analysis is created to test the model’s ability to predict the intended outcome. In this each column of decile analysis chart(x-axis) represents a collection of records that have been scored using the model. The height of each column(y-axis) represents the average of those records’ actual behavior.

Steps to calculate Decile Analysis are :

Step 1 : The records are sorted by their predicted scores in descending order and divided into ten equal-sized bins or deciles. The top decile contains the 10% of the population most likely to respond and the bottom decile contains the 10% of the population least likely to respond, based on the model scores.

Step 2 : The deciles and their actual response rates are graphed on the x and y axes, respectively.

When we’re looking at a decile analysis, we want to see a staircase effect; that is, we’ll want the bars to descend in order from left to right, as shown below :

In contrast, if the bars seem to be out of order or flat, the decile analysis is tell us that the model is not doing a very good job of predicting actual responses.

Thursday, May 23, 2013

Apache Mahout Setup with different configuration on Hadoop Cluster

Setup Mahout for Classification :

Step1 : Download and extract mahout from hadoop user "hduser/Some other" as mentioned in previous hadoop setup blog.

Step 2 : Start the hadoop. If any issue, the follow the previous hadoop setup blog. Set the HADOOP_HOME environment variable with hadoop path as :
export HADOOP_HOME=/home/hduser/hadoop/

Step 3: Put Classification data on hadoop file system from local file system as :

/home/hduser/hadoop/bin/hadoop fs -put /home/hduser/train-data train-data

We can check data is copied on hdfs or not by this command :
/home/hduser/hadoop/bin/hadoop fs -ls /home/hduser/

Step 4 : Convert this data to sequence file(hdfs format) as :
/home/hduser/mahout/bin/mahout seqdirectory -i train-data -o train-seq
Assumed mahout is present at /home/hduser/mahout path

Step 5 : Convert this sequence file sequence vectors as :
/home/hduser/mahout/bin/mahout seq2sparse -i train-seq -o complete-vectors

Step 6 : Split this train-vector to training part and testing part as :

/home/hduser/mahout/bin/mahout split -i complete-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
This command split the data 60% for training and 40% for testing.
--randomSelectionPct 40 value gives input how much percentage split for testing and rest for training.

Step 7 : Train the training data with different algorithms. This example for Naivebayes training :
/home/hduser/mahout/bin/mahout trainnb -i train-vectors -el -li labelindex -o model -ow -c

Step 8 : Test the model with testing data as :
/home/hduser/mahout/bin/mahout testnb -i test-vectors -m model -l labelindex -ow -o op-testing -c 2>&1 | tee Result.txt

It will print confusion matrix also.

Tips for Apache Mahout

1) Apache mahout will work on hadoop cluster if MAHOUT_LOCAL in environment is not set. If MAHOUT_LOCAL is true, then it will run on local node.

2) Apache mahout give only "seqdirectory" option which convert normal text file directory data to sequence directory. But it doesn't give option to convert one single large file for classification to sequencefile.

For this text file to sequence file conversion, we have to write java code explicity, then that sequence file to put on hdfs as :

/home/hduser/hadoop/bin/hadoop fs -put /home/hduser/train-seq train-seq

Then start the process from Step5

Java Code for conversion of text file (.txt/.csv etc) is :

import java.io.BufferedReader;
import java.io.FileReader;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.Text;
import org.apache.commons.configuration.*;
import org.apache.commons.lang.*;
public class ConvertTextToSeq {

/**
* @param args
*/
public static void main(String[] args)
{
// TODO Auto-generated method stub
try
{
System.out.println("Start time " + System.currentTimeMillis());
if (args.length != 2) {
System.err.println("Arguments: [input tsv file] [output sequence file]");
return;
}
String inputFileName = args[0];
String outputDirName = args[1];

final Configuration configuration = new Configuration();
final FileSystem fs = FileSystem.get(configuration);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, configuration, new Path(outputDirName + "/chunk-0"),
Text.class, Text.class);

int count = 0;
BufferedReader reader = new BufferedReader(new FileReader(inputFileName));
Text key = new Text();
Text value = new Text();
while(true) {
String line = reader.readLine();
if (line == null) {
break;
}
String[] strArr = line.split(" ");
int len = strArr[0].length() + strArr[1].length()+ 2;

String category = strArr[0].split("##")[0];
String id = strArr[1];
String message = line.substring(len);

key.set("/" + id + "/" + category);

value.set(message);
writer.append(key, value);
count++;
}
writer.close();
System.out.println("time took " + System.currentTimeMillis());

}
catch(Exception e)
{

}

}

}

Run the program as :

Java ConvertTextToSeq /home/hduser/mahout/train.csv /home/hduser/mahout/train-seq

csv file have 3 columns in this example one is category data,2nd column is label and 3rd column is description. In sequence file conversion use always label column for key and description column for value.

Then do all the steps after step.

3) In above steps, we had used --randomSelectionPct 40 function for training and testing split. To explicitly provide training and testing data and predict accuracy own instead of get from Apache Mahout we have to follow these steps and write java code mentioned below.

#################################################################################

Step A : If input data is directory then use above Step 1 to Step 4 and otherwise convert text file to sequence file through above code ConvertTextToSeq.
After Step A train-seq file would be at hdfs path.

Step B : Convert sequence file to vectors as :
/home/hduser/mahout/bin/mahout seq2sparse -i train-seq -o train-vectors

Step C: Train the model through training data by any algorithm provided by mahout. Following example of NaiveBayes :

/home/hduser/mahout/bin/mahout trainnb -i train-vectors/tfidf-vectors -el -li labelindex -o model -ow -c

Step D : After this get the these files from hdfs to local system as :

/home/hduser/hadoop/bin/hadoop fs -get labelindex /home/hduser/mahout-work/labelindex

/home/hduser/hadoop/bin/hadoop fs -get model /home/hduser/mahout-work/model

/home/hduser/hadoop/bin/hadoop fs -get train-vectors/dictionary.file-0 /home/hduser/mahout-work/dictionary.file-0

/home/hduser/hadoop/bin/hadoop fs -getmerge train-vectors/df-count /home/hduser/mahout-work/df-count

Step E : Run below java code by giving the step D files as a command lien argument :

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.StringReader;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.apache.mahout.classifier.naivebayes.BayesUtils;

import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;

import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;

import org.apache.mahout.common.Pair;

import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;

import org.apache.mahout.math.RandomAccessSparseVector;

import org.apache.mahout.math.Vector;

import org.apache.mahout.math.Vector.Element;

import org.apache.mahout.vectorizer.DefaultAnalyzer;

import org.apache.mahout.vectorizer.TFIDF;

import com.google.common.collect.ConcurrentHashMultiset;

import com.google.common.collect.Multiset;

public class ClassificationForMahout {

public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {

Map<String, Integer> dictionnary = new HashMap<String, Integer>();

for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {

dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());

}

return dictionnary;

}

public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {

Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();

for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {

documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());

}

return documentFrequency;

}

public static void main(String[] args) throws Exception {

System.out.println("Start time :" + System.currentTimeMillis());

if (args.length < 5) {

System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency] [tweet file]");

return;

}

String modelPath = args[0];

String labelIndexPath = args[1];

String dictionaryPath = args[2];

String documentFrequencyPath = args[3];

String testFilePath = args[4];

Configuration configuration = new Configuration();

// model is a matrix (wordId, labelId) => probability score

NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

// labels is a map label => classId

Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));

Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath));

Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));

// analyzer used to extract word from tweet

Analyzer analyzer = new DefaultAnalyzer();

int labelCount = labels.size();

int documentCount = documentFrequency.get(-1).intValue();

System.out.println("Number of labels: " + labelCount);

System.out.println("Number of documents in training set: " + documentCount);

BufferedReader reader = new BufferedReader(new FileReader(testFilePath));

String outputFile = "/home/hduser/result.txt";

FileWriter f1 = new FileWriter(outputFile,true);

BufferedWriter out = new BufferedWriter(f1);

int correctCounter=0;

int totalCounter=0;

while(true)

{

String line = reader.readLine();

if (line == null) {

break;

}

String[] arr = line.split(" ");

String catId = arr[0];

String label = arr[1];

String msg = line.substring(arr[0].length() + arr[1].length() + 2);

Multiset<String> words = ConcurrentHashMultiset.create();

// extract words from Msg

TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(msg));

CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

ts.reset();

int wordCount = 0;

while (ts.incrementToken()) {

if (termAtt.length() > 0) {

String word = ts.getAttribute(CharTermAttribute.class).toString();

Integer wordId = dictionary.get(word);

// if the word is not in the dictionary, skip it

if (wordId != null) {

words.add(word);

wordCount++;

}

// create vector wordId => weight using tfidf

Vector vector = new RandomAccessSparseVector(10000);

TFIDF tfidf = new TFIDF();

for (Multiset.Entry<String> entry:words.entrySet()) {

String word = entry.getElement();

int count = entry.getCount();

Integer wordId = dictionary.get(word);

Long freq = documentFrequency.get(wordId);

double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);

vector.setQuick(wordId, tfIdfValue);

}

// With the classifier, we get one score for each label

// The label with the highest score is the one the tweet is more likely to

// be associated to

Vector resultVector = classifier.classifyFull(vector);

//double bestScore = -Double.MAX_VALUE;

double bestScore =Double.MAX_VALUE;

int bestCategoryId = -1;

String resultStr=catId+" ";

for(Element element: resultVector)

{

int categoryId = element.index();

double score = -1 * element.get();

if (score < bestScore) {

bestScore = score;

bestCategoryId = categoryId;

}

//System.out.print(" " + labels.get(categoryId) + ": " + score);

if(resultStr.equalsIgnoreCase(catId + " "))

{

resultStr=resultStr + labels.get(categoryId) + " " + score;

}

else

{

resultStr=resultStr + " " + labels.get(categoryId) + " " + score;

}

try

{

out.write(resultStr);

out.write("\n");

}

catch(Exception e)

{

}

//System.out.println(label + " => " + labels.get(bestCategoryId));

out1.write(label + " => " + labels.get(bestCategoryId));

out1.write("\n");

totalCounter++;

if(label.equalsIgnoreCase(labels.get(bestCategoryId)))

{

correctCounter++;

System.out.println("correctCounter : " + correctCounter);

}

};

//Close the output stream

System.out.println("correctCounter : " + correctCounter + " TotalCounter :" + totalCounter);

System.out.println("End time :" + System.currentTimeMillis());

System.out.println("Accuracy : " + (double)correctCounter/totalCounter);

out.close();

}

There is 5 command line argument to run this code .

Argument 1 : model

Argument 2 : labelindex

Argument 3 : dictionary.file-0

Argument 4 : df-count

Argument 5 : testing file

It generate result.txt which contains score value for each label correspond to test description. WHich is more, assign description to that label.

Hadoop Setup on Multinode Cluster(Linux)

Step1 : Create new user for hadoop hduser or other one like as on ubuntu/Red Hat

useradd hduser
passwd hduser
/*Type the password*/

Step2 : Create new group hadoop and add hduser to in that.
addgroup hadoop
adduser --ingroup hadoop hduser

Do all the following process from /home/hduser other permission denied problem occur at some steps.
Step 3 : Download hadoop tar file

Step 4 : Extract in /home/hduser/

Step 5 : Disable IPV6 as :
open /etc/sysctl.conf and add these lines in that :
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Step 6 : Check whether IPv6 is enabled on your machine with the following command:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Step 7 : Add the entry of all m/c in /etc/hosts. If you are not able to edit this, then change the permission of file from root to every one access permission
(chmod 777 /etc/hosts)

like as :

152.144.198.245 tarunrhels1
152.144.198.246 tarunrhels2
152.144.198.247 tarunrhels3

In this tarunrhels1,tarunrhels2 and tarunrhels3 are m/c names which are we using for hadoop cluster. It includes bot Namenodes and Datanodes.

Steps for Name node

DO all the above 7 steps for each m/c which will be use for hadoop.
For hadoop setup, we have to crate one Namenode and others Datanode.

Step 8 : If ssh is not running on m/c then first install ssh.

Generate ssh key for hduser as :
ssh-keygen -t rsa -P ""

Step 9 : Enable SSH access to local machine with this newly created key as :
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 10 : Hadoop creates temporary directory both for the local file system and HDFS where it generate data files.
For local system create directory as :
mkdir -p /home/hduser/hdfs

Step 11 : Change the JAVA_HOME path in conf/hadoop-env.sh file according to java installed on linux m/c.

Step12 : Change the conf/core-site.xml as

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://152.144.198.245</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.
</description>
<final>true</final>
</property>

</configuration>

Step 13 : Change conf/mapre-site.xml as :

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
<name>mapred.job.tracker</name>
<value>152.144.198.245:50300</value>
<final>true</final>
</property>

<property>
<name>mapred.system.dir</name>
<value>/home/marvin1/mapred/system</value>
<final>true</final>
</property>

<property>
<name>mapred.local.dir</name>
<value>/home/marvin1/cache/mapred/local</value>
<final>true</final>
</property>

</configuration>

Step 14 : Change the conf/hdfs-site.xml as

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
<name>dfs.name.dir</name>
<value>/home/hduser/hdfs/name</value>
<description>Determines where on the local filesystem the DFS name
node should store the name table. If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy.
</description>
<final>true</final>
</property>

<property>
<name>dfs.data.dir</name>
<value>/home/hduser/hdfs/data</value>
<description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories,
then data will be stored in all named directories, typically on different devices.Directories that do not exist are ignored.
</description>
<final>true</final>
</property>

</configuration>

Step 15 : Add new file as conf/masters
In that masters file add the entry of Namenode. Master file entry will be ip address of m/c or localhost.

Step16 : Add new file as conf/slaves
In this slaves file add entry of slaves/Datanodes. In this add ip of each slave m/c. If we want to treat master node as Datanode also then add entry of master node in slaves also.
slaves file look like as :

152.144.198.245 tarunrhels1
152.144.198.246 tarunrhels2
152.144.198.247 tarunrhels3

Step 17 : To format the filesystem for hdfs run the command as
/home/hduser/hadoop/bin/hadoop namenode -format

###############################################################################
Do the following step at each Datanode.

Login to slave/datanode(152.144.198.246) and do these steps

Step 1 to Step 7 as mentioned above.

Step A : Copy conf/core-site.xml,conf/mapred-site.xml and conf/hdfs-site.xml from Namenode/Master m/c to Datanode/Slave m/c at same conf/core-site.xml,conf/mapred-site.xml and conf/hdfs-site.xml path by scp command.

Step B : Copy id_rsa.pub from master node to slave node through scp as

scp hduser@152.144.198.245:/home/hduser/.ssh/id_rsa.pub /home/hduser/.ssh/

Step C : Do this :
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Login to master node again and do these steps

Step X_1 : Start hadoop processes from master node as :
/home/hduser/hadoop/bin/start-all.sh

Step X_2 : Check the processes through jps command, it should list down processes as:

50682 TaskTracker
50471 JobTracker
49554 Jps
50084 DataNode
50281 SecondaryNameNode
49881 NameNode

Step X_3 : Same jps command run at each slave/Datanode m/c. It should diplay results as :

62216 Jps
8122 TaskTracker

Tips :

T1) If jps command is not working, then add your java installation in PATH variable as :
export PATH = $PATH:/usr/java/jdk1.7.0_15/bin

jps helps to check the status of hadoop processes.

T2) check with netstat if Hadoop is listening on the configured port or not on master m/c as :
netstat -plten | grep java

If port you had added in conf/mapred-site.xml in some conflicting state by checking in above command. Then change the port number from conf/mapred-site.xml and start the hadoop process again. Then again check port status through above command.

T3) Sometime hadoop not able to write on hdfs it gives error of permission denied or java.io.IOException, then do these steps at master m/c.
Step a) Stop the hadoop process as
/home/hduser/hadoop/bin/stop-all.sh
Step b) delete data and name folder from /home/hduser/hdfs folder
Step c) start the hadoop again through /home/hduser/hadoop/bin/start-all.sh
Hopefully by this it will start working.

T4) Sometime face the proble of safemode. If safemode is on, hadoop start giving error. Then off the safemode by this :
/home/hduser/hadoop/bin/hadoop dfsadmin -safemode leave

Wednesday, May 22, 2013

Recommendation System in M/C Learning

Recommendation system is the important feature of m/c learning.Many e-commerce portal like as Ebay, Amazon, Netflix provide recommendation to user. They provide recommendation by analysis the user behavior through past history of items purchased by user or item selling pattern.

Here is example of Predict Movie Rating by user :

Example :

In above table question mark(?) for those movie, user didn't rate it. According to above rating table, Ravi, Deepak, Vijay like Romantic Movie and Alice, Bob like action movie.

n_u: Number of user, n_m : Number of movies

r(i,j)=1 if user j has rated movie i.

y^(i,j)=Rating given by user j to movie i (defined only if r(i,j) = 1)

Recommended system will predict the rating of all '?' in above table by analyzing te user behavior like as Ravi, Deepak, Vijay like Romantic Movie and Alice, Bob like action movie.

Different approaches uses in building the recommended system. One of the approach is Content Base Recommend System.

Content Base Recommended System:
Lest assume we have two features x1, x2 for above movie rating data, which provide how strong movie is romantic/action.

For each user we calculate y^(i,j)which is calculated by product of Θ^(j)transpose and x⁽ⁱ⁾. Here Θ^(j)is learning rate of user j and x⁽ⁱ⁾is feature vector of movie i.

For movie DDLJ feature vector x⁽¹⁾is [1 0.9. 0] and lets assume we had calculated learning rate Θ⁽¹⁾for user 1 is :

(Θ⁽¹⁾)T * x⁽¹⁾ = 3.6 (1*0 + 0.9*4+ 0*1)

So user Ravi will give rating 3.6 to Aashiqui-2.

Learn Θ :

To learn Θ minimize this function :

Another approach is Collaborative Filtering :

In Content base recommendation system feature set x^{(1) ,}x^{(2 )}is already provided. But in collaborative filtering it learn by itself from the user.

Get the information form user, what kind of movie Romantic/Action he/she like. We get Θ value from each user.

Above figure explain, we will get Θ value from each user and from that we will predict feature value as how strong movie is action movie or romantic movie.
In this to get the value of x^(i),we use this following function for minimizing

For each product of (Θ⁽ⁱ⁾)T *   x⁽ⁱ⁾ , We are learning feature vector  x⁽ⁱ⁾
x⁽¹⁾= Romance,x⁽²⁾= Action,x⁽³⁾= Horror

How can we find movie i related to movie j ?

Find the minimum value of || x⁽ⁱ⁾- x^(j)||

Which movie i have minimum value of || x⁽ⁱ⁾- x^(j)||, that movie i is most similar to movie j.

This we can find 5/10/20 most similar movie recommendation of movie j.
In the above equation we are predicting X value from given Θ value and in previous content base recommendation we predict Θ value from given feature value x.
So from x predict  Θ and from  Θ predict x.

X ----predict--->Θ---predict---->X----predict---->Θ---predict---->X----predict--->Θ---predict---->X

This is collaborative filtering where each user collaborating with other user to predict movie rating in better way.

Mean Normalization : Mean normalization method use for predict the rating for those user who haven't rated any movie till now.

For each user j, predict the rating for movie i as :
(Θ⁽ⁱ⁾)T *   x⁽ⁱ⁾+ µ

For new user 6, Θ⁽ⁱ⁾is :

then (Θ⁽ⁱ⁾)T * x⁽ⁱ⁾= 0 and (Θ⁽ⁱ⁾)T * x⁽ⁱ⁾+ µ = µ

So Rating for new use would be average of total rating given to movie i like as shown is figure :