Tuesday, June 11, 2013

Data Preparation and Predictive Modeling Mistakes in M/C Learning and How to Avoid These Mistakes?


1) Mistake (Including ID Fields as Predictors) :
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building the model.

2) Mistake(Using Anachronistic Variables) :
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect.

3) Mistake(Selecting the wrong Y-variable) :
When building your dataset for a logistic regression model, we’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, we’ll actually want to model attrition, identifying those employee who are likely to leave (hopefully the smaller group) rather than those who are likely to stay.

4) Mistake(Allowing Duplicate Records) :
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included.

5) Mistake(Modeling on Too Small of a Population) :
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in our dataset. The larger our population size is, the most robust our model will be.

6) Mistake(Judging the quality of a model using one measure) :
It’s wrong to capture the quality of a model with a single Measure. We should use Cross Validation etc methods.

7) Mistake(Not Accounting for Outliers and/or Missing Values) :
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when we are combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.

8) Mistake(Fail to consider enough variables) :
When deciding which variables to audition for a model, we want to include anything we have on-hand that we think could possibly be predictive. Thats wrong way. We should separate out the extra variables is something that we modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.

Monday, June 10, 2013

Underfitting/Overfitting Problem in M/C learning


Underfitting : If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C 2. If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.

How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.

Overfitting : If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.

How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function






which penalizes the magnitudes of the parameters like as






Sunday, May 26, 2013

Clustering Tips

Hierarchical Clustering Tips:

There are several approaches we might use to stopping the hierarchical clustering process:
1) We could be told, or have a belief, about how many clusters there are in the data. For example, if we are told that the data about sports news is taken from Cricket, Hockey, and Football, then we know to stop when there are three clusters left.
2) We could stop combining when at some point the best combination of existing clusters produces a cluster that is inadequate. There can be various tests for the adequacy of a cluster. But for an example, we could insist that any cluster have an average distance between the centroid and its points no greater than some limit. This approach is only sensible if we have a reason to believe that no cluster extends over too much of the space.
3) We could continue clustering until there is only one cluster. However, it is meaningless to return a single cluster consisting of all the points. Rather, we return the tree representing the way in which all the points were combined. This form of answer makes good sense in some applications, such as one in which the points are genomes of different species, and the distance measure reflects the difference in the genome.Then, the tree represents the evolution of these species, that is, the likely order in which two species branched from a common ancestor.
4) Stop if the diameter of the cluster that results from the best merger exceeds a threshold. We can also base this rule on the radius, or on any of the variants of the radius.
5) Stop if the density of the cluster that results from the best merger is below some threshold. The density can be defined in many different ways. Roughly, it should be the number of cluster points per unit volume of the cluster. That ratio can be estimated by the number of points divided by some power of the diameter or radius of the cluster. The correct power could be the number of dimensions of the space. Sometimes, 1 or 2 is chosen as the power, regardless of the number of dimensions.
6) Stop when there is evidence that the next pair of clusters to be combined yields a bad cluster. For example, we could track the average diameter of all the current clusters. As long as we are combining points that belong in a cluster, this average will rise gradually. However, if we combine two clusters that really don’t deserve to be combined, then the average diameter will take a sudden jump.


Efficiency of Hierarchical Clustering :

The basic algorithm for hierarchical clustering is not very efficient. At each step, we must compute the distances between each pair of clusters, in order to find the best merger. The initial step takes O(n2) time, but subsequent steps take time proportional to (n-1)2, (n-2)2, ....... The sum of squares up to n is O(n3), so this algorithm is cubic. Thus, it cannot be run except for fairly small numbers of points.

More efficient implementation steps are :

Step 1 : We start, as we must, by computing the distances between all pairs of points, and this step is O(n2).
Step 2 : Form the pairs and their distances into a priority queue, so we can always find the smallest distance in one step. This operation is also O(n2).
Step 3 : When we decide to merge two clusters C and D, we remove all entries in the priority queue involving one of these two clusters; that requires work O(n log n) since there are at most 2n deletions to be performed, and priority-queue deletion can be performed in O(log n) time.
Step 4 : We then compute all the distances between the new cluster and the remaining clusters. This work is also O(n log n), as there are at most n entries to be inserted into the priority queue, and insertion into a priority queue can also be done in O(log n) time.

Since the last two steps are executed at most n times, and the first two steps are executed only once, the overall running time of this algorithm is O(n2 log n).
That is better than O(n3), but it still puts a strong limit on how large n can be before it becomes infeasible to use this clustering approach.


K-Means Clustering Tips:

Picking the Right Value of k :
We may not know the correct value of k to use in a k-means clustering. However, if we can measure the quality of the clustering for various values of k, we can usually guess what the right value of k as :

If we take a measure of appropriateness for clusters, such as average radius or diameter, that value will grow slowly, as long as the number of clusters we assume remains at or above the true number of clusters.However, as soon as we try to form fewer clusters than there really are, the measure will rise precipitously. The idea is expressed by the below diagram as :











Above figure shows that average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls below the true number present in the data.
  
If we have no idea what the correct value of k is, we can find a good value in a number of clustering operations that grows only logarithmically with the true number. Begin by running the k-means algorithm for k = 1, 2, 4, 8, . . . .Eventually, you will find two values v and 2v between which there is very little decrease in the average diameter, or whatever measure of cluster cohesion you are using. We may conclude that the value of k that is justified by the data lies between v/2 and v. If we use a binary search in that range, we can find the best value for k in another log2clustering operations, for a total of 2 log2v clusterings. Since the true value of k is at least v/2, we have used a number of clusterings that is logarithmic in k.

Picking the Right Centroids :

Problem : If centroid chosen initially wrong then it can be end up with wrong clusters like as it can be possible one cluster is very large and other clusters are very small. Thats all depend on centroid chosen initially. 

Solution : To avoid this, we should do the k-means clustering process with different random initialization and calculate the cost function 




with each random  initialize centroids. We should pick that random initialization which has highest minimized cost function.

Classification

Classification is an algorithm that distinguishes between a fixed set of classes.
Such as "spam" vs. "non-spam", based on labeled training is example of classification
So many algorithms are use for classification problem. Some algorithms are :
  a) Naive Bayes
  b) Balanced Winnow
  c) Winnow
  d) Maximum Entropy
  e) Decision Tree
  f) Support Vector M/C(SVM) 

Balanced Winnow Algorithm :


Balanced  Winnow keeps two weights for each feature,  wand w-.


Overall weight of feature : (w  - w- ).

For a given document with feature vector x = (Sp1, Sp2,…..,Spd) the algorithm deems the document relevant iff : 









 : Weights corresponding to active feature Pj




                  : Strength of Pfeature.




d                                    :  Number of documents

τ                                     :  Threshold

In Balanced Winnow algorithm initialize below weights with 2τ /m  and τ /m respectively.





Here m is the active number of feature in a document in the collection.
Weight of active features are updated only when a mistake  is made and these weight updated with promotion step or demotion step.

Positive Example : In the positive example step, following a mistake on a positive example(If the algorithm predicts 0 and the label is 1). Positive part promoted and Negative part of weight is demoted as :







Negative  Example : In the negative example step, following a mistake on a negative example(If the algorithm predicts 1 and the label is 0). Positive part demoted and Negative part of weight is promoted as :







α : Promotion Parameter. The value of α  is greater than 1(α > 1)
β : Demotion Parameter. The value of β lies between 0 and 1(1>β>0).

Three ways of adjusting the value of  “s(i, d)” (Strength of feature I in document d) according to 
frequency document:
1) If the feature is present(active) in document then “s(i, d) = 1” otherwise “s(i, d) = 0”.
2) Strength of feature depend on number of occurrence “n(i, d)” in document. In this “s(i, d) = n(i, d)”.
3) In this reducing the strength by normalizing the value of n(i ,d) as : s(i, d) = sqrt(n(i,d))

Example :
Let the document d  contains {information, retrieval, retrieval, retrieval, retrieval, information, 
retrieval, retrieval} terms.
Let the weight vector of these two features ”information, retrieval” is Wd  = (0.5,0.1) and  
threshold τ = 1
Strength function of above each approach calculated  as :
1.) X1   = (1,1)(boolean feature present or not), Strength function(1)  is  ∑ = 0.5*1 + 0.1*1 = 0.6.
2) X2   = (2,6) (Number of times feature occur), Strength function(2)  is  ∑ = 0.5*2 + 0.1*6 = 1.1.
3) X3   = (1.41,2.45) (Normalize the number of times feature occur), Strength function(3)  is  ∑ = 0.5*1.41 + 0.1*2.45 = 0.95. 


In above example Balanced Winnow algorithm will predict 0 when using the first and third strength function and 1 when using the second strength function


A computational advantage of Balanced Winnow is given in a dispersed high dimensional space because
only weights of active features are updated.
This makes the algorithm more sensitive to the relationships among the features. These relationships may go unnoticed by an algorithm that is based on counts of features.


Positive Winnow : Positive Winnow keeps only one weight Wpj  of feature Pinstead of  two weights wand w-  like as Balanced Winnow.
For a given document with feature vector x = (Sp1, Sp2,…..,Spd) the algorithm deems the document relevant iff : 





When a mistake is made, the weight vector is updated. If the algorithm predicts 0 and the label is 1 (positive example) then the weights of all the active features is promoted - the weight Wpj is multiplied by α
Otherwise, if the algorithm predicts 1 and the received label is 0 (negative example) then the weights of all the features are demoted - the weight Wpj is multiplied by β. 
No matter which mistake is made, the weights of the inactive features remain unchanged.
.

Friday, May 24, 2013

Clustering


Clustering is a supervise learning process. It is the process of examining a collection of "points" and grouping the points into "clusters" according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

Applications :
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
WWW: Document classification; clustering weblog data to discover groups of similar access patterns.
Market Segmentation.


Clustering algorithms may be classified as listed below:-

    Hierarchical Clustering : Hierarchical Clustering
    Exclusive Clustering  : K-means
    Overlapping Clustering : Fuzzy C-means
    Probabilistic Clustering : Mixture of Gaussian

Distance Measure in Clustering :
Distance measure is the important component of clustering. Often if the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. But sometime Euclidean distance also misleading even data instance vectors are all in the same physical units.
Eq.












Above figure shows, different scaling can lead to different clustering.
For higher dimensional data, a popular measure is the Minkowski metric :








d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1.



Steps of Hierarchical Clustering :

Step 1 : Start by assigning each item to a cluster, so that if we have N items, we now have N clusters, each containing just one item. 
Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

Step 2 : Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one cluster less.

Step 3: Compute distances (similarities) between the new cluster and each of the old clusters.

Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)

Step 3 can be done in 3 different ways like as :

a)  Single-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

b) Complete-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

c) Average-Linkage : In this we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Lets take the example of Hierarchical Clustering  using Single-Linkage :

Step 1 done as by assigning each item to cluster.









Step 2 done as computed the nearest pair of cities is MI and TO, at distance 138 from above figure.
These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138.  
Distance matrix after that is :









Repeated step 2 and 3 until it merge to single cluster as :





















Finally last two clusters at level 295 merge.
The process is summarized by the following hierarchical tree as : 











K-Means Clustering : K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.
In the beginning of k-means clustering we determine number of clusters K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroid.
Steps are :
Iterate untill stable(= no no object move )
Step 1 : Determine the centroid of coordinate.
Step 2 : Determine the distance of each object to the centroids
Step 3 : Group the object based on  minimum distance.














Lets go through numeric example of k-means :

Suppose we have several objects(4 types) and each object have 2 attributes/features as show below:












Our goal is to group objects into K=2 group base on two features(Weight/Density) 

Step 1 : Initial Value of centroid : Suppose we consider Object A and Object B as the first centroids. Let c1 and c2 denote the coordinates of centroids, then c1=(1,1) and c2=(2,1)

Step 2 : Objects-Centroid Distance calculate the distance b/w cluster centroid to each object.
Lets us Euclidean diatance for this caculation. Distance matrix at iteration 0 is








Each column in the distance matrix symbolizes te object. The first row of the distance matrix cooresponds to the distance of each object to the first centroid and second row is the distance of each object to the second centroid.

Step 3 : In object clustering step, we assign each object base on min distance. Thus Object A is assign to group 1, Object B to group 2, Object C to group 2 and Object D to group 2.
In below group matrix element is 1 if and only object is assign to that group.  







Step 4 : In iteration 1 determine centroid of both the group as
Centroid of group 1 as : c1 = (1,1). Because group 1 only has 1 member.
Centroid of group 2 as : c2 = ((2+4+5)/3,(1+3+4)/3) =(11/3,8/3). Because group 2 only has 3 members.


Step 5 : In this compute distance matrix with new centroid as done in Step 2 as






Ste 6 : Compute group matrix from distance matrix of Step 5 as :

  





Step 7 : Repeat Step 4 to calculate new centroid  as :










Step 8 : In iteration 2 calculate distance matrix with new centroid of step 7 as :






Step 9 : Group matrix from Step 8 as :









We obtaining the result of G2 = G1. Comparing the grouping of last iteration and this iteration reveals that object doesn’t move group anymore. Thus the computation of the k-means clustering has reached its stable state and no more iteration is needed.

Final grouping results are :









Tips for clustering are present in that blog http://datachurn.blogspot.in/2013/05/clustering-tips.html

Dimensionality Reduction in M/C Learning

Dimensionality reduction in machine learning and statistics is the process of reducing the number of random variables under consideration.
It is divided into 2 categories :
1) Feature Selection
2) Feature Extraction

Feature Selection : Reduce dimensionality by selecting subset of original variables. It finds interesting features from the original feature set.

Why Use Feature Selection? :


a) Simplifying or speeding up computations with only little loss in classification quality.
b) Reduce dimensionality of feature space and improve the efficiency, performance gain, and precision of the classifier.
c) Improve classification effectiveness, computational efficiency, and accuracy.
d) Remove non-informative and noisy features and reduce the feature space to a manageable size.
e) Keep computational requirements and dataset size small, especially for those text categorization algorithms that do not scale with the feature set size.


Some of Feature Selection Methods are :
1) TF and TF-IDF :
TF :  In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d).

IDF : The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

aking the idf(t,D) = log |D| / d Ɛ : Ɛ d 


TF-IDF : tf–idf is the product of two statistics, term frequency and inverse document frequency.  tfidf(t,d,D) = tf(t,d) * idf(t,D)

2) Mutual Information : Mutual information method assumes that the "term with higher category ratio is more effective for classification"
MI = log (A * N /((A+C)(A+B))

A : Number of document that contain term t and also belong to category c.
B : Number of documents that contain term t, but don't belong to category c.
C:  Number of documents that don't contain term t, but belong to category c
D:  Number of documents that don't contain term t and also don't belong to category c

3) Chi Square : Chi square measures the lack of independence between a term t and the category, c
χ2  = N(AD-CB)2  / ((A+C)(B+D)(A+B)(C+D))

A,B,C,D as mentioned above and N is number of documents.
Through chi square value measure the p(significance value) from table as shown below :











Feature Extraction : It reduce dimensionality by (linear or non- linear) projection of D-dimensional vector onto d-dimensional vector (d < D).
The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized.
principal component analysis(pca) Algorithm :

1) Create covariance matrix for features. covariance matrix (also known as dispersion matrix or variance–covariance matrix) is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector (that is, of a vector of random variables).
Covariance matrix shown beow :








It measure the relation b/w features.

Compute the "eigenvector" of covariance matrix ∑.

[u,s,v] = svd(covariance matrix ∑)

SVD : Singular Value Decomposition

SVD returns U matrix as shown below :
















For calculating k dimensional z, we have to first take reduced dimension k column from U matrix.



















Calculate Z from reduced U dimension as :

















It can be write as :







Classifier/Model Analysis

After generating the classifier/model, we do the analysis on that. There is two terms precision and recall define the strength of classifier/model.
Precision (+ve Predicted Value)  : It is fraction of retrieved instances that are relevant. Ex. Search engine have given 10 pages result to user according to query from the 20 pages those are exactly match query result. Only 6 pages match the query correctly out of 10 retrieved pages.Total 9 pages exactly match the query result out of 20.

Precision  : (Total number of retrieved documents those match exactly the query result) / (Total number of retrieved documents)  : 6/10
Precision : (True positive) / (True Positive + False Positive)
                :  (   6              ) /  (6                   +   4                )

Precision can be seen as a measure of exactness or quality.

Recall  (Sensitivity) : Fraction of relevant instances retrieved.

Recall  : (Total number of retrieved documents those match exactly the query result) / (Total number of documents exactly match query result)  : 6/9

Recall : (True positive) / (True Positive + False Negative)
            : (         6        ) /  (       6           +       3          )

Recall is a measure of completeness or quantity.

In statistics, if the null hypothesis is that all and only the relevant items are retrieved, absence of type I and type II errors corresponds respectively to maximum precision (no false positives) and maximum recall (no false negatives). 

type I error  :  False Positive : 10 -6 = 4 for above example.
type II error : False Negative : 9 -6 = 3 for above example.

Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.

Usually, precision and recall scores are not measured in isolation. Instead, either values for one measure are compared for a fixed level at the other measure and both are combined into a single measure, such as their harmonic mean the F-measure(Balanced F-score), which is the weighted harmonic mean of precision and recall.

There is another analysis which is decile analysis to check the efficiency of model.
Decile analysis is created to test the model’s ability to predict the intended outcome. In this each column of decile analysis chart(x-axis) represents a collection of records that have been scored using the model. The height of each column(y-axis) represents the average of those records’ actual behavior.

Steps to calculate Decile Analysis are :

Step 1 : The records are sorted by their predicted scores in descending order and divided into ten equal-sized bins or deciles. The top decile contains the 10% of the population most likely to respond and the bottom decile contains the 10% of the population least likely to respond, based on the model scores.

Step 2 : The deciles and their actual response rates are graphed on the x and y axes, respectively.

When we’re looking at a decile analysis, we want to see a staircase effect; that is, we’ll want the bars to descend in order from left to right, as shown below :














In contrast, if the bars seem to be out of order or flat, the decile analysis is tell us that the model is not doing a very good job of predicting actual responses.