Its all about Data: 2013-06-09

Tuesday, June 11, 2013

Data Preparation and Predictive Modeling Mistakes in M/C Learning and How to Avoid These Mistakes?

1) Mistake (Including ID Fields as Predictors) :
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building the model.

2) Mistake(Using Anachronistic Variables) :
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect.

3) Mistake(Selecting the wrong Y-variable) :
When building your dataset for a logistic regression model, we’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, we’ll actually want to model attrition, identifying those employee who are likely to leave (hopefully the smaller group) rather than those who are likely to stay.

4) Mistake(Allowing Duplicate Records) :
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included.

5) Mistake(Modeling on Too Small of a Population) :
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in our dataset. The larger our population size is, the most robust our model will be.

6) Mistake(Judging the quality of a model using one measure) :
It’s wrong to capture the quality of a model with a single Measure. We should use Cross Validation etc methods.

7) Mistake(Not Accounting for Outliers and/or Missing Values) :
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when we are combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.

8) Mistake(Fail to consider enough variables) :
When deciding which variables to audition for a model, we want to include anything we have on-hand that we think could possibly be predictive. Thats wrong way. We should separate out the extra variables is something that we modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.

Monday, June 10, 2013

Underfitting/Overfitting Problem in M/C learning

Underfitting : If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C². If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.

How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.

Overfitting : If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.

How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function

which penalizes the magnitudes of the parameters like as