Tuesday, June 11, 2013

Data Preparation and Predictive Modeling Mistakes in M/C Learning and How to Avoid These Mistakes?


1) Mistake (Including ID Fields as Predictors) :
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building the model.

2) Mistake(Using Anachronistic Variables) :
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect.

3) Mistake(Selecting the wrong Y-variable) :
When building your dataset for a logistic regression model, we’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, we’ll actually want to model attrition, identifying those employee who are likely to leave (hopefully the smaller group) rather than those who are likely to stay.

4) Mistake(Allowing Duplicate Records) :
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included.

5) Mistake(Modeling on Too Small of a Population) :
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in our dataset. The larger our population size is, the most robust our model will be.

6) Mistake(Judging the quality of a model using one measure) :
It’s wrong to capture the quality of a model with a single Measure. We should use Cross Validation etc methods.

7) Mistake(Not Accounting for Outliers and/or Missing Values) :
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when we are combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.

8) Mistake(Fail to consider enough variables) :
When deciding which variables to audition for a model, we want to include anything we have on-hand that we think could possibly be predictive. Thats wrong way. We should separate out the extra variables is something that we modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.

No comments:

Post a Comment