Wednesday, October 9, 2013

Ensemble Learning

Ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models
Bagging, Boosting and Random Forest fall in category of Ensemble Learning.
Ensemble learning can be broken down into two tasks:
            1) Developing a population of base learners from the training data.

2) Then combining them to form the composite predictor.

Boosting and Regularization Paths :

Boosting technology that goes a step further; it builds an ensemble model by conducting a regularized and supervised search in a high-dimensional space of weak learners.

Before going forward lets start with some terminologies :

Least Squares :
            least squares is a standard approach to the approximate solution of  overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation.

Regularized versions :
Regularization in the fields of machine learning, refers to a process of introducing additional information in order to to prevent overfitting.
           
 Ridge regression :
                        In this adds a constraint that  β2  the L2-norm of the parameter vector, is not greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with   α β2  added, where  α   is a constant.

Lasso method :

An alternative regularized version of least squares is Lasso (least absolute shrinkage and selection operator), which uses the constraint that β1 , the L1-norm of the parameter vector, is no greater than a given value. (As above, this is equivalent to an unconstrained minimization of the least-squares penalty with   added.) 


One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero.


Best of Sparsity Principle :

Boosting’s forward stagewise strategy with shrinkage approximately minimizes the same loss function with a Lasso-style L1 penalty.

However, the sometimes superior performance of boosting over procedures such as the support vector machine may be largely due to the implicit use of the L1 versus L2 penalty. The shrinkage resulting from the L1 penalty is better suited to sparse situations, where there are few basis functions with nonzero coefficients.

Bayesian sense the best predictor is ridge regression . That is, we should use an L2 rather than an L1 penalty when fitting the coefficients.
On the other hand, if there are only a small number (e.g., 1000) coefficients that are nonzero, the lasso (L1 penalty) will work better. We think of this as a sparse scenario, while the first case (Gaussian coefficients) is dense.

In other words, use of the L1 penalty follows what we call the “bet on sparsity” principle for high-dimensional problems:
Use a procedure that does well in sparse problems, since no procedure does well in dense
problems.