Ensemble learning is to build a prediction model by combining the
strengths of a collection of simpler base models
Bagging,
Boosting and Random Forest fall in category of Ensemble Learning.
Ensemble
learning can be broken down into two tasks:
1) Developing a population of base learners from the training data.
2) Then combining them to form the composite predictor.
Boosting and Regularization Paths :
Boosting
technology that goes a step further; it builds an ensemble model by conducting
a regularized and supervised search in a high-dimensional space of weak
learners.
Before going forward lets start with
some terminologies :
Least Squares :
least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation.
least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation.
Regularized versions :
Regularization in the fields of
machine learning, refers to a process of introducing additional information in
order to to prevent overfitting.
Ridge
regression :
In this adds a constraint that β2
the L2-norm of the parameter vector, is not greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with α β2 added, where α is a constant.
Lasso method :
An alternative regularized version of least squares is Lasso (least absolute shrinkage and selection operator), which uses the constraint that β1 , the L1-norm of the parameter vector, is no greater than a given value. (As above, this is equivalent to an unconstrained minimization of the least-squares penalty with added.)
One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero.
Best of Sparsity Principle :
Boosting’s
forward stagewise strategy with shrinkage approximately minimizes the same loss
function with a Lasso-style L1
penalty.
However, the sometimes superior performance of boosting over
procedures such as the support vector machine may be largely due to the
implicit use of the L1 versus L2 penalty. The shrinkage resulting from
the L1 penalty is better
suited to sparse situations, where there are few basis functions with nonzero
coefficients.
Bayesian
sense the best predictor is ridge regression . That is, we should
use an L2
rather than an L1
penalty when fitting the coefficients.
On the
other hand, if there are only a small number (e.g., 1000) coefficients that are
nonzero, the lasso (L1
penalty) will work better. We think of this as a sparse scenario, while the
first case (Gaussian coefficients) is dense.
In other
words, use of the L1
penalty follows what we call the “bet on sparsity” principle for
high-dimensional problems:
Use a
procedure that does well in sparse problems, since no procedure does well in
dense
problems.