Ensemble model combines multiple ‘individual’ (diverse) models together and delivers superior prediction power. A good model should maintain a balance between bias-variance. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.
Bagging :
Bagging is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities.
Objective is to average noisy and unbiased models to create a model with low variance
- Create Multiple DataSets:
- Sampling is done with replacement on the original data and new datasets are formed.
- The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model
- Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
- Build Multiple Classifiers:
- Classifiers are built on each data set.
- Generally the same classifier is modeled on each data set and predictions are made.
- Combine Classifiers:
- The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
- The combined values are generally more robust than a single model.
Boosting :
The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.
- It starts by assigning equal weights to each observation.
- Base learning algorithm is applied.
- If there is a prediction error, then the misclassified observations are assigned a higher weightage.
- Next base learning algorithm is applied.
- The iteration continues until the limit of learning algorithm is reached or higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classified or have higher errors by preceding weak rules.
It has shown better predictive accuracy than bagging, but it also tends to overfit the training data as well.
Stacking works in two phases. First, we use multiple base classifiers to predict the class. Second, a new learner is used to combine their predictions with the aim of reducing the generalization error.
Random Forest
Random Forest works as a large collection of decorrelated decision trees and is based on bagging (boosted aggregating).
- A random sample is taken from the population with random variables (feature selection)
- A decision tree is made based on the sample.
- Multiple samples are taken with replacement (bootstrap sampling)
- Multiple decision trees are created based on their respective samples
- All the decision trees are used to create a ranking of classes
- The final model is based on the most no votes for the class. In case of regression, it takes the average of outputs of different trees.
Advantages:
- Capable of performing both regression and classification
- Performs dimension reduction and handles missing values / outliers effectively.
- Can handle huge number of variables
- Shows Importance of variable
- RF is different from bagging : RF selects a subset of predictors as well. This results in decorrelated trees.
Disadvantages:
- Good at classification but doesn’t provide precise continuous nature predictions for regression
- Black Box – very little control on what the model does. Only model parameters can be tuned.
Gradient Boosting (GBM) & AdaBoost
- The base learner takes all the distributions and assign equal weight or attention to each observation.
- If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
- Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.