Linear Regression

[Method = Ordinary Least Square]

Y = Continuous ; X = Continuous

Objective is to minimise the sum of squares of the residuals(residual=difference between observation and the fitted line)

Assumptions

Linear relationship between dependent & independent variables
No presence of outliers
Independent variables are independent of each other (non collinear)
Errors, also called residuals
1. Should have constant variance (homoscedasticity)
2. Are independent and identically distributed (iid) ie No Autocorrelation
3. Are normally distributed with a mean of 0

Tests for Assumptions :

Linearity :
- Methods :
  - Residuals vs Predicted plot / Residuals vs Actuals plot
- Corrections :
  - Log transformation for strictly positive variables
  - Adding regressor which is non-linear function eg x and x²
  - Create new variable which is sum/product of A & B

Multicollinearity
- Methods:
  - Correlation Matrix
  - VIF (Variance Inflation Factor)

VIF is calculated only on the Independent variables. It runs a series of auxiliary regressions which fetches the R² value of X_i against other IVs.

Eg : If X₂, X₃, X₄, have high R² value when regressed against X₁, it essentially means that X₂, X₃, X₄ can explain a high amount of variation in X₁ and it is redundant. Range = 1 to ∞ 1 < low < 5 < medium < 10 < high

Homoscedasticity
- Methods:
  - Goldfeld-Quandt test
  - Scatter plot (residuals vs predicted)
- Corrections :
  - Take actual or predicted values of DV and plot it against errors. The plot should be random. If there is a trend, take log of DV.

Autocorrelation
- Durbin-Watson Test : Tests for serial correlations between errors

Range : 0-4 positive < 2 (uncorr elated) < negative

Multivariate Normal
- Methods:
  - Kolmogorov-Smirnov test / Shapiro-Wilk / Anderson-Darling / Jarque-Bera
  - Q-Q Plot
  - Histogram with fitted normal curve
- Corrections:
  - Nonlinear / Log transformation

Dummy Variable Trap

Include one less variable when adding dummy variables to regression.
The excluded variable serves as the base variable.
All the other values are a reference to the base variable.

Model Performance :

R Square : % of variance in Y that is explained by X. It is defined as the square of correlation between Predicted and Actual values.

R₂₌_SSE_{Independent Var}_SSE_{Independent Var}_{+ SSE}_Errors

Adjusted R Square : It penalizes for adding impurity (insignificant variables) to the model

MSE (Mean Squared Error) :

RMSE (Root Mean Square Error) : It measures standard deviation of the residuals.

Model with the least RMSE is the best model

= sqrt (Sum of Squared Errors) / no of obs = sqrt (mean ( (Actual – Predicted)² ))

Mean Square : Sum of squares / df

MAE (Mean Absolute Error) : sum( |Error| ) / n Error = Actual – Predicted |Error|=Absolute Error

MAPE (Mean Absolute Percentage Error) : { absolute (average [ (Actual – Predicted) / Actual ])} should not exceed ~ 8% – 10%

Loss Functions: objective is to minimise these

MAE : Mean Absolute Error (mean of the absolute errors)
MSE : Mean Squared Error (mean of the squared errors)
RMSE : Root Mean Squared Error (square root of the mean of squared errors)

Share this: