Linear Regression

[Method = Ordinary Least Square]

Y = Continuous ; X = Continuous

Objective is to minimise the sum of squares of the residuals(residual=difference between observation and the fitted line)

Assumptions

  1. Linear relationship between dependent & independent variables
  2. No presence of outliers
  3. Independent variables are independent of each other (non collinear)
  4. Errors, also called residuals
    1. Should have constant variance (homoscedasticity)
    2. Are independent and identically distributed (iid) ie No Autocorrelation
    3. Are normally distributed with a mean of 0

Tests for Assumptions :

  • Linearity :
    • Methods :
      • Residuals vs Predicted plot  / Residuals vs Actuals plot
    • Corrections : 
      • Log transformation for strictly positive variables
      • Adding regressor which is non-linear function eg x and x2
      • Create new variable which is sum/product of A & B
  • Multicollinearity
    • Methods:
      • Correlation Matrix
      • VIF (Variance Inflation Factor)

VIF is calculated only on the Independent variables. It runs a series of auxiliary regressions which fetches the R2 value of Xi against other IVs.

Eg : If X2, X3, X4, have high R2 value when regressed against X1, it essentially means that X2, X3, X4  can explain a high amount of variation in X1 and it is redundant. Range = 1 to ∞ 1 < low < 5 < medium < 10 < high

  • Homoscedasticity
    • Methods:
      • Goldfeld-Quandt test
      • Scatter plot (residuals vs predicted)
    • Corrections : 
      • Take actual or predicted values of DV and plot it against errors. The plot should be random. If there is a trend, take log of DV.
  • Autocorrelation
    • Durbin-Watson Test : Tests for serial correlations between errors

Range : 0-4 positive < 2 (uncorr elated) < negative

  • Multivariate Normal
    • Methods:
      • Kolmogorov-Smirnov test / Shapiro-Wilk / Anderson-Darling / Jarque-Bera
      • Q-Q Plot
      • Histogram with fitted normal curve
    • Corrections:
      • Nonlinear / Log transformation

Dummy Variable Trap

  • Include one less variable when adding dummy variables to regression.
  • The excluded variable serves as the base variable.
  • All the other values are a reference to the base variable.

Model Performance :

  • R Square : % of variance in Y that is explained by X. It is defined as the square of correlation between Predicted and Actual values.

R2= SSEIndependent VarSSEIndependent Var + SSEErrors

  • Adjusted R Square : It penalizes for adding impurity (insignificant variables) to the model
  • MSE (Mean Squared Error) : 
  • RMSE (Root Mean Square Error) : It measures standard deviation of the residuals.

Model with the least RMSE is the best model

=  sqrt (Sum of Squared Errors) / no of obs = sqrt (mean ( (Actual – Predicted)2 ))

Mean Square : Sum of squares / df

  • MAE (Mean Absolute Error) : sum( |Error| ) / n Error = Actual – Predicted     |Error|=Absolute Error
  • MAPE (Mean Absolute Percentage Error) : { absolute (average [ (Actual – Predicted) / Actual ])}  should not exceed ~ 8% – 10%
  • AIC
  • BIC

Loss Functions: objective is to minimise these

  • MAE : Mean Absolute Error (mean of the absolute errors)
  • MSE : Mean Squared Error (mean of the squared errors)
  • RMSE : Root Mean Squared Error (square root of the mean of squared errors)