Logistic Regression

[Method = Maximum Likelihood Estimation / Chi-square]

It is used to model the probability of an outcome. Based on concept of Generalized Linear Model.

Since the dependent variable is binary, errors will be non-normally distributed. Also, Errors are heteroskedastic

Y = Binary ; X = Continuous or Categorical

Types : > Binary / Dichotomous > Ordered > Multinomial (classification)

Assumptions

  1. Binary logistic regression requires the dependent variable to be binary coded
  2. Model should have little or no multicollinearity
  3. Model should be fitted correctly: Neither overfitting or underfitting should occur
  1. The error terms should be independent ie the data should not be before-after samples
  2. Requires comparatively larger data sample (min 30 observations)
  3. There should be no outliers. Assessed by converting predictors to standardized or z scores and remove values below or greater than -3.29 or 3.29

Technique

It is similar to linear regression, except the Y variable is not regressed directly, instead the log odds ratio of Y is regressed.

Logit is a log of odds and odds are a function of P.

Linear regression » -∞ to +∞

Probability values »   0 to   1

Odds ratio »   0 to  

Log odds ratio » -∞ to +∞

(natural) Log of odds is taken for better expressing the results :

eg: odds of 90% and 10% expressed as ::  (0.9/0.1) = 9 and conversely, (0.1/0.9) = 0.11

However  ln(0.9/0.1) = 2.217 and conversely, ln(0.1/0.9) = -2.217   relates in a much better way.

Interpretation

Logistic regression coefficients give the change in log odds of the outcome for a one unit increase in the predictor variable

Output

  • Null Deviance : Indicates the response predicted by a model with nothing but an intercept.Lower the value, better the model. The difference between null and residual deviance should also be high
  • Residual Deviance : Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model
  • Fisher’s score : how far the model had to reiterate to get to the results, similar to AIC value.

Validation

  • Same significant variables should come in both the training and validation sample.
  • The behavior of variables should be same in both the samples (same sign of coefficients)
  • Beta coefficients should be close in training and validation samples
  • KS statistics should be in top 3 deciles
  • KS statistics should be between 40 and 70
  • Rank Ordering – There should not be any break in rank ordering.
  • Lift Curve – The larger the cumulative lift value the better the accuracy
  • Goodness of Fit Tests – Model should fit the data well. Check Hosmer and Lemeshow Test and Deviance and Residual Test.

Loss Function

  • A loss function is a measure of fit between a mathematical model of data and the actual data.
  • Parameters of model are chosen that minimize the badness of fit or maximize the goodness of fit of the model to the data
  • With least squares, minimize SSres, the sum of squares residual and maximize the SSreg the sum of squares due to regression.
  • With the logistic curve there is no mathematical solution that will produce least squares estimates of the parameters. It’s more of an optimization problem
  • For many of these models, the loss function chosen is called maximum likelihood

A likelihood is a conditional probability (eg P(Y|X), the probability of Y given X).