Dimensionality Reduction

Principal Component Analysis (PCA)

In this technique, variables are transformed into a new set of variables, which are linear combination of original variables. These new set of variables are known as principal components.

PCA rotates the data in space so that maximum variance is obtained

The axes that describe the variation are the first and the second principal component. The 1st PC spans the most variation and the 2nd PC is orthogonal to the 1st.

Eigenvectors : Describes the direction.
Eigenvalues : The value of the maximum found variance in the direction specified by eigenvector.

[Analogy : Rotate Teapot in 3D to find the most detail in photo]

Factor Analysis

Factor Analysis takes groups of correlated variables and constructs them into factors. They are highly correlated within themselves but have low correlation between other factors.

Thus variables are reduced into factors hence reducing dimensionality.

EFA (Exploratory Factor Analysis)
CFA (Confirmatory Factor Analysis)

Principal component analysis involves extracting linear combinations of observed variables.

Factor analysis involve predicting observed variables from theoretical latent factors.

Use PCA to reduce correlated variables into smaller set of independent composite variables.

Use Factor Analysis to test theoretical model of latent factors causing observed variables.

Weight of Evidence (WOE)

It’s a variable transformation method based on grouping variables into buckets and calculating the WOE values for each bucket

Measures the strength/ability of the independent variable to separate two categories of dependent variable.

Create buckets
Count the no of Events and Non-Events for every bucket
Calculate the % of Events and % of Non-Events (observation / column totals)
Calculate the WOE values for each bucket by the below formula

Example : Assume Good=0 & Bad=1

Initial Variable Transformation Transformed Variable

Age	Good/Bad							Good/Bad	Age WOE
10	1		Number of					1	-0.176
20	1	Age Group	Zeros	Ones	Zero %	One %	Age WOE	1	-0.176
30	0	<35	1	2	0.33	0.5	-0.176	0	-0.176
40	0	>35	2	2	0.67	0.5	0.125	0	0.125
50	0	Total	3	4				0	0.125
60	1							1	0.125
70	1							1	0.125

Applications

Missing Value Treatment
Effective Outlier Treatment
Monotonicity
Used for grouping, variable selection, predictive strength.

Information Value (IV)

It measures overall strength / predictive power of the variable : sum of (absolute) values for WOE over all groups

Example : Assume Good=0 & Bad=1

Age Group	Zero %	One %	IV Formula	IV
<35	0.33	0.5	= (0.33 – 0.5)* ln (0.33 / 0.5)	0.07063762547
>35	0.67	0.5	= (0.67 – 0.5)* ln (0.67 / 0.5)	0.04975383437
TOTAL			Information Value >>	0.1203914598

not useful < 0.02 < weak < 0.1 < medium < 0.3 < strong

Linear Discriminant Analysis

LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements.[1][2] However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label).[3] Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.[4] LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

LDA works when the measurements made on independent variables for each observation are continuous quantities.

Automatic Variable Selection

Forward :

Modeling starts with only the intercept.
Variables are added one by one
Most significant variables are added first.
Process continues till no significant variables are left

Backward :

Modeling starts with all the variables.
Variables are removed one by one.
Most insignificant variables removed first.
Process continues until only significant variables remain

Step-wise :

Combines forward and backward at the same time
Variables are added or removed based on the significance at that particular step.

Other Methods :

Low Variance : Remove variables with low variance (ie repeated data for all)

Decision Trees :

Random Forest : Variable importance feature

Multicollinearity :

Backward Feature Elimination: In this method, we start with all n dimensions. Compute the sum of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-1 input features.