Is a set of data driven partitioning techniques designed to group a collection of objects into clusters.
Clustering is finding borders between groups
Segmentation is using borders to form groups
- Linkage method
- Variance method
- Centroid method
Types :
Agglomerative : Start from n clusters and get to one cluster
Divisive : Start from one cluster and get to n clusters
Applications :
Market Segmentation
Sales segmentation : what type of customer wants what
Credit risk
Operations : High performing persons and promotions
Insurance : identifying groups with high average claim cost
Data reduction : grouping observations to reduce number is obs
Distance Between Clusters
Single link : Shortest distance between an element in one cluster and an element in another cluster.
Complete link : Largest distance between elements in two clusters.
Average : Average distance of elements between 2 clusters.
Centroid : Distance between the centroids of 2 clusters
Metroid : Distance between centrally located object in both clusters.
How to build clusters :
Select distance measure > Select clustering algorithm > Define the distance between 2 clusters > Determine no of clusters > Validate the analysis
Non Hierarchical
It is based on division of objects into non overlapping subsets. Faster, more reliable, works with large data.
K means
Main objective is to form clusters that are homogeneous in nature and heterogeneous to each other. Only for continuous variables.
Process :
- Identify value of ‘k’
- Assign random k observations as seeds
- Assign each record to one of the k seeds based on proximity
- Form clusters
- Calculate centroids of clusters
- Assign centroids as new seed
- Form new clusters
- Recalculate clusters
- Continue process until stable clusters are formed (boundary ceases to change)
Elbow Criterion:
K means clustering doesn’t provide an estimate of the number of clusters required. Hence elbow criterion is used to determine optimal number of clusters.
The method says that you should choose a number of clusters so that adding another cluster does not add any sufficient information. It is plotted by ratio of within cluster variance to between cluster variance against number of clusters. The objective is to minimize the within and maximize the between distances.
Validation :
- R squared : R2 = Between sum of squares / total sum of squares
- Pseudo f
- Clc
- ccc cluster criterion
- Scree plot (elbow criterion)
- Silhouette
- pc plot
Hierarchical
Set of nested clusters organized as a hierarchical tree. No decision about number of clusters
It is not used when data is big due to higher processing time.
Hierarchical clustering can be
- Top down approach
- Bottom up approach
The results of hierarchical clustering can be shown using dendrogram.
- At the bottom, we start with n data points (observations), each assigned to separate clusters
- Two closest clusters are then merged till we have just one cluster at the top
- The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space.
The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
- Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
- Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
- Manhattan distance: ||a-b||1 = Σ|ai-bi|
- Maximum distance:||a-b||INFINITY = maxi|ai-bi|
- Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}
Singular Value Decomposition