Clustering

Is a set of data driven partitioning techniques designed to group a collection of objects into clusters.

Clustering is finding borders between groups
Segmentation is using borders to form groups

Linkage method
Variance method
Centroid method

Types :

Agglomerative : Start from n clusters and get to one cluster
Divisive : Start from one cluster and get to n clusters

Applications :

Market Segmentation
Sales segmentation : what type of customer wants what
Credit risk
Operations : High performing persons and promotions
Insurance : identifying groups with high average claim cost
Data reduction : grouping observations to reduce number is obs

Distance Between Clusters

Single link : Shortest distance between an element in one cluster and an element in another cluster.
Complete link : Largest distance between elements in two clusters.
Average : Average distance of elements between 2 clusters.
Centroid : Distance between the centroids of 2 clusters
Metroid : Distance between centrally located object in both clusters.

How to build clusters :

Select distance measure > Select clustering algorithm > Define the distance between 2 clusters > Determine no of clusters > Validate the analysis

Non Hierarchical

It is based on division of objects into non overlapping subsets. Faster, more reliable, works with large data.

K means

Main objective is to form clusters that are homogeneous in nature and heterogeneous to each other. Only for continuous variables.

Process :

Identify value of ‘k’
Assign random k observations as seeds
Assign each record to one of the k seeds based on proximity
Form clusters
Calculate centroids of clusters
Assign centroids as new seed
Form new clusters
Recalculate clusters
Continue process until stable clusters are formed (boundary ceases to change)

Elbow Criterion:

K means clustering doesn’t provide an estimate of the number of clusters required. Hence elbow criterion is used to determine optimal number of clusters.

The method says that you should choose a number of clusters so that adding another cluster does not add any sufficient information. It is plotted by ratio of within cluster variance to between cluster variance against number of clusters. The objective is to minimize the within and maximize the between distances.

Validation :

R squared : R2 = Between sum of squares / total sum of squares
Pseudo f
Clc
ccc cluster criterion
Scree plot (elbow criterion)
Silhouette
pc plot

Hierarchical

Set of nested clusters organized as a hierarchical tree. No decision about number of clusters

It is not used when data is big due to higher processing time.

Hierarchical clustering can be

Top down approach
Bottom up approach

The results of hierarchical clustering can be shown using dendrogram.

At the bottom, we start with n data points (observations), each assigned to separate clusters
Two closest clusters are then merged till we have just one cluster at the top
The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space.

The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :

Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
Manhattan distance: ||a-b||1 = Σ|ai-bi|
Maximum distance:||a-b||INFINITY = maxi|ai-bi|
Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

Non Hierarchical

K means

Hierarchical

Singular Value Decomposition

Share this: