Hierarchical Clustering¶
Learning Objectives
Understand hierarchical clustering algorithm
Learn the binary tree and its properties
Use the tree’s properties to gain insight of data
Implement the algorithm for hierarchical clustering
Limitations of the k-mean Algorithm¶
The previous chapter explains the k-mean clustering algorithm. It is simple and useful as an initial analysis of data but it also has several major problems:
There is no obvious reason to select a particular value of k. The previous chapter shows an example: the data is generated for 10 clusters but the smallest distance occurs when
k
is 13.The algorithm is non-deterministic and it is often necessary running the same program multiple times.
For a given cluster, there is no easy answer which other cluster is closer. It is possible to calculate the distances of centroids but this requires additional work.
The k-mean algorithm assigns all data points to one of the
k
clusters in the very first step. Then the algorithm adjusts the clusters by the finding closest centroid from each data point. Because the initial assignments have direct impacts on the final results, the program often needs to run multiple times for selecting better results.
Hierarchical Clustering is a method that does not have these same problems as the k-mean algorithm.
Examples of Hierarchical clustering¶
Hierarchical clustering iteratively finds the closest pair of clusters (at the beginning, each data point is a cluster by itself). The algorithm makes the pair two children of a binary tree. For hierarchical clustering, it is called dendrogram, instead of binary tree. This book uses both terms interchangably. The tree node becomes a new cluster. The algorithm continues until only one tree node is left. Before formally describing the algorithm, let us go through several examples. The simplest case is when there is only one data point but it is not very meaningful. Thus, let us start with two data points.
index |
x |
y |
---|---|---|
0 |
-66 |
45 |
1 |
95 |
-84 |
They are the two children of a binary tree:
Next, we consider the case when there are four data points:
index |
x |
y |
---|---|---|
0 |
-66 |
45 |
1 |
95 |
-84 |
2 |
-35 |
-70 |
3 |
26 |
94 |
We use the Euclidean distance here. Consider two data points: \((x_a, y_a)\) and \((x_b, y_b)\). The distance between them is defined as
\(\sqrt{(x_a - x_b)^2 + (y_a - y_b)^2}\).
Consider three data points: \((x_a, y_a)\), \((x_b, y_b)\), and \((x_c, y_c)\). The following inequality is true.
\(\sqrt{(x_a - x_b)^2 + (y_a - y_b)^2} > \sqrt{(x_a - x_c)^2 + (y_a - y_c)^2} \Leftrightarrow ((x_a - x_b)^2 + (y_a - y_b)^2) > ((x_a - x_c)^2 + (y_a - y_c)^2)\).
We care about only the order of the distance so we do not really care about the square root.
The following table shows \((x_a - x_b)^2 + (y_a - y_b)^2\) between the pairs of the data points.
0 |
1 |
2 |
3 |
|
---|---|---|---|---|
0 |
0 |
42562 |
14186 |
10865 |
1 |
42562 |
0 |
17096 |
36445 |
2 |
14186 |
17096 |
0 |
30617 |
3 |
10865 |
36445 |
30617 |
0 |
The shortest distance (10865) occurs between \((x_0, y_0)\) and \((x_3,y_3)\). They are the first pair to be put into the same cluster. Thus, points 0 and 3 are the two children of the same parent node.
How do we represent the cluster that includes points 0 and 3? There are several different commonly used representations. This example uses the centroid method: a cluster is represented by the centroid of the data points in the cluster. The cluster that contains \((x_0, y_0)\) and \((x_3, y_3)\) is represented by \((\frac{-66+26}{2}, \frac{45+94}{2}) = (-20, 69.5)\). This cluster is marked as the new \((x_0, y_0)\) and \((x_3, y_3)\) no longer exists. We can recompute the distances among the pairs of points:
TO DO: make the table align right
0 |
1 |
2 |
3 |
|
---|---|---|---|---|
0 |
0 |
42562 |
14186 |
10865 |
1 |
42562 |
0 |
17096 |
36445 |
2 |
14186 |
17096 |
0 |
30617 |
3 |
10865 |
36445 |
30617 |
0 |
The shortest distance (10865) occurs between \((x_0, y_0)\) and \((x_3,y_3)\). They are the first pair to be put into the same cluster. Thus, points 0 and 3 are the two children of the same parent node.
How do we represent the cluster that includes points 0 and 3? There are several different commonly used representations. This example uses the centroid method: a cluster is represented by the centroid of the data points in the cluster. The cluster that contains \((x_0, y_0)\) and \((x_3, y_3)\) is represented by \((\frac{-66+26}{2}, \frac{45+94}{2}) = (-20, 69.5)\). This cluster is marked as the new \((x_0, y_0)\) and \((x_3, y_3)\) no longer exists. We can recompute the distances among the pairs of points:
TO DO: make the table align right
0 |
1 |
2 |
|
---|---|---|---|
0 |
0.0 |
36787.25 |
19685.25 |
1 |
36787.25 |
0 |
17096 |
2 |
19685.25 |
17096 |
0 |
Now, the shortest distance (17096) occurs between \((x_1, y_1)\) and \((x_2, y_2)\). These two data points are the two children of a tree node. At this moment, there are only two clusters and they are the children of a binary tree node. The final result is shown below
Next, let’s add two more data points.
index |
x |
y |
---|---|---|
0 |
-66 |
45 |
1 |
95 |
-84 |
2 |
-35 |
-70 |
3 |
26 |
94 |
4 |
15 |
20 |
5 |
66 |
-3 |
The pair-wise distances is shown below
0 |
1 |
2 |
3 |
4 |
5 |
|
---|---|---|---|---|---|---|
0 |
0 |
42562 |
14186 |
10865 |
7186 |
19728 |
1 |
42562 |
0 |
17096 |
36445 |
17216 |
7402 |
2 |
14186 |
17096 |
0 |
30617 |
10600 |
14690 |
3 |
10865 |
36445 |
30617 |
0 |
5597 |
11009 |
4 |
7186 |
17216 |
10600 |
5597 |
0 |
3130 |
5 |
19728 |
7402 |
14690 |
11009 |
3130 |
0 |
The shortest distance (3130) occurs between \((x_4, y_4)\) and \((x_5, y_5)\). These two data points are the two children of a node. This cluster is represented by the centroid \((\frac{15 + 66}{2}, \frac{20 + (-3)}{2}) = (\frac{81}{2}, \frac{17}{2}) = (40.5, 8.5)\). The cluster is expressed as \((x_4, y_4)\) and \((x_5, y_5)\) is removed. Now there are four data points and one cluster. The distances are shown in the following table:
0 |
1 |
2 |
3 |
4 |
|
---|---|---|---|---|---|
0 |
0 |
42562 |
14186 |
10865 |
12674.5 |
1 |
42562 |
0 |
17096 |
36445 |
11526.5 |
2 |
14186 |
17096 |
0 |
30617 |
11862.5 |
3 |
10865 |
36445 |
30617 |
0 |
7520.5 |
4 |
12674.5 |
11526.5 |
11862.5 |
7520.5 |
0 |
The smallest distance is 7520.5 and it is between \((x_3, y_3)\) and the cluster created earlier. Thus, \((x_3, y_3)\) and the cluster are put together by making them the two childrens of a binary tree node. This process continues until there is only cluster left. The final binary tree is shown below:
The examples are generated using the following program:
#!/usr/bin/python3
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import random
from scipy.spatial.distance import pdist
random.seed(1)
def distance(x1, y1, x2, y2):
# no need to take square root
# print (x1 - x2, y1 - y2)
return (x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2)
if __name__ == "__main__":
random.seed(1)
numdp = 6
x = []
for i in range(numdp):
x.append([random.randint(-100, 100),
random.randint(-100, 100)])
print (x)
Z = linkage(x, 'centroid')
dendrogram(Z)
plt.show()
pairdist = []
for i1 in range(numdp):
row = [0] * numdp
pairdist.append(row)
for i1 in range(numdp):
for i2 in range(numdp):
d = distance(x[i1][0], x[i1][1], x[i2][0], x[i2][1])
pairdist[i1][i2] = d
print ('pairdist', pairdist[i1])
Hierarchical Clustering Algorithm¶
The hierarchical clustering algorithm starts by treating each data point as a cluster. Then it repetively finds the closest two clusters. These two clusters are the two children of a binary tree node and form one cluster. As a result, two clusters become one cluster. This process continues until only one cluster is left. The algorithm is described below:
Define Distance of Two Clusters¶
The earlier examples use the centroid to express each cluster. Other
definitions of clusters’ distances are also used. There are four
commonly adopted definitions for the distance of two clusters (An
Introduction to Statistical Learning by James, Witten, Hastie, and
Tibshirani
): complete, single, average, and centroid. They
are described below.
Suppose \(A = \{a_1, a_2, ..., a_n\}\) is a cluster and \(a_1\), \(a_2\), …, \(a_n\) are the \(n\) data points in this cluster. Suppose \(B = \{b_1, b_2, ..., b_m\}\) is another cluster and \(b_1\), \(b_2\), …, \(b_m\) are the \(m\) data points in this cluster. The distance between these two clusters can be defined as
Complete: Compute the pairwise distances of every point in \(A\) and every point in \(B\), then select the largest distance. Mathematically, the distance is defined as \(\underset{a_i \in A, b_j \in B}{\max}{|a_i - b_j|}\). Here, \(|a_i - b_j|\) means the distance of the two points.
Single. This definition considers the smallest distance among all pairs of points in \(A\) and \(B\): \(\underset{a_i \in A, b_j \in B}{\min}{|a_i - b_j|}\).
Average. This definition computes the average of the pairwise distances: \(\frac{1}{n \times m} \underset{a_i \in A, b_j \in B}{\Sigma} {|a_i - b_j|}\).
Centroid. Find the centroid \(c_A\) of \(A\) and the centroid of \(c_B\) of \(B\) using in Chapter of k-mean. The distance of the two clusters is the distance of the two centroids: \(| c_A - c_B|\).
Implementation¶
Data Structures¶
TO DO: Explain list and tree
Procedure¶
To implement hierarchical clustering, we will use two types of data structures: binary tree node and list node. Each binary tree node represents a cluster. The original data points are stored in the leaf nodes and each data point is a cluster of its own. These binary tree nodes are stored in a list. In a binary tree, a node is a leaf if it has no child. Then, the cloest two clusters are fused together into a single cluster. The two clusters are removed from the list and the new cluster is added to the list. In each step, two clusters are removed and one cluster is added. As a result, the number of clusters is reduced by one in each step. This process continues until only one cluster is left.
TO DO: Make a figure
The program’s starting point is relatively simple. It accepts one
argument as the input file. Please notice that it is different from
the k-mean solution since hierarchical clustering does not need the
value of k
.
TO DO: Explain the code
#! /usr/bin/python
# main.py for hierarchical clustering
import sys
import cluster
import argparse
def checkArgs(argv = None):
parser = argparse.ArgumentParser(description='parse arguments')
parser.add_argument('-f', '--filename',
help = 'name of the data file', default = 'data.txt')
parser.add_argument('-d', '--distance',
help = 'distance definition',
choices = ('c', 's', 'a', 't'))
# four methods to measure distances between clusters:
# complete, single, average, centroid
args = parser.parse_args(argv)
return args
if __name__ == "__main__":
args = checkArgs(sys.argv[1:])
cluster.cluster(args)
# cluster.py for hierarchical clustering
import sys
from hclist import HCList
def cluster(args):
filename = args.filename
distdef = args.distance
try:
fhd = open(filename) # file handler
except:
sys.exit('open fail')
# create an empty hierarchical clustering list
hcl = HCList(distdef)
# read the data
count = 0
for oneline in fhd:
data = map(int, oneline.split())
count = count + 1
# print "add to list", data
hcl.add(data)
# print "--- finish reading file ---"
hcl.printList()
hcl.cluster()
print "--- final cluster ---"
hcl.printList()
# hctree.py for hierarchical
class TreeNode:
def __init__(self, val):
self.value = val
self.left = None
self.right = None
def printNode(self):
if (self == None):
return
if ((self.left == None) and
(self.right == None)):
print "Leaf: ", self.value
else:
print self.value
if (self.left != None):
self.left.printNode()
if (self.right != None):
self.right.printNode()
def getLeft(self):
return self.left
def getRight(self):
return self.right
def distanceComplete(self, trnd):
dist1 = self._distanceCompleteNode(trnd)
dist2 = dist1
dist3 = dist1
if (self.left != None):
dist2 = self._distanceCompleteNode(self.left, trnd)
if (self.right != None):
dist3 = self._distanceCompleteNode(self.right, trnd)
return max(dist1, dist2, dist3)
def distanceSingle(self, trnd):
return 0
def distanceAverage(self, trnd):
return 0
def distanceCentroid(self, trnd):
return 0
def _distance(self, sndnode):
sum = 0
for ind in range(len(self.value)):
dist = self.value[ind] - sndnode.value[ind]
sum = dist * dist
return sum
def _distanceCompleteNode(self, trnd):
# find the maximum distance of one node with all nodes in trnd
dist1 = self._distance(trnd)
dist2 = dist1
dist3 = dist1
if (trnd.left != None):
dist2 = self._distanceCompleteNode(self, trnd.left)
if (trnd.right != None):
dist3 = self._distanceCompleteNode(self, trnd.right)
return max(dist1, dist2, dist3)
from hctree import TreeNode
import sys
class ListNode:
def __init__(self, tn):
# double linked list, with after and before
# use after and before, not next and prev
# because next is already defined in Python
self.trnode = tn
self.after = None
self.before = None
class HCList:
def __init__(self, distdef):
# the list has a dummy node. It is both head and tail
trnd = TreeNode(None)
lsnd = ListNode(trnd)
self.head = lsnd
self.tail = lsnd
self.dist = distdef # definition of distance
# add a cluster to the list
# a new cluster is stored in a tree node
# the tree node is added to the list
def add(self, val):
trnd = TreeNode(val)
lsnd = ListNode(trnd)
self.tail.after = lsnd
lsnd.before = self.tail
self.tail = lsnd
return trnd
def delete(self, lsnd):
# cannot delete the first node (dummy)
if (lsnd == self.head):
print "ERROR, delete dummy head"
return
# delete the last node
if (lsnd == self.tail):
self.tail = self.tail.before
self.tail.after = None
return
# delete a node in the middle
before = lsnd.before
after = lsnd.after
before.after = after
after.before = before
def printList(self):
print "HCList::printList"
lsnd = self.head.after
while (lsnd != None):
lsnd.trnode.printNode()
lsnd = lsnd.after
def distance(self, lsnd1, lsnd2):
if ((self.dist == 'c') or (self.dist == 'C')):
return lsnd1.distanceComplete(lsnd2)
if ((self.dist == 's') or (self.dist == 'S')):
return lsnd1.distanceSimple(lsnd2)
if ((self.dist == 'a') or (self.dist == 'A')):
return lsnd1.distanceAverage(lsnd2)
if ((self.dist == 't') or (self.dist == 'T')):
return lsnd1.distanceCentroid(lsnd2)
print ('Unknown method for calcuating distance')
return 0
def merge(self, trnd1, trnd2):
# value not used for non-leaf nodes
val = [-1] * len(trnd1.value)
trnd = self.add (val)
trnd.left = trnd1
trnd.right = trnd2
trnd.printNode()
def cluster(self):
# this implementation is not efficient but it should
# be easy to understand
# remember that head is the dummy node
while (self.head.after != self.tail):
# at least two list nodes, not done yet
# distance of the first two list nodes as reference
lsnd1 = self.head.after
lsnd2 = lsnd1.after
mindist = self.distance(lsnd1, lsnd2)
# find the pair of the smallest distance
flsnd = self.head.after # first list node
while (flsnd != self.tail):
slsnd = flsnd.after # second list node
while (slsnd != None):
dist = self.distance(flsnd, slsnd)
if (dist < mindist): # notice, no =
dist = mindist
lsnd1 = flsnd
lsnd2 = slsnd
slsnd = slsnd.after
flsnd = flsnd.after
# remove the two nodes that have the smallest distance
# from the list
print "merge", lsnd1.trnode.value, lsnd2.trnode.value
self.delete(lsnd1)
self.delete(lsnd2)
self.merge(lsnd1.trnode, lsnd2.trnode)
return self.head.after.trnode