Hierarchical Clustering
========================

Learning Objectives


- Understand hierarchical clustering algorithm

- Learn the binary tree and its properties

- Use the tree's properties to gain insight of data
  
- Implement the algorithm for hierarchical clustering

    
Limitations of the k-mean Algorithm
-----------------------------------


The previous chapter explains the k-mean clustering algorithm.  It is
simple and useful as an initial analysis of data but it also
has several major problems:

- There is no obvious reason to select a particular value of `k`.  The previous chapter shows an example: the data is generated for 10 clusters but the smallest distance occurs when ``k`` is 13.

- The algorithm is non-deterministic and it is often necessary running the same program multiple times.

- For a given cluster, there is no easy answer which other cluster is  closer.  It is possible to calculate the distances of centroids but  this requires additional work.

- The k-mean algorithm assigns all data points to one of the ``k``  clusters in the very first step.  Then the algorithm adjusts the  clusters by the finding closest centroid from each data  point. Because the initial assignments have direct impacts on the  final results, the program often needs to run multiple times for  selecting better results.

*Hierarchical Clustering* is a method that does  not have these same problems as the k-mean algorithm.


Examples of Hierarchical clustering
-----------------------------------

Hierarchical clustering iteratively finds the closest pair of clusters
(at the beginning, each data point is a cluster by itself).  The
algorithm makes the pair two children of a binary tree.  For
hierarchical clustering, it is called *dendrogram*, instead of binary
tree. This book uses both terms interchangably.  The tree node becomes
a new cluster.  The algorithm continues until only one tree node is
left.  Before formally describing the algorithm, let us go through
several examples.  The simplest case is when there is only one data
point but it is not very meaningful. Thus, let us start with two data
points.

+-------+-----+-----+
| index |  x  |  y  |
+=======+=====+=====+
|   0   | -66 |  45 |
+-------+-----+-----+
|   1   |  95 | -84 |
+-------+-----+-----+

They are the two children of a binary tree:

.. figure:: hierarchical/figures/cluster2.png

	    Two data points are the two children of a binary tree.


Next, we consider the case when there are four data points:

+-------+-----+-----+
| index |  x  |  y  |
+=======+=====+=====+
|   0   | -66 |  45 |
+-------+-----+-----+
|   1   |  95 | -84 |
+-------+-----+-----+
|   2   | -35 | -70 |
+-------+-----+-----+
|   3   | 26  |  94 |
+-------+-----+-----+

We use the *Euclidean distance* here. Consider two data points:
:math:`(x_a, y_a)` and :math:`(x_b, y_b)`. The distance between them
is defined as

:math:`\sqrt{(x_a - x_b)^2 + (y_a - y_b)^2}`.

Consider three data points: :math:`(x_a, y_a)`, :math:`(x_b, y_b)`,
and :math:`(x_c, y_c)`.  The following inequality is true.
      
:math:`\sqrt{(x_a - x_b)^2 + (y_a - y_b)^2} > \sqrt{(x_a - x_c)^2 + (y_a - y_c)^2}  \Leftrightarrow ((x_a - x_b)^2 + (y_a - y_b)^2) > ((x_a - x_c)^2 + (y_a - y_c)^2)`.

We care about only the order of the distance so we do not really care
about the square root.


The following table shows :math:`(x_a - x_b)^2 + (y_a - y_b)^2`
between the pairs of the data points.


+---+-------+-------+-------+-------+
|   |  0    | 1     | 2     | 3     |
+===+=======+=======+=======+=======+
| 0 |  0    | 42562 | 14186 | 10865 | 
+---+-------+-------+-------+-------+
| 1 | 42562 | 0     | 17096 | 36445 |
+---+-------+-------+-------+-------+
| 2 | 14186 | 17096 | 0     | 30617 |
+---+-------+-------+-------+-------+
| 3 | 10865 | 36445 | 30617 | 0     |
+---+-------+-------+-------+-------+


The shortest distance (10865) occurs between :math:`(x_0, y_0)` and
:math:`(x_3,y_3)`.  They are the first pair to be put into the same
cluster.  Thus, points 0 and 3 are the two children of the same parent
node.  


.. figure:: hierarchical/figures/cluster4.png

	    Clusters of four data points. Do not worry about the colors, nor the numbers along the vertical axis. Pay attention to the shape only.

	    
How do we represent the cluster that includes points 0 and 3?  There
are several different commonly used representations.  This example
uses the centroid method: a cluster is represented by the centroid of
the data points in the cluster.  The cluster that contains
:math:`(x_0, y_0)` and :math:`(x_3, y_3)` is represented by
:math:`(\frac{-66+26}{2}, \frac{45+94}{2}) = (-20, 69.5)`.  This
cluster is marked as the new :math:`(x_0, y_0)` and :math:`(x_3, y_3)`
no longer exists.  We can recompute the distances among the pairs of
points:

TO DO: make the table align right

+---+-------+-------+-------+-------+
|   |  0    | 1     | 2     | 3     |
+===+=======+=======+=======+=======+
| 0 |  0    | 42562 | 14186 | 10865 | 
+---+-------+-------+-------+-------+
| 1 | 42562 | 0     | 17096 | 36445 |
+---+-------+-------+-------+-------+
| 2 | 14186 | 17096 | 0     | 30617 |
+---+-------+-------+-------+-------+
| 3 | 10865 | 36445 | 30617 | 0     |
+---+-------+-------+-------+-------+


The shortest distance (10865) occurs between :math:`(x_0, y_0)` and
:math:`(x_3,y_3)`.  They are the first pair to be put into the same
cluster.  Thus, points 0 and 3 are the two children of the same parent
node.  


How do we represent the cluster that includes points 0 and 3?  There
are several different commonly used representations.  This example
uses the centroid method: a cluster is represented by the centroid of
the data points in the cluster.  The cluster that contains
:math:`(x_0, y_0)` and :math:`(x_3, y_3)` is represented by
:math:`(\frac{-66+26}{2}, \frac{45+94}{2}) = (-20, 69.5)`.  This
cluster is marked as the new :math:`(x_0, y_0)` and :math:`(x_3, y_3)`
no longer exists.  We can recompute the distances among the pairs of
points:

TO DO: make the table align right

+---+----------+----------+----------+
|   |  0       | 1        | 2        |
+===+==========+==========+==========+
| 0 | 0.0      | 36787.25 | 19685.25 |
+---+----------+----------+----------+
| 1 | 36787.25 | 0        | 17096    |
+---+----------+----------+----------+
| 2 | 19685.25 | 17096    | 0        |
+---+----------+----------+----------+


Now, the shortest distance (17096) occurs between :math:`(x_1, y_1)`
and :math:`(x_2, y_2)`.  These two data points are the two children of
a tree node.  At this moment, there are only two clusters and they are
the children of a binary tree node.  The final result is shown below

.. figure:: hierarchical/figures/cluster4.png

	    Clusters of four data points. Do not worry about the colors, nor the numbers along the vertical axis. Pay attention to the shape only.

Next, let's add two more data points.


+-------+-----+-----+
| index |  x  |  y  |
+=======+=====+=====+
|   0   | -66 |  45 |
+-------+-----+-----+
|   1   |  95 | -84 |
+-------+-----+-----+
|   2   | -35 | -70 |
+-------+-----+-----+
|   3   | 26  |  94 |
+-------+-----+-----+
|   4   | 15  | 20  |
+-------+-----+-----+
|   5   | 66  | -3  |
+-------+-----+-----+


The pair-wise distances is shown below


+---+-------+-------+-------+-------+-------+-------+
|   | 0     | 1     | 2     | 3     | 4     | 5     | 
+===+=======+=======+=======+=======+=======+=======+
| 0 | 0     | 42562 | 14186 | 10865 | 7186  | 19728 |
+---+-------+-------+-------+-------+-------+-------+
| 1 | 42562 | 0     | 17096 | 36445 | 17216 | 7402  |
+---+-------+-------+-------+-------+-------+-------+
| 2 | 14186 | 17096 | 0     | 30617 | 10600 | 14690 |
+---+-------+-------+-------+-------+-------+-------+
| 3 | 10865 | 36445 | 30617 | 0     | 5597  | 11009 |
+---+-------+-------+-------+-------+-------+-------+
| 4 | 7186  | 17216 | 10600 | 5597  | 0     | 3130  | 
+---+-------+-------+-------+-------+-------+-------+
| 5 | 19728 | 7402  | 14690 | 11009 | 3130  | 0     |
+---+-------+-------+-------+-------+-------+-------+


The shortest distance (3130) occurs between :math:`(x_4, y_4)` and
:math:`(x_5, y_5)`.  These two data points are the two children of a
node.  This cluster is represented by the centroid :math:`(\frac{15 +
66}{2}, \frac{20 + (-3)}{2}) = (\frac{81}{2}, \frac{17}{2}) = (40.5,
8.5)`.  The cluster is expressed as :math:`(x_4, y_4)` and
:math:`(x_5, y_5)` is removed.  Now there are four data points and one
cluster.  The distances are shown in the following table:

+---+---------+---------+---------+--------+---------+
|   | 0       | 1       | 2       | 3      | 4       | 
+===+=========+=========+=========+========+=========+
| 0 | 0       | 42562   | 14186   | 10865  | 12674.5 |
+---+---------+---------+---------+--------+---------+
| 1 | 42562   | 0       | 17096   | 36445  | 11526.5 |
+---+---------+---------+---------+--------+---------+
| 2 | 14186   | 17096   | 0       | 30617  | 11862.5 |
+---+---------+---------+---------+--------+---------+
| 3 | 10865   | 36445   | 30617   | 0      | 7520.5  |
+---+---------+---------+---------+--------+---------+
| 4 | 12674.5 | 11526.5 | 11862.5 | 7520.5 | 0       | 
+---+---------+---------+---------+--------+---------+

The smallest distance is 7520.5 and it is between :math:`(x_3, y_3)`
and the cluster created earlier. Thus, :math:`(x_3, y_3)` and the
cluster are put together by making them the two childrens of a binary
tree node.  This process continues until there is only cluster left.
The final binary tree is shown below:

.. figure:: hierarchical/figures/cluster6.png

	    Cluster of the six data points.

The examples are generated using the following program:
   
.. literalinclude:: hierarchical/code/example1.py
   :language: python

Hierarchical Clustering Algorithm
---------------------------------	      

The hierarchical clustering algorithm starts by treating each data
point as a cluster. Then it repetively finds the closest two clusters.
These two clusters are the two children of a binary tree node and form
one cluster.  As a result, two clusters become one cluster.  This
process continues until only one cluster is left.  The algorithm is
described below:
    

.. figure:: hierarchical/figures/hierarchicalalgorithm.png

	    Hierarchical Clustering Algorithm

Define Distance of Two Clusters
-------------------------------   

The earlier examples use the centroid to express each cluster.  Other
definitions of clusters' distances are also used.  There are four
commonly adopted definitions for the distance of two clusters (``An
Introduction to Statistical Learning by James, Witten, Hastie, and
Tibshirani``): *complete*, *single*, *average*, and *centroid*.  They
are described below.

Suppose :math:`A = \{a_1, a_2, ..., a_n\}` is a cluster and :math:`a_1`,
:math:`a_2`, ..., :math:`a_n` are the :math:`n` data points in this
cluster.  Suppose :math:`B = \{b_1, b_2, ..., b_m\}` is another cluster
and :math:`b_1`, :math:`b_2`, ..., :math:`b_m` are the :math:`m` data
points in this cluster.  The distance between these two clusters can
be defined as
	    

-  Complete: Compute the pairwise distances of every point in  :math:`A` and every point in :math:`B`, then select the  largest distance.  Mathematically, the distance is defined as :math:`\underset{a_i \in A, b_j \in B}{\max}{|a_i - b_j|}`.   Here, :math:`|a_i - b_j|` means the distance of the two points.

-  Single. This definition considers the smallest distance  among all pairs of points in :math:`A` and :math:`B`:  :math:`\underset{a_i \in A, b_j \in B}{\min}{|a_i - b_j|}`.

-  Average. This definition computes the average of the pairwise  distances: :math:`\frac{1}{n \times m} \underset{a_i \in A, b_j \in B}{\Sigma} {|a_i - b_j|}`.


- Centroid. Find the centroid :math:`c_A` of :math:`A` and the    centroid of :math:`c_B` of :math:`B` using in Chapter of k-mean.   The distance of the two clusters is the distance of the two    centroids: :math:`| c_A - c_B|`.

  
Implementation
--------------  

Data Structures
^^^^^^^^^^^^^^^

TO DO: Explain list and tree

Procedure
^^^^^^^^^

To implement hierarchical clustering, we will use two types of data
structures: *binary tree node* and *list node*.  Each binary tree node
represents a cluster. The original data points are stored in the
*leaf* nodes and each data point is a cluster of its own. These binary
tree nodes are stored in a list.  In a binary tree, a node is a leaf
if it has no child.  Then, the cloest two clusters are fused together
into a single cluster.  The two clusters are removed from the list and
the new cluster is added to the list.  In each step, two clusters are
removed and one cluster is added.  As a result, the number of clusters
is reduced by one in each step.  This process continues until only one
cluster is left.

TO DO: Make a figure

The program's starting point is relatively simple.  It accepts one
argument as the input file. Please notice that it is different from
the k-mean solution since hierarchical clustering does not need the
value of ``k``.

TO DO: Explain the code

.. literalinclude:: hierarchical/code/main.py
   :language: python


.. literalinclude:: hierarchical/code/cluster.py
   :language: python	   

.. literalinclude:: hierarchical/code/hctree.py
   :language: python	   

.. literalinclude:: hierarchical/code/hclist.py
   :language: python