K-Means Clustering ===================== Learning Objectives - Understand data clustering - Learn the k-mean algorithm - Write a test generator for creating test cases - Implement the algorithm in Python Clustering Data --------------- Consider the following scenario: A department store is planning the annual promotion. The store wants to give customers discount coupons based on their purchase history. If a customer has purchased a desk at this store, the customer is given a coupon for a desk. If a customer has purchased shoes, the customer is given a coupon for shoes. If a customer has purchased a jacket, the customer is given a coupon for a jacket. The store sells thousands of products and it does not want to give one type of coupon for each product because doing so would create thousands of types of coupons. It would be too expensive making so many types of coupons and programming the checkout machines to recognize these many types of coupons. Instead, the store wants to provide only a few (say 10) types of coupons. The store wants to group the customers based on what they bought together. For example - If a customer has bought shirts, shoes, and a sweater together, this customer is given a coupon for clothes. - If another customer has bought a desk, a dining table, and four chairs, this customer is given a coupon for furniture. - If a third customer has bought a tie and shoes, this customer also receives a coupon for clothes (not for furniture). - If the fourth customer has bought chair and a desk, this customer receives a coupon for furniture (not for clothes). The problem is that the store does not know how to group the customers so that the same group of customers gets the same coupon. In other words, the store wants to divide the customers into groups so that the customers in each group have bought similar items. The discussion above assumes that customers would likely buy what they have already bought. In reality, stores would give customers coupons for what they may buy later. For example, if a customer has bought a desk, the customer is unlikely to buy another desk soon. Instead of giving a coupon for another desk, the store may give a coupon for buying a chair. If a customer has bought a jacket, the store may give a coupon for a sweater. Consider another example: A research project wants to find the commonalities among lung cancer patients. They want to divide the patients into group based on personal information and behavior, such as age, location, occupation, education, marital status, diet, amounts of sleep, the food they eat, etc. They want to know whether one group has more patients than the other groups. .. index:: unsupervised learning unsupervised learning: clustering Both examples are *clustering* problems: these problems divide data (people in both cases) into groups so that the data inside each group is similar and the data in different group is dissimilar. This is unsupervised learning because there is no teacher telling computers whether two pieces of data belong to the same group. The correct answer depends on the other pieces of data. Clustering into Groups ---------------------- .. index:: unsupervised learning: clustering: k-mean Consider the department store coupon problem again. Suppose the store decides to issue exactly ten types of coupons. The problem becomes dividing the customers into ten groups based on their purchase history. This section describes the *k-mean clustering algorithm*; here, :math:`k` is a number assigned to the problem as the number of groups. For the department store, :math:`k` is 10. Before explaining how this clustering method works, let us understand what this method wants to accomplish. An obvious question is how to decide the value of :math:`k`. This book will discuss that later. Suppose there are :math:`n` data points (each point is a person): :math:`x_1`, :math:`x_2`, ..., :math:`x_n`. Each data point is a vector. The department store considers these products: - refrigerator - washer - dryer - jackets - shirts - shoes - pants - sweaters For example, :math:`x_1 = <1, 0, 0, 3, 5, 0, 0, 2>` means that in the past twelve months the first customer has bought one refrigerator, no washer, no dryer, three jackets, five shirts, no shoes, no pants, and two sweaters. If :math:`x_2 = <0, 1, 1, 0, 0, 1, 0, 0>` means the second customer has bought no refrigerator, one washer, one dryer, no jackets, no shirts, one pair of shoes, no pants, and no sweater. Define K-Mean Clustering Problem -------------------------------- We want to group these :math:`n` data points into :math:`k` clusters: :math:`C_1`, :math:`C_2`, ..., :math:`C_k`. Obviously :math:`k` must not exceed :math:`n`. In reality, :math:`k` is usually much smaller than :math:`n`, i.e., :math:`k << n`. Each cluster contains a set of data points. In this book, a vector uses a lower case letter such as :math:`x`. A set uses an upper case letter such as :math:`C`. The following properties must be satisfied: - A data point must belong to one cluster: :math:`\forall i, 1 \le i \le n, \exists m, 1 \le m \le k`, such that :math:`x_i \in C_m`. This means for any value of :math:`i` between 1 and :math:`n` (inclusively), there is a value :math:`m` between 1 and :math:`k` (inclusively), such that :math:`x_i` is an element of :math:`C_m`. - Each cluster contains one or more data points, i.e., :math:`|C_j| \ge 1`, for :math:`1 \le j \le k`. In other words, an empty cluster cannot be accepted. - The clusters together includes all data points: :math:`C_1 \cup C_2 \cup ... \cup C_k = \{x_1, x_2, ..., x_n\}`. - A data point must not belong to two or more clusters: :math:`\forall i, 1 \le i \le n` if :math:`x_i \in C_j` and :math:`x_i \in C_m` then :math:`j = m`, here :math:`1 \le j, m \le k`. To put it in another way, :math:`C_j` and :math:`C_m` has no overlap if :math:`j \ne m`: :math:`C_j \cap C_m = \emptyset`. How is the problem defined? Suppose :math:`D` (as distance) defines how dissimilar the data points of each cluster. If the data points are quite similar, :math:`D` is small. If the data points are quite dissimilar, :math:`D` is large. The goal is to assign :math:`x_1`, :math:`x_2`, ..., :math:`x_n` to :math:`C_1`, :math:`C_2`, ..., :math:`C_k` so that :math:`\min \underset{j = 1}{\overset{k}{\sum}} D(C_j)` is as small as possible. This is a *minimization problem*. A minimization problem aims to make a quantity, called the *cost function*, as small as possible. Minimization problems are *optimization problems*, so are *maximization problems*. A maximization problem aims to make a quantity, called the *profit function* or *score function*, as large as possible. .. index:: cost function profit function minimization problem maximization problem optimization problem How is :math:`D` defined? It can be defined in many ways. One commonly used definition is the sum of pairwise Euclidean distance: .. index:: Euclidean distance :math:`D(C_j) = \underset{x_r, x_s \in C_j}{\sum} (x_r - x_s)^ 2` If :math:`x_r` and :math:`x_s` are :math:`p`-dimensional vectors: :math:`x_r = (x_{r1}, x_{r2}, ..., x_{rp})` and :math:`x_s = (x_{s1}, x_{s2}, ..., x_{sp})`. The distance of them is defined as the sum of the square of the difference in each dimension: :math:`(x_r - x_s)^ 2 = \underset{i = 1}{\overset{p}{\sum}} (x_{ri} - x_{si})^2`. It is common to divide :math:`D(C_j)` by the number of data points in :math:`C_j` so that clusters of different sizes are treated equally. Thus, :math:`D(C_j)` is redefined as :math:`D(C_j) = \frac{1}{|C_j|} \underset{x_r, x_s \in C_j}{\sum} (x_r - x_s)^ 2`. Rewrite the cost function as :math:`\min \underset{j = 1}{\overset{k}{\sum}} \frac{1}{|C_j|} \underset{x_r, x_s \in C_j}{\sum} (x_r - x_s)^ 2`. Number of Solutions for Clustering ---------------------------------- Deciding the proper value of :math:`k` is a difficult problem because there are many possible solutions when :math:`n` and :math:`k` are large. How many possible solutions are there? This is equivalent to the *set partition problem*: divided a set of :math:`n` elements into :math:`k` non-overlapping non-empty subsets. Suppose the :math:`n` data points can be assigned to any of the :math:`k` clusters, there are :math:`k^n` possibilities. This, however, allows empty clusters. Thus, we have to exclude the situation when one cluster is empty. In this case, there are :math:`(k-1)^n` options. If two clusters are empty, there are :math:`(k-2)^n` options. Continue until all except one cluster is empty. The total number of options is :math:`k^n - (k-1)^n - (k -2) ^ n ... - 1^n = k^n - \underset{i = 1}{\overset{k-1} \sum} (k-i)^n`. K-Mean Algorithm ---------------- When :math:`n` and :math:`k` are large, there are too many possible solutions and finding the best solution (or one of the best solutions, if several solutions are equally good and better than the other solutions) would be difficult. Instead of find the best solution, a heuristic, called the *k-mean algorithm*, usually finds good solutions. This is the step of the k-mean algorithm: .. figure:: kmean/figures/kmeanalgorithm.png K-Mean Algorithm A cluster's centroid is the center of the data points assigned to this cluster. For cluster :math:`C_j`, its centroid is :math:`\frac{1}{|C_j|} \underset{x_r \in C_j}{\sum} x_r`. Please remember that each data point is a :math:`p`-dimensional vector. This is a *heuristic solution*, meaning that the solution often works but there is no assurance. The solution may fail in some cases. For many problems, finding the best solution (or solutions) can be computationally difficult because there are too many possibilities to check. Instead of finding the best solution, an algorithm is adopted to find "good enough" solutions fast. For the *k-mean* clustering problem, there is no easy way to determine the best value of :math:`k`. Thus, there is not clearly defined best solution. Generate Test Cases ------------------- Before showing the implementation in Python, this section discusses how to generate test cases. Lacking test cases of known properties is one common problem in developing software for machine learning. The purpose of machine learning is to recognize unknown patterns in data. How can we know that the programs are correct if we do not know what the programs should find? Two methods are commonly used: - Creating simple test cases manually with known patterns. - Adopting widely used test cases whose patterns have already been studied. The first methods is restricted to only very small test cases that are unlikely to have sophisticated patterns needed to test computer programs. The second methods, in contrast, may have sophisticated patterns but the data may be too complex for identifying problems in the programs. Another approach is to write another program (or several programs) as a "test case generators" to generate the test cases of known patterns. More specifically, for this problem, we can write a program that generates :math:`n` data points in :math:`k` clusters (:math:`n > k`). Since the data points are generated intentionally, it is easy to test whether the :math:`k`-mean program is correct. Writing a good test case generator takes some thinking because the generator has to consider many different possible properties. The programs in this chapter handles integers only because floating-point numbers have limited precision and sometimes give surprising results, to be discussed later in this book. This program generates two output files: ``data.txt`` and ``cluster.txt``. The first file has :math:`n` lines and each line has :math:`p` integers. The second file also has :math:`n` lines; each line has the same :math:`p` integers, following by one integer between :math`0` and :math:`m - 1` to indicate which cluster this line belongs to. .. literalinclude:: kmean/code/testmain.py :language: python .. literalinclude:: kmean/code/testgen.py :language: python TO DO: Need to explain the program To run the program using the default values, type ``python3 main.py``. This is an example output ``data.txt``: .. literalinclude:: kmean/code/data.txt This is the corresponding ``cluster.txt``. .. literalinclude:: kmean/code/cluster.txt Visualize Data -------------- Does the test case generate produce the data with expected properties? One way to answer this question is to visualize data (up to three dimensions). Visualization can help inspect whether the data produced by the test generator is indeed clustered into several regions. .. literalinclude:: kmean/code/visualize.py :language: python TODO: explain the code .. figure:: kmean/figures/figure1.png Visualize a generated test case of three clusters, clearly separated. .. figure:: kmean/figures/figure2.png A test case generated with ``-d 1 -m 4`` arguments. .. figure:: kmean/figures/figure3.png A test case generated with ``-d 3 -m 3`` arguments. .. figure:: kmean/figures/figure4.png A test case generated with ``-t -m 3`` arguments. K-Mean Implementation --------------------- Now is the time to show how the K-Mean algorithm can be implemented. The implementation pretty much reflects the algorithm described earlier. TODO: Explain the program .. literalinclude:: kmean/code/kmeanmain.py :language: python .. literalinclude:: kmean/code/kmean.py :language: python .. literalinclude:: kmean/code/datapoint.py :language: python .. literalinclude:: kmean/code/centroid.py :language: python Does this program work? Let's try an example. .. figure:: kmean/figures/figure5.png A generated test case. .. figure:: kmean/figures/figure6.png The clusters are marked by different labels. The result is encouraging. Next, consider some more test cases: .. figure:: kmean/figures/figure7.png A test case generated with ``-m 4`` arguments. .. figure:: kmean/figures/figure8.png K-mean result using only two clusters. The next example allows clusters to overlap by adding ``-t`` in the test generator. .. figure:: kmean/figures/figure9.png A test case generated with ``-t -m 4`` arguments. .. figure:: kmean/figures/figure10.png K-mean result using three clusters. This looks pretty good. Running the same clutering multiple times, we may get different results: .. figure:: kmean/figures/figure11.png .. figure:: kmean/figures/figure12.png .. figure:: kmean/figures/figure13.png Running the program three times using four clusters. The top one seems reasonable but the other two are questionable. Why? Understand Non-Deterministic Behavior of K-Mean ----------------------------------------------- This is the place where most books stop and why this book is unique. This book gives deeper insight. How can this program produces different results even though the input data is the same? A simple answer is that this program is *non-deterministic* because of this line ``clu = random.randint(0, kval - 1)`` It assigns each data point to a cluster between ``0`` and ``k-1`` randomly. It is possible that two or more data points in the same cluster (by the test generator) are assigned to different clusters. Python (and most programming languages) has a *random number generator*. It would be difficult to predict the next number after observing a sequence of numbers. Let us consider several examples. This is a simply program calling Python's ``random.randint``. .. literalinclude:: kmean/code/sequence1.py :language: python The following shows the results executing the program three times. .. literalinclude:: kmean/code/sequence1.txt .. literalinclude:: kmean/code/sequence2.txt .. literalinclude:: kmean/code/sequence3.txt It is difficult to predict the next number from the numbers that have already been seen. Python allows programmers to set the *seed* of the random number generator. If the same seed is same, the same sequence of numbers is generated. In other words, the sequence becomes *deterministic*. The following program always generates the same sequence of numbers. .. literalinclude:: kmean/code/sequence2.py :language: python The words "random" and "non-deterministic" are often confused. Random means it is difficult to predict in advance. Non-deterministic means that if the same program runs again with the same inputs, the results may be different. It is possible to have a deterministic sequence of random numbers, like the ones generated by ``sequence2.py``. The sequence is deterministic because running the program again generates the same sequence. The sequence is random because knowing the generated numbers cannot predict the future numbers. Another common confusion is "random" and "unknown". If a sequence of number is random, the next unseen number is unknown (unless the seed is set). However, something that is unknown may not be random. For example, you may not know the birthday of a person but this person's birthday is not "random". A common way of varying the seed is using the microsecond of the current time in the following way: ``random.seed(datetime.datetime.now().microsecond)`` Be careful using time as random number seeds for keys in security programs. Using microseconds restrict the possible keys to only ``0`` and ``999999``, not the full scale of all possible integers. It is possible that the data points that should be in the cluster are put into different clusters. When this occurs, the results are not very useful. Since the *k-mean* algorithm does not guarantee finding the best clustering solution, it is advised running the program multiple times and find the result that has the smallest value of the cost function. Determine ``k``'s Value ----------------------- One important question is to determine the value of ``k``. This can also be answered by using different values and choose one that minimize the cost function. The following function considers different values of ``k`` and runs the program multiple times for each value. Consider the data generated using ``-d 6 -c 10 -m 50 -M 60`` as the input. The following figure shows the sum of the distances to the centrolds for different values of ``k``. It is clear that when ``k`` is too small (below 8), the distances are much larger. .. figure:: kmean/figures/iteratedistance.png Distances for different values of ``k``. Five times for each value. .. figure:: kmean/figures/iteratemindistance.png Smallest distance for each value. Please notice that the smallest distance occurs when ``k`` is 13. The data is generated for 10 clusters. Each value runs five times and has five bars. The size of each cluster is shown as the height. Since the clusters together contain all data points, the sum of the cluters' sizes is always the same. The sizes of the clusters can vary significantly. For example, when ``k`` is 5, the smallest cluster has no data point and the largest cluster has more than one third of the data points. .. figure:: kmean/figures/clustersizes.png Sizes of clusters for different values of ``k``. Clean Data before Clustering ---------------------------- The k-mean algorithm seems relatively simple. Is it actually useful? Yes. The beginning of this chapter describes several scenarios where clustering would be helpful. Here are some other applications. Imagine that you own several pizza stores and want to open new stores. You want to decide the new stores' locations. You have the budget to open at most five more stores (in this case ``k`` is between 1 to 5). You want to use the addresses of deliveries to determine the stores' locations as the centroids of the clusters based on customers' addresses. This can be applied to many other scenarios: As another example, a bank wants to decide where to install four ATM (automatic teller machine) machines (``k`` is 4 in this case) based on customers' home and office addresses. Both of them can take advantage of the clustering method. It is common that the data needs to be "cleaned" before sending to a clustering program. For example, if you want to decide the locations of pizza stores, you may want to exclude the customers that are too far away from the other customers, maybe more than 10 km away from all the other customers. Maybe you also want to consider how often customers order deliveries and give frequent customers larger weights so that the stores are closer to these customers. The same thinking can be applied to selecting ATM locations: maybe you want to give higher weights to the customers that have high account balances or use ATM more often. Such "data cleaning" usually requires human judgement. In the example of deciding pizza stores' locations, should the customers be excluded if they are more than 10km away from all the other customers? Why should it be 10km? Why not 5, or 8, or 15km? If you want to include the customers that order deliveries often even though they live more than 10km away, how do you define "often"? Is more than once per week considered often? Or more than twice per week? Is one week too short? Instead, you prefer to consider ordering more than five times per month? Doing meaningful data cleaning often requires knowledge about the business. You need to know how much it costs to deliver pizza for customers living at different distances from the store. You also need to know the traffic conditions: delivering pizza to a customer living in the center of a city may take longer even though the distance is shorter. The business knowledge is essential understanding what data is needed, what data should be excluded, and how to interpret the results from the clustering method. Using the k-mean method, or any machine learning method, is not as simple as throwing data into a program and get meaningful results effortlessly. Summary ------- K-Mean is a simple clustering method and it can be the starting point of more complex methods. It can handle high-dimensional data (even though visualization above three-dimensional is difficult). When you get a set of data, you may use K-Mean (experiment with different values of ``k``) to obtain some idea whether the data is somewhat clustered.