Slide 6.20: Data clustering methods (cont.)

Data Clustering Methods (Cont.)

Similarity Measures
In order to cluster the items in a data set, some means of quantifying the degree of association between them is required. This may be a distance measure, or a measure of similarity or dissimilarity.

where C is the number of terms that D_i and D_j have in common, and A and B are the number of terms in D_i and D_j.

An Example of Data Clustering Using MST Algorithms
A minimal spanning tree (MST) of a weighted graph is a collection of edges connecting all the vertexes such that the sum of the weights of the edges is at least as small as the sum of the weights of any other collection of edges connecting all the vertexes. Or an MST is a tree linking N objects with N-1 connections so that there are no loops and the sum of the N-1 dissimilarities is minimized. Once an MST has been constructed, the corresponding single link hierarchy can be generated in O(N²) operations.

Two fundamental construction principles for MSTs are
1. Any isolated point can be connected to a nearest neighbor.
2. Any isolated fragment (subset of an MST) can be connected to a nearest neighbor by a shortest available link.

The MST construction algorithm
1. Place an arbitrary point in the MST and connect its nearest to it.
2. Find the point not in the MST closest to any point in the MST and add it to the fragment.
3. If a point remains that is not in the fragment, return to Step 2.