Our similarity measures for 2D and 3D structures, along with a wide variety of descriptors, allow us to use a variety of techniques which order data points into groups by similarity (cluster analysis) or analyze datasets to see how similar or dissimilar the data points are to each other (diversity analysis).

Cluster analysis

Cluster analysis refers to a group of statistical methods that are used for identifying groups ("clusters") of similar items in multidimensional space. They require a measure of similarity between items. To illustrate this, let's imagine a simple 2D space (say, two property descriptors), so we can take a look at the points in this space visually. In the diagram below, the blue circles represent compounds plotted in this space, and the red circles one possible set of clusters. Note that there is not a definitive clustering: determining groups is subjective and dependent on the method used to identify them. Data points that are not put into groups are called singletons.

Cluster analysis methods have been applied in many different areas, and are now widely used in fields like data mining, pattern recognition and machine learning. In cheminformatics, clustering methods are used for three main purposes:

Grouping compounds into chemical series (or something approximating to this), as a way of organizing large datasets. For example, it is easier for a chemist to browse through 500 clusters (where the molecules in a cluster are similar) than 50,000 arbitrarily ordered compounds

Identifying new bioactive molecules: if a compound with unknown activity is in a cluster that is biased towards compounds with known activity, we can make a prediction of the probability of activity of the unknown compound (for example, if 75% of the compounds in the cluster are active, we might say the probability of activity is 75%).

Picking representative subsets: if we cluster a set of compounds, we can then take one compound from each cluster as a "representative" of this cluster, and the total set of representative compounds as a representative subset of the whole dataset. This is sometimes more useful than random selection.

Hierarchical clustering

Clustering methods can be either hierarchical or non-hierarchical. Hierarchical clustering creates a "tree" of clusters, with, at the bottom level, every item in its own cluster, and at the top level all items in one cluster. Algorithmically, this can be done either by starting at the bottom and progressively merging clusters (agglomerative) or starting at the top and breaking up clusters (divisive).

Mostly, the hierarchical agglomerative methods work in the same algorithmic fashion, but differ in the way that they decide which clusters to merge at each level. The Ward's method (described in detail below) has been used widely in cheminformatics, and is distinguished by merging clusters which, when merged, have the smallest increase in variance from the mean (i.e. create the "tightest" cluster when merged). Other methods include single linkage (the clusters are merged with the minimum distance between the nearest two points in each cluster); complete linkage (the clusters are merged with the minimum distance between the farthest points in each cluster); and group average (the minimum value of the mean distance between all pairs in the two clusters). Due to their computational complexity and memory requirements, hierarchical methods do not scale well to very large datasets, and thus they are giving way to faster, nonhierarchical methods. In order to create a partitioned grouping of a dataset (i.e. where every item is in one and only one cluster, or is a singleton), one must select a horizontal slice from this tree. The only divisive method that has been used is Divisive K-means.

To see how these methods work, let's take a look at the sample data set that we used in the 2D databases class:

Hierarchical methods work by initially putting every item in a cluster by itself, at the bottom level (with n clusters, where n = the number of points). It then identifies the two clusters to merge (depending on the methods as described above) and merges them to form a new cluster at the next level. The next level up will therefore consist of one cluster with two points, and all the rest of the points in clusters by themselves (i.e. there will be n-1 clusters). The process repeats until there is just 1 cluster at the top containing all the points, resulting in a cluster hierarchy. This is demonstrated for our sample set below:
In order to extract a partition from the hierarchy (i.e. a grouping of compounds where every compound is in one and only one cluster), we need to select a "level" from this hierarchy. There are many algorithms for selecting "good" levels (see for example those discussed in Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering , Journal of Chemical Information and Computer Sciences, 2000, 40, 155-162)

Nonhierarchical clustering

Nonhierarchical methods can use a variety of algorithms, but they generally all produce a single partitioning of the dataset into clusters (versus a tree which can result in many partitions). Some of the more common nonhierarchical methods are Jarvis-Patrick, K-means and K-medioids.

Jarvis-Patrick (JP) is a non-hierarchical method where, for each compound in a set, the j nearest neighbors (i.e. other compounds in the dataset that are the most similar) are identified. Compounds are then placed in the same cluster if they (i) are in each others' list of j nearest neighbors, and (ii) have k of their j nearest neighbors in common. This method doesn't require level selection, but does require j and k to be predefined. Tanimoto is usually used as the measure of similarity. JP is fast, but has had mixed results in cheminformatics in terms of quality.

K-means clustering is more widely used than JP. It requires that the number of desired clusters m be known in advance. An initial set of m cluster centroids is created, for example by randomly selecting compounds to use as centroids. Each of the n items is then placed into the nearest cluster, by calculating the similarity between the item and each of the cluster centroids. After one pass through all of the items, the centroids are recalculated (as the center of the newly formed clusters). Since this will change the assignments to clusters, then the process is repeated, reassigning members to clusters, until no more items change cluster (i.e. the clusters are stable). Generally only a few (<100, often <10) iterations are required to settle.

K-medioids is a derivative of K-means that is implemented as PAM (Partitioning Around Medioids) in the R statistical package (and thus is is commonly used in this environment). It differs from K-means in that it uses real exemplars, or "medioids", to represent cluster centers, rather than centroids.

Some good references for clustering in cheminformatics are:
  • Chemical Similarity Searching, P. Willett, J.M. Barnard, G.M.Downs, J. Chem. Inf. Comput. Sci., 1998, 38, 983-996
  • Clustering of Chemical Structures on the Basis of Two-Dimensional Similarity Measures, J. Chem. Inf. Comput. Sci, 1992, 32, 644-649
  • Clustering methods and their uses in Computational Chemistry, G.M.Downs and J. M. Barnard, Reviews in Computational Chemistry, 2002, 18, 1-40

Diversity analysis

Diversity Analysis gained popularity in the late 1990’s in response to the following needs in the pharmaceutical industry:

  • There was much interest as to how well the corporate collections of compounds held by pharmaceutical companies “covered” possible chemistry / drug space
  • Combinatorial Chemistry experiments were producing many new compounds, and people wanted to know if these compounds added anything new (in terms of chemical or biological functionality) to their corporate collections, i.e. if they made the datasets more diverse, or just replicated what was already in there
  • Libraries of thousands of compounds became available for purchase – are they worth the money?

This provoked a discussion about "descriptor spaces", i.e. the multidimensional Euclidean spaces created by treating each descriptor of a compound as a dimension. Of particular interest was how these descriptor spaces might map to conceptual "chemistry space" (i.e. if you made all the compounds that could theoretically be made, the chemistry space represents the regions of a multi-dimensional descriptor space - as defined by a given descriptor set - that would be occupied), and "drug space" (the regions of chemistry space that would be inhabited by drug molecules). Conceptually, we could think of something like this for an imaginary two-dimensional space:

Coverage and cell-based methods

Once one has defined these spaces, one can consider identifying regions of chemistry or drug space that are or are not covered by a collection, and increasing the diversity of a collection by adding more compounds that would increase the coverage of chemistry or drug space. The most straightforward way of performing this kind of analysis is to choose two or three descriptors, and to plot compounds in the two or three dimensional space created by these descriptors, indicating whether the compounds are drugs or not (in the case of differentiating drug and chemical space). For an example of this, see Shemetulskis, et. al., Enhancing the diversity of a corporate database using chemical database clustering and analysis, Journal of Computer-Aided Molecular Design, 1995, 9, 407-416.

Cell-based methods have also been used, that discretize each dimension into a number of bins, and thus create "cells" in Euclidean space at the intersection of these bins. One can then determine how many of these cells are populated and the extent to which they are over or under represented relative to the space as a whole.

Relative diversity

Additionally, several methods were developed for measuring relative diversity, i.e. how internally dissimilar the compounds are in a set. The most straightforward way of measuring this latter property, is to calculate the mean inter-molecular similarity of all the pairs of compounds in a set using the Tanimoto coefficient, and then subtract this from 1 as a measure of diversity. It is important to recognize that this doesn't say anything about the how much the set covers a particular space, just how dissimilar the compounds in a set are to each other. For example, which of these sets is the most diverse (by dissimilarity and coverage)?

One can also perform cluster analysis on a set, and identify which clusters are associated with higher or lower prevalence of compounds of interest (e.g. drug molecules) relative to the dataset as a whole.

Comparing datasets

Both coverage and relative methods can be used to compare sets as well as investigate single data sets. For example, we can answer the question: "how diverse is set A compared to set B?" by comparing the coverage of the sets (e.g. number of cells populated), or by comparing mean dissimilarities of the two sets. We can answer the question "how different are these two sets of compounds?" by looking at the overlap with coverage methods, or seeing how a mean dissimilarity of one set is changed by adding the second.

Diverse subset selection

It is also often desirable to extract a diverse subset, i.e. a subset of the compounds in a database that represents the chemical or biological diversity of a set as a whole. Note that this is different to taking a random subset, which merely attempts to sample the distribution in a set. There are several ways we can use to pick a diverse subset. With a cell-based coverage approach, we can take a single compound as representative of an entire cell, or of a set of cells if there are more populated cells than desired subset members. If we are taking a relative approach, we can adapt the mean inter-molecular similarity approach into a Dissimilarity-based compound selection (DBCS). For example, we can select an initial compound (e.g. randomly), then select the next compound as the one which is maximally dissimilar to the first, then the next which is maximally dissimilar to the first two, and so on.

We can also apply cluster analysis to this problem, by clustering a set into n clusters (where n is the size of the desired subset), and then picking a representative from each cluster (e.g. the compound closest to the cluster centroid).

There are many articles published on diversity. Here are a few examples:
  • New perspectives in Lead Generation II: Evaluating Molecular Diversity, M.J. Ashton, M.C. Jaye, J.S. Mason., Drug Discovery Today, 1996, Vol 1, No. 2
  • Molecular Diversity and Representativity in Chemical Datasets, D.M. Bayada, H. Hamersma, V. van Geerestein, J. Chem. Inf. Comput. Sci, 1999, 39, 1-10
  • Rapid Quantification of Molecular Diversity for Selective Database Acquisition, D.B. Turmer, S.M. Tyrrell, P. Willett, J. Chem. Inf. Comput. Sci., 1997, 37, 18-22

  • Challenges and prospects for computational aids to molecular diversity, Y. Martin, Perspectives in Drug Discovery and Design, 1997, 7/8, 159-172
  • Descriptors for diversity analysis, R.Brown, Perspectives in Drug Discovery and Design, 1997, 7/8, 31-49
  • Cluster-based selection, J.B. Dunbar, Perspectives in Drug Discovery and Design, 1997, 7/8, 51-63
  • Enhancing the diversity of a corporate database using chemical database clustering and analysis, N.E. Shemetulskis, J.B. Dunbar, B.W. Dunbar, D.W. Moreland, C. Humblet, Journal of Computer-Aided Molecular Design, 1995, 9, 407-416

Resources for clustering
1. R Bioconductor
2. Notes on Clustering
3. Methods to determine number of clusters
4. Data Mining book in R

Reading assignments

  • Leach & Gillet, Chapter 6