Data mining is defined in a variety of ways. Here we use it quite broadly to refer to any computational method that allows us to gain insights into the biological actions of chemicals by analyzing large amounts of data.

Kinds of data mining

Data mining is generally held to be generically a discipline of (or synonymous with) the field of Knowledge Discovery, or Knowledge Discovery in Databases (KDD). It is usually defined as the process of identifying valid, novel, potentially useful, and ultimately understandable patterns from large collections of data. Data mining can be used to test hypotheses (or verification goals) or to autonomously find entirely new patterns (discovery goals). Discovery goals may be descriptive (requiring characterization of general properties of the data in the database) or predictive (requiring predictions to be made using the data in the database). For example, we could think of diversity analysis as being a descriptive discovery goal.

Data mining is not just about analyzing data: it can include activities such as concept description, data cleaning, integration, and selection selection.

The tools of data mining include visualization, clustering, supervised learning, database querying, statistical analysis, and pattern matching

Data visualization

We can consider data visualization in cheminformatics to be a specific instance of the general problem of visualization, except to the extent that we wish to visualize chemical structures, and organize chemical structures in a visualization by structural similarity. The former usually requires specialized extensions to visualization software, whilst the latter can be achieved by working with structural descriptors such as fingerprints.

There are several generalized packages for data visualization that have been adapted for bespoke use in cheminformatics; in particular Spotfire is very popular in the pharmaceutical industry. Other packages include Miner3D . Excel is even used, particularly with extensions such as Accord for Excel .

Most of these tools involve plotting in 2 or 3 dimensions. Much can be learned by selecting sets of descriptors and viewing simple plots of one descriptor against one or two others. For example, we can see if two biological activities are correlated, whether activity and gene expression are related, or whether activity is related to a particular physical property. We can also look at the relationships of certain descriptors (such as structural fingerprints) to others (such as activities) through the use of clustering and heat maps where we cluster entities on two axes, then color the intersection by a particular dependent variable (for example, we can plot compounds against targets and color by activity). More recently, people have started looking at network visualizations (see. e.g. Quantifying the relationships between drug classes and PubChem as a source of polypharmacology .

We can also project multiple descriptors into two or three using a variety of techniques:
  • Multidimensional scaling (MDS) which maps similarities in m dimensions to similarities in n dimensions
  • Nonlinear dimensionality reduction, which maps points in m dimensional space to points in n dimensional space.
  • Unsupervised clustering, particularly Self Organizing Maps (SOMs), sometimes called Kohonen Maps, and GTM clustering

Mining data from high throughput technologies

Over the last decade or so, high-throughput technologies have had a significant effect on the amount of experimental chemical & biological data available. In particular, High Throughput Screening (HTS) allows hundreds of thousands of compounds to be tested for biological activity; Microarray Assays allow the whole genome regulation patterns for tissue samples (with or without compound treatment) to be determined; High Content Screening (HCS) allows complex effects of compounds on cells to be determined.

When analyzing HTS, we are not just concerned about which compounds area active and inactive, but what the overall trends and patterns are. For example, we might have some chemical series that are always weakly active, and others where small changes can result in compounds being either inactive or highly active. We need to start by understanding the properties of the experiment and the dataset as a whole (what the error margins are, what cutoffs we use to determine activity and inactivity, what kinds of errors we can get, and so on). Results are usually presented as percent inhibition, or IC50's if the result of multiple runs (LD50 in the case of toxicity assays), although these figures are generated from raw values. Note that HTS screens can be enzymatic (i.e. measuring inhibition of a protein) but are increasingly cellular (i.e. measuring effects on particular cells, particularly cytotoxicity).

Here are some data mining methods commonly used for HTS:
  • Cluster the compounds (by structural descriptors, e.g.) and look for SAR patterns within clusters (even build QSAR models on individual clusters). For an example of how this method can be extended, see the VisualiSAR presentation .
  • Project the compounds (e.g. with MDS) into two dimensions, color by activity, then manually look for "hot spots". For an example of this, see Tripos HTS Data Miner .
  • For multiple HTS runs, create a heat map with compounds clustered vertically (by structure) and assays clustered horizontally (by target or gene, or even gene expression), and color by activity.

Here is an example of the first approach using VisualiSAR:


Here is an example heatmap taken from Learning from the Data: Mining of Large High-Throughput Screening Databases


Whereas HTS gives us information relating compounds to protein targets or cellular effects, Microarray assays give us information about regulation patterns of whole genomes of genes in cells. There are thus some clever things we can do linking this with HTS. First, we can use the microarray results (typically 14,000 or more data points for a cell line / tissue sample) as descriptors of the cell line, and then use these to cluster cellular HTS targets (for example putting certain kinds of cancer cells together). This is especially useful in conjunction with heat map representations. For an example of this kind of data mining, see the paper Chemical Data Mining of the NCI Human Tumor Cell Line Database.

HCS runs tend to produce a set of descriptors based on analysis of cellular images. One can thus try to find relationships between say chemical structure descriptors and image descriptors.

Both of these applications lend themselves to application of association rule mining (ARM), a technique developed for generic large-scale data mining applications. ARM looks for statistical associations between multiple descriptors, and then expresses these as rules (classic example: men who buy diapers on Fridays also buy beer). We can use this same reasoning technique to determine relationships such as "compounds with features X and Y tend to cause cellular apoptosis in cells with up-regulation of genex A B and C.

Using Pubchem API to download compounds and generate heat maps.