Skip to main content
Get your Wikispaces Classroom now:
the easiest way to manage your class.
Introducing Cheminformatics: Navigating the world of chemical data
Introducing Cheminformatics: Navigating the world of chemical data
Pages and Files
Add "All Pages"
Quantitative Structure-Activity Relationships (QSAR) and Predictive Models
We have previously discussed the use of 2D and 3D descriptors to characterize compounds, and how these can be used in similarity calculation, clustering, diversity, and so on. In this section we are concerned with ways of correlating these descriptors with outcomes, such as biological activities, properties, and toxicity, and the building of predictive models based on these descriptors and correlations.
Quantitative Structure-Activity Relationships (QSAR)
The establishment of
(SAR) in medicinal chemistry predates the use of computers in chemistry, and relies on correlating structural features with experimental results for multiple compounds, usually in the same series. It is common in medicinal chemistry to use synthesis techniques to create several related compounds (e.g. methyl-, ethyl-, butyl- forms), and then to investigate the effect of these synthetic changes on a particular property or biological activity (so we might find, for instance, that extending the Methyl chain reduces a particular activity). The relationship between structure and activity may or may not be quantified.
Quantitative Structure-Activity Relationships
(QSAR) were originally designed as an attempt to add some mathematical basis to this process, particularly to define the activity as some function of descriptors (note that when the activity is a property or a toxicity, this is sometimes referred to as QSPR and QSTR respectively). If we develop a function that relates descriptors to a particular activity, we can then use the function
for compounds where the activity is unknown but the descriptors can be calculated.
The earliest examples of QSAR were
which are actually applications of
. Hansch analysis pertained to property descriptors, and Free-Wilson, which we shall discuss here, to structural descriptors. Free-Wilson defined a function that equates activity (defined as log of 1 / the concentration) with weighted descriptors, the weightings, or coefficients, being determined by linear regression. That is, we have the equation:
Log (1/C) = a1x1 + a2x2 + a3x3 ...
where C is the concentration required for activity, x1, x2, x3, etc are the descriptor values (usually 1 or 0 to represent absence or presence of features), and a1, a2, a3, etc are the coefficients derived from linear regression. Linear regression is a generalized technique that aims to optimize the coefficients applied to independent variables so that the dependent variable (in this case Log 1/C) most closely matches the observed value for a set of descriptors. Thus one an think of a regression equation being
using data with known dependent values, and then being
predictively to data with unknown dependent values. Linear regression works by minimizing the sum of the differences between the values predicted by the equation and the actual observation. This is nicely illustrated in a
If a regression equation is to be used predictively, then we need some way of gauging its accuracy. The simplest way to do this is with
, which is the proportion of the variance in the dependent variable that is explained by the regression equation (i.e. if r-squared = 1.0, then all the actual points lie on the regression line; if r-squared = 0.0, then the variance around the regression line is as high as the overall variance of the dependent variable).
There is a problem though with r-squared: the same data that is used to build the equation is also used to evaluate it. This can be addressed using q-squared (sometimes called crossvalidated r-squared). Here, we make n versions of the equation, each build leaving one of the original known values out (it is thus an example of leave-one-out validation); the q-squared is then the mean overall variance in using the equation to predict the values left out. q-squared is always thus less than r-squared.
Nonlinear approaches to QSAR
The main drawback of these early approaches are that they assume that the activity varies linearly with the descriptor values that affect it. However, this is usually not the case. Nolinear approaches still try to correlate descriptors and outcomes, but do not make this assumption. They are thus at least theoretically more useful, although there is usually some trade-off (such as speed, scalability or interpretability). Nonlinear approaches are generally an example of
(as opposed to
such as clustering; however unsupervised methods such as
may also be employed). The method used will also sometimes depend on the kind of QSAR that is to be determined - particularly there is a difference between classification problems (such as predicting whether compounds are active or inactive) and quantitative prediction problems (where we want to predict an activity value). Some of the most frequently-used nonlinear methods for QSAR are:
(such as Recursive Partitioning and
Support Vector Machines
Different methods have different strengths and weakensses: for example neural nets are a "black box" approach and thus are not useful if we want to know
a particular prediction was made. Decision Trees are only usable for classification problems.
Regardless of the method use, building a model will generally be done in three phases: training (presenting known data to build the model); validation (testing the model with known data that has not been presented to build the model, such as a validation set); and prediction (using the model for truly unknown data). This is illustrated below.
Effective evaluation of models
If predictive models are to be properly evaluated there are a few basic principles that should be adhered to:
For publication. public datasets should be used, and the method and descriptors used should be made freely available or be described well enough that a reader could replicate the experiment
A validation set should always be used, and any success statistics should be based on the validation set, not the training set
For classification problems, always create a
. From this, you can derive measures like
sensitivity and specificity
precision and recall
. For large sets, particularly for virtual screening applications, it is appropriate to show a
(one can also calculate AUC, or area under curve).
Some essential QSAR References
History of QSAR
a must read
Modeling methods in QSAR/QSPR
Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection
Here is a video describing how to perform QSAR in R.
After reading the materials take the following Quiz on QSAR
help on how to format text
Turn off "Getting Started"