Assignment 6

ASSIGNMENT 6 -Pharmacophore Modeling and Virtual Screening

1. Watch the ligand based pharmacophore modeling using LigandScout video on 3D methods before moving on with these assignments.

2. You need LigandScout for this assignment. After downloading LigandScout you follow with the installation instructions and you can get 1 month license by emailing the company.
3. There are several tutorials for using LigandScout search in google or help menu from LigandScout. The video I have shown shows the full process of ligand based virtual screening and validation.

4. For this assignment i am providing you a dataset based on PknB assay from Mycobacterium tuberculosis. . Download the active and inactive compounds in sdf format. Make sure all your structures in 3D format. You can also create 3D format inside LigandScout using MMFF forcefield as described in the video.

5.Mark all the compound as inactive which shows >10000 Kd values as inactive and the remaining ones as active. For creating a good model you may want to exclude some structures in the active set or inactive set.

6. Generate conformers using the fast or best settings and create a merge features pharmacophore model and select the best 3 models and then plot a ROC curve of your data.For ROC Curve you need to convert the actives and inactives into ldb format and test. Have a look at this excellent blog for plotting roc curves both part 1 and part 2. I have also provided a PknB dataset for testing it has 36 actives and 999 decoy sets (1035 compounds) . You need to separately convert actives and inactives in .ldb format to test for your model. Post your pharmacophore model on cheminfoclub wiki and compare the model with the paper discussed here .

Note: Taking a large dataset might increase the computational time. First try a short run on 10 active and 10/20 inactive compounds .

7. Calculate the enrichment factor for top 1%, top 5% and top 20% of the database hits.

I have provided a R function below to calculate enrichment :
enrichment <- function (x, y, top = 0.05, decreasing = TRUE)
  N <- length(y)
    n <- sum(y == 1)
    x_prev <- -Inf
    area <- 0
    fp = tp = fp_prev = tp_prev = 0
    ord <- order(x, decreasing = decreasing)
    for (i in seq_along(ord)) {
        j <- ord[i]
        if (x[j] != x_prev) {
            if (fp + tp >= N * top) {
                n_right <- (fp - fp_prev) + (tp - tp_prev)
                rat <- (N * top - (fp_prev + tp_prev))/n_right
                tp_r <- tp_prev + rat * (tp - tp_prev)
                return((tp_r/(N * top))/(n/N))
            x_prev <- x[j]
            fp_prev <- fp
            tp_prev <- tp
        if (y[j] == 1) {
            tp <- tp + 1
        else {
            fp <- fp + 1
#to use it suppose
x<- rnorm(1000)
e= 1.1

8. What are the important features do you think is most important in your model compared to the paper discussed.