ASSIGNMENT 5.1

1.How machine learning methods can be used in drug discovery studies?
2.Enumerate the steps involved in building a QSAR/QSPR model.
3. What is confusion matrix?
4. Enlist the various machine learning methods and its uses.
5. Why are pipelining tools useful in data mining?


ASSIGNMENT 5.2 - PREDICTIVE MODELING

Software requirements : R, Open Babel, RPackages(rcdk,fingerprints,rpubchem,ROCR,rpart,randomForest,party (You can install or

write your own R script too.)

1.Install Open Babel , R and the given R packages in windows, linux or mac OS.

2.Pubchem Bioassay(http://www.ncbi.nlm.nih.gov/pcassay) provides access to the Bioassay results of the tested compounds. Select any one of the given bioassay and perform your experiments.
a) AID: 602118 b) AID: 602156 c) AID: 540252 d) AID: 504318

3.Download any one of the bioassay data i.e Active and Inactive set of compounds in sdf format.

4. Using open babel filter the compounds like strips the salts,remove duplicates compounds and filter compounds containing number of heteroatoms >=10 and <= 60 and append the activity outcome active or inactive to the output file(sdf or smiles)

5. Install the rcdk , fingerprint ,ROCR,rpart,randomForest and party packages in R.

6. Load all the packages to R. We will use rcdk to generate fingerprints/descriptors for filtered molecules(actives and inactives) from open babel.You can select any one of the datasets to perform the modeling experiment.

7. Generate the maccs(166),pubchem(881) and extend(depth=6, size=1024 ) and estate fingerprints for both the active and inactive compounds and save the data as a csv file .Following three lines shows you how to load molecules generate fingerprints and converting fingerprints to binary matrix.If you want to try to model with other features in rcdk your are welcome to do that.Other than Binary features sometimes you need to scale your features.For scaling i am providing a sample code which you can modify.

dat<-load.molecules("Actives.smi")
fplist <-lapply(dat,get.fingerprint,type="maccs")
fpm<-fp.to.matrix(fplist)
write.csv(fpm,"Actives.csv")
# given a data frame and the column names i want to scale
# creates new columns: feature.scale = (feature - mean)/std
feature.scale = function (data, cols) {
  for (col in cols) {
    sigma = sd(data[col])
    mu = mean(data[col])
    data[paste(names(data[col]), ".scale", sep = "")] = (data[col] - mu)/sigma
  }
  return(data)
}
 
data = feature.scale(mydata, c("area", "bedrooms"))</span>


8. Add the Outcome column of Active and Inactive to the csv file and merge the two files into a single file.Load the csv file and perform modeling experiments.

9. For NA values in the data, replace the NA values with columns means. Also for modeling you need to remove the constant columns and also highly correlated columns (>0.95). With the remaining attributes perform your modeling.

##Removing Constant or Near-Constant Columns
 
d <- data.frame(somedata)
dropc <- apply(d, 2, function(x) { length(which(x == 0))/length(x) > .8 })
d <- d[, !dropic]</span>
 



Note: replacing missing values with column means you can use this one liner below and you can use FSelector package to reduce variables based on correlations or a code to do that given below.
( dataset<- ifelse(is.na(data), rep(colMeans(data, na.rm=TRUE), rep(nrow(data), ncol(data))),unlist(data)) )

# Removing correlated columns
corr.rem <- function(d, cutoff=0.9) {
  if (cutoff > 1 || cutoff <= 0) {
    stop(" 0 <= cutoff < 1")
  }
  if (!is.matrix(d) && !is.data.frame(d)) {
    stop("Must supply a data.frame or matrix")
  }
  r2cut = sqrt(cutoff);
  cor.mat <- cor(d);
  bad.idx <- which(abs(cor.mat)>r2cut,arr.ind=T);
  bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]], ncol=2);
  drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1], bad.idx [,2]);
  if (length(drop.idx) == 0) {
      1:ncol(d)
  } else {
      (1:ncol(d))[-unique(drop.idx)]
  }
}</span>


10. Split the train and test set in 80% and 20% for model training and testing and learn the training set with 5 folds of cross validation(you may need to write the K fold cross validation code by yourself).

11. Check the performance measures of different fingerprints and describe which one performs good. The performance measure variables are given in this wiki.

12 .Use the ROCR package to get the performance and plot the ROC plots,Lift plot for each fingerprints.

13. Submit a PDF file for the Assignment 1 that contains:

(i) Your ROC plots for extended, Pubchem and MACCS fingerprints,

(ii) The confusion matrices

(iii) A short description of which you think is the best model and why

(iv) R codes or KNIME Workflow in Zip format




In between you can also use KNIME to prepare the model a video how to use KNIME for predictive modeling is given below. You need to extend this workflow model and by including more Machine Learning Classifiers and the features into account. Check my git for workflows