We will look at two toolkits: the CDK , an open source Java toolkit, and OEChem , a C/C++ toolkit created by OpenEye. These materials are based on those originally written by Rajarshi Guha.

CDK

What Do We Need?

  • The CDK jar files. Since the CDK is modular, there are a number of jar files that may be required. An easier way out is to get the comprehensive jar which contains all the CDK jars as well as any dependencies. You can get the
  • A Java IDE
  • The source code exaples are in the I573 SVN repository
  • For more examples of nearly all classes in the CDK look at the extensive JUnit test files

What Does the CDK Provide?

  • Fundamental chemical objects
    • atoms
    • bonds
    • molecules
  • More complex objects are also available
    • Sequences (such as a protein sequence)
    • Reactions
    • Collections of molecules
  • Input/output for a wide variety of chemical file formats
  • SMILES parsing and generation
  • Fingerprints
  • Fragment generation
  • Rigid alignments
  • Substructure search
  • Atom typing
  • Partial charges (Gasteiger-Marsilli)
  • 3D coordinates and force fields
  • Molecular descriptors
  • Integration with R and Weka
Also see the keyword list to get an idea the things that can be done. A feature list is also available

Who's Using It?

A number of projects use the CDK under the hood

Programming With The CDK


Atoms

Create a carbon atom object using
IAtom atom = DefaultChemObjectBuilder.getInstance().newAtom("C");
 
You can also specify 2D or 3D coordinates
Once you have an atom, you can get/set coordinates, charge, hydrogen count, parity etc.

Bonds

Creating a bond is similar to creating an atom. As before, use DefauktChemObjectBuilder
IBond bond = DefaultChemObjectBuilder.getInstance().newBond(atom1, atom2, order);
 
In addition it is also possible to specify the stereo orientation of the bond.

Molecules

In the CDK a molecule is fundamentally represented as an AtomContainer object. For many purposes this is fine. In some cases specialization is required so we have
  • Crystal
  • Ring
  • Molecule
  • Fragment
as subclasses. Some methods require a specific subclass. Many methods simply require an AtomContainer and so any of these subclasses can be supplied. As before we can create an object doing
IAtomContainer container = DefaultChemObjectBuilder.getInstance().newAtomContainer();
 
and then populate it
container.addAtom( atom1 );
container.addAtom( atom2 );
container.addBond( atom1, atom2, 1.5);
 
Since most code will deal with these types of objects what can we do with them?
  • Loop over atoms or bonds
Iterator iter = container.atoms()
while( iter.hasNext() ) {
  IAtom atom = (IAtom) iter.next();
}
 
  • Get a specific atom by serial number
IAtom atom = container.getAtom(4);
 
  • Get the serial number for a given atom
int serial = container.getAtomNumber(anAtom);
 

Input/Output

A common format is the SMILES format. For this we use the SmilesParser:
SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());
IMolecule molecule = sp.parseSmiles("CCCC(CC)CC=CC=CC");
 
We can also create a SMILES string from a molecule object
SmilesGenerator sg = new SmilesGenerator();
String smiles = sg.createSMILES(molecule);
System.out.println("SMILES = "+smiles);
 
Another requirement will be to read in a file from disk, such as an SD file. This process is a little longer, due to the abstraction but the basic idea is:
FileReader fileReader = new FileReader(new File(args[0]));
MDLReader mdlReader = new MDLReader(fileReader);
 
IChemFile chemFile = (ChemFile) mdlReader.read(new ChemFile());
List containers = ChemFileManipulator.getAllAtomContainers(chemFile);
 
We can also easily write a molecule to any supported format, say SDF, by doing
FileWriter w1 = new FileWriter(new File("molecule.sdf"));
try {
   MDLWriter mw = new MDLWriter(w1);
   mw.write(molecule);
   mw.close();
} catch (Exception e) {
   System.out.println(e.toString());
}
 

SD tags

  • SD files are a common format
  • Includes atom, bond information
  • Can also contain arbitrary information in form of tags:
  CDK    1/28/07,15:24
 
  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <myProperty>
1.2
 
> <anotherProperty>
Hello world
 
  • When a molecule is read in from an SD file we can get the value of a given tag by doing
Object value = molecule.getProperty("myProperty");
 
  • When writing a molecule we can supply a set of tags in a HashMap
HashMap tags = new HashMap();
tags.put("myProperty", new Double(1.2));
 
To ensure tags are written to disk we would do
FileWriter w1 = new FileWriter(new File("molecule.sdf"));
try {
   MDLWriter mw = new MDLWriter(w1);
   mw.setSdFields( tags );
   mw.write(molecule);
   mw.close();
} catch (Exception e) {
   System.out.println(e.toString());
}
 
Note that a molecule might have several properties associated with it. If we wanted to write all of them to disk we can avoid creating an extra HashMap and simply do
   MDLWriter mw = new MDLWriter(w1);
   mw.setSdFields( molecule.getProperties() );
   mw.write(molecule);
   mw.close();
 

Cheminformatics tasks


Generating a caonical SMILES

The SmilesGenerator class will create canonical SMILES. So a very simple example is
String smiles = "C(C)(C)CC=CC(C(CC(C))CC)CC";
SmilesParser sp = new SmilesParser();
IMolecule mol = sp.parseSmiles(smiles);
 
SmilesGenerator sg = new SmilesGenerator();
String canSmi = sg.createSMILES(mol);
 
Similarly, we can read in a SDF formatted molecule and generate a canonical SMILES string

Getting fingerprints

Fingerprints are use for a variety of purposes, such as database screening, similarity searching etc.
The CDK can generate binary fingerprints very simply:
BitSet fingerprint = Fingerprinter.getFingerprint(molecule)
 
This gives a default fingerprint of 1024 bits. It is possible to specify smaller or larger fingerprints. The CDK also provides variants of the default fingerprint

Evaluating similarity

Given two molecules we are interested in determining how similar they are. A common approach is to evaluate their fingerprints and then calculate the Tanimoto coefficient. We can this in the CDK quite easily
BitSet fp1 = Fingerprinter.getFingerprint(molecule1);
BitSet fp2 = Fingerprinter.getFingerprint(molecule2);
 
float tc = Tanimoto.calculate(fp1, fp2);
 

Calculating a molecular descriptor

If you know which descriptor to calculate, such as say, ZagrebIndexDescriptor we can use the following procedure
ZagrebIndexDescriptor desc = new ZagrebIndexDescriptor();
DescriptorValue value = desc.calculate(atomContainer);
DoubleResult result = (DoubleResult) value.getValue();
System.out.println("value = "+result.doubleValue());
 
The value object is useful since in addition to containing the numeric value it also stores information such as who wrote the descriptor and a link to a dictionary from which we can extract further information.
In case we want to evaluate all descriptors available we can do
IMolecule molecule;
 
DescriptorEngine engine = new DescriptorEngine();
engine.process(molecule);
 
The descriptors are placed as properties of the molecule (which can be obtained using getProperty)


OEChem



Where to Get It

  • Openeye downloads
    • The toolkit distribution includes OEChem as well as the other toolkits
    • Python and Java wrappers
  • You will need an OpenEye license
    • Cheminfo has this

Set Up

  • If you plan on using the Python wrappers, you'll need to set some paths
#> export PYTHONPATH=/usr/local/openeye/wrappers/python:$PYTHONPATH
#> export LD_LIBRARY_PATH=/usr/local/openeye/wrappers/libs:$LD_LIBRARY_PATH
 
  • When writing Python code, add the following import to get all the OEChem library functionality
from openeye.oechem import *
 
  • If you're going to be coding from C/C++ then the gcc invocation will need the following arguments
-I/usr/local/openeye/include -L/usr/local/openeye/lib \
-loechem -loesystem -loeplatform -loebio \
-lz -lm -lpthread
 

Resources

Where Is It Used?

As a commercial toolkit, it's quite polished. As a result, it's quite widely used in the pharmaceutical industry. Some well known and public sites that use the OE toolkits include

What Features Does It Provide?

  • Basic chemical objects
    • Atoms, bonds, molecules
  • Complex objects
    • Protein sequences
    • Conformers
  • Support for many chemical formats
  • SMILES parsing/generation
  • Good stereochemistry support
  • Multiple models for aromaticity

Programming with the Python Wrappers to OEChem


Atoms and Bonds

  • These objects only exist in the context of a molecule
mol = OEMol()
a1 = mol.NewAtom(6)
a2 = mol.NewAtom(6)
a3 = mol.NewAtom(6)
b1 = mol.NewBond(a1,a2, 1)
b2 = mol.NewBond(a2,a3, 1)
 
  • Atoms have a variety of properties that can be obtained or set
    • atomic number, formal charge
    • symmetry
    • type (is it a carbon, metal, halogen etc)
  • Similarly bonds will have properties such as order, is it in a ring or not and so on
  • It is possible to get numerical indices for atoms and bonds using the GetIdx() member function
atom.GetIdx()
 

Molecules

There are two types of molecule objects that can be utilized
  • OEMol
  • OEGraphMol
Very simply, OEMol objects can handle single conformer and multi-conformer molecules OEGraphMol objects only handle single conformer molecules Since Python is dynamaic, strong-typed, we don't really have to care about the difference, unless we specifically need to take into multi-conformer molecules
  • We can create an empty molecule by
>>> mol = OEMol()
 
Given a molecule object we can loop over the atoms and bonds very simply
mol = OEMol()
OEParseSmiles(mol, 'C1CC(CC=N)CC1')
for atom in mol.GetAtoms():
  print atom.GetAtomicNum()
 
  • When a molecule is read in, it is already configured. That is,
    • aromaticity detection is performed
    • ring perception is performed
    • ...

Input/Output

As before, we'll consider the easiest way to get a molecule into the program, using SMILES. First we need to create the molecule and then parse a SMILES into the molecule
>>> mol = OEMol()
>>> status = OEParseSmiles(mol, 'c1ccccc1')
 
Note that the OEParseSmiles function returns an boolean - TRUE indicates that the SMILES was parsed OK. If not, it will return FALSE as well as printing out a warning message.
We can then write a molecule as a SMILES string. In this case we use the OECreateCanSmiString, which generates a canonical SMILES string
>>> mol = OEMol()
>>> status = OEParseSmiles(mol, 'C1CCC(CC=CC(C)(C))CCC1')
>>> smi = OECreateCanSmiString(mol)
>>> smi
'CC(C)C=CCC1CCCCCC1'
 
NOTE: Due to the way the OEChem toolkit is designed it may be required to explicitly clear a molecule object before reusing it. So for example if we did:
>>> mol = OEMol()
>>> OEParseSmiles(mol, 'C1CCC1')
>>> OEParseSmiles(mol, 'c1ccccc1')
 
and we then generate the canonical SMILES string, we do not get just benzene (as it was the last one we parsed). Instead we get cyclobutane and benzene as disconnected fragments
>>> smi = OECreateCanSmiString(mol)
>>> smi
'c1ccccc1.C1CCC1'
 
So the correct way to do this is to clear the molecule object when we want to consider a new one (you could also just create a new molecule object, but this not memory efficient)
>>> mol = OEMol()
>>> OEParseSmiles(mol, 'C1CCC1')
>>> mol.Clear()
>>> OEParseSmiles(mol, 'c1ccccc1')
>>> smi = OECreateCanSmiString(mol)
>>> smi
'c1ccccc1'
 

Reading from files

Reading a file is similar to the way it is done in C++, except we can use Python generators. Format is automatically detected:
ifs = oemolstream()
ifs.open('pdgfr.sdf')
for mol in ifs.GetOEMols():
  print mol.GetTitle()
 
Writing is done similarly, though in this case we need to specify the format
ifs = oemolistream()
ofs = oemolostream()
ifs.open('pdgfr.sdf')
ofs.open('mypdgfr.smi')
ofs.SetFormat(OEFormat_ISM)
for mol in ifs.GetOEMols():
  print mol.GetTitle()
  OEWriteMolecule(ofs, mol)
 

NOTE: When iterating through a file like above, don't try and save the molecules (say in a list) since the molecule object is reused in the loop. If you need to save molecules for use outside the loop, do:
ifs = oemolistream()
ifs.open('pdgfr.sdf')
 
mlist = []
for mol in ifs.GetOEMols():
  aMol = OEMol(mol)
  mlist.append(aMol)
 

SD Tags

As before we can easily extract SD tag values.
ifs = oemolistream()
ifs.open('comp.sdf')
mol = OEGraphMol()
OEReadMolecule(ifs, mol)
 
OEGetSDData(mol, 'PUBCHEM_COMPOUND_CID')
 
We can als easily loop over tag-value pairs by doing
for tagpair in OEGetSDDataPairs(mol):
  print tagpair.GetTag(), ' --> ', tagpair.GetValue()
 
Setting a tag is also easy
OESetSDData(mol, 'myOwnTag', '1.2.3.4')
 
When the molecule is written out (in SD format) it will have this tag-value pair written as well


Cheminformatics Tasks


Generating a caonical SMILES

This is a trivial example
mol = OEMol()
OEParseSmiles(mol, 'c1ccccc1')
OECreateCanSmiString(mol)
 

Convert an SD file to a SMILES file

ifs = oemolistream()
ifs.open('file1.sdf')
 
ofs = oemolostream()
ofs.open('file2.smi')
ofs.SetFormat(OEFormat_CAN)
 
for mol in ifs.GetOEMols():
  OEWriteMolecule(ofs, mol)

Add explicit hydrogens

In some applications we are interested in explicit hydrogens (such as when we build a 3D structure). We can loop over each atom in a molecule and add explicit hydrogens. This is useful when we need to consider, say, a subset of the atoms in a molecule.
mol = OEMol()
OEParseSmiles(mol, 'c1ccc(CC=CC)cc1')
for atom in mol.GetAtoms():
  OEAddExplicitHydrogens(mol, atom)
 
If we simply want to add explicit hydrogens to all atoms in the molecule we can do
mol = OEMol()
OEParseSmiles(mol, 'c1ccc(CC=CC)cc1')
<span style="font-family: monospace; line-height: 14px; white-space: pre;">OEAddExplicitHydrogens(mol) </span>