Skip to main content
Get your Wikispaces Classroom now:
the easiest way to manage your class.
Introducing Cheminformatics: Navigating the world of chemical data
Introducing Cheminformatics: Navigating the world of chemical data
Pages and Files
Add "All Pages"
Representation of 2D structures on computer
Why do we need to handle chemical information in special ways on computer? This module will cover the need for special 2D chemical representations, the SMILES and InChI linear notations, internal graph theory representations using adjacency matrices, and some of the subtleties that come from aromaticity, tautomerism and sterochemistry.
Related websites & resources
Henry Stewart 2D Representation talk
Introducing Cheminformatics eBook
The simple rules of SMILES are:
Atoms are represented by their atomic symbols
Hydrogens atoms automatically saturate free valences and not considered.
Neighbouring atoms stand next to each other.
Double and triple bonds are represented by "=" and "#" respectively.
Branches are represented by parentheses ().
Rings are described by allocating digits to the two "connecting" ring atoms.
Try to create a SMILES by hand for
. You can cross-reference your solution with the SMILES on the ChemSpider page.
Historic ways of representing chemicals
Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc. Identifies the compound, but gives no (or little) information about what it consists of
Chemical formula, e.g. C6H12O6. Specifies the type and quantity of the atoms in the compound, but not its structure (i.e. how the atoms are connected by bonds)
Systematic name, e.g. 1,2-dibromo-3-chloropropane. Identifies the atoms present and how they are connected by bonds.
2D chemical structure diagram
3D chemical structure diagram
Early pioneers in using computers in chemistry considered two questions: how do we communicate structural information between humans and (text-only) computers? and how do we represent the atoms and bonds in a molecule one they are stored internally on a computer?. The answer to the former question was
: clever ways of representing 2D structures in a text string. The earliest example was
Wiswesser Line Notation
, followed by Beilstein's ROSDAL (which is still used in a limited fashion today). Early work was also done on ways of using linear notations for indexing structures, including the
Today, linear representations are extremely useful, not because computers can only work in text, but because text is still the most efficient way of storing and communicating information. The most popular current linear representations are
(try it with
InChI unofficial FAQ
) , although some others are in use, such as
. Here is an example of the SMILES for a common drug:
Linear notations are not the only way of communicating structure: also popular are file-based formats such as
(a variant of XML). These have the advantage of flexibility, although they are much more verbose.
Internal representation for 2D structures is the same as one would represent a mathematical graph (which is useful - see later!). The
atom lookup table
assigns a unique number to each atom, along with listing other properties such as atomic type; the
which shows which atoms are bonded to which other atoms, the bond order being indicated by the number in a cell (i.e. 1=single bond, 2=double bond, 3=triple bond). By convention, a 4 can be used for an "aromatic" bond. Here is an example atom lookup table and connection table for Acetaminophen (Tylenol, Paracetamol):
Note that if we need to ensure that the same molecule is numbered the same way each time, we need an algorithm that consistently numbers atoms via rules. Fortunately, this can be done with the Morgan Algorithm (see Leach & Gillet). In this algorithm, each atom is given a "connectivity value" reflecting how many atoms it is connected to. This value is iteratively replaced by the sum of the connectivity values of its neighbors, until the number of different values is maximized. Atoms are then numbered in decreasing order of connectivity value. In the case of a tie, other properties are used (e.g. atomic number, bond order, etc). Doing this is an important basis for producing canonical representations, e.g. canonical SMILES.
We now have some neat, simple ways of representing and communicating 2D chemical structures. However, there are some nuances of chemistry that complicate matters. In particular,
Most representations don't inherently store stereochemical information, and we have a policy decision about whether we actually want to differentiate stereoisomers (in some instances, such as
, it makes a life or death difference!). This can be done at the representation level, or the computation level. Stereoisomerism is addressed in Isomeric SMILES and
For aromaticity, it is not always entirely clear whether a ring should be considered "aromatic" or not, and even if so, it may be represented as alternating single or double bonds, or in "aromatic" form. This can be addressed at the representation or computation level
For tautomerism, the same functional group can be represented differently, either through different conventions or to indicate a particular state (usually at a particular pH). Tautomerism is addressed in InChI.
The usefulness of graph theory
is a branch of mathematics that is used to model graphs - objects (nodes) with links between them (edges).
How does this apply to chemical structures? Well, if we consider atoms as nodes and bonds as edges, we have access to a large number of graph theory algorithms: for example comparing two chemical structures to see if they are the same becomes a
problem; determining if a chemical structure contains a given substructure becomes a
problem, solvable with the
(see Leach & Gillet).
Structural representations of reactions need to identify only the arrangement of products and reagents, and possibly which reagent atom maps to which product atom; other information such as
and yield are generally stored separately.
is a superset of SMILES with symbols for arrows and to separate components of the reaction.
is a superset of Reaction SMILES that allows mapping of individual atoms. Note that Reaction SMILES and SMIRKS are languages for representing
, which may or may not be valid reactions. For example a common use for SMIRKS is representing generic reaction rules.
Representing generic (Markush) structures
Genericized forms of chemical structures are thought to have been first introduced by
in 1924 as part of a patent (prior to that, patents were for specific structures). Thus the term "Markush structures" came to be used for 2D representations that describe more than one actual structure (for example, by enumerating alternate groups on particular points of the molecule). Representing generic structures is difficult because a Markush structure can represent an unlimited number of compounds (e.g. "aryl group"). However this problem has been addressed with text-based languages for describing generic structures, such as GENSAL, and extended connection table representations for internal use. They are widely used in patent searching systems. We will be looking at Markush structures in more detail in a later class.
Leach & Gillet Chapter 1
Brown Sections 2 & 3
help on how to format text
Turn off "Getting Started"