Improved Atom Typing in the Chemistry Development Kit (CDK)


, , , , , , , , ,

Atom type test casesEnriched CDK Atom Type Model vs. ChemAxon vs. Present CDK Atom Type Model

Improved Atom typing leads to better hydrogen handling in the CDK

Updated on : 28th July 2011

A chemically valid atom typing leads to better chemistry and consistent outputs from any chemo-informatics toolkit. In my previous post, I had highlighted the performance of the CDK atom typing on the KEGG dataset and the pressing need to improve it. Mr. Nimish Gopal (from IIT Roorkee, India) has taken up this herculean task to fix the missing CDK atom types (reported in the KEGG molecules) as part of his summer internship in Prof. Thornton’s group at the EMBL-EBI. Since I am deeply involved with this project, I thought it would be fruitful for the community to know about the progress we have made in this direction.

Aim: The aim of this project was to enrich the atom typing model in the CDK.

Assumption: A valid atom typing will lead to an accurate explicit hydrogen count.

Conclusion: We have successfully added around 90 missing 124 missing/curated atom types in the CDK. They range from metals to salts, etc. You can find the atom type enriched CDK on my github CDK branch named as atomtype.

Model: We have performed cross validation using Chemaxon as gold standard. The KEGG molecules were used as test cases. Each KEGG mol file was read by the CDK; hydrogens were stripped and two cloned copies were generated. Explicit hydrogens were added using the CDK and Chemaxon on the respective copies of the cloned molecules. The explicit hydrogen count was recored and if they were empirically same then a subgraph Isomorphism search was performed on them (in order to make sure the hydrogens were placed correctly).

Result: 15499 KEGG molecules were tested and only 5 of them disagreed between the CDK and Chemaxon explicit hydrogen adder results. From the graphs its clear that the improved and enriched atom typing in the CDK outperforms the present CDK atom typing model. The new enriched atom typing model based CDK hydrogen adder also concurs with the Chemaxon hydrogen adder results.

The scatterplot and regression lines are linear as the resulting explicit hydrogen counts are same except few outliers

The failed cases are of ambiguous nature (C11065C13932C18368C18380C18384) and both softwares have different approach to handle such cases. The Chemaxon adds hydrogens to each atom in a molecules which is perceived correctly and skips ones (sets an error flag) which are not defined correctly. Whereas, CDK adds hydrogens to each atom in a molecule but exits (throws exception) as soon as it finds an untyped atom. Theoretically, they should end up giving same results but technically they differ.

The good news is that now CDK is able to atom type all the valid molecules from the KEGG database (June 2011 release). I am sure that there are few missing atom types which might crop up with some other small molecule databases ( e.g. ChEBI or PubChem etc.).


  • Prof. Thornton for her support and guidance.
  • I must thank Gilleain who helped Nimish to get well versed with JAVA code hierarchy in the CDK.
  • As a CDK starter, Nimish also found the Groovy book on the CDK by Egon very helpful.
  • Egon’s blog post for reporting missing CDK Atom types.
  • The Chemaxon software for granting us the license to use its hydrogen adder.
  • The SMSD for performing the isomorphism between molecules with explicit hydrogens generated by the CDK and Chemaxon.
  • The EMBL for funding this project.
We are glad to learn about the strong interest shown by the CDK community to have this work integrated back into the CDK. Thank you all for your support, we (Nimish, Gilleain and myself) have already submitted a CDK patch and it contains the following atom types.

Yakhni Pulao (Rice + meat = Non Veg Dish)


, , , , , ,

    This is a light biryani (less masala), hence should appeal to all taste buds.

Total Time: 1 hr

For Yakhni (stock)

1 1/2 kg chicken (cut into pieces) or 1 1/2 kg mutton/lamb (cut into pieces)
5-6 cups water
2 teaspoon salt
1 medium onion (peeled & cut in two pieces)
1 teaspoon clove (laung)
1 teaspoon whole black peppercorn (sabut kali mirch)
4 black cardamom pods (bari elaichi)
4 cinnamon sticks (dalchini)
3 bay leaf (tez patta)
10-12 garlic cloves (Lehsan)
4 inches ginger (adrak)
2 teaspoon coriander seed (sukha dhaniya)
2 pinch ground nutmeg (jaifal)
1/2 teaspoon anise (javetri)

For Masala (curry)

1/2 cup oil (prefer olive oil)
2-3 medium onion (thinly sliced)
2-3 teaspoon cumin seed (zeera)
2 1/2 teaspoons ginger-garlic paste
1 medium tomato (chopped)
1 cup yogurt
1-2 teaspoon salt
1 teaspoon red chili powder
1 teaspoon coriander powder
2 teaspoon garam masala powder
4 green cardamoms (choti elaichi)
1-2 1/2 teaspoons ground aniseed (saunf)
2-4 tablespoon mint leaf, coriander and green chilli (chopped)

4 cups basmati rice (soaked in water for at least 20-30 mins)

Directions (patience is key to success):

Prep Time: 10 mins

  • Trim most of the fat from the meat. Wash thoroughly with cold water. Put water in a large heavy based saucepan and bring to boil.
  • Tie aniseed and coriander seed in a piece of muslin to make spice bag (you can add whole spices in it if you want to avoid them being chewed by accident while eating!).
  • Mix all ingredients for yakhni.
  • Bring them to a boil & let them boil on medium high heat for 20 minutes or till the lamb gets tender (slow cooking helps).
  • Strain the yakhni only leaving the lamb in it.

    Fry the onions till golden

  • In pan heat oil, add sliced onions & fry the onions till they turn golden brown.
  • Take 1/4 out of it & keep aside.
  • Now add cumin seeds (zeera), ginger garlic paste, tomatoes, salt, red chilly powder to the remaining fried onions & stir-fry for 3-4 minutes.
  • Next add yogurt, garam masala and green cardamom (choti elaichi).
  • Cook on medium high heat till the tomatoes get tender.
  • Add the bag, lamb pieces to the masala & let it simmer for another 2-4 minutes.
  • Yakhni (Stock): Cook the lamb with masala and yoghurt

    Finally pour the masala in the yakhni, already back in the cooking pot.

  • Discard the spices bag from meat then add rice & cook it uncovered on high heat for 5 minutes or till the water “almost” dries up.

Sprinkle the fried onions & mint leaves.

Close the pot by tightly covering the pot with foil just where the lid goes making sure no steam escapes (Dum dena).

Cook on low heat for 15-20 minutes or till the rice is done.

Enjoy with chutney/cucumber raita.

Atom Typing with the CDK


, , , ,

Atom typing is an important and integral part of any chemoinformatics software. Most of the calculations, and more importantly the assignment of implicit/explicit hydrogen(s) depend on this. Thanks to Egon, Rajarshi and others, the CDK has improved a lot over the years.

I (with Nimish a summer intern) did a quick test on the KEGG molecules (before KEGG becomes a paid service on 30th June, which is a pity…..!).

The good news is that only 1.37% molecules failed the “Atom Typing” test, as compared to last year (the failure rate was approximately 10%, mostly phosphates). Most of the failed atoms are metals (Cl,N,S,Fe,Mn,Zn,Cu, etc).

Here are the raw results:

Wish list: I hope the CDK developers or chemists can fix these too!

Kindly comment and leave your suggestions.

Join/Fuse two molecules


, , , , , , , ,

Lately, an interesting question was posed to me.

How can I join two molecules?

Well for many of us this might be a regular exercise on a chemical editor but this is a little trickier to program. So mathematically, we are aiming for a union between two molecules. The union between two molecules can be defined as (A ∪ B) = n(A) + n(B) – (A ∩ B), where A is a query molecule and B is the target molecule. The intersection (A ∩ B) can be obtained by finding isomorphism between two molecules. The joining part is a little challenging as not all combination(s) might be chemically valid. So we need to find combination(s) which are unique and unsaturated!

Here is a little snippet to perform this task in Java using the SMSD and CDK.

You can download the java source UnionMolecule and java executable Union.jar.

Union between query and target molecule

Union between query and target molecules

Usage example:

java -jar Union.jar "OOC1=CC=CC=C1" "c1ccc(cc1)c2ccccc2"


------------- Combination 1--------------------
Query SMILES OOC1=CC=CC=C1, count 8
Target SMILES c1ccc(cc1)c2ccccc2, count 12
Union SMILES OOc1ccccc1c2ccccc2, count 14

------------- Combination 2--------------------
Query SMILES OOC1=CC=CC=C1, count 8
Target SMILES c1ccc(cc1)c2ccccc2, count 12
Union SMILES OOc1cccc(c1)c2ccccc2, count 14

------------- Combination 3--------------------
Query SMILES OOC1=CC=CC=C1, count 8
Target SMILES c1ccc(cc1)c2ccccc2, count 12
Union SMILES OOc1ccc(cc1)c2ccccc2, count 14

Benchmarking Substructure Search


, , , , , ,

Here are some interesting results as they highlight the run time between three flavours of substructure search algorithms. I have chosen, a) UIT (Universal Isomorphism Tester) as implemented in the CDK (Chemistry Development Kit); b) VF2 as implemented in the SMSD (I have further optimised it); and c) VF2 as ported from ChemKit. The former two implementations (i.e. a and b) are optimised for substructure search whereas the later implementation (i.e. c) is optimised for the graph automorphism search (further extension would be good/ideal).

While benchmarking the substructure search, I realised that the CDK AtomContainer takes ample amount of time for methods such as getConnectedAtomsList(atom) or getConnectedAtomsCount(atom). Hence, I ended up refactoring the AtomContainer to store the adjacency map of atoms and bonds. This will speed up the above mentioned methods and the benchmark results will bring out the actual differences in the run time. Please find the updated AtomContainer here (there is scope for further optimisation).

Now let’s focus on the results from the benchmark 🙂

Benchmark results from the substructure search
Benchmark results from the substructure search. X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

The comparative study/benchmark results between the three algorithms revealed some very interesting characteristics.

1) ChemKit version of the VF2 is very fast as it just looks for graph automorphisms (light green bars on the graph).

2) CDK UIT seems to perform better on graphs with atom count <= 19.

3) VF2 implementation in the SMSD performs better on graphs with atom count > 19 and yes this implementation is better off in the worst case (i.e. if no hits were reported, e.g.: first bar with 11 atoms).

Further permutation based tests will reveal more intricacies, artefacts and the innate nature of these algorithms and implementation. If you are interested in these implementations and test cases please feel free to refer to the benchmark code on the github.

Update (16 March 2011): VF2 implementation from the Chemkit now calculates substructure isomorphism. Thanks to Kyle Lutz for his feedback. The new runtime of the VF2 chemkit implementation looks as good as the one in the SMSD except that the rejection of a solution (in case of mismatch) is faster in the SMSD version of the VF2 whereas matching is faster in the former. Rest of the facts regarding size and runtime mentioned in the graph above still hold good. Below is the updated graph.

Update (17 March 2011): Further optimisation of the ChemKit ported VF2 implementation makes is fastest (reporting the matches and rejection of solution; refer to the benchmark code on the github).

Performance benchmark of substructure searches VF2 vs CDK's UIT
Performance benchmark of substructure searches VF2 vs CDK’s UIT
X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures. 

Ultra Fast substructure search based on the VF2 outperforms CDK UIT


, , , , , ,

Updated (17 March 2010): Gilleain and myself were able to code an ultrafast VF2 algorithm for Isomorphism substructure search (permutation bug is resolved). The code is ported or re-factored from Chemkit and I have further optimised the code at our end. The previous VF2 code in the SMSD is more MCS oriented which was a big compromise in terms of speed while performing a substructure search. I have mentioned these issues in a previous blog. The whole idea of a substructure search is to answer a boolean question i.e. is A a subgraph/substructure of B ? If the answer is true then matching atoms are reported. On the other hand MCS where it’s about reporting the commonality between A and B.

Below is the benchmark graph which speaks about the performance (30 queries were subjected to 15030 targets, same as my previous benchmark test cases) of the VF2.

Optimised Chemkit VF2 vs CDK's UIT
Substructure benchmark results between VF and CDK’s UIT
X-axis: 30 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

As is apparent from the benchmark graph, VF2 takes less than 10-20% of the total time (in most cases) for performing substructure searches (results are sorted in ascending order of the query atom count). So as for now we have an implementation which at last outperforms UIT (CDK code) in terms of speed for automorphic graphs. The performance of the VF2 improves as the query atom count (graph size) increases. We might need a few changes in the CDK AtomContainer class itself to further improve the speed/performance but this will involve further discussions with CDK developers.

How are enzymes classified?


, , , , , , , , , ,

How are enzymes classified?

Metabolism influences building or replacement of tissue, conversion of food to energy, disposal of waste materials, reproduction etc. “Catalysis” is defined as the acceleration of a chemical reaction by a substance which itself undergoes no permanent chemical change. Most biochemical reactions do not take place spontaneously and enzyme catalysis plays an important role in biochemical reactions necessary for all life processes. Without enzymes, these reactions would take place at a rate far too slow for effective metabolism.

Enzymes can be classified by the kind of chemical reaction they catalyze. One such scheme of enzyme classification is defined by IUBMB.

The IUBMB assigns a 4-digit code to each enzyme. Each enzyme is prefixed by EC, followed by the digits.

For exampleoxidoreductases EC

1.     The first digit denotes “Class” of the enzyme

2.     The second digit indicates, “Sub-class” of the enzyme

3.     The third digit gives “Sub sub-class” of the enzyme

4.     The fourth digit in the code is “Serial number” of the enzyme

The classification is as follows:

Group Name Type of Reaction Catalysed Example
Oxidoreductases Oxidation-reduction reactions Alcohol oxidoreductase (EC 1.1)
Transferases Transfer of functional groups Methyltransferase (EC 2.1)
Hydrolases Hydrolysis reactions Lipase (EC 3.1)
Lyases Addition to double bonds or single bonds Decarboxylases (EC 4.1)
Isomerases Isomerization reactions Epimerases and Racemases (EC 5.1)
Ligases Formation of bonds with ATP cleavage Enzymes forming carbon-oxygen bonds (EC 6.1)

b) How can I find similar enzymes?

Any similarity search is based on the presence of similar patterns (similar bond changes and/or small molecules) shared between query and target reactions. A large number of shared patterns results in higher similarity score or lesser distance score. In Bioinformatics, the concept of similarity or distance is used to find similar sequences based on amino acid similarity, structural topology, etc. In Chemoinformatics similarity between small molecules/drug molecules (i.e. based on Tanimoto score) is based on the presence of similar bonds and atoms between query and target molecules.

c) Literature

  1. Automatic Assignment of EC Numbers.
  2. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions.
  3. Automatic Determination of Reaction Mappings and Reaction Center Information. 2. Validation on a Biochemical Reaction Database.
  4. Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps.
  5. Chemical similarity searching.
  6. Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes: implications for classification of enzyme function.
  7. Using Reaction Mechanism to Measure Enzyme Similarity
  8. etc.

I reckon in the near future we might see such concepts being adapted by IUBMB itself to annotate and classify enzymes.

This would be vital in the study of the interactions between the components of biological systems (metabolites, enzymes and metabolic pathways), and how these interactions give rise to the function and behavior of that system.

As always, thoughts/suggestions are welcome!

Performing substructure searches?


, , ,

The design principle of the SMSD code was always to find Maximum Common Subgraph (MCS) between two molecules. Of course, it can be optimised to find substructures between two molecules. I was lazy in doing so but a recent post on the CDK mailing list about the SMSD based substructure search performance was a wakeup call for me.

I have made some changes to the SMSD code and now the substructure search is not that bad. Nonetheless, I have to redesign the substructure search part of the SMSD as substructure search ideally shouldn’t perform an exhaustive comparison between all possible combinations of the graph. It’s usually a boolean question i.e. is A a subgraph/substructure of B, unlike in the case of MCS where it’s about reporting the commonality between A and B.

So now one can query substructure search in the SMSD using the following code.

private static int getSMSDSolutionCount(IMolecule query, IMolecule target) throws CDKException {
SubStructure substructure = new SubStructure();

substructure.set(query, target);
if (substructure.isSubgraph(true)) {

return 1;

} else {

return 0;



The underlying substructure code can be found here

New link has optimised substructure search code

Results: Here is a small substructure search benchmark test between UIT and SMSD. I have taken 30 query molecules and fired each of them against a decoy of 15030 molecules. At least a hit is expected from each query run.

Substructure benchmark between SMSD vs UIT

Substructure benchmark between SMSD vs UIT

In this case UIT seems to have an edge over SMSD as, in a few cases it takes more time. I have to check if the real difference is between implementation of the code or the actual algorithm itself.

You can find the results and the benchmark code here

The change(s) which is required in the SMSD code will involve heavy refactoring of the VF lib code. The present code works fine for the MCS but it’s time to optimise it for the substructure search. Any suggestions in optimising the VF code would be more than welcome. Please refer to the following class if you have any suggestions 🙂

Links updated: 1st Oct 2011

I hope someone out there will have a cracking solution!

MCS between N-Molecules


, , ,

Recently we have been working on extending and cleaning the SMSD command line tool. One of the priority requests from the SMSD users was to be able to obtain MCS between N-molecules. Thanks to Gilleain who has been instrumental in cleaning the code and providing a good visual output to the searches. He has implemented a circular layout for the MCS search between N-molecules. This looks cleaner and visually appealing to the users.

You can download SMSDcmd-3.0.jar from our website. Try the following command line option for obtaining MCS between N-molecules

> java -jar SMSDcmd-3.0.jar -T SDF -N Data/arom.sdf -g -b

You can download/clone the arom.sdf.

The result would look something like this

MCS between N-molecules

The SMSD integrated with the CDK


, , ,

Today with the help of Egon the SMSD was successfully integrated with the CDK. The nightly build is available via nightly build. If you are interested in the source code then please visit GIT SMSD.

Q: How can I compute Substructure search between a query and target molecule?

A: An example for Substructure search:

//Turbo mode search
//Bond Sensitive is set true
SMSD comparison = new SMSD(Algorithm.SubStructure, true);
// set molecules and remove hydrogens
comparison.init(queryMol, targetMol, true, true);
// set chemical filter true
comparison.setChemFilters(true, true, true);
if (comparison.isSubgraph()) {
//Get similarity score
System.out.println(“Tanimoto coefficient:  ” + comparison.getTanimotoSimilarity());

Q: How can I compute MCS search between a query and target molecule?

A: An example for MCS search:

//{ 0: Default SMSD Algorithm, 1: MCSPlus Algorithm, 2: VFLibMCS Algorithm, 3: CDKMCS Algorithm}
//Bond Sensitive is set true
SMSD comparison = new SMSD(Algorithm.DEFAULT, true);
// set molecules and remove hydrogens
comparison.init(queryMol, targetMol, true, true);
// set chemical filter true
comparison.setChemFilters(true, true, true);
//Get similarity score
System.out.println(“Tanimoto coefficient:  ” + comparison.getTanimotoSimilarity());


If you interested in the complete java code please refer to:

  1. MCS Search code
  2. Substructure search code