Updated on : 28th July 2011
A chemically valid atom typing leads to better chemistry and consistent outputs from any chemo-informatics toolkit. In my previous post, I had highlighted the performance of the CDK atom typing on the KEGG dataset and the pressing need to improve it. Mr. Nimish Gopal (from IIT Roorkee, India) has taken up this herculean task to fix the missing CDK atom types (reported in the KEGG molecules) as part of his summer internship in Prof. Thornton’s group at the EMBL-EBI. Since I am deeply involved with this project, I thought it would be fruitful for the community to know about the progress we have made in this direction.
Aim: The aim of this project was to enrich the atom typing model in the CDK.
Assumption: A valid atom typing will lead to an accurate explicit hydrogen count.
Conclusion: We have successfully added around
90 missing 124 missing/curated atom types in the CDK. They range from metals to salts, etc. You can find the atom type enriched CDK on my github CDK branch named as atomtype.
Model: We have performed cross validation using Chemaxon as gold standard. The KEGG molecules were used as test cases. Each KEGG mol file was read by the CDK; hydrogens were stripped and two cloned copies were generated. Explicit hydrogens were added using the CDK and Chemaxon on the respective copies of the cloned molecules. The explicit hydrogen count was recored and if they were empirically same then a subgraph Isomorphism search was performed on them (in order to make sure the hydrogens were placed correctly).
Result: 15499 KEGG molecules were tested and only 5 of them disagreed between the CDK and Chemaxon explicit hydrogen adder results. From the graphs its clear that the improved and enriched atom typing in the CDK outperforms the present CDK atom typing model. The new enriched atom typing model based CDK hydrogen adder also concurs with the Chemaxon hydrogen adder results.
The scatterplot and regression lines are linear as the resulting explicit hydrogen counts are same except few outliers
The failed cases are of ambiguous nature (C11065, C13932, C18368, C18380, C18384) and both softwares have different approach to handle such cases. The Chemaxon adds hydrogens to each atom in a molecules which is perceived correctly and skips ones (sets an error flag) which are not defined correctly. Whereas, CDK adds hydrogens to each atom in a molecule but exits (throws exception) as soon as it finds an untyped atom. Theoretically, they should end up giving same results but technically they differ.
The good news is that now CDK is able to atom type all the valid molecules from the KEGG database (June 2011 release). I am sure that there are few missing atom types which might crop up with some other small molecule databases ( e.g. ChEBI or PubChem etc.).
- Prof. Thornton for her support and guidance.
- I must thank Gilleain who helped Nimish to get well versed with JAVA code hierarchy in the CDK.
- As a CDK starter, Nimish also found the Groovy book on the CDK by Egon very helpful.
- Egon’s blog post for reporting missing CDK Atom types.
- The Chemaxon software for granting us the license to use its hydrogen adder.
- The SMSD for performing the isomorphism between molecules with explicit hydrogens generated by the CDK and Chemaxon.
- The EMBL for funding this project.
We are glad to learn about the strong interest shown by the CDK community to have this work integrated back into the CDK. Thank you all for your support, we (Nimish, Gilleain and myself) have already submitted a CDK patch and it contains the following atom types