Here are some interesting results as they highlight the run time between three flavours of substructure search algorithms. I have chosen, a) UIT (Universal Isomorphism Tester) as implemented in the CDK (Chemistry Development Kit); b) VF2 as implemented in the SMSD (I have further optimised it); and c) VF2 as ported from ChemKit. The former two implementations (i.e. a and b) are optimised for substructure search whereas the later implementation (i.e. c) is optimised for the graph automorphism search (further extension would be good/ideal).
While benchmarking the substructure search, I realised that the CDK AtomContainer takes ample amount of time for methods such as getConnectedAtomsList(atom) or getConnectedAtomsCount(atom). Hence, I ended up refactoring the AtomContainer to store the adjacency map of atoms and bonds. This will speed up the above mentioned methods and the benchmark results will bring out the actual differences in the run time. Please find the updated AtomContainer here (there is scope for further optimisation).
Now let’s focus on the results from the benchmark 🙂
- Benchmark results from the substructure search. X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.
The comparative study/benchmark results between the three algorithms revealed some very interesting characteristics.
1) ChemKit version of the VF2 is very fast as it just looks for graph automorphisms (light green bars on the graph).
2) CDK UIT seems to perform better on graphs with atom count <= 19.
3) VF2 implementation in the SMSD performs better on graphs with atom count > 19 and yes this implementation is better off in the worst case (i.e. if no hits were reported, e.g.: first bar with 11 atoms).
Further permutation based tests will reveal more intricacies, artefacts and the innate nature of these algorithms and implementation. If you are interested in these implementations and test cases please feel free to refer to the benchmark code on the github.
Update (16 March 2011): VF2 implementation from the Chemkit now calculates substructure isomorphism. Thanks to Kyle Lutz for his feedback. The new runtime of the VF2 chemkit implementation looks as good as the one in the SMSD except that the rejection of a solution (in case of mismatch) is faster in the SMSD version of the VF2 whereas matching is faster in the former. Rest of the facts regarding size and runtime mentioned in the graph above still hold good. Below is the updated graph.
Update (17 March 2011): Further optimisation of the ChemKit ported VF2 implementation makes is fastest (reporting the matches and rejection of solution; refer to the benchmark code on the github).
- Performance benchmark of substructure searches VF2 vs CDK’s UIT
X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.