jCompoundMapper: An
Open Source Java Library and Command-Line Tool for Chemical Fingerprints
Georg Hinselmann, Lars Rosenbaum, Andreas Jahn, Nikolas Fechner, and Andreas Zell, University of Tuebingen, Center for Bioinformatics Tuebingen(ZBIT), Sand 1, 72076, Tuebingen, Germany
What is jCompoundMapper?
- jCompoundMapper provides popular fingerprinting algorithms for chemical graphs such as depth-first search fingerprints, shortest-path fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints
- jCompoundMapper provides exporters for several formats for machine learning tools such as LIBSVM, LIBLINEAR, and WEKA.
- jCompoundMapper allows for a parameterization like search depth, distance cut-offs, or atom typing. In case of hashed fingerprints you may configure the size of the hash space.
- jCompoundMapper for the computation of similarity matrices for clustering approaches or as input format for LIBSVM
- jCompoundMapper can be used as a lightweight jar library or a stand-alone executable
- jCompoundMapper is solely based on open source software with a liberal license. It uses the chemical expert system of the Chemistry Development Kit.
Goals
- Provide standard fingerprint implementations for data mining experiments
- Open source implementations with an exact definition
- Easy-to-use command-line tool for the described tasks
- Provide a basis for the development of new fingerprint encodings which can be compared, for example, to algorithms with exactly the same labeling algorithm.
Getting started
- All you need is this binary executable jar file and a Java 1.6 runtime environment
- A demo how to train a LIBSVM model with a few program calls is described in this tutorial. You can reproduce the results of the command-line interface example using the data sets from the environmental toxicity challenge (prepared MDL sd files can be downloaded here: train test). Binaries for LIBSVM can be downloaded from here.
- Input format: MDL sd format with attached hydrogen atoms is required as input format. Additionally, a reasonable geometry for 3D fingerprints is required.
- Examples of fingerprints in string format
Files
Background
The first lines of code of the jCompoundMapper were written about three years ago as we needed reference approaches for benchmarking new algorithms in the field of cheminformatics. The tool has been extended by exporters to common machine learning tools so that a user can train a model on his data set with a few calls from the command-line. We think that parts of the source code or the tool for itself may be useful for academic users who need alternative fingerprints for the analysis of their data or just reference approaches to test their implementations. A problem with many implementations is that they are closed source and nobody knows what the implementation does exactly. This is extremely bad for the use in academia where the scientist has to state explicitly the tools used for the experiments.
To show that some of the implementations have a performance comparable to the state-of-the-art and to make the methods applicable as methods in scientific papers, we published a study.
Most of the implementations have been already published. The original publications are cited in the paper. Many fingerprint algorithms may be configured so that they may differ from their original implementations.