Papers: Literature text mining and extensions to UniChem


Two new papers from the group have just been published, both in Journal of Chemoinformatics - and of course both Open Access.

The first deals with some extensions to UniChem to allow far more flexible searches. The abstract is:

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

The second deals with using text mining approaches to find papers that look like they could be abstracted into ChEMBL - that is they contain keywords enriched in medicinal chemistry and compound structure concepts. The abstract for this paper is:


The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches.
Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

%T UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers
%A J Chambers
%A M Davies
%A A Gaulton
%A G Papadatos
%A A Hersey
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:43  
%O doi:10.1186/s13321-014-0043-5
%O http://www.jcheminf.com/content/6/1/43

%T A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
%A G Papadatos
%A GJP van Westen
%A S Croset
%A R Santos
%A S Trubian
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:40  
%O doi:10.1186/s13321-014-0040-8
%O http://www.jcheminf.com/content/6/1/40