As chemical curator for ChEMBL, I spend a lot of time
processing, checking and standardising the compounds in the database. I
use various pieces of software for this, but mostly it’s Pipeline Pilot. For
those of you who don’t know, Pipeline Pilot is Accelrys’s graphical scientific
workflow authoring application, that allows passing hundreds of thousands
of compounds through various components to make sure they meet our standards to
be loaded into ChEMBL.
However, it’s always incredibly useful to utilise other available software in a complementary manner to see if anything
may have been missed, could be done in a different way or just to see what alternative results
you can get. One such open source software package is Indigo, created by GGA Software Services. On of the web
application developers was passing all of the ChEMBL compounds through
the standard Indigo loader, via a Python script, during the course of his work, and found that there
were about 9,000 compounds (0.7% of the current database) that failed to be loaded. The list of exceptions was then examined to see where the errors had come from. An important learning here is that different tool kits will throw different exceptions, since these structures were all happy within the PP environment.
The reasons
they had failed were as follows:
1. The presence of a wiggly (query) bond
2. Two stereo bonds connected to one chiral centre
This was split into two sections:
Firstly, where the two bonds effectively
canceled each other out and no stereochemistry was recorded at that centre.
Secondly, where the stereochemistry was present
at that centre but having the two stereo bonds is against IUPAC drawing standards.
3. Presence of a stereo bond when there’s no chiral centre
Some examples of typical scenarios are shown below:
From this Indigo check, I was able to extract and fix these compounds, a lot of which won’t have new standard InChIs, just updated molfiles (i.e. they will keep their CHEMBL ID). For most of these compounds, to confirm the changes that I was going to make, I went back to the original published literature. It is interesting to note that the majority of compounds with the two stereo bonds on a single chiral centre had been extracted exactly as they had been drawn in the paper.
These changes will be visible in ChEMBL_18 and I am aiming to
incorporate this Indigo loader into our standard compound cleanup and loading protocol. This will probably be implemented under the Indigo toolkit extension that is found in Knime.
Any questions or queries about what I have done, please feel free to
email: Chembl-help@ebi.ac.uk
An outstanding opportunity for computational research using approaches such as bioinformatics, genomics, systems biology, mathematical modelling, image analysis.
The Francis Crick Institute will open at St Pancras in central London in 2015. Its research will use interdisciplinary approaches to investigate the biology of human health and disease, supported by core funding from CRUK, the MRC, and the Wellcome Trust, and by grants from UK and international funding agencies.
The Crick is expanding Computational Biology research as a key component of its scientific strategy. The institute will offer an outstanding environment for computational research, with excellent opportunities for wet/dry collaborations across the range of biomedical and clinical research disciplines, supported by a strategic alliance with the Wellcome Trust Sanger Institute. The new Crick laboratories will feature excellent computational facilities including a state-of-the-art data centre.
The London Research Institute (LRI) is the largest Cancer Research UK research institute, with 40 research groups focusing on fundamental cancer biology. The Institute is based in well-equipped laboratories at Lincoln's Inn Fields in central London, and at Clare Hall in Hertfordshire.
Computational Biology in Cancerhttp://tinyurl.com/ohgnxw8 An outstanding opportunity for computational research using approaches such as bioinformatics, genomics, systems biology, mathematical modelling, image analysis. The LRI recruitment process for 2013 will carried out jointly with the Crick Institute. We shall appoint outstanding scientists seeking to establish independent and innovative research programmes focussed on:
Newly appointed group leaders will receive core funding for research personnel, travel and consumables, and access to the Institute's comprehensive computational core facilities, backed by competitive employment terms. The new group leaders will move to the Crick laboratories in 2015.
Over the last the year we have be doing a lot work designing and building an API layer to the ChEMBL database. The reason for adding this programmatic interface is to simplify many of the daily tasks we carry out on the database. From a technical perspective the API is actually a series Object Relational Mapper (ORM) classes built on top of the ChEMBL database using the Python Django web framework. For many of our daily programmatic tasks we use the ORM directly, but we also expose the ORM as a RESTful Interface using Tastypie.
Some examples tools and processes currently using the new API include the ChEMBL twitter bot and the database migration process (creating PostgreSQL and MySQL versions of the ChEMBL Oracle database during the ChEMBL release cycle). We are now at the stage where we can start to think about updating some of the existing larger services to run off the new API and first of these to make the transition are the ChEMBL Web Services. So, what have we done? Essentially we have rewritten the Web Services using the API (actually we use the ORM in this case) to interact with the ChEMBL data model. We have made this new set of Web Services available under the following base URL:
https://www.ebi.ac.uk/chemblws2
Those familiar with our current Web Services will notice we have added a ‘2’, to the end. An example call the current live service looks like:
The new Web Service base URL will provide you with all the same methods listed on page above and more importantly the format of the results returned by the Web Services will also be the same. Our plan going forward is to run both services for next 4-6 weeks and we ask users of the current ChEMBL Web Services to test the new versions (remember you just need to add a 2) and report back any issues encountered. Assuming we do not hit any major obstacles, after the 4-6 week period we will replace the current live services with the new ChEMBL API based services.
This first Web Service update is technology focused. We want to ensure the new services scale and perform well in the wild and that our end users do not notice a change (well we are hopefully expecting you to see a performance boost). Further down the line we will make some bigger changes to the Web Services, such as reviewing methods, attributes, naming conventions, introduce paging and more. We will obviously consult the community and allow for a period of transition before releasing any such changes. Now is the time to tell us if you have any must have new ws features.
Finally, it is not strictly true that the new Web Services are identical to the current live versions. There are a couple of new features we have built in, such as improved image rendering and JSONP responses. We will blog about these in new features in the next couple of days, but in the meantime please have a look at new ChEMBL Web Services and let us know how you get on.
Ben Stauch in the group has just been examined on his these - 'Methods for the Investigation of Protein-Ligand Complexes'. This was a tour de force of many techniques - NMR, computational and X-ray crystallography. Ben will be around for a few more months, writing things up, and completing/starting some experimental work on Xe complex refinement and characterisation.
Congratulations to Ben from all the group!
In due course, the thesis will be downloadable from the EBI and EMBL websites, and I'll update this post when the files are there.
We usually blog about exciting scientific and technological updates, interesting concepts, ideas and publications within the realm of life sciences and drug discovery.
This post is slightly different, as it deals with something that might be (even) more important:
A number of us in the ChEMBL Group (Rita (not in the picture above), Patricia, Felix, Anna, Sam, Anne, Mark, Michal, George, Gerard and Ashwini) are doing a Fun Run at Victoria Park on 12th October to help raise money for Cancer Research UK. We are doing this to support a colleague who is currently receiving treatment for cancer.
We've set up a JustGiving page which makes donations fast, easy and secure.
Anything you can donate (in almost any currency :)) to this worthwhile cause would be really much appreciated.
Following up on yesterdays post by George and Mark, I put together a slide, hopefully illustrating the advantages of document comparison using objects other than words alone.
Many of you will have noticed a new section on the ChEMBL interface, specifically at the Document Report Card page, called Related Documents. It consists of a table listing the links for up to 5 other ChEMBL documents (i.e. publications aka papers) that are scored to be the most similar to the one featured in the report card. Here's an example.
How does this work? There are examples of related documents sections online, e.g. in PubMed or in various journal publishers' websites. Document 'related-ness' or similarity can be assessed by comparing MeSH keywords or by clustering documents using TF-IDF weighted term vectors. Fortunately, ChEMBL puts a lot of effort in manually extracting and curating the compounds and biological targets from publications, so why not using these as descriptors to assess document similarity instead - as far as we know this is the first time this approach has been implemented?
So, here's how it works:
Firstly, for each document in ChEMBL, its list of references is retrieved using the excellent EuropePMC web services. By considering documents as nodes which are connected with an edge if one paper cites the other, a directed graph structure emerges. By doing this for all ~50K documents in ChEMBL, you get the massive graph illustrated above in Cytoscape. As a bonus, by measuring the in- and out- degree of the nodes, one could check which are the most cited papers in ChEMBL - but that's the topic of another blog post. This graph could be further annotated with protein target families, authors and institutions, as it has been elegantly done here.
Moving on, once a relationship between two documents is established, we need a way to quantify their similarity. As hinted above, we used the normalised overlap of compounds and targets reported in the two documents. This is done using the classic Tanimoto coefficient, so if doc A reports compounds (1,2,3) and doc B reports compounds (3,4,5), their compound Tanimoto similarity T is 1/5 or 0.2. Exactly the same applies for the target-based document similarity. The composite score we use to rank docs in the Related Documents section is simply the maximum of the two individual ones.
What does all that mean in practice? It means that 2 papers are listed as similar if they their reported compounds or biological targets overlap significantly (and one cites the other). For example, papers with follow-up experiments on the same candidate drug will be deemed similar, e.g. this one. The same will apply to two papers that involve kinase panel screening assays. A desirable side-effect is that by following the links, the tenacious user may traverse the whole graph displayed above!
A paper from Gerard in the group on some of his proteochemometric modelling work; a link to the paper is here. Z-scales rule! (the original Sandberg et alJ Med Chem paper on the Z-scales was one of my 'lightbulb turning on' moments in my professional life - go hunt it down if you don't know it.)
%T Benchmarking of protein descriptor sets in proteochemometric modeling (part 2: modeling performance of 13 amino acid descriptor sets
%A G.J.P. van Westen
%A R.F. Swier
%A I. Cortes-Ciriano
%A J.K. Wegner
%A J.P Overington
%A A.P. IJzerman
%A H.W.T. van Vlijmen
%A A. Bender
%J J. Cheminformatics
%D 2013
%V 5
%O doi:10.1186/1758-2946-5-42