Crowdsourced Validation of Research Code Compounds

01 Sep 2012

I love the drug discovery process, and how it works, the progression from 'secret' to 'public' as the health benefits of a potential new drug become clearer. And there is associated excitement of seeing a compound or target validation result for the first time - these things can really change your world view and progress your own ideas. This free and liquid exchange of accurate reliable data is a fundamental part of invention, discovery and the progress of science.

One part of this process is the Research Code of a compound - the name that looks like a few characters, then some integers (usually), e.g. UK-92,480. There are some other posts on research codes on the ChEMBL-og, and also there's a few tables in the ChEMBL database that contain some useful metadata stylee stuff connected with these.

Anyway, making a reliable link between a research code and a 2-D structure or sequence is an important thing, and should be easier than it is. So, over a dinner the other evening, in Vienna, with Tudor Oprea, we stumbled across the following idea, and I think it's a reasonable one, worth sharing, and following up on.

There are sometimes errors in the structures associated with research codes, in the literature, on the web, etc. The people who know the compound structures best are the companies themselves, and given that they have disclosed them themselves, either as some form of scientific publication, or at a conference, or in regulatory filings, etc. it is not secret or confidential data. It's public, pure and simple.

So, here's the proposal - why don't we bundle up compound structures for which we have both a published research code, and a 2D structure for each company, and then get them validated by the company that originally synthesized them. Ship a simple list of Research code and InChI pairs, and get them checked, and potentially corrected. This 'blessing' of the name provides a source of provenance, and high quality to other resources that may inherit and use them. It also provides a mapping between the name and an InChI, and I'm sure you'll all know the infinite array of possibilities that can be done once you have an InChI ;)

We have done something very similar to this idea already with a large pharma company, but as one way of estimating the errors in ChEMBL. This had really interesting results (I'll try and get permission to share the outcome of this, but it was definitely worth doing). I floated this idea with a few Pharma people at a recent meeting, and the response was favourable. So we'll try and progress this as a pre-competitive project, and ensure that the data ends up in the public domain - if anyone else wants to join these efforts, get in touch....

Notes:

Of course, this will involve some effort on behalf of the companies doing the validation, and is not intended as a way to 'fish' for non-disclosed, or 'borderline' disclosed structures. So maybe restricting it to cases where there is a PubMed source ensures that any compounds really are definitively public is a good place to start.
I find some of the best sources of newly disclosed structures are far-east chemical suppliers, they are great at getting products available really quickly. I would like to find out what websites they use to get the info so fast.
It's also potentially a little bit like the trust certificates given out and assigned to web sites by certification authorities, so that when you use a secure, private, web resource like ChEMBL, or FaceBook, over https: you know you're getting the real deal.