Annotating ChEMBL With Disease/Indication Data


A surprisingly difficult thing to do is to perform a search of ChEMBL for potential anti-cancer compounds - an obvious place to start is by building  a list of genes involved in cancer, and then pulling back all data against these genes, applying activity filters, etc. Alternatively you could think of searching for keywords like 'xenograft' in the assay descriptions, since these are likely to be linked quite strongly for anticancer indications. However, it is a big pain to do, far harder than it should be, but the good news is that we starting to think about fixing this, a little. A surprising number of gotcha's get in the way of doing this - for example a target clearly linked to cancer could be an anti-target in a cardiovascular project, etc.

The data in ChEMBL is largely centred around 'depositions from projects' in this case the project can be the assembled data in a publication, which was either part of, or an entire 'project' in the authors lab(s), or are depositions that aren't from the literature, for example the GSK malaria HTS dataset. Each of these projects had an intent - they were making compounds to cure gout, or malaria, etc. Capturing this 'intent' data is the key thing to try and do. It is often pretty simple to do just from the title of the paper, for example.

New Serotonin 5-HT1A Receptor Agonists with Neuroprotective Effect against Ischemic Cell Damage


as a title gives a pretty clear clue that the intent was to discover compounds useful for the treatment of stroke. That particular paper is here.

So how successful is the simple approach of looking for the disease area of a project from the titles of papers - it turns out it is pretty good, largely due to the frequent use of canonically constructed titles for the literature - "X for the treatment of Y" where X is a compound-related term and Y is a disease related term.

So it sounds like a simple problem, take the titles, tag them up with disease terms and away you go. However, this is where it gets complicated - there is not a good taxonomy/ontology for diseases, at least one that maps back into discovery space well. You can seem to get quite a way with a list of synonyms for various diseases, but the world needs a common public dictionary/vocabulary for disease terms. To balance this lack of a fantastic existing standard, there is the ATC classification for drugs, it is a little inconsistent in the way it mixes chemotypes, pathways and targets, but it is stable, accepted and robust - and all new approved drugs will be placed into this taxonomy, so let's see what ATC style tagging can achieve.

What we are specifically trying is to take neglected diseases (defined here as tuberculosis, helminth infections, schistosomiasis, trypanosomiasis, HIV/AIDS and malaria), and manually tag up the assays in ChEMBL with the corresponding ATC codes, at the 'depth' supported by the title - for example if malaria compounds are artemisinin-based, then it can be placed at a deeper level (P01BE) than just a malaria targeted approach (P01B). Of course, it's then possible to do cool things that group data across the span of the ATC classification.

We'll let you know how we get on!