ChEMBL blog

A sort of H-index for the coverage of bioactivity databases
26 Aug 2013

After a long time out of the office on summer holiday, I'm just sorting out my satchel, uniform and pencil case for the autumn term. I've missed being in the office, and I've missed my group and colleagues. I've had time to think of some ideas, some bad, and some potentially useful - the group are pretty good at sorting out into the relevant piles.

So here's a little idea about quantifying the coverage/diversity of the contents of a bioactivity database (like ChEMBL, but also the internal knowledge of a company in it's screening and lead optimisation programs, etc). Essentially, it's applying the H-index, regularly used for citation analysis to bioassay results. There's a lot of criticism of the H-index in it's use of comparing researchers, and plenty of problems in cross-field comparison, but that is not for here. However, the H-index is a pretty robust statistic capturing the structure of a frequency-class distribution.

In the context of bioassay data, the H-index (well lets call it the Ch-index and Ass-index from now on to avoid confusion), can capture the number of bioassays data-points for a set of compounds, or the number of compounds screened across a set of bioassays. Probably best illustrated with a series of pictures of hypothetical bioactivity matrices - a red cell indicates the presence of a measured bioactivity - the columns are assays, the rows are compounds.

So here is high-throughput screen - a single assay, with a large number of tested compounds.

And here is a broad-profiling of a single compound.

Here is a sparse matrix, essentially full of cherry picked bioassay datapoints of one compound in one assay - there's very little SAR data within this set (so allowing the exploration of differences within an assay or compound series), and so building predictive models, and whole bunch of other stuff that one would want to do is difficult.

Imagine some more experiments (profiling) are done on this set of assays/compounds, and you end up with a matrix such as the following. You can see there's now blocks of data, and stripes (both columns and rows) A row is a compound run across multiple assays, and a column is an assay with multiple compounds tested. Of course, the axes can be ordered to maximise the 'blockiness' of the view of the data.

Here is after some more assays are run, the Ch- and Ass-index will both increase further. The data becomes more useful, since it is likely that a larger number of queries one would want to make would be actually already known, and for the missing ones, one would assume that better predictive models could be built.

Finally, complete knowledge, everything becomes a simple lookup, assuming the data is accurate (etc.).

So as one goes through this progression of filling in the matrix (and expense incurred) the Ch- and Ass-indices both get larger. For the above, the possible 'space' has been confined, but of course, new compounds are made all the time, and new bioassays are developed all the time, the total possible space increases.

There are also some interesting features to the above; imagine collapsing data across assays for orthologues - if you assume that mouse, and dog, and human and zebra activities for a given target are all pretty much the same, you don't really 'value' an extra species added to the matrix. You can however go further, and collapse across protein families (for example pfam domains), to get an idea of total target class diversity. Similarly, it's possible to index/cluster compounds by shared scaffolds/chemotypes, and one can imagine the exploration of a series of 'lenses', that allow one to view coverage from a sets of different perspective.

So how does this map on to the real world?

ChEMBL classic (for want of a better phrase) is like the 4th matrix in the series above, largely a set of stripe- and some block-like structures around a chemical series (since chemists typically explore the chemical space around a lead in optimisation) and screen against related targets. ChEMBL depositions, such as the great GSK PKIS sets, are larger blocks, more comprehensive profiling of a set of compounds across a set of assays. PubChem Bioassay - specifically the output of the NIH MLP, is 'complete' for a relatively small set of assays, but for compounds that are within the set diverse.

Finally - Ass-index is probably not the best acronym.

jpo
Removal of Metal-Containing Compounds
19 Aug 2013

Further to my post a few months ago (To Remove or Not to Remove) about removing certain problem metal-containing compounds, we have now come up with a plan of what to do.

Instead of labeling this curation as ‘removal of inorganics’, or ‘removal of organometallics’, we simply want this to be known as ‘removal of some metal-containing compounds’.

The criterion that we used was to exclude a large proportion of compounds that contained a metal, apart from cases where a metal was commonly found as part of a pharmaceutical preparation (e.g. Ranitidine Bismuth Citrate CHEMBL2111286, Silver Sulfadiazine CHEMBL1382627, Bacitracin Zinc CHEMBL2096639). The reasoning behind the removal of such compounds was that most of these metals are bonded to the rest of the compound components via coordinate bonds. However, due to InChI limitations, there is no way of creating a Standard InChI that retains coordinate bond information. As we use Standard InChI as the main compound identifier of uniqueness in ChEMBL, it was decided to exclude the structures altogether.

This change will come into effect with the release of ChEMBL_17, and only affects ~3,200 compounds. The compound image on the interface will be replaced with an icon that shows it’s a metal-containing compound (see picture, above). The structures will not be part of the download set on the FTP site, but we will retain the molecular formula in both the downloads and on the ChEMBL interface, so that you can still see the elemental make up of the compound. We will, of course, retain all of the bioactivity data on these compounds.

Any questions, please feel free to contact chembl-help@ebi.ac.uk
Differences in Timeline of European and US Approval of Drugs
16 Aug 2013

I had a question the other day - paraphrased it was 'Why do you focus on US approvals for the great ChEMBL-og drug monographs; don't you miss things in Europe?'. Well for an admittedly small subset, here's the reason.

The graph above is the difference (in years) between approval in the US and Europe, for all worldwide approved protein kinase inhibitors (N=30). 28 of these are approved in the US, and 20 in Europe - and all of these 20 are approved in the US (as of 16th August 2013). As you can see, typically drugs are approved around a year later in Europe than in the US, and no examples from this set show the reverse behaviour. The two 'Japan only' compounds are Fasudil and Umirolimus (remember, we include the rolimus class - but they're not the classical small molecule kinase inhibitors).

Caveat - the data is initial, and I haven't gone through and checked every data point yet, but things won't change a lot. I'll also go through and add Japan to this analysis when I have a little more free time....

Update - the initial data from Japan is quite different, there are a few (so far) where Japan approves NME prior to the rest of the World. Anyway, more later....

jpo

It's that time again.. An update on the GPCR structures!

13 Aug 2013

It's been roughly a month ago since the last update on GPCR structures, and, oh boy, we do live in interesting times!

As I mentioned in my last post, the next major publication would probably be the glucagon receptor. And indeed it was (4L6R, 1). However, the good folks at Heptares had a surprise for us, as they simultaneous released the structure of the corticotropin-releasing factor receptor 1 (4K5Y, 2). Two class B GPCRs in just one month!

Despite the glucagon structure only having a resolution of 3.3Å, it does show some interesting features. The binding pocket of the protein is exceptionally large, which is not surprising when considering its main ligand, glucagon. Also, the N-terminus of the first transmembrane helix is a bit longer than in any solved class A structures. This protein has been recognised as a potential drug target for diabetes-2, and this solved structure can hopefully help in this process. The protein was crystallized in complex with the antagonist NNC0640 (Sorry, no ChEMBL entry just yet!), however, it could not be reliably identified by electron density, so sadly the binding mode could not be identified.

The corticotropin-releasing factor receptor is an interesting target, in the sense that it is already a well established drug target for diseases like diabetes, depression etc. The receptor was crystallized in a complex with the antagonist CP-376395

The GPCR structure research has also entered what I'd like to call phase 2. The first GPCRs with a disease-causing mutation was actually released a few months ago, but I missed them! 4BEZ and 4BEY both feature a G90D mutation in rhodopsin which causes night blindess (3). It'll be interesting to see when these starts being as common as mutated kinases!

Ah, and as I ranted about in my last post, alignment based on pure sequence alone is next to impossible. Luckily, GPCRs share the common fold of having 7 transmembrane helices, so a 3D based pairwise alignment works out quite well! See first figure for the MNYFIT 'referenceless' structural alignment. The sequence alignment produced using t_coffee with Joy markup is displayed at the buttom.

1. 4grvA (turquise) - Rat neurotensin receptor NTS1
2. 4l6rA (light purple) - Human Glucagon receptor
3. 1f88A (yellow) - Bovine Rhodopsin
4. 4k5yA (red) - Human corticotropin-releasing factor receptor

10        20        30        40        50  
4grvA  (  52 )                            nsdldVnTdiyskvlvtaiYlalfvv
4l6rA  ( 123 )                         mdgeeievqkevakmyssfqvmytvGYsl
1f88A  (   1 )    mnGtegpnfyVPfsnktgvVrsPfeapQyyLaepwqFsmlAayMflliml
4k5yA  ( 115 )                                       hyhvaaiinylGhci
                                                   aaaaaaaaaaaaaaaaa

                           60        70        80        90        100 
4grvA  (  78 )    GtvgNsvtlftlar-k--slqstvhyHlgsLalSDllILllAMpvElyNF
4l6rA  ( 152 )    SlgaLllAlaiLggl--sklhctrNaIHanLFaSFvlkAssv-lvidgl-
1f88A  (  51 )    GfpiNflTlyVTvqHk--kLrtpLNyILlnLAvADlfMVfgGFtTTlyT-
4k5yA  ( 130 )    SlvaLlvAfvlFlr--arsircLrNiIHanLIaAFilrnatw-fvvqlT-
                  aaaaaaaaaaaa          aaaaaaaaaaaaaaaaaaaa aaaaaa 

                           110       120       130       140       150 
4grvA  ( 129 )    IwvhhpWafgdagÇrgyYflRDactYATAlNVasLSvaRylAichpfkak
4l6rA  ( 198 )    lrt--lsdgavagÇrvaavfmqyGiVaNYcWLlVEglyLhnllglatl--
1f88A  (  98 )    Slh-GyFvfgptGÇnlEGffATLGGEIaLwSLvvLaieRyvvvckpms-n
4k5yA  ( 176 )    msp-evhqsnvgwÇrlvtaaynyfhVTNFFWMfGeGcylhtaIvl-----
                           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa       

                           160       170       180       190       200 
4grvA  ( 179 )    tlmsrsrtkkfisaIwlaSallAi-pMlftMGlqnrSad-gthpgGlVÇT
4l6rA  ( 259 )    p--ersffslylgigwgaPmlfVvpwavvkclf------en-v----qçw
1f88A  ( 146 )    frfgenhAimgvafTwvmAlaCAa-pPlvgwSrYIPEGM------QCSÇG
4k5yA  ( 220 )    t--drlrawmficiGwgvPfpiivaWaigKlyy------dn-e----kÇw
                          aaaaaaa  aaaaaa   aaaa                    

                           210       220       230       240       250 
4grvA  ( 227 )    ----PivdtatvkvvIqvNtfmSFlfPmlvIsilNtvIAnkLtvmv----
4l6rA  ( 296 )    t-------s-ndnmgfwwilrfPvflailiNffifvrIvqllvaklra--
1f88A  ( 189 )    IDYYTpheetnNesFViyMfvvHfiiPlivIffcygqLvftvkeaA--aS
4k5yA  ( 260 )    aG------krpgvyTdyiyqgp-MalvlliNfiflfnIvrilmtklra--
                               aaaaaaaaa  aaaaaaaaaaaaaaaaaa        

                           260       270       280       290       300 
4grvA  ( 300 )    ---v------qalrhGVlvAraVviafvvcWlpYHvRRlmFCyisdeqWt
4l6rA  ( 336 )    ----rqmhhtdykfrlAksTltLIplLGvhevvfafvt-d-ehaq-----
1f88A  ( 241 )    attq------kaekevTrMViiMviaFliCWlpYAgvAfyIfthq--g--
4k5yA  ( 301 )    ----sttseTiqArkavkaTlvLlplLgitymlafvnevs----------
                             aaaaaaaaa aaaaaaaaaaaaaa   aa          

                           310       320       330       340       350 
4grvA  ( 341 )    tflFdfYHyfYmlTNalAYasSAinpilYnlvsanFrqv           
4l6rA  ( 375 )    ---gtlrsaklffdlflsSfqGllVAvlYCflnkeVqselrrrwhrwrlg
1f88A  ( 281 )    ---sdfgPifMtipAFfAKtSAvyNPviYimmnkqFrnCmvttlccgknp
4k5yA  ( 341 )    ------rvvfiyfnAfLeSfqGffVSvfAcflns                
                        aaaaaaaaaaaaa aaaaaaaaaaaa                  

                           360   
4grvA                            
4l6rA  ( 422 )    kvlweern       
1f88A  ( 334 )    sttvsktetsqvapa
4k5yA

Nature. 2013 Jul 25;499(7459):444-9
(1) Siu, F.Y et al., Nature 16 Juli 2013: 499, 444–9
(2) Hollenstein, K et al., Nature 16 Juli 2013: Advance Online Publication, 1476-4687
(3) Singhal, A et al., EMBO Rep, June 2013, 14(6):520-6

david

2nd RDKit UGM - A reminder
13 Aug 2013

For those who forgot to register, this is a gentle reminder for the 2nd RDKit User Group Meeting. The meeting will take place October 2nd-4th here the Genome Campus in Hinxton, UK. We're using a different format for the meeting this year:

Days 1 and 2: Talks, lightning talks, roundtable(s), discussion, and something new: talktorials! Talktorials are somewhere between a talk and a tutorial, they cover something interesting done with the RDKit and include the code used to do the work. During the presentation you'll give an overview of what you did and also show the pieces of the code that are central to the work. The idea is to mix the science up with the tutorial aspects.

Day 3 will be the first ever RDKit sprint: those who choose to stay will spend an intense day working in small groups to produce useful artifacts: new bits of code, KNIME nodes, KNIME workflows, tutorials, documentation, IPython notebooks, etc. We'll see who's there and what folks are interested in contributing and go from there.

There will also be, of course, social and networking activities!

Registration is free at the following link: http://rdkitugm2.eventbrite.co.uk/

We are also looking for people who are willing to do presentations or talktorials on the first two days. If you're interested in contributing, please send us an email.

We are really looking forward to seeing a bunch of you again, to meet some new people from the ever growing RDKit developer and user community, and to hear some more cool stories about what people do with the RDKit.

Greg and George
What is the R&D Cost of a New Medicine?
12 Aug 2013

Here's a recent (2012), and excellent, analysis and estimate of the development costs of a new medicine (specifically an NME, a chemically distinct, novel molecule). There is a good overview of the historical trends in costs and attrition, and a collection of all significant previous estimates of the R&D costs of a new drug. There's some nice exploration of the sensitivity of the costs to various factors, and differential success and costs across various therapeutic areas.

In case you wanted to jump to the punchline, the costs in this study is $1,506,000,000 (i.e. $1.5bn) at 2011 USD prices.

The report is free, with only registration at the OHE website required to download the report. Great value!
```
%T The R&D Cost of a New Medicine
%A J. Mestre-Ferrandiz
%A J. Sussex
%A A. Towse
%I Office of Health Economics
%D 2012
%O ISBN 978-1-899040-19-3
```
jpo

USAN Watch: August 2013

09 Aug 2013

The USANs for August 2013 have recently been published.

USAN	Research Code	InChIKey (Parent)	Drug Class	Therapeutic class	Target
apatorsen	OGX-427	n/a	therapeutic	oligonucleotide	HSP-27
brincidofovir	CMX-001		therapeutic	synthetic small molecule prodrug	CMV DNA polymerase
censavudine	BMS-986001		therapeutic	natural product derived small molecule prodrug	HIV RT
daratumumab	HuMax-CD38, 3003-005	n/a	therapeutic	monoclonal antibody	CD38
diclofenac		DCOPUUMXTXDBNB-UHFFFAOYSA-N	therapeutic	synthetic small molecule	COX
duvelisib	IPI-145; INK-1197		therapeutic	synthetic small molecule	PI3K-delta, PI3K-gamma
elbasvir			therapeutic	synthetic small molecule	HCV-NS5A
grapiprant	RQ-7, RQ-00000007, MR10A7, AAT-007, CJ-023, 423		therapeutic	synthetic small molecule	EP4
samatasvir	IDX-18719, IDX-719		therapeutic	synthetic small molecule	HCV-NS5A
sotagliflozin	LP-802034, LX-4211		therapeutic	synthetic small molecule	SGLT1, SGLT2
taladegib	LY-2940680	SZBGQDXLNMELTB-UHFFFAOYSA-N	therapeutic	synthetic small molecule	SMO-1
veledimex	INXN-1001		therapeutic	synthetic small molecule	Adenoviral Vector Ad-RTS-IL-12

Open PHACTS KNIME and Pipeline Pilot Components
02 Aug 2013

Open PHACTS has released a collection of Pipeline Pilot and KNIME workflow components which integrate with the Open PHACTS API. Integration with these well-established graphical workflow tools allows the pharmacological and physicochemical data within the Open PHACTS Discovery Platform to be easily accessed and consumed.

Open PHACTS (Open PHArmacological Concepts Triple Store) is a project of the Innovative Medicines Initiative (IMI) and has seen SMEs, academia and the pharmaceutical industry work together to create a freely-available online platform to multiple, integrated sources of publicly available pharmacological data. The project ends in 2014 and the project’s not-for-profit successor organisation, the Open PHACTS Foundation, will continue to support and develop the infrastructure created.

The Open PHACTS Discovery Platform has been designed to answer various critical pharmacology questions, many of which can be addressed using the newly released Pipeline Pilot and KNIME nodes. The portal to the workflow integration collection can be found at dev.openphacts.org/workflow.