-
A sort of H-index for the coverage of bioactivity databases
So here's a little idea about quantifying the coverage/diversity of the contents of a bioactivity database (like ChEMBL, but also the internal knowledge of a company in it's screening and lead optimisation programs, etc). Essentially, it's applying the H-index, regularly used for citation analysis to bioassay results. There's a lot of criticism of the H-index in it's use of comparing researchers, and plenty of problems in cross-field comparison, but that is not for here. However, the H-index is a pretty robust statistic capturing the structure of a frequency-class distribution.In the context of bioassay data, the H-index (well lets call it the Ch-index and Ass-index from now on to avoid confusion), can capture the number of bioassays data-points for a set of compounds, or the number of compounds screened across a set of bioassays. Probably best illustrated with a series of pictures of hypothetical bioactivity matrices - a red cell indicates the presence of a measured bioactivity - the columns are assays, the rows are compounds.
So here is high-throughput screen - a single assay, with a large number of tested compounds.
Here is a sparse matrix, essentially full of cherry picked bioassay datapoints of one compound in one assay - there's very little SAR data within this set (so allowing the exploration of differences within an assay or compound series), and so building predictive models, and whole bunch of other stuff that one would want to do is difficult.Imagine some more experiments (profiling) are done on this set of assays/compounds, and you end up with a matrix such as the following. You can see there's now blocks of data, and stripes (both columns and rows) A row is a compound run across multiple assays, and a column is an assay with multiple compounds tested. Of course, the axes can be ordered to maximise the 'blockiness' of the view of the data.Here is after some more assays are run, the Ch- and Ass-index will both increase further. The data becomes more useful, since it is likely that a larger number of queries one would want to make would be actually already known, and for the missing ones, one would assume that better predictive models could be built.Finally, complete knowledge, everything becomes a simple lookup, assuming the data is accurate (etc.).
So as one goes through this progression of filling in the matrix (and expense incurred) the Ch- and Ass-indices both get larger. For the above, the possible 'space' has been confined, but of course, new compounds are made all the time, and new bioassays are developed all the time, the total possible space increases.There are also some interesting features to the above; imagine collapsing data across assays for orthologues - if you assume that mouse, and dog, and human and zebra activities for a given target are all pretty much the same, you don't really 'value' an extra species added to the matrix. You can however go further, and collapse across protein families (for example pfam domains), to get an idea of total target class diversity. Similarly, it's possible to index/cluster compounds by shared scaffolds/chemotypes, and one can imagine the exploration of a series of 'lenses', that allow one to view coverage from a sets of different perspective.So how does this map on to the real world?ChEMBL classic (for want of a better phrase) is like the 4th matrix in the series above, largely a set of stripe- and some block-like structures around a chemical series (since chemists typically explore the chemical space around a lead in optimisation) and screen against related targets. ChEMBL depositions, such as the great GSK PKIS sets, are larger blocks, more comprehensive profiling of a set of compounds across a set of assays. PubChem Bioassay - specifically the output of the NIH MLP, is 'complete' for a relatively small set of assays, but for compounds that are within the set diverse.Finally - Ass-index is probably not the best acronym.
jpo -
Removal of Metal-Containing Compounds
Further to my post a few months ago (To Remove or Not to Remove) about removing certain problem metal-containing compounds, we have now come up with a plan of what to do.
Instead of labeling this curation as ‘removal of inorganics’, or ‘removal of organometallics’, we simply want this to be known as ‘removal of some metal-containing compounds’.The criterion that we used was to exclude a large proportion of compounds that contained a metal, apart from cases where a metal was commonly found as part of a pharmaceutical preparation (e.g. Ranitidine Bismuth Citrate CHEMBL2111286, Silver Sulfadiazine CHEMBL1382627, Bacitracin Zinc CHEMBL2096639). The reasoning behind the removal of such compounds was that most of these metals are bonded to the rest of the compound components via coordinate bonds. However, due to InChI limitations, there is no way of creating a Standard InChI that retains coordinate bond information. As we use Standard InChI as the main compound identifier of uniqueness in ChEMBL, it was decided to exclude the structures altogether.This change will come into effect with the release of ChEMBL_17, and only affects ~3,200 compounds. The compound image on the interface will be replaced with an icon that shows it’s a metal-containing compound (see picture, above). The structures will not be part of the download set on the FTP site, but we will retain the molecular formula in both the downloads and on the ChEMBL interface, so that you can still see the elemental make up of the compound. We will, of course, retain all of the bioactivity data on these compounds.Any questions, please feel free to contact chembl-help@ebi.ac.uk -
Differences in Timeline of European and US Approval of Drugs
I had a question the other day - paraphrased it was 'Why do you focus on US approvals for the great ChEMBL-og drug monographs; don't you miss things in Europe?'. Well for an admittedly small subset, here's the reason.
The graph above is the difference (in years) between approval in the US and Europe, for all worldwide approved protein kinase inhibitors (N=30). 28 of these are approved in the US, and 20 in Europe - and all of these 20 are approved in the US (as of 16th August 2013). As you can see, typically drugs are approved around a year later in Europe than in the US, and no examples from this set show the reverse behaviour. The two 'Japan only' compounds are Fasudil and Umirolimus (remember, we include the rolimus class - but they're not the classical small molecule kinase inhibitors).
Caveat - the data is initial, and I haven't gone through and checked every data point yet, but things won't change a lot. I'll also go through and add Japan to this analysis when I have a little more free time....
Update - the initial data from Japan is quite different, there are a few (so far) where Japan approves NME prior to the rest of the World. Anyway, more later....
jpo -
It's that time again.. An update on the GPCR structures!
It's been roughly a month ago since the last update on GPCR structures, and, oh boy, we do live in interesting times!
As I mentioned in my last post, the next major publication would probably be the glucagon receptor. And indeed it was (4L6R, 1). However, the good folks at Heptares had a surprise for us, as they simultaneous released the structure of the corticotropin-releasing factor receptor 1 (4K5Y, 2). Two class B GPCRs in just one month!
Despite the glucagon structure only having a resolution of 3.3Å, it does show some interesting features. The binding pocket of the protein is exceptionally large, which is not surprising when considering its main ligand, glucagon. Also, the N-terminus of the first transmembrane helix is a bit longer than in any solved class A structures. This protein has been recognised as a potential drug target for diabetes-2, and this solved structure can hopefully help in this process. The protein was crystallized in complex with the antagonist NNC0640 (Sorry, no ChEMBL entry just yet!), however, it could not be reliably identified by electron density, so sadly the binding mode could not be identified.
The corticotropin-releasing factor receptor is an interesting target, in the sense that it is already a well established drug target for diseases like diabetes, depression etc. The receptor was crystallized in a complex with the antagonist CP-376395
The GPCR structure research has also entered what I'd like to call phase 2. The first GPCRs with a disease-causing mutation was actually released a few months ago, but I missed them! 4BEZ and 4BEY both feature a G90D mutation in rhodopsin which causes night blindess (3). It'll be interesting to see when these starts being as common as mutated kinases!
Ah, and as I ranted about in my last post, alignment based on pure sequence alone is next to impossible. Luckily, GPCRs share the common fold of having 7 transmembrane helices, so a 3D based pairwise alignment works out quite well! See first figure for the MNYFIT 'referenceless' structural alignment. The sequence alignment produced using t_coffee with Joy markup is displayed at the buttom.
1. 4grvA (turquise) - Rat neurotensin receptor NTS1
2. 4l6rA (light purple) - Human Glucagon receptor
3. 1f88A (yellow) - Bovine Rhodopsin
4. 4k5yA (red) - Human corticotropin-releasing factor receptor
10 20 30 40 50 4grvA ( 52 ) nsdldVnTdiyskvlvtaiYlalfvv 4l6rA ( 123 ) mdgeeievqkevakmyssfqvmytvGYsl 1f88A ( 1 ) mnGtegpnfyVPfsnktgvVrsPfeapQyyLaepwqFsmlAayMflliml 4k5yA ( 115 ) hyhvaaiinylGhci aaaaaaaaaaaaaaaaa 60 70 80 90 100 4grvA ( 78 ) GtvgNsvtlftlar-k--slqstvhyHlgsLalSDllILllAMpvElyNF 4l6rA ( 152 ) SlgaLllAlaiLggl--sklhctrNaIHanLFaSFvlkAssv-lvidgl- 1f88A ( 51 ) GfpiNflTlyVTvqHk--kLrtpLNyILlnLAvADlfMVfgGFtTTlyT- 4k5yA ( 130 ) SlvaLlvAfvlFlr--arsircLrNiIHanLIaAFilrnatw-fvvqlT- aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaa aaaaaa 110 120 130 140 150 4grvA ( 129 ) IwvhhpWafgdagÇrgyYflRDactYATAlNVasLSvaRylAichpfkak 4l6rA ( 198 ) lrt--lsdgavagÇrvaavfmqyGiVaNYcWLlVEglyLhnllglatl-- 1f88A ( 98 ) Slh-GyFvfgptGÇnlEGffATLGGEIaLwSLvvLaieRyvvvckpms-n 4k5yA ( 176 ) msp-evhqsnvgwÇrlvtaaynyfhVTNFFWMfGeGcylhtaIvl----- aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 160 170 180 190 200 4grvA ( 179 ) tlmsrsrtkkfisaIwlaSallAi-pMlftMGlqnrSad-gthpgGlVÇT 4l6rA ( 259 ) p--ersffslylgigwgaPmlfVvpwavvkclf------en-v----qçw 1f88A ( 146 ) frfgenhAimgvafTwvmAlaCAa-pPlvgwSrYIPEGM------QCSÇG 4k5yA ( 220 ) t--drlrawmficiGwgvPfpiivaWaigKlyy------dn-e----kÇw aaaaaaa aaaaaa aaaa 210 220 230 240 250 4grvA ( 227 ) ----PivdtatvkvvIqvNtfmSFlfPmlvIsilNtvIAnkLtvmv---- 4l6rA ( 296 ) t-------s-ndnmgfwwilrfPvflailiNffifvrIvqllvaklra-- 1f88A ( 189 ) IDYYTpheetnNesFViyMfvvHfiiPlivIffcygqLvftvkeaA--aS 4k5yA ( 260 ) aG------krpgvyTdyiyqgp-MalvlliNfiflfnIvrilmtklra-- aaaaaaaaa aaaaaaaaaaaaaaaaaa 260 270 280 290 300 4grvA ( 300 ) ---v------qalrhGVlvAraVviafvvcWlpYHvRRlmFCyisdeqWt 4l6rA ( 336 ) ----rqmhhtdykfrlAksTltLIplLGvhevvfafvt-d-ehaq----- 1f88A ( 241 ) attq------kaekevTrMViiMviaFliCWlpYAgvAfyIfthq--g-- 4k5yA ( 301 ) ----sttseTiqArkavkaTlvLlplLgitymlafvnevs---------- aaaaaaaaa aaaaaaaaaaaaaa aa 310 320 330 340 350 4grvA ( 341 ) tflFdfYHyfYmlTNalAYasSAinpilYnlvsanFrqv 4l6rA ( 375 ) ---gtlrsaklffdlflsSfqGllVAvlYCflnkeVqselrrrwhrwrlg 1f88A ( 281 ) ---sdfgPifMtipAFfAKtSAvyNPviYimmnkqFrnCmvttlccgknp 4k5yA ( 341 ) ------rvvfiyfnAfLeSfqGffVSvfAcflns aaaaaaaaaaaaa aaaaaaaaaaaa 360 4grvA 4l6rA ( 422 ) kvlweern 1f88A ( 334 ) sttvsktetsqvapa 4k5yA
Nature. 2013 Jul 25;499(7459):444-9
(1) Siu, F.Y et al., Nature 16 Juli 2013: 499, 444–9
(2) Hollenstein, K et al., Nature 16 Juli 2013: Advance Online Publication, 1476-4687
(3) Singhal, A et al., EMBO Rep, June 2013, 14(6):520-6
david -
2nd RDKit UGM - A reminder
For those who forgot to register, this is a gentle reminder for the 2nd RDKit User Group Meeting. The meeting will take place October 2nd-4th here the Genome Campus in Hinxton, UK. We're using a different format for the meeting this year:
Days 1 and 2: Talks, lightning talks, roundtable(s), discussion, and something new: talktorials! Talktorials are somewhere between a talk and a tutorial, they cover something interesting done with the RDKit and include the code used to do the work. During the presentation you'll give an overview of what you did and also show the pieces of the code that are central to the work. The idea is to mix the science up with the tutorial aspects.
Day 3 will be the first ever RDKit sprint: those who choose to stay will spend an intense day working in small groups to produce useful artifacts: new bits of code, KNIME nodes, KNIME workflows, tutorials, documentation, IPython notebooks, etc. We'll see who's there and what folks are interested in contributing and go from there.
There will also be, of course, social and networking activities!
Registration is free at the following link: http://rdkitugm2.eventbrite.co.uk/ We are also looking for people who are willing to do presentations or talktorials on the first two days. If you're interested in contributing, please send us an email.We are really looking forward to seeing a bunch of you again, to meet some new people from the ever growing RDKit developer and user community, and to hear some more cool stories about what people do with the RDKit.Greg and George -
What is the R&D Cost of a New Medicine?
Here's a recent (2012), and excellent, analysis and estimate of the development costs of a new medicine (specifically an NME, a chemically distinct, novel molecule). There is a good overview of the historical trends in costs and attrition, and a collection of all significant previous estimates of the R&D costs of a new drug. There's some nice exploration of the sensitivity of the costs to various factors, and differential success and costs across various therapeutic areas.
In case you wanted to jump to the punchline, the costs in this study is $1,506,000,000 (i.e. $1.5bn) at 2011 USD prices.
The report is free, with only registration at the OHE website required to download the report. Great value!
%T The R&D Cost of a New Medicine %A J. Mestre-Ferrandiz %A J. Sussex %A A. Towse %I Office of Health Economics %D 2012 %O ISBN 978-1-899040-19-3
jpo
-
USAN Watch: August 2013
The USANs for August 2013 have recently been published.
USAN Research Code InChIKey (Parent) Drug Class Therapeutic class Target apatorsen OGX-427n/a therapeutic oligonucleotide HSP-27 brincidofovir CMX-001 therapeutic synthetic small molecule prodrug CMV DNA polymerase censavudine BMS-986001 therapeutic natural product derived small molecule prodrug HIV RT daratumumab
HuMax-CD38, 3003-005
n/a therapeutic monoclonal antibody CD38 diclofenac DCOPUUMXTXDBNB-UHFFFAOYSA-N therapeutic synthetic small molecule COX duvelisib IPI-145; INK-1197therapeutic synthetic small molecule PI3K-delta, PI3K-gamma elbasvir therapeutic synthetic small molecule HCV-NS5A grapiprant RQ-7, RQ-00000007, MR10A7, AAT-007, CJ-023, 423
therapeutic synthetic small molecule EP4 samatasvir IDX-18719, IDX-719therapeutic synthetic small molecule HCV-NS5A sotagliflozin LP-802034, LX-4211therapeutic synthetic small molecule SGLT1, SGLT2 taladegib LY-2940680SZBGQDXLNMELTB-UHFFFAOYSA-N therapeutic synthetic small molecule SMO-1 veledimex INXN-1001 therapeutic synthetic small molecule
Adenoviral Vector Ad-RTS-IL-12
-
Open PHACTS KNIME and Pipeline Pilot Components
Open PHACTS has released a collection of Pipeline Pilot and KNIME workflow components which integrate with the Open PHACTS API. Integration with these well-established graphical workflow tools allows the pharmacological and physicochemical data within the Open PHACTS Discovery Platform to be easily accessed and consumed.Open PHACTS (Open PHArmacological Concepts Triple Store) is a project of the Innovative Medicines Initiative (IMI) and has seen SMEs, academia and the pharmaceutical industry work together to create a freely-available online platform to multiple, integrated sources of publicly available pharmacological data. The project ends in 2014 and the project’s not-for-profit successor organisation, the Open PHACTS Foundation, will continue to support and develop the infrastructure created.The Open PHACTS Discovery Platform has been designed to answer various critical pharmacology questions, many of which can be addressed using the newly released Pipeline Pilot and KNIME nodes. The portal to the workflow integration collection can be found at dev.openphacts.org/workflow.