ChEMBL blog

GPCR Structure: Human PAR1 Receptor

12 Dec 2012

Another GPCR structure, this time that of human Protease Activated Receptor-1 also known as the thrombin receptor (PAR1) receptor, complexed with the late-stage clinical candidate antagonist vorapaxar (SCH-530348). This brings the total number of sequence distinct GPCRs now known structurally to 18. The PDB code is 3vw7. This structure helps understand the mechanism of regulation of the binding of the thrombin cleaved N-terminus involved in receptor activation.

Update: I just looked at the tracking page at Scripps - and there's a lot of near term exciting stuff by the looks of it - 5HT2B and 5HT1C in refinement - these will be great for exploring polypharmacology of many centrally acting drugs, and glucagon receptor - the first class 2 GPCR (Heptares have announced that they have the structure of a type 2 as well, but with no plans to publish). We do live in interesting times!

The link to the paper is here.

%A C. Zhang
%A Y. Srinivasan
%A D.H. Arlow
%A J.J. Fung
%A D. Palmer
%A Y. Zheng
%A H.F. Green
%A A. Pandey
%A R.O. Dror
%A D.E. Shaw
%A W.I. Weis
%A S.R. Coughlin
%A B.K. Kobilka
%T High-resolution crystal structure of human protease-activated receptor 1
%J Nature
%D 2012
%O http://dx.doi.org/10.1038/nature11701

3uon - human muscarinic M2 receptor
4daj - rat muscarinic M3 receptor
3rze - human histamine H1 receptor
2rh1 - human beta-2 adrenergic receptor
2vt4 - turkey beta-1 adrenergic receptor
3pbl - human dopamine D3 receptor
2ydv - human adenosine A2a receptor
3v2w - human sphingosine-1-phosphate receptor
4djh - human kappa opioid receptor
4dkl - mouse mu opioid receptor
4ej4 - mouse delta opioid receptor
4ea3 - human nociceptin receptor
4grv - rat neurotensin receptor
3odu - human CXCR4 receptor
2lnl - human CXCR1 receptor (NMR)
3vw7 - human PAR1 receptor
1u19 - bovine rhodopsin
2z73 - squid rhodopsin

                           10        20        30        40        50        60        70    
3uon   (  20 )                                             tfevvfivlvagslSlvTiigNilVmvSIkvnrh
4dajA  (  64 )                                             iwqvvfiafltgflAlvTiigNilVivAFkvnkq
3rze   (  28 )                                                 mplvvvlsticlvTvglNllVlyAvrserk
2rh1   (  29 )                                            devwvvgmgivmslivlaIvfgNvlVitAIakfer
2vt4A  (  40 )                                               weagmsllmalVvllIvagNvlViaAigstqr
3pblA  (  32 )                                                   yalsYcalilaIvfgNglVcmAVlkera
2ydv   (   3 )                                             imgssvYitvElaiavlAilgNvlVcwAvwlnsn
3v2w   (  17 )           sdyvnydIIvrHYnyTgklnisa                ltsvvfiliCcfIileNifvlltiwktkk
4djhA  (  55 )                                            spaipviitavysvvfvvGlvgNslVmfVIirytk
4dkl   (  65 )                                             mvtaitimalYsiVcvvGlfgNflvmyvIvrytk
4ej4   (  41 )                                        rsasslalaiaitalYsavcavGllgNvlvmfgIvrytk
4ea3A  (  47 )                                            plglkvtIvglYlavcvgGllgNclvmyVIlrhtk
4grvA  (  52 )                                    nsdldVnTdiyskvlvtaiYlalfvvGtvgNsvtlftlark s
3oduA  (  27 )            pçfre-------------------------enanfnkiflptiYsiIfltGivgNglvilvMgyqkk
2lnl   (  29 )            pÇmle--------------------------tetLnkYvviiayalvFllsllgNslvMlvilysrv
3vw7   (  91 )                                     dasgYLtsswLtlfVPsvYtgVfvvSlplNimaivvFilkmk
1u19A  (   1 )            mnGtegpnfyVPfsnktgvVrsPFeapQyyLaepwqFsmlAayMflLimlGfpiNflTlyVTvqHkk
2z73A  (   9 )         etwwyNpsIvVhpHWref--------------dqvpdavYyslGifIgiCgiiGcggNgiViyLFtktks
                                                              aaaaaaaaaaaaaaaaaaaaaaaaaaaa   

                      80        90        100       110       120       130       140       150 
3uon   (  54 )    LqtvnnyflfSLAcADliiGvfSMnlytlytvi--gyWplgpvvÇdlWlalDYvVSNAsVmNLliiSfdryfcvt
4dajA  (  98 )    LktvnnyFllSLAcADliIGviSMnlFttyiim--nrWalgnlaÇdlwLSiDYvASNAsVmNLlvISfDryfsit
3rze   (  58 )    LhtvGnlYIvsLSvADliVGavVMpmnilyllm--skwsLgrplÇlfWLSmDYVASTASIfSVfiLCiDryrsvq
2rh1   (  64 )    LqtvtnyFItsLAcADlvMGlaVVpfgaahilm--kmWtfgnfwçefWTSiDVlCVTASIeTLcvIAvdryfAIt
2vt4A  (  72 )    LqtltnlFItsLAcADlvvGllVVpfgatlvvr--gtWlwgsflçelWTSlDVlCVTAsIeTLcvIAiDrylait
3pblA  (  60 )    LqtttnyLVvsLAvADllvAtlVMpwvvylevt-ggvWnfsricÇdvFVTlDVmMcTAsIwNLCaISidRytAVv
2ydv   (  37 )    LqnvtnyFVvsAAaADilVGvlAIpfaiaIst----GfçaaçhgÇLfiACfVLVLTASSIfSLlaIAiDryiair
3v2w   (  76 )    FhrpMYyFIgnLAlSDllaGvaYtaNlllsga---tTykLtPaqWFlREGsMFvALSASVfSLlaIAieryitml
4djhA  (  90 )    mktaTniYIfNLAlADalVTtTMpfqstvylmn---sWpfgdvlÇkiVlsiDyyNMfTSIfTLtmMSvdRyiaVc
4dkl   (  99 )    MktAtniYIfNLAlADalATsTLpfqsvnylmg---tWpfgnilÇkiviSidYyNMFTSIfTLctMSvdRyiAVC
4ej4   (  80 )    LktATniYIfNLAlADalATstLpfqsakylme---tWpfgellÇkaVlSidYyNMFTSIfTLtmMSvDRyiavc
4ea3A  (  82 )    mktatNiYIfNLAlADtlVLlTLpfQGtdillg---fWpfgnalÇktVIaiDyyNMFTSTfTLtaMSvdryvaic
4grvA  (  98 )    lqstvhyHlgsLalSDllILllAMpvElyNFIWvhhpWafgdagÇrgyYflRDactYATAlNVasLSvaRylAic
3oduA  (  69 )    lrsmtdkYRlhLSvADllFVitLpfWavDAva----nWyfgnflÇkaVHviYTVNlYSSVwILAfISlDRylAiV
2lnl   (  70 )    GrsvTdvyLlnLalaDllfaltlpiwaaSkvn----gwifgtfLÇkvVslLkEvnfYsgilLlacIsvdrylaiv
3vw7   ( 133 )    vkkPAVVyMlhLAtADvlFVsvLpfkisYyfsg--SdWqfgselÇrfVtAaFYcnMYASIlLMtvISiDrflAVv
1u19A  (  68 )    LrtplNyILlnLAvADlfMVfg-GFtTTlyTSl-hGyFvfgptGÇnlEGffATLGGEIaLWSLvvLaieRyvvVc
2z73A  (  65 )    LqtpanmFiinLAfSDftFSlvNGfplMtiSCf-lkkWifgfaaÇkvYGfiGGiFGFMsIMTMAMiSiDrynViG
                     aaaaaaaaaaaaaaaaaa aaaaaaaaa          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 

                           160       170       180       190       200       210       220   
3uon   ( 127 )    kpltypvk---rttkmAgmmiaaAwvlSfilwapaIlfwqfivg-----------vrtVedgeÇyIqff------
4dajA  ( 171 )    rpltyrak---rttkrAgvmiglAwviSfvlWApaIlfwqyfvg-----------krtVppgeÇfIqfl------
3rze   ( 131 )    qplrylky---rtktrAsatilgawflSfl-WvipIlgwnh                 rredkÇeTdfy------
2rh1   ( 137 )    spfkyqSl---ltknkArviilmvwivSgltSflpIqmhwyr-----athqeAinÇyae-etçÇdff--------
2vt4A  ( 145 )    spfryqsl---mtrarAkviictvwaiSalvSflpImmhwWr-----dedpqAlkçyqd-pgçÇdfv--------
3pblA  ( 134 )    mpvhyqhgtgqsscrrValmitavwvlAfaVSc-pLlfgfNtTg---------------dptvÇsIs--------
2ydv   ( 108 )    iplryngl---vtgtrAkgiiaicwvlSfaIGltPmlgwnnÇgqp--kegkahsqgÇgegqvAÇlFedVV-----
3v2w   ( 148 )    k           nnfrlfllisacwviSlilGglPimgwn---------------ÇisalssÇSTVLP-------
4djhA  ( 162 )    hpvkaldf---rtplkAkiinicIwllSssvGisAivlGGtkvred------------vdvieÇslqFpdddysw
4dkl   ( 171 )    hpvkaldf---rtprnAkivnvcNwilSsaiGlpVmfmAttkyrqg--------------sidçtltfsh-ptwy
4ej4   ( 152 )    hpvkaldf---rtpakAklinicIwvlAsgvGvpimvmAvtqprdg--------------avvÇmlqfps-pswy
4ea3A  ( 154 )    hp          tsskAqavnvaIwalAsvvGvpvaimGsAqvede--------------eieÇlveipt-pqdy
4grvA  ( 173 )    hpfkaktl---msrsrtkkfisaIwlaSallAipMlftMGlqnrSadg--------thpgGlVÇTPiv----dta
3oduA  ( 140 )    hatn---sqrprkllAekvVyvgVwipAlllT-ipDfif--Anvsead-----------dryiÇdrfyp---ndl
2lnl   ( 141 )    haTr----tltqkrhlvkfvclgcwglsmnlS-lpFflf--RQayhpN----------NsSPvÇyEVlg-ndtak
3vw7   ( 206 )    ypm        rtlgrAsftClaiwalAiagV-vpLllkeQtiqvpg-----------lgitTçhdvlsetLleg
1u19A  ( 141 )    kpmsn----frfgenhaimgvafTwvmAlaCAapPlvgwSrYIPE-------------GMQCSÇGIDYYTpheet
2z73A  ( 139 )    rpmaas---kkMshrrAfimiifVwlwSvlwAigPifgwGaYtLE-------------GVLCNÇSFdYIsr--ds
                               aaaaaaaaaaaaaaaaaaa  aaa                                      

                      230       240       250       260       270       280       290       300 
3uon   ( 182 )    snaavtfgtAiaaFylpviiMtvlywhisrasksri                   pppsrekkvtrtilaIllaF
4dajA  ( 226 )    septitfgtAiaaFymPvtiMtilywrIyketek                       like   aqTlsaIllaF
3rze   ( 186 )    dvtwfkvmtaiinFylPtllMlwfyakIykaVrqhc                   lhmnrerkaakQLgfIMaaF
2rh1   ( 195 )    TnqayaiasSivSFyvplviMvfvYsrVfqeakrql                   kfclkeHkaLktlgiIMgtF
2vt4A  ( 203 )    TnrayaiasSiiSFyipLliMifvalrvyreakeq                       irehkalktlgiImgvF
3pblA  ( 185 )    -npdFViySSvvSFylPfgvTvlvyarIyvvlkqrrrk-----------------gvplrekkatqMVaiVlgaF
2ydv   ( 173 )    pmnYMVyfNffaCVlvPlllMlgvylrIflaarrqlkqmesq             stlqkevhaakSLaiIvglF
3v2w   ( 197 )    LYhkhYIlfCTtvFtllllsIvilYcriyslvrtr                   asrssenvaLlkTViiVLsvF
4djhA  ( 222 )    wdlfmkicVfifAfviPvliIivcytlMilrlksvrllsg              rekdrnlrritrLVlvVVavF
4dkl   ( 228 )    wenllKicVfifAfimPvliItvcyglmilrlksvr                   ekdrnlrritrMVlvVvavF
4ej4   ( 209 )    wdtvtkicvflfAfvvPiliitvcyglMllrlrsvr                   ekdrslrriTrMVlvVvgaF
4ea3A  ( 211 )    wgpvfaiciflfSFivPvlvIsvcyslMirrlrgvrlls-------------gsrekdrnlrritrLVlvVvavF
4grvA  ( 233 )    tvkvvIqvNtfmSFlfPmlvIsilNtvIAnkLtvmv                     vqalrhGVlvAraVviaf
3oduA  ( 195 )    wvvvfqfqhimvglilPgivIlsCyciIisklshs                     kghqkrkalktTviLilaF
2lnl   ( 198 )    wrmvLrilPHtfGfivplfvmlfcygftlrtlf---------------------kahmgqkhrAmrvIfaVvlif
3vw7   ( 266 )    yyayyfsafSavfFfvpliiStvCyvsIirclsssa                   anrskksrAlfLSaaVfcIF
1u19A  ( 199 )    nNesFViyMfvvHfiiPlivIffcygqLvftvkeaaaq------------qqesattqkaekevTrMviiMviaF
2z73A  ( 196 )    ttrsNIlcMFilGffgPiliiffCyfnIvmsvsnhekemaamakrlnakelrkaqaganaemrlAkIsivIVsqF
                    aaaaaaaaaaa aaaaaaaaaaaaaaaaa                            aaaaaaaaaaaaaaaa

                           310       320       330       340       350       360       370   
3uon   ( 397 )    iitWapYNvmVlintfçap--------ç--ipntvwtiGywlCYinstiNpacYalcnatFkktfkhllm     
4dajA  ( 500 )    iitWtpyNimVlvntfçds--------ç--ipktywnlgywlCYiNStvNPvcYalcnktFrttfkt        
3rze   ( 425 )    ilCWipYFiffmviafçkn--------ç--cnehlhmftiWlGYiNStlNPliYplCnenFkktfkrilhi    
2rh1   ( 283 )    tlcWlpFFiVNivhviqdn----------lirkevyillNwiGYvNSgfNpliYc-rspdfriAfqellcl    
2vt4A  ( 300 )    tlCWlpFFlvnivnvfnrd----------lvpdwlfvafnwlGYAnSAmnpiiYc-rspdfrkAfkrlla     
3pblA  ( 339 )    ivCWlpFFltHvlnthçqt--------ç-hvspelysattwlGYvNsalNPviYttfnieFrkAflkilsc    
2ydv   ( 243 )    alCWlpLHiiNcftffçpd--------çshaplwlMylAivlSHtNSvvNPfiyAyrireFrqTFrkiirshvlr
3v2w   ( 266 )    iacwapLFiLLllDvgçkvk------tç--diLfrAeyfLvlAvlNSgtNPiiytltNkemrrafiri       
4djhA  ( 284 )    vvcWtpIHifilvealgs            aalssyyfcIalGytNSslNPilYafldenFkrcfrdfcfp    
4dkl   ( 290 )    ivcWtpIHiyViikaliti-------pettfqtvswhfcialGYtNSclNpvlYafldenFkrCfrefci     
4ej4   ( 271 )    vvCWapIHifVivwtlvdi------nrrdplvvaalhlcialGYaNSslNpvlYaflDenfkrc           
4ea3A  ( 273 )    vgcWtpVQvfvlaqglgvq-------pssetavailrfctAlGYvNSclNpilYafldenFkacfr         
4grvA  ( 318 )    vvcWlpYHvRRlmFCyisdeq--WttflFdfYHyfYmlTNalAYasSAinpilYnlvsanFrqv           
3oduA  ( 249 )    facWlpyyigisidsfilleiikqgçefentvhkwisitEAlAFfHCclNpilyaflgakfktsaqhalts    
2lnl   ( 252 )    llcwlpynlvlLadTlmrtqviqesçeRrNnIGraLdatEilGflhsclnpiiyafigqnfrhgflkilamhg  
3vw7   ( 323 )    iiCFgpTNvlLiaHYsflsh-----tstteaAYfaYLlcvCvSSiSCciDplIyyyAssec              
1u19A  ( 262 )    liCWlpYAgvAfyIfthqgsd---------fgpifMTipAFfAKtSAvyNPviYimmnkqFrnCmvttlccgknp
2z73A  ( 271 )    llSWspYAvvAllAQfgplew---------VtpyaAQlpVMfAKaSaihNPmiYsvsHpkFreAIsqtfpwvLtc
                  aaaaaaaaaaaaaaaa                aaaaaaaaaaaaa   aaaaaaaa  aaaaaaaa         

                      380       390       400       410 
3uon                                                   
4dajA                                                  
3rze                                                   
2rh1                                                   
2vt4A                                                  
3pblA                                                  
2ydv   ( 310 )    qqepfkaa                             
3v2w                                                   
4djhA                                                  
4dkl                                                   
4ej4                                                   
4ea3A                                                  
4grvA                                                  
3oduA                                                  
2lnl                                                   
3vw7                                                   
1u19A  ( 328 )    lgddeasttVsktetsqvapa                
2z73A  ( 337 )    cqfddketeddkdaeteipage

New Drug Approvals 2012 - Pt. XXVII - Choline C-11
11 Dec 2012

On September 12, FDA approved Choline C-11, an intravenous radioactive diagnostic agent to be used as tracer during Positron Emission Tomography (PET) scan to help detect sites of recurrent Prostate Cancer (OMIM : 176807 ; MeSH : D011471) .

Prostate cancer is the most common cause of death from cancer in men over age 75, and is rarely found in men younger that 40. Unlike many other cancers, prostate cancer usually progresses very slowly. Sometimes the cancer cells may metastasize from the prostate to other parts of body. Overall, it is estimated to be the sixth leading cause of cancer-related death in men.

Choline is a naturally occurring component of the numerous Vitamin-B complex, and is necessary for normal cell structure and signalling. Choline C-11 is a radiolabeled synthetic analog of choline that releases a positron by beta decay which can be visualised by PET. Choline is rapidly taken up by the prostate cells and this allows the prostate to be imaged.

Choline, a precursor molecule essential for the biosynthesis of phospholipids which are the structural components of cell membranes, as well as modulation of trans-membrane signalling. Increased activity of phospholipid synthesis has been associated with increased cell proliferation and the transformation process that occurs in tumour cells.

Choline C-11 is a positron emitting radiopharmaceutical that is used for diagnostic purpose in conjunction with PET imaging. The active ingredient is Choline C-11 and each millilitre of the injection contains 148 MBq to 1225 MBq of the active ingredient.

IUPAC Name (Choline) : 2-hydroxy-N,N,N-trimethylethanaminium

Canonical Smiles : [Cl-].[11CH3][N+](C)(C)CCO

Standard InChI : 1S/C5H14NO.ClH/c1-6(2,3)4-5-7;/h7H,4-5H2,1-3H3;1H/q+1;/p-1/i1-1;

Following intravenous administration, Choline C-11 distributes mainly to the pancreas, kidney, liver, spleen and colon. The radioactivity accumulated rapidly within the prostate and peak uptake appeared with in 5 mins following the administration. Choline C-11 undergoes metabolism resulting in the detection of 11C-betaine as the major metabolite in blood. The rate of excretion of Choline C-11 in urine was 0.014 mL/min.

Choline C-11 has been developed and marketed by Mayo Clinic.

Full prescribing information is found here.
Browsers and Bugs
11 Dec 2012

We had a support email recently that some things on the interface didn't work with chrome (an export function) - we couldn't repeat the issue with the equipment we have here at ChEMBL Towers. But there are a lot of OS's and a lot of browsers out there, and we can't recreate every possible environment - interestingly, chrome is really popular amongst you people (the image above is a google analytics report of a weeks access of this very blog). I'm a safari man myself....

So as a reminder, we love hearing about bugs and issues, we really do, so send them to chembl-help@ebi.ac.uk!
ChEMBL Cross Reference Links Now In UniProt
29 Nov 2012

So, some great news for those of you that use UniProt - there are now links to the corresponding target pages in ChEMBL in there.

Here's the link (http://www.uniprot.org/uniprot/?query=database%3AChEMBL&sort=score) to the list of ChEMBL targets that are in Uniprot. And there are links to ChEMBL in the Cross References section.

jpo
Phinterest - A More Sketched Out Idea For An App To Cover Conferences
25 Nov 2012
Returning for a moment to some stuff we've covered in the blog before - the capture and open sharing of timely data to help drug discovery. The basic idea is, is it possible to rapidly capture and share key disclosure data (compound structures, toxicology, efficacy, ADME, etc.) in order to incorporate accurate timely data into your own experiments. At the moment, this area is very active commercially, with large corporations providing the needed data to people who can pay, who are not necessarily the best consumers of such data. There is also some experimentation by professional bodies - C&EN live blogging a few years ago on some of the key Med Chem talks from a National ACS meeting, which hasn't been repeated, despite being well received.

To me this seems a great opportunity for citizen science - attendees at key conferences sharing results openly, in real time, for the benefit of all - introducing knowledge and data 'liquidity' to research.

Let's now suspend reality for a few seconds, and especially ignore the likely tightening of rules of reporting/sharing data if this Citizen Science impacts valuable commercial streams for middlemen/conference organizers. The copyright of the original slide producer. But, we may return to this in a future post.

So the basic idea is
1. go to conference
2. write down stuff
3. share it with the world
There can't be anything wrong with that surely?

I tweeted a few months ago asking if there was an app that could take photos from, say a poster, then do structure conversion. There was not a lot out there at all (the one thing out there was pretty duff, but then the developers said it was pretty duff), but there was quite some interest in the idea based on replies, retweets, etc. This has led to a little spare time thinking, and the following now seems to be technically possible.

Names for stuff are important to me - so we'll call this 'Phinterest' - named as homage to the online pinboard website Pinterest. This has a really simple paradigm, upload some pictures and provide some tags. Which is where we start.

1) You go to the conference of interest, and either cruise the posters, or attend talks. Almost everybody has a smartphone now, with a camera capable of capturing good pictures. You'd then take pictures of whatever captures your interest and upload them with a single click to phinterest.org. There is often built in location tagging (so phinterest.org would capture time and location of the upload - this could support provenance of the uploaded photo, and allow auto-tagging with conference name, etc). There would be the ability to tag the photos if you wanted, but it's not really needed.

2) The pictures could then be bundled automatically into sets from a conference - a stream, and would be visible by all, as they were uploaded. The crappy out of focus ones could be down-voted, and the useful ones would be quality auto-curated in this way, so the community sorts out the interesting stuff. If you really cared about the structure of the selective MEK inhibitor RG-7421, you could read the photo, and get what you need right there and then.

3) The photos could then go through auto-OCR - this is pretty simple to do and set up - for example there's the website http://www.free-ocr.com that takes pictures and does OCR - some simple technical and semantic rules to enhance OCR for things like Vd, IC50, gene names, research codes, etc would be pretty easy, as would pairing an IC50 with a numeric value and a unit. This sort of things is now pretty standard in biomedical literature. So no big shakes. This would then add some useful tags to the photos, and a simple search functionality would allow useful searches if you knew what you were looking for. Regardless of how accurate this would be, you'd always have the original evidence photo to check with.

4) A special feature would be to perform useful OCR on molecular objects - DNA and protein sequences are pretty simple to extract and OCR, and then tag with parent genes, patents, etc. Secondly, and more technically challenging is to perform OCR on chemical structures. OSRA is great, and there are already some web services to allow upload of images and extraction of structures. On phinterest.org, these could be displayed alongside the parent photo, then confirmed/flagged by the community of experts.

Basic workflow is therefore....

Photograph - Upload - OCR - Tagging - Sharing

Even if you only got to stage 1) it would be useful to the community of drug discoverers. To me the challenges are...
- Dealing with segmenting the images, and selecting the high quality useful ones, but with enough sources of photos of the same thing, it's likely that a couple of useful ones could be found.
- It is the way of the world, that idiots would abuse the facility, load rude offensive photos, use it for inappropriate marketing.
- Of course, direct real-time upload of slides would be great - in the real world though this doesn't happen, how many times have you asked the speaker for copy of slides, been told, 'yeah, of course buddy' then got nothing. For conference organisers the slides are part of the overall commercial package/benefits in some cases. So it just ain't gonna happen - at least not without a significant nudge - and anyway, steps 2 to 4 would still be required - just you'd have a higher resolution starting point.
- Legally, it's an interesting area. How different is it from me writing something in a notebook and using it in my research, or sharing it with a colleague - just set in an Internet age? Sharing data is exactly what conferences are for. However, there are all sorts of concerns, image copyright, recording permissions, etc. As I said at the start, it's likely that conference organisers and publishers would try and strangle this idea at birth, primarily for the commercial interests of their shareholders and highly paid management.....
As a pharma employee said over dinner the other evening, "let's get precompetitive!"

This would form a great project for an intern, so if you're interested in coming to the lab to work on this, let me know.
A 101 Thankyou's!
24 Nov 2012

This week, our ChEMBL NAR Database paper made the milestone of over a hundred citations (in less than a year too). This made us all very, very happy, and for a few moments, we rested our fingers from our keyboards, and used our them instead to grasp a mug of coffee/tea; but only for a few seconds, before we got back to mixing and baking and cooking ChEMBL 15 for you all.

Here's a list to the current citations of Gaulton et al., NAR Database, 40, D1100-D1107, 2012. Remember this is an Open Access paper.

Please keep, keeping us happy by using our work, it's probably the biggest satisfaction we can get :)
ChEMBL At SWAT4LS
23 Nov 2012

As part of the ChEMBL groups involvement in the OpenPhacts project, a representative from the ChEMBL team will be attending SWAT4LS next week. As well as hacking and learning about new Semantic technologies there may be time to catch up with ChEMBL users also attending the workshop. So if you would like to hear about what we are doing with the Semantic Web, RDF or just have a general chat about ChEMBL, please get in touch.
Ever had a funny dream? The InChI filesystem.
21 Nov 2012

I had a dream recently - I occasionally have really technical dreams, where unlike in the real world, I'm smart, I have great insights and I solve 'big' problems - then frustratingly the great insight of the dream disappears and I'm left with a half-formed memory and the complete lack of the insight. Of probably the hundred or so dreams like this I have ever had, a few have actually led to some interesting research, which I think has been useful - never as grand as the impact in my dream, but useful. But I also have dreams about discovering new British mushroom species, fishing for electric eels with capacitors and resistors as lures, so to be clear, most of my dreams are simple nonsense.

The dream was set in the future, and I was running a group looking after the largest chemical database in the world (clearly a fantasy!!). The database was huge, about 10²⁰ molecules - the thing we did though was use InChIs for chemical structures but to get over a lot of the storage problems we use the filesystem structure itself to hold the structures and the relations between them. At the tips of the filesystem were just a set of standard files containing descriptors of the molecule - ./logP, containing a logP value, alongside a bunch of other useful descriptors. In this system we treated the InChI as the complete filename, with the slash layer separators (/) as directory names, so all the isomers of C3H4F2 were contained as subfiles of that directory on the /InChI=1 root filesystem. So really this is just using the hierarchical structure of the InChI itself in a hierarchical tree form.

In the database, we used links between files to store relationships (say from all tautomers to a standard InChI), but there were different types of links for isomers and salts, etc. The reason we did this was for space since the very size of the database precluded storing the data in a database, there was never any prospect of storing the data in core memory due to the huge size. This InChI filesystem approach was very efficient and scalable (there was something in the dream as well about having to use ZFS, which can currently scale to 16 Exabytes as a single volume. We'd optimized this though, for the really small block sizes required by the data). The directory/file dates were used to store history about the date of registration - this was important for patent novelty checking, the querying of the database was based on 'extended' unix filesystem tools, like a pharmacophore enabled 'find'. The duality of the filename as a location on a disk, and a location on the internet, and the ubiquity and beauty of everything in unix being a file also played their role.

Finally, there was something really really cool for drug discovery that could be done precisely because of this InChI as a filesystem model, and that's the bit that's missing from the recollection of the dream.

What a bummer!

Here's the directory containing the data for our good friend aspirin.

/InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)

This blogpost was based closely on a mail I sent to a friend/collaborator (you know who your are). I've also added a few more sentences to increase readability and accessibility for those outside of the InChI field. A few spurious personal references have been removed from the original text (particular people had roles, and to prevent them appearing as apparent fact in google, I've removed them; and anyway, the people may not have enjoyed their dream roles ;) ).

I've explained the contents of the dream to a few people now, and the story usually makes people smile (always a good sign), go quiet (an even better sign) and then ask a bunch of questions that try and dismiss it as fantasy (the best sign of all).

Since that time, we've reduced the core idea to practice -
- There's a tarfile of a toy InChI filesystem (thanks Gerard) that you can do a surprising amount of chemoinformatics with just ls and cat.
- Some initial work comparing the efficiency and scaling of this filesystem approach to classical prefix and suffix trees (thanks Michał), but these seem to have scaling problems.
- General cheminformatics InChI related stuff (thanks Francis).
If you'd like to know more, get in touch via the comments.