• GPCR Structure: Human PAR1 Receptor


    Another GPCR structure, this time that of human Protease Activated Receptor-1 also known as the thrombin receptor (PAR1) receptor, complexed with the late-stage clinical candidate antagonist vorapaxar (SCH-530348). This brings the total number of sequence distinct GPCRs now known structurally to 18. The PDB code is 3vw7. This structure helps understand the mechanism of regulation of the binding of the thrombin cleaved N-terminus involved in receptor activation.

    Update: I just looked at the tracking page at Scripps - and there's a lot of near term exciting stuff by the looks of it -  5HT2B and 5HT1C in refinement - these will be great for exploring polypharmacology of many centrally acting drugs, and glucagon receptor - the first class 2 GPCR (Heptares have announced that they have the structure of a type 2 as well, but with no plans to publish). We do live in interesting times!

    The link to the paper is here.


    %A C. Zhang
    %A Y. Srinivasan
    %A D.H. Arlow
    %A J.J. Fung
    %A D. Palmer
    %A Y. Zheng
    %A H.F. Green
    %A A. Pandey
    %A R.O. Dror
    %A D.E. Shaw
    %A W.I. Weis
    %A S.R. Coughlin
    %A B.K. Kobilka
    %T High-resolution crystal structure of human protease-activated receptor 1
    %J Nature
    %D 2012
    %O http://dx.doi.org/10.1038/nature11701
    
    1. 3uon - human muscarinic M2 receptor 
    2. 4daj - rat muscarinic M3 receptor 
    3. 3rze - human histamine H1 receptor
    4. 2rh1 - human beta-2 adrenergic receptor 
    5. 2vt4 - turkey beta-1 adrenergic receptor 
    6. 3pbl - human dopamine D3 receptor
    7. 2ydv - human adenosine A2a receptor 
    8. 3v2w - human sphingosine-1-phosphate receptor 
    9. 4djh - human kappa opioid receptor 
    10. 4dkl - mouse mu opioid receptor 
    11. 4ej4 - mouse delta opioid receptor
    12. 4ea3 - human nociceptin receptor
    13. 4grv - rat neurotensin receptor
    14. 3odu - human CXCR4 receptor 
    15. 2lnl - human CXCR1 receptor (NMR)
    16. 3vw7 - human PAR1 receptor
    17. 1u19 - bovine rhodopsin 
    18. 2z73 - squid rhodopsin
                               10        20        30        40        50        60        70    
    3uon   (  20 )                                             tfevvfivlvagslSlvTiigNilVmvSIkvnrh
    4dajA  (  64 )                                             iwqvvfiafltgflAlvTiigNilVivAFkvnkq
    3rze   (  28 )                                                 mplvvvlsticlvTvglNllVlyAvrserk
    2rh1   (  29 )                                            devwvvgmgivmslivlaIvfgNvlVitAIakfer
    2vt4A  (  40 )                                               weagmsllmalVvllIvagNvlViaAigstqr
    3pblA  (  32 )                                                   yalsYcalilaIvfgNglVcmAVlkera
    2ydv   (   3 )                                             imgssvYitvElaiavlAilgNvlVcwAvwlnsn
    3v2w   (  17 )           sdyvnydIIvrHYnyTgklnisa                ltsvvfiliCcfIileNifvlltiwktkk
    4djhA  (  55 )                                            spaipviitavysvvfvvGlvgNslVmfVIirytk
    4dkl   (  65 )                                             mvtaitimalYsiVcvvGlfgNflvmyvIvrytk
    4ej4   (  41 )                                        rsasslalaiaitalYsavcavGllgNvlvmfgIvrytk
    4ea3A  (  47 )                                            plglkvtIvglYlavcvgGllgNclvmyVIlrhtk
    4grvA  (  52 )                                    nsdldVnTdiyskvlvtaiYlalfvvGtvgNsvtlftlark s
    3oduA  (  27 )            pçfre-------------------------enanfnkiflptiYsiIfltGivgNglvilvMgyqkk
    2lnl   (  29 )            pÇmle--------------------------tetLnkYvviiayalvFllsllgNslvMlvilysrv
    3vw7   (  91 )                                     dasgYLtsswLtlfVPsvYtgVfvvSlplNimaivvFilkmk
    1u19A  (   1 )            mnGtegpnfyVPfsnktgvVrsPFeapQyyLaepwqFsmlAayMflLimlGfpiNflTlyVTvqHkk
    2z73A  (   9 )         etwwyNpsIvVhpHWref--------------dqvpdavYyslGifIgiCgiiGcggNgiViyLFtktks
                                                                  aaaaaaaaaaaaaaaaaaaaaaaaaaaa   
    
                          80        90        100       110       120       130       140       150 
    3uon   (  54 )    LqtvnnyflfSLAcADliiGvfSMnlytlytvi--gyWplgpvvÇdlWlalDYvVSNAsVmNLliiSfdryfcvt
    4dajA  (  98 )    LktvnnyFllSLAcADliIGviSMnlFttyiim--nrWalgnlaÇdlwLSiDYvASNAsVmNLlvISfDryfsit
    3rze   (  58 )    LhtvGnlYIvsLSvADliVGavVMpmnilyllm--skwsLgrplÇlfWLSmDYVASTASIfSVfiLCiDryrsvq
    2rh1   (  64 )    LqtvtnyFItsLAcADlvMGlaVVpfgaahilm--kmWtfgnfwçefWTSiDVlCVTASIeTLcvIAvdryfAIt
    2vt4A  (  72 )    LqtltnlFItsLAcADlvvGllVVpfgatlvvr--gtWlwgsflçelWTSlDVlCVTAsIeTLcvIAiDrylait
    3pblA  (  60 )    LqtttnyLVvsLAvADllvAtlVMpwvvylevt-ggvWnfsricÇdvFVTlDVmMcTAsIwNLCaISidRytAVv
    2ydv   (  37 )    LqnvtnyFVvsAAaADilVGvlAIpfaiaIst----GfçaaçhgÇLfiACfVLVLTASSIfSLlaIAiDryiair
    3v2w   (  76 )    FhrpMYyFIgnLAlSDllaGvaYtaNlllsga---tTykLtPaqWFlREGsMFvALSASVfSLlaIAieryitml
    4djhA  (  90 )    mktaTniYIfNLAlADalVTtTMpfqstvylmn---sWpfgdvlÇkiVlsiDyyNMfTSIfTLtmMSvdRyiaVc
    4dkl   (  99 )    MktAtniYIfNLAlADalATsTLpfqsvnylmg---tWpfgnilÇkiviSidYyNMFTSIfTLctMSvdRyiAVC
    4ej4   (  80 )    LktATniYIfNLAlADalATstLpfqsakylme---tWpfgellÇkaVlSidYyNMFTSIfTLtmMSvDRyiavc
    4ea3A  (  82 )    mktatNiYIfNLAlADtlVLlTLpfQGtdillg---fWpfgnalÇktVIaiDyyNMFTSTfTLtaMSvdryvaic
    4grvA  (  98 )    lqstvhyHlgsLalSDllILllAMpvElyNFIWvhhpWafgdagÇrgyYflRDactYATAlNVasLSvaRylAic
    3oduA  (  69 )    lrsmtdkYRlhLSvADllFVitLpfWavDAva----nWyfgnflÇkaVHviYTVNlYSSVwILAfISlDRylAiV
    2lnl   (  70 )    GrsvTdvyLlnLalaDllfaltlpiwaaSkvn----gwifgtfLÇkvVslLkEvnfYsgilLlacIsvdrylaiv
    3vw7   ( 133 )    vkkPAVVyMlhLAtADvlFVsvLpfkisYyfsg--SdWqfgselÇrfVtAaFYcnMYASIlLMtvISiDrflAVv
    1u19A  (  68 )    LrtplNyILlnLAvADlfMVfg-GFtTTlyTSl-hGyFvfgptGÇnlEGffATLGGEIaLWSLvvLaieRyvvVc
    2z73A  (  65 )    LqtpanmFiinLAfSDftFSlvNGfplMtiSCf-lkkWifgfaaÇkvYGfiGGiFGFMsIMTMAMiSiDrynViG
                         aaaaaaaaaaaaaaaaaa aaaaaaaaa          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 
    
                               160       170       180       190       200       210       220   
    3uon   ( 127 )    kpltypvk---rttkmAgmmiaaAwvlSfilwapaIlfwqfivg-----------vrtVedgeÇyIqff------
    4dajA  ( 171 )    rpltyrak---rttkrAgvmiglAwviSfvlWApaIlfwqyfvg-----------krtVppgeÇfIqfl------
    3rze   ( 131 )    qplrylky---rtktrAsatilgawflSfl-WvipIlgwnh                 rredkÇeTdfy------
    2rh1   ( 137 )    spfkyqSl---ltknkArviilmvwivSgltSflpIqmhwyr-----athqeAinÇyae-etçÇdff--------
    2vt4A  ( 145 )    spfryqsl---mtrarAkviictvwaiSalvSflpImmhwWr-----dedpqAlkçyqd-pgçÇdfv--------
    3pblA  ( 134 )    mpvhyqhgtgqsscrrValmitavwvlAfaVSc-pLlfgfNtTg---------------dptvÇsIs--------
    2ydv   ( 108 )    iplryngl---vtgtrAkgiiaicwvlSfaIGltPmlgwnnÇgqp--kegkahsqgÇgegqvAÇlFedVV-----
    3v2w   ( 148 )    k           nnfrlfllisacwviSlilGglPimgwn---------------ÇisalssÇSTVLP-------
    4djhA  ( 162 )    hpvkaldf---rtplkAkiinicIwllSssvGisAivlGGtkvred------------vdvieÇslqFpdddysw
    4dkl   ( 171 )    hpvkaldf---rtprnAkivnvcNwilSsaiGlpVmfmAttkyrqg--------------sidçtltfsh-ptwy
    4ej4   ( 152 )    hpvkaldf---rtpakAklinicIwvlAsgvGvpimvmAvtqprdg--------------avvÇmlqfps-pswy
    4ea3A  ( 154 )    hp          tsskAqavnvaIwalAsvvGvpvaimGsAqvede--------------eieÇlveipt-pqdy
    4grvA  ( 173 )    hpfkaktl---msrsrtkkfisaIwlaSallAipMlftMGlqnrSadg--------thpgGlVÇTPiv----dta
    3oduA  ( 140 )    hatn---sqrprkllAekvVyvgVwipAlllT-ipDfif--Anvsead-----------dryiÇdrfyp---ndl
    2lnl   ( 141 )    haTr----tltqkrhlvkfvclgcwglsmnlS-lpFflf--RQayhpN----------NsSPvÇyEVlg-ndtak
    3vw7   ( 206 )    ypm        rtlgrAsftClaiwalAiagV-vpLllkeQtiqvpg-----------lgitTçhdvlsetLleg
    1u19A  ( 141 )    kpmsn----frfgenhaimgvafTwvmAlaCAapPlvgwSrYIPE-------------GMQCSÇGIDYYTpheet
    2z73A  ( 139 )    rpmaas---kkMshrrAfimiifVwlwSvlwAigPifgwGaYtLE-------------GVLCNÇSFdYIsr--ds
                                   aaaaaaaaaaaaaaaaaaa  aaa                                      
    
                          230       240       250       260       270       280       290       300 
    3uon   ( 182 )    snaavtfgtAiaaFylpviiMtvlywhisrasksri                   pppsrekkvtrtilaIllaF
    4dajA  ( 226 )    septitfgtAiaaFymPvtiMtilywrIyketek                       like   aqTlsaIllaF
    3rze   ( 186 )    dvtwfkvmtaiinFylPtllMlwfyakIykaVrqhc                   lhmnrerkaakQLgfIMaaF
    2rh1   ( 195 )    TnqayaiasSivSFyvplviMvfvYsrVfqeakrql                   kfclkeHkaLktlgiIMgtF
    2vt4A  ( 203 )    TnrayaiasSiiSFyipLliMifvalrvyreakeq                       irehkalktlgiImgvF
    3pblA  ( 185 )    -npdFViySSvvSFylPfgvTvlvyarIyvvlkqrrrk-----------------gvplrekkatqMVaiVlgaF
    2ydv   ( 173 )    pmnYMVyfNffaCVlvPlllMlgvylrIflaarrqlkqmesq             stlqkevhaakSLaiIvglF
    3v2w   ( 197 )    LYhkhYIlfCTtvFtllllsIvilYcriyslvrtr                   asrssenvaLlkTViiVLsvF
    4djhA  ( 222 )    wdlfmkicVfifAfviPvliIivcytlMilrlksvrllsg              rekdrnlrritrLVlvVVavF
    4dkl   ( 228 )    wenllKicVfifAfimPvliItvcyglmilrlksvr                   ekdrnlrritrMVlvVvavF
    4ej4   ( 209 )    wdtvtkicvflfAfvvPiliitvcyglMllrlrsvr                   ekdrslrriTrMVlvVvgaF
    4ea3A  ( 211 )    wgpvfaiciflfSFivPvlvIsvcyslMirrlrgvrlls-------------gsrekdrnlrritrLVlvVvavF
    4grvA  ( 233 )    tvkvvIqvNtfmSFlfPmlvIsilNtvIAnkLtvmv                     vqalrhGVlvAraVviaf
    3oduA  ( 195 )    wvvvfqfqhimvglilPgivIlsCyciIisklshs                     kghqkrkalktTviLilaF
    2lnl   ( 198 )    wrmvLrilPHtfGfivplfvmlfcygftlrtlf---------------------kahmgqkhrAmrvIfaVvlif
    3vw7   ( 266 )    yyayyfsafSavfFfvpliiStvCyvsIirclsssa                   anrskksrAlfLSaaVfcIF
    1u19A  ( 199 )    nNesFViyMfvvHfiiPlivIffcygqLvftvkeaaaq------------qqesattqkaekevTrMviiMviaF
    2z73A  ( 196 )    ttrsNIlcMFilGffgPiliiffCyfnIvmsvsnhekemaamakrlnakelrkaqaganaemrlAkIsivIVsqF
                        aaaaaaaaaaa aaaaaaaaaaaaaaaaa                            aaaaaaaaaaaaaaaa
    
                               310       320       330       340       350       360       370   
    3uon   ( 397 )    iitWapYNvmVlintfçap--------ç--ipntvwtiGywlCYinstiNpacYalcnatFkktfkhllm     
    4dajA  ( 500 )    iitWtpyNimVlvntfçds--------ç--ipktywnlgywlCYiNStvNPvcYalcnktFrttfkt        
    3rze   ( 425 )    ilCWipYFiffmviafçkn--------ç--cnehlhmftiWlGYiNStlNPliYplCnenFkktfkrilhi    
    2rh1   ( 283 )    tlcWlpFFiVNivhviqdn----------lirkevyillNwiGYvNSgfNpliYc-rspdfriAfqellcl    
    2vt4A  ( 300 )    tlCWlpFFlvnivnvfnrd----------lvpdwlfvafnwlGYAnSAmnpiiYc-rspdfrkAfkrlla     
    3pblA  ( 339 )    ivCWlpFFltHvlnthçqt--------ç-hvspelysattwlGYvNsalNPviYttfnieFrkAflkilsc    
    2ydv   ( 243 )    alCWlpLHiiNcftffçpd--------çshaplwlMylAivlSHtNSvvNPfiyAyrireFrqTFrkiirshvlr
    3v2w   ( 266 )    iacwapLFiLLllDvgçkvk------tç--diLfrAeyfLvlAvlNSgtNPiiytltNkemrrafiri       
    4djhA  ( 284 )    vvcWtpIHifilvealgs            aalssyyfcIalGytNSslNPilYafldenFkrcfrdfcfp    
    4dkl   ( 290 )    ivcWtpIHiyViikaliti-------pettfqtvswhfcialGYtNSclNpvlYafldenFkrCfrefci     
    4ej4   ( 271 )    vvCWapIHifVivwtlvdi------nrrdplvvaalhlcialGYaNSslNpvlYaflDenfkrc           
    4ea3A  ( 273 )    vgcWtpVQvfvlaqglgvq-------pssetavailrfctAlGYvNSclNpilYafldenFkacfr         
    4grvA  ( 318 )    vvcWlpYHvRRlmFCyisdeq--WttflFdfYHyfYmlTNalAYasSAinpilYnlvsanFrqv           
    3oduA  ( 249 )    facWlpyyigisidsfilleiikqgçefentvhkwisitEAlAFfHCclNpilyaflgakfktsaqhalts    
    2lnl   ( 252 )    llcwlpynlvlLadTlmrtqviqeeRrNnIGraLdatEilGflhsclnpiiyafigqnfrhgflkilamhg  
    3vw7   ( 323 )    iiCFgpTNvlLiaHYsflsh-----tstteaAYfaYLlcvCvSSiSCciDplIyyyAssec              
    1u19A  ( 262 )    liCWlpYAgvAfyIfthqgsd---------fgpifMTipAFfAKtSAvyNPviYimmnkqFrnCmvttlccgknp
    2z73A  ( 271 )    llSWspYAvvAllAQfgplew---------VtpyaAQlpVMfAKaSaihNPmiYsvsHpkFreAIsqtfpwvLtc
                      aaaaaaaaaaaaaaaa                aaaaaaaaaaaaa   aaaaaaaa  aaaaaaaa         
    
                          380       390       400       410 
    3uon                                                   
    4dajA                                                  
    3rze                                                   
    2rh1                                                   
    2vt4A                                                  
    3pblA                                                  
    2ydv   ( 310 )    qqepfkaa                             
    3v2w                                                   
    4djhA                                                  
    4dkl                                                   
    4ej4                                                   
    4ea3A                                                  
    4grvA                                                  
    3oduA                                                  
    2lnl                                                   
    3vw7                                                   
    1u19A  ( 328 )    lgddeasttVsktetsqvapa                
    2z73A  ( 337 )    cqfddketeddkdaeteipage               
                                                           
    
    

  • New Drug Approvals 2012 - Pt. XXVII - Choline C-11



    On September 12, FDA approved Choline C-11, an intravenous radioactive diagnostic agent to be used as tracer during Positron Emission Tomography (PET) scan to help detect sites of recurrent Prostate Cancer (OMIM : 176807 ; MeSH : D011471) .

    Prostate cancer is the most common cause of death from cancer in men over age 75, and is rarely found in men younger that 40. Unlike many other cancers, prostate cancer usually progresses very slowly. Sometimes the cancer cells may metastasize from the prostate to other parts of body. Overall, it is estimated to be the sixth leading cause of cancer-related death in men.

    Choline is a naturally occurring component of the numerous Vitamin-B complex, and is necessary for normal cell structure and signalling. Choline C-11 is a radiolabeled synthetic analog of choline that releases a positron by beta decay which can be visualised by PET. Choline is rapidly taken up by the prostate cells and this allows the prostate to be imaged. 


    Choline, a precursor molecule essential for the biosynthesis of phospholipids which are the structural components of cell membranes, as well as modulation of trans-membrane signalling. Increased activity of phospholipid synthesis has been associated with increased cell proliferation and the transformation process that occurs in tumour cells.
    Choline C-11 is a positron emitting radiopharmaceutical that is used for diagnostic purpose in conjunction with PET imaging. The active ingredient is Choline C-11 and each millilitre of the injection contains 148 MBq to 1225 MBq of the active ingredient.

    IUPAC Name (Choline) : 2-hydroxy-N,N,N-trimethylethanaminium
    Canonical Smiles : [Cl-].[11CH3][N+](C)(C)CCO
    Standard InChI : 1S/C5H14NO.ClH/c1-6(2,3)4-5-7;/h7H,4-5H2,1-3H3;1H/q+1;/p-1/i1-1;

    Following intravenous administration, Choline C-11 distributes mainly to the pancreas, kidney, liver, spleen and colon. The radioactivity accumulated rapidly within the prostate and peak uptake appeared with in 5 mins following the administration. Choline C-11 undergoes metabolism resulting in the detection of 11C-betaine as the major metabolite in blood. The rate of excretion of Choline C-11 in urine was 0.014 mL/min.

    Choline C-11 has been developed and marketed by Mayo Clinic.

    Full prescribing information is found here.

  • Browsers and Bugs


    We had a support email recently that some things on the interface didn't work with chrome (an export function) - we couldn't repeat the issue with the equipment we have here at ChEMBL Towers. But there are a lot of OS's and a lot of browsers out there, and we can't recreate every possible environment - interestingly, chrome is really popular amongst you people (the image above is a google analytics report of a weeks access of this very blog). I'm a safari man myself....

    So as a reminder, we love hearing about bugs and issues, we really do, so send them to chembl-help@ebi.ac.uk!

  • ChEMBL Cross Reference Links Now In UniProt


    So, some great news for those of you that use UniProt - there are now links to the corresponding target pages in ChEMBL in there.

    Here's the link (http://www.uniprot.org/uniprot/?query=database%3AChEMBL&sort=score) to the list of ChEMBL targets that are in Uniprot. And there are links to ChEMBL in the Cross References section.

    jpo

  • Phinterest - A More Sketched Out Idea For An App To Cover Conferences


    Returning for a moment to some stuff we've covered in the blog before - the capture and open sharing of timely data to help drug discovery. The basic idea is, is it possible to rapidly capture and share key disclosure data (compound structures, toxicology, efficacy, ADME, etc.) in order to incorporate accurate timely data into your own experiments. At the moment, this area is very active commercially, with large corporations providing the needed data to people who can pay, who are not necessarily the best consumers of such data. There is also some experimentation by professional bodies - C&EN live blogging a few years ago on some of the key Med Chem talks from a National ACS meeting, which hasn't been repeated, despite being well received.

    To me this seems a great opportunity for citizen science - attendees at key conferences sharing results openly, in real time, for the benefit of all - introducing knowledge and data 'liquidity' to research.

    Let's now suspend reality for a few seconds, and especially ignore the likely tightening of rules of reporting/sharing data if this Citizen Science impacts valuable commercial streams for middlemen/conference organizers. The copyright of the original slide producer. But, we may return to this in a future post.

    So the basic idea is

    1. go to conference
    2. write down stuff
    3. share it with the world

    There can't be anything wrong with that surely?


    I tweeted a few months ago asking if there was an app that could take photos from, say a poster, then do structure conversion. There was not a lot out there at all (the one thing out there was pretty duff, but then the developers said it was pretty duff), but there was quite some interest in the idea based on replies, retweets, etc. This has led to a little spare time thinking, and the following now seems to be technically possible.



    Names for stuff are important to me - so we'll call this  'Phinterest' - named as homage to the online pinboard website Pinterest. This has a really simple paradigm, upload some pictures and provide some tags. Which is where we start.

    1) You go to the conference of interest, and either cruise the posters, or attend talks. Almost everybody has a smartphone now, with a camera capable of capturing good pictures. You'd then take pictures of whatever captures your interest and upload them with a single click to phinterest.org. There is often built in location tagging (so phinterest.org would capture time and location of the upload - this could support provenance of the uploaded photo, and allow auto-tagging with conference name, etc). There would be the ability to tag the photos if you wanted, but it's not really needed.

    2) The pictures could then be bundled automatically into sets from a conference - a stream, and would be visible by all, as they were uploaded. The crappy out of focus ones could be down-voted, and the useful ones would be quality auto-curated in this way, so the community sorts out the interesting stuff. If you really cared about the structure of the selective MEK inhibitor RG-7421, you could read the photo, and get what you need right there and then.

    3) The photos could then go through auto-OCR - this is pretty simple to do and set up - for example there's the website http://www.free-ocr.com that takes pictures and does OCR - some simple technical and semantic rules to enhance OCR for things like Vd, IC50, gene names, research codes, etc would be pretty easy, as would pairing an IC50 with a numeric value and a unit. This sort of things is now pretty standard in biomedical literature. So no big shakes. This would then add some useful tags to the photos, and a simple search functionality would allow useful searches if you knew what you were looking for. Regardless of how accurate this would be, you'd always have the original evidence photo to check with.

    4) A special feature would be to perform useful OCR on molecular objects - DNA and protein sequences are pretty simple to extract and OCR, and then tag with parent genes, patents, etc. Secondly, and more technically challenging is to perform OCR on chemical structures. OSRA is great, and there are already some web services to allow upload of images and extraction of structures. On phinterest.org, these could be displayed alongside the parent photo, then confirmed/flagged by the community of experts.

    Basic workflow is therefore....

    Photograph - Upload - OCR - Tagging - Sharing

    Even if you only got to stage 1) it would be useful to the community of drug discoverers. To me the challenges are...
    • Dealing with segmenting the images, and selecting the high quality useful ones, but with enough sources of photos of the same thing, it's likely that a couple of useful ones could be found.
    • It is the way of the world, that idiots would abuse the facility, load rude offensive photos, use it for inappropriate marketing.
    • Of course, direct real-time upload of slides would be great - in the real world though this doesn't happen, how many times have you asked the speaker for copy of slides, been told, 'yeah, of course buddy' then got nothing. For conference organisers the slides are part of the overall commercial package/benefits in some cases. So it just ain't gonna happen - at least not without a significant nudge - and anyway, steps 2 to 4 would still be required - just you'd have a higher resolution starting point.
    • Legally, it's an interesting area. How different is it from me writing something in a notebook and using it in my research, or sharing it with a colleague - just set in an Internet age? Sharing data is exactly what conferences are for. However, there are all sorts of concerns, image copyright, recording permissions, etc. As I said at the start, it's likely that conference organisers and publishers would try and strangle this idea at birth, primarily for the commercial interests of their shareholders and highly paid management.....
    As a pharma employee said over dinner the other evening, "let's get precompetitive!"

    This would form a great project for an intern, so if you're interested in coming to the lab to work on this, let me know.

  • A 101 Thankyou's!


    This week, our ChEMBL NAR Database paper made the milestone of over a hundred citations (in less than a year too). This made us all very, very happy, and for a few moments, we rested our fingers from our keyboards, and used our them instead to grasp a mug of coffee/tea; but only for a few seconds, before we got back to mixing and baking and cooking ChEMBL 15 for you all.

    Here's a list to the current citations of Gaulton et al., NAR Database, 40, D1100-D1107, 2012. Remember this is an Open Access paper.

    Please keep, keeping us happy by using our work, it's probably the biggest satisfaction we can get :)

  • ChEMBL At SWAT4LS





    As part of the ChEMBL groups involvement in the OpenPhacts project, a representative from the ChEMBL team will be attending SWAT4LS next week. As well as hacking and learning about new Semantic technologies there may be time to catch up with ChEMBL users also attending the workshop. So if you would like to hear about what we are doing with the Semantic Web, RDF or just have a general chat about ChEMBL, please get in touch.


  • Ever had a funny dream? The InChI filesystem.


    I had a dream recently - I occasionally have really technical dreams, where unlike in the real world, I'm smart, I have great insights and I solve 'big' problems - then frustratingly the great insight of the dream disappears and I'm left with a half-formed memory and the complete lack of the insight. Of probably the hundred or so dreams like this I have ever had, a few have actually led to some interesting research, which I think has been useful - never as grand as the impact in my dream, but useful. But I also have dreams about discovering new British mushroom species, fishing for electric eels with capacitors and resistors as lures, so to be clear, most of my dreams are simple nonsense.

    The dream was set in the future, and I was running a group looking after the largest chemical database in the world (clearly a fantasy!!). The database was huge, about 1020 molecules - the thing we did though was use InChIs for chemical structures but to get over a lot of the storage problems we use the filesystem structure itself to hold the structures and the relations between them. At the tips of the filesystem were just a set of standard files containing descriptors of the molecule - ./logP, containing a logP value, alongside a bunch of other useful descriptors. In this system we treated the InChI as the complete filename, with the slash layer separators (/) as directory names, so all the isomers of C3H4F2 were contained as subfiles of that directory on the /InChI=1 root filesystem. So really this is just using the hierarchical structure of the InChI itself in a hierarchical tree form. 

    In the database, we used links between files to store relationships (say from all tautomers to a standard InChI), but there were different types of links for isomers and salts, etc. The reason we did this was for space since the very size of the database precluded storing the data in a database, there was never any prospect of storing the data in core memory due to the huge size. This InChI filesystem approach was very efficient and scalable (there was something in the dream as well about having to use ZFS, which can currently scale to 16 Exabytes as a single volume. We'd optimized this though, for the really small block sizes required by the data). The directory/file dates were used to store history about the date of registration - this was important for patent novelty checking, the querying of the database was based on 'extended' unix filesystem tools, like a pharmacophore enabled 'find'. The duality of the filename as a location on a disk, and a location on the internet, and the ubiquity and beauty of everything in unix being a file also played their role. 

    Finally, there was something really really cool for drug discovery that could be done precisely because of this InChI as a filesystem model, and that's the bit that's missing from the recollection of the dream.

    What a bummer!

    Here's the directory containing the data for our good friend aspirin.


    /InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)


    This blogpost was based closely on a mail I sent to a friend/collaborator (you know who your are). I've also added a few more sentences to increase readability and accessibility for those outside of the InChI field. A few spurious personal references have been removed from the original text (particular people had roles, and to prevent them appearing as apparent fact in google, I've removed them; and anyway, the people may not have enjoyed their dream roles ;) ).

    I've explained the contents of the dream to a few people now, and the story usually makes people smile (always a good sign), go quiet (an even better sign) and then ask a bunch of questions that try and dismiss it as fantasy (the best sign of all). 

    Since that time, we've reduced the core idea to practice - 

    • There's a tarfile of a toy InChI filesystem (thanks Gerard) that you can do a surprising amount of chemoinformatics with just ls and cat
    • Some initial work comparing the efficiency and scaling of this filesystem approach to classical prefix and suffix trees (thanks Michał), but these seem to have scaling problems.
    • General cheminformatics InChI related stuff (thanks Francis). 


    If you'd like to know more, get in touch via the comments.