-
GPCR Structure: Human PAR1 Receptor
Another GPCR structure, this time that of human Protease Activated Receptor-1 also known as the thrombin receptor (PAR1) receptor, complexed with the late-stage clinical candidate antagonist vorapaxar (SCH-530348). This brings the total number of sequence distinct GPCRs now known structurally to 18. The PDB code is 3vw7. This structure helps understand the mechanism of regulation of the binding of the thrombin cleaved N-terminus involved in receptor activation.
Update: I just looked at the tracking page at Scripps - and there's a lot of near term exciting stuff by the looks of it - 5HT2B and 5HT1C in refinement - these will be great for exploring polypharmacology of many centrally acting drugs, and glucagon receptor - the first class 2 GPCR (Heptares have announced that they have the structure of a type 2 as well, but with no plans to publish). We do live in interesting times!
The link to the paper is here.
%A C. Zhang %A Y. Srinivasan %A D.H. Arlow %A J.J. Fung %A D. Palmer %A Y. Zheng %A H.F. Green %A A. Pandey %A R.O. Dror %A D.E. Shaw %A W.I. Weis %A S.R. Coughlin %A B.K. Kobilka %T High-resolution crystal structure of human protease-activated receptor 1 %J Nature %D 2012 %O http://dx.doi.org/10.1038/nature11701
- 3uon - human muscarinic M2 receptor
- 4daj - rat muscarinic M3 receptor
- 3rze - human histamine H1 receptor
- 2rh1 - human beta-2 adrenergic receptor
- 2vt4 - turkey beta-1 adrenergic receptor
- 3pbl - human dopamine D3 receptor
- 2ydv - human adenosine A2a receptor
- 3v2w - human sphingosine-1-phosphate receptor
- 4djh - human kappa opioid receptor
- 4dkl - mouse mu opioid receptor
- 4ej4 - mouse delta opioid receptor
- 4ea3 - human nociceptin receptor
- 4grv - rat neurotensin receptor
- 3odu - human CXCR4 receptor
- 2lnl - human CXCR1 receptor (NMR)
- 3vw7 - human PAR1 receptor
- 1u19 - bovine rhodopsin
- 2z73 - squid rhodopsin
10 20 30 40 50 60 70 3uon ( 20 ) tfevvfivlvagslSlvTiigNilVmvSIkvnrh 4dajA ( 64 ) iwqvvfiafltgflAlvTiigNilVivAFkvnkq 3rze ( 28 ) mplvvvlsticlvTvglNllVlyAvrserk 2rh1 ( 29 ) devwvvgmgivmslivlaIvfgNvlVitAIakfer 2vt4A ( 40 ) weagmsllmalVvllIvagNvlViaAigstqr 3pblA ( 32 ) yalsYcalilaIvfgNglVcmAVlkera 2ydv ( 3 ) imgssvYitvElaiavlAilgNvlVcwAvwlnsn 3v2w ( 17 ) sdyvnydIIvrHYnyTgklnisa ltsvvfiliCcfIileNifvlltiwktkk 4djhA ( 55 ) spaipviitavysvvfvvGlvgNslVmfVIirytk 4dkl ( 65 ) mvtaitimalYsiVcvvGlfgNflvmyvIvrytk 4ej4 ( 41 ) rsasslalaiaitalYsavcavGllgNvlvmfgIvrytk 4ea3A ( 47 ) plglkvtIvglYlavcvgGllgNclvmyVIlrhtk 4grvA ( 52 ) nsdldVnTdiyskvlvtaiYlalfvvGtvgNsvtlftlark s 3oduA ( 27 ) pçfre-------------------------enanfnkiflptiYsiIfltGivgNglvilvMgyqkk 2lnl ( 29 ) pÇmle--------------------------tetLnkYvviiayalvFllsllgNslvMlvilysrv 3vw7 ( 91 ) dasgYLtsswLtlfVPsvYtgVfvvSlplNimaivvFilkmk 1u19A ( 1 ) mnGtegpnfyVPfsnktgvVrsPFeapQyyLaepwqFsmlAayMflLimlGfpiNflTlyVTvqHkk 2z73A ( 9 ) etwwyNpsIvVhpHWref--------------dqvpdavYyslGifIgiCgiiGcggNgiViyLFtktks aaaaaaaaaaaaaaaaaaaaaaaaaaaa 80 90 100 110 120 130 140 150 3uon ( 54 ) LqtvnnyflfSLAcADliiGvfSMnlytlytvi--gyWplgpvvÇdlWlalDYvVSNAsVmNLliiSfdryfcvt 4dajA ( 98 ) LktvnnyFllSLAcADliIGviSMnlFttyiim--nrWalgnlaÇdlwLSiDYvASNAsVmNLlvISfDryfsit 3rze ( 58 ) LhtvGnlYIvsLSvADliVGavVMpmnilyllm--skwsLgrplÇlfWLSmDYVASTASIfSVfiLCiDryrsvq 2rh1 ( 64 ) LqtvtnyFItsLAcADlvMGlaVVpfgaahilm--kmWtfgnfwçefWTSiDVlCVTASIeTLcvIAvdryfAIt 2vt4A ( 72 ) LqtltnlFItsLAcADlvvGllVVpfgatlvvr--gtWlwgsflçelWTSlDVlCVTAsIeTLcvIAiDrylait 3pblA ( 60 ) LqtttnyLVvsLAvADllvAtlVMpwvvylevt-ggvWnfsricÇdvFVTlDVmMcTAsIwNLCaISidRytAVv 2ydv ( 37 ) LqnvtnyFVvsAAaADilVGvlAIpfaiaIst----GfçaaçhgÇLfiACfVLVLTASSIfSLlaIAiDryiair 3v2w ( 76 ) FhrpMYyFIgnLAlSDllaGvaYtaNlllsga---tTykLtPaqWFlREGsMFvALSASVfSLlaIAieryitml 4djhA ( 90 ) mktaTniYIfNLAlADalVTtTMpfqstvylmn---sWpfgdvlÇkiVlsiDyyNMfTSIfTLtmMSvdRyiaVc 4dkl ( 99 ) MktAtniYIfNLAlADalATsTLpfqsvnylmg---tWpfgnilÇkiviSidYyNMFTSIfTLctMSvdRyiAVC 4ej4 ( 80 ) LktATniYIfNLAlADalATstLpfqsakylme---tWpfgellÇkaVlSidYyNMFTSIfTLtmMSvDRyiavc 4ea3A ( 82 ) mktatNiYIfNLAlADtlVLlTLpfQGtdillg---fWpfgnalÇktVIaiDyyNMFTSTfTLtaMSvdryvaic 4grvA ( 98 ) lqstvhyHlgsLalSDllILllAMpvElyNFIWvhhpWafgdagÇrgyYflRDactYATAlNVasLSvaRylAic 3oduA ( 69 ) lrsmtdkYRlhLSvADllFVitLpfWavDAva----nWyfgnflÇkaVHviYTVNlYSSVwILAfISlDRylAiV 2lnl ( 70 ) GrsvTdvyLlnLalaDllfaltlpiwaaSkvn----gwifgtfLÇkvVslLkEvnfYsgilLlacIsvdrylaiv 3vw7 ( 133 ) vkkPAVVyMlhLAtADvlFVsvLpfkisYyfsg--SdWqfgselÇrfVtAaFYcnMYASIlLMtvISiDrflAVv 1u19A ( 68 ) LrtplNyILlnLAvADlfMVfg-GFtTTlyTSl-hGyFvfgptGÇnlEGffATLGGEIaLWSLvvLaieRyvvVc 2z73A ( 65 ) LqtpanmFiinLAfSDftFSlvNGfplMtiSCf-lkkWifgfaaÇkvYGfiGGiFGFMsIMTMAMiSiDrynViG aaaaaaaaaaaaaaaaaa aaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 160 170 180 190 200 210 220 3uon ( 127 ) kpltypvk---rttkmAgmmiaaAwvlSfilwapaIlfwqfivg-----------vrtVedgeÇyIqff------ 4dajA ( 171 ) rpltyrak---rttkrAgvmiglAwviSfvlWApaIlfwqyfvg-----------krtVppgeÇfIqfl------ 3rze ( 131 ) qplrylky---rtktrAsatilgawflSfl-WvipIlgwnh rredkÇeTdfy------ 2rh1 ( 137 ) spfkyqSl---ltknkArviilmvwivSgltSflpIqmhwyr-----athqeAinÇyae-etçÇdff-------- 2vt4A ( 145 ) spfryqsl---mtrarAkviictvwaiSalvSflpImmhwWr-----dedpqAlkçyqd-pgçÇdfv-------- 3pblA ( 134 ) mpvhyqhgtgqsscrrValmitavwvlAfaVSc-pLlfgfNtTg---------------dptvÇsIs-------- 2ydv ( 108 ) iplryngl---vtgtrAkgiiaicwvlSfaIGltPmlgwnnÇgqp--kegkahsqgÇgegqvAÇlFedVV----- 3v2w ( 148 ) k nnfrlfllisacwviSlilGglPimgwn---------------ÇisalssÇSTVLP------- 4djhA ( 162 ) hpvkaldf---rtplkAkiinicIwllSssvGisAivlGGtkvred------------vdvieÇslqFpdddysw 4dkl ( 171 ) hpvkaldf---rtprnAkivnvcNwilSsaiGlpVmfmAttkyrqg--------------sidçtltfsh-ptwy 4ej4 ( 152 ) hpvkaldf---rtpakAklinicIwvlAsgvGvpimvmAvtqprdg--------------avvÇmlqfps-pswy 4ea3A ( 154 ) hp tsskAqavnvaIwalAsvvGvpvaimGsAqvede--------------eieÇlveipt-pqdy 4grvA ( 173 ) hpfkaktl---msrsrtkkfisaIwlaSallAipMlftMGlqnrSadg--------thpgGlVÇTPiv----dta 3oduA ( 140 ) hatn---sqrprkllAekvVyvgVwipAlllT-ipDfif--Anvsead-----------dryiÇdrfyp---ndl 2lnl ( 141 ) haTr----tltqkrhlvkfvclgcwglsmnlS-lpFflf--RQayhpN----------NsSPvÇyEVlg-ndtak 3vw7 ( 206 ) ypm rtlgrAsftClaiwalAiagV-vpLllkeQtiqvpg-----------lgitTçhdvlsetLleg 1u19A ( 141 ) kpmsn----frfgenhaimgvafTwvmAlaCAapPlvgwSrYIPE-------------GMQCSÇGIDYYTpheet 2z73A ( 139 ) rpmaas---kkMshrrAfimiifVwlwSvlwAigPifgwGaYtLE-------------GVLCNÇSFdYIsr--ds aaaaaaaaaaaaaaaaaaa aaa 230 240 250 260 270 280 290 300 3uon ( 182 ) snaavtfgtAiaaFylpviiMtvlywhisrasksri pppsrekkvtrtilaIllaF 4dajA ( 226 ) septitfgtAiaaFymPvtiMtilywrIyketek like aqTlsaIllaF 3rze ( 186 ) dvtwfkvmtaiinFylPtllMlwfyakIykaVrqhc lhmnrerkaakQLgfIMaaF 2rh1 ( 195 ) TnqayaiasSivSFyvplviMvfvYsrVfqeakrql kfclkeHkaLktlgiIMgtF 2vt4A ( 203 ) TnrayaiasSiiSFyipLliMifvalrvyreakeq irehkalktlgiImgvF 3pblA ( 185 ) -npdFViySSvvSFylPfgvTvlvyarIyvvlkqrrrk-----------------gvplrekkatqMVaiVlgaF 2ydv ( 173 ) pmnYMVyfNffaCVlvPlllMlgvylrIflaarrqlkqmesq stlqkevhaakSLaiIvglF 3v2w ( 197 ) LYhkhYIlfCTtvFtllllsIvilYcriyslvrtr asrssenvaLlkTViiVLsvF 4djhA ( 222 ) wdlfmkicVfifAfviPvliIivcytlMilrlksvrllsg rekdrnlrritrLVlvVVavF 4dkl ( 228 ) wenllKicVfifAfimPvliItvcyglmilrlksvr ekdrnlrritrMVlvVvavF 4ej4 ( 209 ) wdtvtkicvflfAfvvPiliitvcyglMllrlrsvr ekdrslrriTrMVlvVvgaF 4ea3A ( 211 ) wgpvfaiciflfSFivPvlvIsvcyslMirrlrgvrlls-------------gsrekdrnlrritrLVlvVvavF 4grvA ( 233 ) tvkvvIqvNtfmSFlfPmlvIsilNtvIAnkLtvmv vqalrhGVlvAraVviaf 3oduA ( 195 ) wvvvfqfqhimvglilPgivIlsCyciIisklshs kghqkrkalktTviLilaF 2lnl ( 198 ) wrmvLrilPHtfGfivplfvmlfcygftlrtlf---------------------kahmgqkhrAmrvIfaVvlif 3vw7 ( 266 ) yyayyfsafSavfFfvpliiStvCyvsIirclsssa anrskksrAlfLSaaVfcIF 1u19A ( 199 ) nNesFViyMfvvHfiiPlivIffcygqLvftvkeaaaq------------qqesattqkaekevTrMviiMviaF 2z73A ( 196 ) ttrsNIlcMFilGffgPiliiffCyfnIvmsvsnhekemaamakrlnakelrkaqaganaemrlAkIsivIVsqF aaaaaaaaaaa aaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaa 310 320 330 340 350 360 370 3uon ( 397 ) iitWapYNvmVlintfçap--------ç--ipntvwtiGywlCYinstiNpacYalcnatFkktfkhllm 4dajA ( 500 ) iitWtpyNimVlvntfçds--------ç--ipktywnlgywlCYiNStvNPvcYalcnktFrttfkt 3rze ( 425 ) ilCWipYFiffmviafçkn--------ç--cnehlhmftiWlGYiNStlNPliYplCnenFkktfkrilhi 2rh1 ( 283 ) tlcWlpFFiVNivhviqdn----------lirkevyillNwiGYvNSgfNpliYc-rspdfriAfqellcl 2vt4A ( 300 ) tlCWlpFFlvnivnvfnrd----------lvpdwlfvafnwlGYAnSAmnpiiYc-rspdfrkAfkrlla 3pblA ( 339 ) ivCWlpFFltHvlnthçqt--------ç-hvspelysattwlGYvNsalNPviYttfnieFrkAflkilsc 2ydv ( 243 ) alCWlpLHiiNcftffçpd--------çshaplwlMylAivlSHtNSvvNPfiyAyrireFrqTFrkiirshvlr 3v2w ( 266 ) iacwapLFiLLllDvgçkvk------tç--diLfrAeyfLvlAvlNSgtNPiiytltNkemrrafiri 4djhA ( 284 ) vvcWtpIHifilvealgs aalssyyfcIalGytNSslNPilYafldenFkrcfrdfcfp 4dkl ( 290 ) ivcWtpIHiyViikaliti-------pettfqtvswhfcialGYtNSclNpvlYafldenFkrCfrefci 4ej4 ( 271 ) vvCWapIHifVivwtlvdi------nrrdplvvaalhlcialGYaNSslNpvlYaflDenfkrc 4ea3A ( 273 ) vgcWtpVQvfvlaqglgvq-------pssetavailrfctAlGYvNSclNpilYafldenFkacfr 4grvA ( 318 ) vvcWlpYHvRRlmFCyisdeq--WttflFdfYHyfYmlTNalAYasSAinpilYnlvsanFrqv 3oduA ( 249 ) facWlpyyigisidsfilleiikqgçefentvhkwisitEAlAFfHCclNpilyaflgakfktsaqhalts 2lnl ( 252 ) llcwlpynlvlLadTlmrtqviqesçeRrNnIGraLdatEilGflhsclnpiiyafigqnfrhgflkilamhg 3vw7 ( 323 ) iiCFgpTNvlLiaHYsflsh-----tstteaAYfaYLlcvCvSSiSCciDplIyyyAssec 1u19A ( 262 ) liCWlpYAgvAfyIfthqgsd---------fgpifMTipAFfAKtSAvyNPviYimmnkqFrnCmvttlccgknp 2z73A ( 271 ) llSWspYAvvAllAQfgplew---------VtpyaAQlpVMfAKaSaihNPmiYsvsHpkFreAIsqtfpwvLtc aaaaaaaaaaaaaaaa aaaaaaaaaaaaa aaaaaaaa aaaaaaaa 380 390 400 410 3uon 4dajA 3rze 2rh1 2vt4A 3pblA 2ydv ( 310 ) qqepfkaa 3v2w 4djhA 4dkl 4ej4 4ea3A 4grvA 3oduA 2lnl 3vw7 1u19A ( 328 ) lgddeasttVsktetsqvapa 2z73A ( 337 ) cqfddketeddkdaeteipage
-
New Drug Approvals 2012 - Pt. XXVII - Choline C-11
On September 12, FDA approved Choline C-11, an intravenous radioactive diagnostic agent to be used as tracer during Positron Emission Tomography (PET) scan to help detect sites of recurrent Prostate Cancer (OMIM : 176807 ; MeSH : D011471) .
Prostate cancer is the most common cause of death from cancer in men over age 75, and is rarely found in men younger that 40. Unlike many other cancers, prostate cancer usually progresses very slowly. Sometimes the cancer cells may metastasize from the prostate to other parts of body. Overall, it is estimated to be the sixth leading cause of cancer-related death in men.Choline is a naturally occurring component of the numerous Vitamin-B complex, and is necessary for normal cell structure and signalling. Choline C-11 is a radiolabeled synthetic analog of choline that releases a positron by beta decay which can be visualised by PET. Choline is rapidly taken up by the prostate cells and this allows the prostate to be imaged.Choline, a precursor molecule essential for the biosynthesis of phospholipids which are the structural components of cell membranes, as well as modulation of trans-membrane signalling. Increased activity of phospholipid synthesis has been associated with increased cell proliferation and the transformation process that occurs in tumour cells.Choline C-11 is a positron emitting radiopharmaceutical that is used for diagnostic purpose in conjunction with PET imaging. The active ingredient is Choline C-11 and each millilitre of the injection contains 148 MBq to 1225 MBq of the active ingredient.IUPAC Name (Choline) : 2-hydroxy-N,N,N-trimethylethanaminiumCanonical Smiles : [Cl-].[11CH3][N+](C)(C)CCOStandard InChI : 1S/C5H14NO.ClH/c1-6(2,3)4-5-7;/h7H,4-5H2,1-3H3;1H/q+1;/p-1/i1-1;Following intravenous administration, Choline C-11 distributes mainly to the pancreas, kidney, liver, spleen and colon. The radioactivity accumulated rapidly within the prostate and peak uptake appeared with in 5 mins following the administration. Choline C-11 undergoes metabolism resulting in the detection of 11C-betaine as the major metabolite in blood. The rate of excretion of Choline C-11 in urine was 0.014 mL/min.Choline C-11 has been developed and marketed by Mayo Clinic.Full prescribing information is found here. -
Browsers and Bugs
We had a support email recently that some things on the interface didn't work with chrome (an export function) - we couldn't repeat the issue with the equipment we have here at ChEMBL Towers. But there are a lot of OS's and a lot of browsers out there, and we can't recreate every possible environment - interestingly, chrome is really popular amongst you people (the image above is a google analytics report of a weeks access of this very blog). I'm a safari man myself....
So as a reminder, we love hearing about bugs and issues, we really do, so send them to chembl-help@ebi.ac.uk! -
ChEMBL Cross Reference Links Now In UniProt
So, some great news for those of you that use UniProt - there are now links to the corresponding target pages in ChEMBL in there.
Here's the link (http://www.uniprot.org/uniprot/?query=database%3AChEMBL&sort=score) to the list of ChEMBL targets that are in Uniprot. And there are links to ChEMBL in the Cross References section.
jpo -
Phinterest - A More Sketched Out Idea For An App To Cover Conferences
Returning for a moment to some stuff we've covered in the blog before - the capture and open sharing of timely data to help drug discovery. The basic idea is, is it possible to rapidly capture and share key disclosure data (compound structures, toxicology, efficacy, ADME, etc.) in order to incorporate accurate timely data into your own experiments. At the moment, this area is very active commercially, with large corporations providing the needed data to people who can pay, who are not necessarily the best consumers of such data. There is also some experimentation by professional bodies - C&EN live blogging a few years ago on some of the key Med Chem talks from a National ACS meeting, which hasn't been repeated, despite being well received.
To me this seems a great opportunity for citizen science - attendees at key conferences sharing results openly, in real time, for the benefit of all - introducing knowledge and data 'liquidity' to research.
Let's now suspend reality for a few seconds, and especially ignore the likely tightening of rules of reporting/sharing data if this Citizen Science impacts valuable commercial streams for middlemen/conference organizers. The copyright of the original slide producer. But, we may return to this in a future post.
So the basic idea is
- go to conference
- write down stuff
- share it with the world
There can't be anything wrong with that surely?
I tweeted a few months ago asking if there was an app that could take photos from, say a poster, then do structure conversion. There was not a lot out there at all (the one thing out there was pretty duff, but then the developers said it was pretty duff), but there was quite some interest in the idea based on replies, retweets, etc. This has led to a little spare time thinking, and the following now seems to be technically possible.
Names for stuff are important to me - so we'll call this 'Phinterest' - named as homage to the online pinboard website Pinterest. This has a really simple paradigm, upload some pictures and provide some tags. Which is where we start.
1) You go to the conference of interest, and either cruise the posters, or attend talks. Almost everybody has a smartphone now, with a camera capable of capturing good pictures. You'd then take pictures of whatever captures your interest and upload them with a single click to phinterest.org. There is often built in location tagging (so phinterest.org would capture time and location of the upload - this could support provenance of the uploaded photo, and allow auto-tagging with conference name, etc). There would be the ability to tag the photos if you wanted, but it's not really needed.
2) The pictures could then be bundled automatically into sets from a conference - a stream, and would be visible by all, as they were uploaded. The crappy out of focus ones could be down-voted, and the useful ones would be quality auto-curated in this way, so the community sorts out the interesting stuff. If you really cared about the structure of the selective MEK inhibitor RG-7421, you could read the photo, and get what you need right there and then.
3) The photos could then go through auto-OCR - this is pretty simple to do and set up - for example there's the website http://www.free-ocr.com that takes pictures and does OCR - some simple technical and semantic rules to enhance OCR for things like Vd, IC50, gene names, research codes, etc would be pretty easy, as would pairing an IC50 with a numeric value and a unit. This sort of things is now pretty standard in biomedical literature. So no big shakes. This would then add some useful tags to the photos, and a simple search functionality would allow useful searches if you knew what you were looking for. Regardless of how accurate this would be, you'd always have the original evidence photo to check with.
4) A special feature would be to perform useful OCR on molecular objects - DNA and protein sequences are pretty simple to extract and OCR, and then tag with parent genes, patents, etc. Secondly, and more technically challenging is to perform OCR on chemical structures. OSRA is great, and there are already some web services to allow upload of images and extraction of structures. On phinterest.org, these could be displayed alongside the parent photo, then confirmed/flagged by the community of experts.
Basic workflow is therefore....
Photograph - Upload - OCR - Tagging - Sharing
Even if you only got to stage 1) it would be useful to the community of drug discoverers. To me the challenges are...
- Dealing with segmenting the images, and selecting the high quality useful ones, but with enough sources of photos of the same thing, it's likely that a couple of useful ones could be found.
- It is the way of the world, that idiots would abuse the facility, load rude offensive photos, use it for inappropriate marketing.
- Of course, direct real-time upload of slides would be great - in the real world though this doesn't happen, how many times have you asked the speaker for copy of slides, been told, 'yeah, of course buddy' then got nothing. For conference organisers the slides are part of the overall commercial package/benefits in some cases. So it just ain't gonna happen - at least not without a significant nudge - and anyway, steps 2 to 4 would still be required - just you'd have a higher resolution starting point.
- Legally, it's an interesting area. How different is it from me writing something in a notebook and using it in my research, or sharing it with a colleague - just set in an Internet age? Sharing data is exactly what conferences are for. However, there are all sorts of concerns, image copyright, recording permissions, etc. As I said at the start, it's likely that conference organisers and publishers would try and strangle this idea at birth, primarily for the commercial interests of their shareholders and highly paid management.....
This would form a great project for an intern, so if you're interested in coming to the lab to work on this, let me know. -
A 101 Thankyou's!
This week, our ChEMBL NAR Database paper made the milestone of over a hundred citations (in less than a year too). This made us all very, very happy, and for a few moments, we rested our fingers from our keyboards, and used our them instead to grasp a mug of coffee/tea; but only for a few seconds, before we got back to mixing and baking and cooking ChEMBL 15 for you all.
Here's a list to the current citations of Gaulton et al., NAR Database, 40, D1100-D1107, 2012. Remember this is an Open Access paper.
Please keep, keeping us happy by using our work, it's probably the biggest satisfaction we can get :) -
ChEMBL At SWAT4LS
As part of the ChEMBL groups involvement in the OpenPhacts project, a representative from the ChEMBL team will be attending SWAT4LS next week. As well as hacking and learning about new Semantic technologies there may be time to catch up with ChEMBL users also attending the workshop. So if you would like to hear about what we are doing with the Semantic Web, RDF or just have a general chat about ChEMBL, please get in touch.
-
Ever had a funny dream? The InChI filesystem.
I had a dream recently - I occasionally have really technical dreams, where unlike in the real world, I'm smart, I have great insights and I solve 'big' problems - then frustratingly the great insight of the dream disappears and I'm left with a half-formed memory and the complete lack of the insight. Of probably the hundred or so dreams like this I have ever had, a few have actually led to some interesting research, which I think has been useful - never as grand as the impact in my dream, but useful. But I also have dreams about discovering new British mushroom species, fishing for electric eels with capacitors and resistors as lures, so to be clear, most of my dreams are simple nonsense.
The dream was set in the future, and I was running a group looking after the largest chemical database in the world (clearly a fantasy!!). The database was huge, about 1020 molecules - the thing we did though was use InChIs for chemical structures but to get over a lot of the storage problems we use the filesystem structure itself to hold the structures and the relations between them. At the tips of the filesystem were just a set of standard files containing descriptors of the molecule - ./logP, containing a logP value, alongside a bunch of other useful descriptors. In this system we treated the InChI as the complete filename, with the slash layer separators (/) as directory names, so all the isomers of C3H4F2 were contained as subfiles of that directory on the /InChI=1 root filesystem. So really this is just using the hierarchical structure of the InChI itself in a hierarchical tree form.
In the database, we used links between files to store relationships (say from all tautomers to a standard InChI), but there were different types of links for isomers and salts, etc. The reason we did this was for space since the very size of the database precluded storing the data in a database, there was never any prospect of storing the data in core memory due to the huge size. This InChI filesystem approach was very efficient and scalable (there was something in the dream as well about having to use ZFS, which can currently scale to 16 Exabytes as a single volume. We'd optimized this though, for the really small block sizes required by the data). The directory/file dates were used to store history about the date of registration - this was important for patent novelty checking, the querying of the database was based on 'extended' unix filesystem tools, like a pharmacophore enabled 'find'. The duality of the filename as a location on a disk, and a location on the internet, and the ubiquity and beauty of everything in unix being a file also played their role.
Finally, there was something really really cool for drug discovery that could be done precisely because of this InChI as a filesystem model, and that's the bit that's missing from the recollection of the dream.
What a bummer!
Here's the directory containing the data for our good friend aspirin.
/InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
This blogpost was based closely on a mail I sent to a friend/collaborator (you know who your are). I've also added a few more sentences to increase readability and accessibility for those outside of the InChI field. A few spurious personal references have been removed from the original text (particular people had roles, and to prevent them appearing as apparent fact in google, I've removed them; and anyway, the people may not have enjoyed their dream roles ;) ).
I've explained the contents of the dream to a few people now, and the story usually makes people smile (always a good sign), go quiet (an even better sign) and then ask a bunch of questions that try and dismiss it as fantasy (the best sign of all).
Since that time, we've reduced the core idea to practice -
- There's a tarfile of a toy InChI filesystem (thanks Gerard) that you can do a surprising amount of chemoinformatics with just ls and cat.
- Some initial work comparing the efficiency and scaling of this filesystem approach to classical prefix and suffix trees (thanks Michał), but these seem to have scaling problems.
- General cheminformatics InChI related stuff (thanks Francis).
If you'd like to know more, get in touch via the comments.