-
Rainbow tables for InChI keys - are they of any practical use?
I love computer security stuff - 'love' in the sense of really interested with no real competence. If I could have a parallel professional life, it would be as a forensic accountant - now that would be exciting ;)
Here's an example of a Standard InChI to Standard InChI key transform, for the natural product morphine. The InChI key is a hash of the long InChI with a whole host of usability benefits.
InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1 -> BQJCRHHNABKAKU-KBQPJGBKSA-N
Anyway, one thing I recently started to think about is the Standard InChI to InChI key hash. This is widely viewed as a one-way function in that it is not possible to reconstruct the chemical structure from the hash - outside of a simple dictionary-based attack (so if you have a database of 2-D structures with precalculated keys, it's trivial to do the reverse lookup). The InChI key algorithm was not designed as a cryptographic hash, but it does a good job of condensing the long meta-character stuffed InChI into a short web friendly ASCII string with good collision properties - yes, there are some known duplicate hashes for different molecules, but they are believed to be rare; and if you can't handle these cases in systems you build, perhaps you should try a different job ;)
However, converting an InChI key back into a real structure is an important task, and there are many public and private services to do this - all essentially based on an extant dictionary lookup - of course these cover the majority of current common use cases, but not some of the more crazy innovative cases where you want to deliberately explore/search novel chemical space.
So, what does the world of password cracking teach us about this problem? Well, one cool thing is the development of Rainbow Tables - essentially very large tables of precomputed hashes for a given password space. So it seems to me, that in principle, you could exhaustively enumerate molecules (given composition and chemical stability common sense constraints) then build an InChI rainbow table to perform arbitrary lookups (within the original constraints). The size of the possible password and chemical spaces are sort of comparable, and already the password cracking community have applied GPU processing, highly optimised code, and you could also think of adding crowdsourcing as the way of generating additional computing resource to get the job done.
An example of the sort of software that handles this sort of thing in the password cracking space is RainbowCrack.
Something like GDB-13 (and higher) would be a great place to start for the initial seed structures, applying some frequent med-chem transforms to these using an simple empirical rule-base is also another interesting way to extend these towards more medchem friendly/optimised chemotypes. Imagine what we could do when we get to GDB-25 or so!
Once you had this InChI Rainbow table you would have an unmatchable InChI resolver. Extension of this concept to include salts (not literal as in chemistry, but literal as in password strengthening) open up some interesting opportunities for secure, blind data sharing based around a salted-InChI key - but that is for another day, over dinner, when I've had some Merlot.....
If anyone is interested in an internship on this sort of thing, using our large (~22K node) farm, and large storage (~15 PB) infrastructure here, get in touch.
As a final aside, the 'parent compound', 'salt form' of compound pair has a lot of similarities to the salting used in password strengthening, the odd thing is that the salt complexity in chemicals is relatively small (chloride, sodium, malate, etc), so the salt complexity is only a few bits (8 bits would be 255 salts) compared to the 48 to 128 bit salts of typical passwords. So an interesting subproblem is the generation of salted-hashed-forms of anions/cations. This is quite funny if you think about it - really it is! -
Interest in a ChEMBL seminar in your lab in the South-East of the US next Spring?
I'm chairing a session at the ACS Spring meeting in New Orleans, in early April 2013 (the 7th thru 11th are the dates of the ACS meeting itself, but I'll probably be finished by the 9th) and am making a visit to Miami and Kentucky on the same trip. I can probably squeeze in two more lab visits if there is interest in a ChEMBL seminar (I would need to leave the US at the latest on Tuesday 16th April). I'll need to look into the practicalities of travel, and realistically they'll need to be in the South-East, but I'm pretty hardy in the air - I don't mind odd timed flights.
So if there is any interest, let me know. -
PhD Position in the ChEMBL Group for an October 2013 Start
We have a PhD position available in the group for an October 2013 start. Details of the EMBL PhD Programm (EIPP) can be found here, with more details on stipend etc. here. Students at the EBI are also typically registered at the University of Cambridge, and fully participate in college life with access to all the facilities of the University. Several other PIs at the EBI, and the rest of EMBL have positions available, so these may be of interest too.
The project areas we are interested in for the ChEMBL group are connected to computational approaches to drug discovery or drug safety. However, being part of EMBL gives students access to great experimental facilities, and we regularly collect data sets useful for our work, or test and validate predictions. We have a few ideas at the moment that would be suitable for a PhD project, including...
- Reconstruction of assay cascades from ChEMBL.
- Is the concept of a 'disease pathway' a useful one?
- Automated design and optimization of biological drugs using a rule-based approach.
- Polypharmacology, Compound Promiscuity and Drug Safety.
- The structural basis of allosteric regulation.
But proposals for other areas of research are very welcome.
The deadline for applications is the 19th November 2012, but you need to register for the online registration system by the 12th November 2012. Interviews will be in Hinxton on the 6th to 8th February 2013.
-
Drug Approval Timeline Visualisation
We're playing around with some visualisation techniques at the moment for ChEMBL, with one of our interests being the display of timelines. Here is a little standalone visualisation of the timeline of FDA drug approvals, annotated with ATC codes, loaded with some toy data (so don't rely on it for structures/publications/analysis!).
Update: So the toolkit we use looks to be quite browser sensitive - we'll look into this, but by default, it looks like it doesn't render in Chrome..... -
LinkedIn Endorsements and that Johari Window thing....
So, LinkedIn have just started up this endorsement thing, if you've listed some skills, your contacts are variously prompted to endorse/validate you against these skills. A sort of "X-factor", or "Scientist's got Talent"vote.
A picture of my current endorsements is above - yeah, yeah, stop laughing - it's only been going a few days, and in a couple of weeks everything's gonna be maxed out (I hope).
I find this whole concept pretty unsettling; for instance, if people have voted for my bioinformatics skills and ignored the chemoinformatics button, does it mean they think I stink at that? Is it a back-handed complement they've just paid me?
The other odd thing is that my personal perception of my skills is not well correlated with the LinkedIn world's views (hence the Johari Window reference in the title). Perhaps, people don't really see the private underlying enabling skills (e.g. Fortran) that are sublimed into the public facing stuff (e.g. Bioinformatics) we do.
I know I am obsessive and paranoid, but I know a lot of more obsessive and paranoid (and fragile) people than me, and they are probably losing sleep over this sort of stuff.
Come on, everyone knows I have leet Fortran haxor skillz - get clickin' -
USAN Watch - August/September 2012
The USANs for August and September 2012 have recently been published. Sorry for being slow in posting these.... structures will appear over the next few days, and if anyone has the research codes for the various Merck compounds in this list, please post in the comments (telmapitant, vibregon and tildrakizumab).
USAN Research Code Structure Drug Class Therapeutic class Target aldoxorubicin, aldoxorubicin hydrochloride INNO-206 natural product-derived small molecule therapeutic topoisomerase II belnacasan VX-765 synthetic small molecule therapeutic ICE-1 deferitazole, deferitazole magnesium FBS-071, SPD-602 synthetic small molecule therapeutic n/a delparantag, delparantag pentahydrochloride PMX-60056 synthetic small molecule therapeutic heparin dupilumab REGN-668, SAR-231893mAb therapeutic IL4R-alpha elosulfase alfa BMN 110, rhGALNS, N-acetylgalactosamine-6-sulfatase, chondroitinsulfatase, galactose-6-sulfate
sulfatase, EC=3.1.6.4enzyme therapeutic n/a emixustat, emixustat hydrochloride ACU-4429 synthetic small molecule therapeutic RPE-65 empagliflozin BI-10773synthetic small molecule therapeutic SGLT2 entolimod CBLB-502 protein therapeutic TLR5 eravacycline TP-434 synthetic small molecule therapeutic 30S ribosome evodenoson ATL313, DE-112synthetic small molecule therapeutic A2A filorexant therapeutic gandotinib LY-2784544synthetic small molecule therapeutic JAK2 ilorasertib ABT-348, A-968660 synthetic small molecule therapeutic AUR VEGFR SRC lusutrombopag S-888711 synthetic small molecule therapeutic mericitabine RO5024048, R7128synthetic small molecule therapeutic HCV NS5B polymerase nelipepimut-s E75 peptide peptide vaccine therapeutic HER2 nesvacumab REGN-910mAb therapeutic angiopoeitin 2 ozanezumab GSK-1223249mAb therapeutic Nogo-A peginesatide, peginesatide acetate AS-37702, Omontys peptide therapeutic EPOr peginterferon beta-1a BIIB-017 protein therapeutic IFNAR pidilizumab CT-011 mAb therapeutic PD1 pimasertib
MSC1936369A; AS703026; EMD 1036239synthetic small molecule therapeutic MEK ramucirumab IMC1121B mAb therapeutic VEGFR2 rebastinib, rebastinib tosylate DP-1919.TO, DCC-2036synthetic small molecule therapeutic Tie2 TrkA riociguat BAY 63-2521synthetic small molecule therapeutic soluble guanylate cyclase sofosbuvir PSI-7977 GS-7977synthetic small molecule prodrug therapeutic HCV pol telmapitant synthetic small molecule therapeutic tenofovir alafenamide GS-7340 synthetic small molecule prodrug therapeutic HIV-1 RT tergenpumatucel-L HAL cell therapeutic n/a tildrakizumab mAb therapeutic IL-23 ulodesine, ulodesine succinate BCX-4208synthetic small molecule therapeutic purine nucleoside phosphorylase (PNP) vibegron synthetic small molecule therapeutic -
Timeline of Structural Biology Nobel Prizes
Our BFFs in PDBe have put together an interactive timeline of Structural Biology Nobel Prize Awards - to celebrate the award of the Nobel Prize for Chemistry 2012 to Lefkowitz and Kobilka.
And now there is a class B GPCR structure known (but not public) things are moving along very nicely! -
Brain-1.0 - Biomedical knowledge manipulation
Brain is a library created to achieve such linkage: It can handle and query large biomedical knowledge-bases. The Brain library can also serve as a framework for users interested in Description Logic and Biology.Documentation: https://github.com/loopasam/Brain/wiki