• Rainbow tables for InChI keys - are they of any practical use?


    I love computer security stuff - 'love' in the sense of really interested with no real competence. If I could have a parallel professional life, it would be as a forensic accountant  - now that would be exciting ;)

    Here's an example of a Standard InChI to Standard InChI key transform, for the natural product morphine. The InChI key is a hash of the long InChI with a whole host of usability benefits.

    InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1 -> BQJCRHHNABKAKU-KBQPJGBKSA-N

    Anyway, one thing I recently started to think about is the Standard InChI to InChI key hash. This is widely viewed as a one-way function in that it is not possible to reconstruct the chemical structure from the hash - outside of a simple dictionary-based attack (so if you have a database of 2-D structures with precalculated keys, it's trivial to do the reverse lookup). The InChI key algorithm was not designed as a cryptographic hash, but it does a good job of condensing the long meta-character stuffed InChI into a short web friendly ASCII string with good collision properties - yes, there are some known duplicate hashes for different molecules, but they are believed to be rare; and if you can't handle these cases in systems you build, perhaps you should try a different job ;)

    However, converting an InChI key back into a real structure is an important task, and there are many public and private services to do this - all essentially based on an extant dictionary lookup - of course these cover the majority of current common use cases, but not some of the more crazy innovative cases where you want to deliberately explore/search novel chemical space.

    So, what does the world of password cracking teach us about this problem? Well, one cool thing is the development of Rainbow Tables - essentially very large tables of precomputed hashes for a given password space. So it seems to me, that in principle, you could exhaustively enumerate molecules (given composition and chemical stability common sense constraints) then build an InChI rainbow table to perform arbitrary lookups (within the original constraints). The size of the possible password and chemical spaces are sort of comparable, and already the password cracking community have applied GPU processing, highly optimised code, and you could also think of adding crowdsourcing as the way of generating additional computing resource to get the job done.

    An example of the sort of software that handles this sort of thing in the password cracking space is RainbowCrack.

    Something like GDB-13 (and higher) would be a great place to start for the initial seed structures, applying some frequent med-chem transforms to these using an simple empirical rule-base is also another interesting way to extend these towards more medchem friendly/optimised chemotypes. Imagine what we could do when we get to GDB-25 or so!

    Once you had this InChI Rainbow table you would have an unmatchable InChI resolver. Extension of this concept to include salts (not literal as in chemistry, but literal as in password strengthening) open up some interesting opportunities for secure, blind data sharing based around a salted-InChI key - but that is for another day, over dinner, when I've had some Merlot.....

    If anyone is interested in an internship on this sort of thing, using our large (~22K node) farm, and large storage (~15 PB) infrastructure here, get in touch.


    As a final aside, the 'parent compound', 'salt form' of compound pair has a lot of similarities to the salting used in password strengthening, the odd thing is that the salt complexity in chemicals is relatively small (chloride, sodium, malate, etc), so the salt complexity is only a few bits (8 bits would be 255 salts) compared to the 48 to 128 bit salts of typical passwords. So an interesting subproblem is the generation of salted-hashed-forms of anions/cations. This is quite funny if you think about it - really it is!

  • Interest in a ChEMBL seminar in your lab in the South-East of the US next Spring?


    I'm chairing a session at the ACS Spring meeting in New Orleans, in early April 2013 (the 7th thru 11th are the dates of the ACS meeting itself, but I'll probably be finished by the 9th) and am making a visit to Miami and Kentucky on the same trip. I can probably squeeze in two more lab visits if there is interest in a ChEMBL seminar (I would need to leave the US at the latest on Tuesday 16th April). I'll need to look into the practicalities of travel, and realistically they'll need to be in the South-East, but I'm pretty hardy in the air - I don't mind odd timed flights.

    So if there is any interest, let me know.

  • PhD Position in the ChEMBL Group for an October 2013 Start



    We have a PhD position available in the group for an October 2013 start. Details of the EMBL PhD Programm (EIPP) can be found here, with more details on stipend etc. here. Students at the EBI are also typically registered at the University of Cambridge, and fully participate in college life with access to all the facilities of the University. Several other PIs at the EBI, and the rest of EMBL have positions available, so these may be of interest too.

    The project areas we are interested in for the ChEMBL group are connected to computational approaches to drug discovery or drug safety. However, being part of EMBL gives students access to great experimental facilities, and we regularly collect data sets useful for our work, or test and validate predictions. We have a few ideas at the moment that would be suitable for a PhD project, including...

    • Reconstruction of assay cascades from ChEMBL.
    • Is the concept of a 'disease pathway' a useful one?
    • Automated design and optimization of biological drugs using a rule-based approach.
    • Polypharmacology, Compound Promiscuity and Drug Safety.
    • The structural basis of allosteric regulation.

    But proposals for other areas of research are very welcome.

    The deadline for applications is the 19th November 2012, but you need to register for the online registration system by the 12th November 2012. Interviews will be in Hinxton on the 6th to 8th February 2013.

  • Drug Approval Timeline Visualisation


    We're playing around with some visualisation techniques at the moment for ChEMBL, with one of our interests being the display of timelines. Here is a little standalone visualisation of the timeline of FDA drug approvals, annotated with ATC codes, loaded with some toy data (so don't rely on it for structures/publications/analysis!).

    Update: So the toolkit we use looks to be quite browser sensitive - we'll look into this, but by default, it looks like it doesn't render in Chrome.....

  • LinkedIn Endorsements and that Johari Window thing....


    So, LinkedIn have just started up this endorsement thing, if you've listed some skills, your contacts are variously prompted to endorse/validate you against these skills. A sort of "X-factor", or "Scientist's got Talent"vote.

    A picture of my current endorsements is above - yeah, yeah, stop laughing - it's only been going a few days, and in a couple of weeks everything's gonna be maxed out (I hope).

    I find this whole concept pretty unsettling; for instance, if people have voted for my bioinformatics skills and ignored the chemoinformatics button, does it mean they think I stink at that? Is it a back-handed complement they've just paid me?

    The other odd thing is that my personal perception of my skills is not well correlated with the LinkedIn world's views (hence the Johari Window reference in the title). Perhaps, people don't really see the private underlying enabling skills (e.g. Fortran) that are sublimed into the public facing stuff (e.g. Bioinformatics) we do.

    I know I am obsessive and paranoid, but I know a lot of more obsessive and paranoid (and fragile) people than me, and they are probably losing sleep over this sort of stuff.

    Come on, everyone knows I have leet Fortran haxor skillz - get clickin'

  • USAN Watch - August/September 2012

    The USANs for August and September 2012 have recently been published. Sorry for being slow in posting these.... structures will appear over the next few days, and if anyone has the research codes for the various Merck compounds in this list, please post in the comments (telmapitant, vibregon and tildrakizumab).

    USANResearch CodeStructureDrug ClassTherapeutic classTarget
    aldoxorubicin, aldoxorubicin hydrochlorideINNO-206natural product-derived small moleculetherapeutictopoisomerase II
    belnacasan VX-765synthetic small moleculetherapeuticICE-1
    deferitazole, deferitazole magnesiumFBS-071, SPD-602synthetic small moleculetherapeuticn/a
    delparantag, delparantag pentahydrochloridePMX-60056synthetic small moleculetherapeuticheparin
    dupilumab
    REGN-668, SAR-231893 
    mAbtherapeuticIL4R-alpha
    elosulfase alfa
    BMN 110, rhGALNS, N-acetylgalactosamine-6-sulfatase,  chondroitinsulfatase, galactose-6-sulfate
    sulfatase, EC=3.1.6.4 
    enzymetherapeuticn/a
    emixustat, emixustat hydrochlorideACU-4429synthetic small moleculetherapeuticRPE-65
    empagliflozin 
    BI-10773 
    synthetic small moleculetherapeuticSGLT2
    entolimodCBLB-502proteintherapeuticTLR5
    eravacycline TP-434synthetic small moleculetherapeutic30S ribosome
    evodenoson
    ATL313, DE-112 
    synthetic small moleculetherapeuticA2A
    filorexant therapeutic
    gandotinib
    LY-2784544 
    synthetic small moleculetherapeuticJAK2


    ilorasertib ABT-348, A-968660synthetic small moleculetherapeuticAUR VEGFR SRC
    lusutrombopagS-888711synthetic small moleculetherapeutic
    mericitabine
    RO5024048,  R7128 

    synthetic small moleculetherapeuticHCV NS5B polymerase
    nelipepimut-s E75 peptide
    peptide vaccinetherapeuticHER2
    nesvacumab
    REGN-910 
    mAbtherapeuticangiopoeitin 2
    ozanezumab
    GSK-1223249 
    mAbtherapeuticNogo-A
    peginesatide, peginesatide acetate AS-37702, OmontyspeptidetherapeuticEPOr
    peginterferon beta-1a BIIB-017proteintherapeuticIFNAR
    pidilizumab CT-011mAbtherapeuticPD1
    pimasertib

    MSC1936369A; AS703026; EMD 1036239 

    synthetic small moleculetherapeuticMEK
    ramucirumab IMC1121BmAbtherapeuticVEGFR2
    rebastinib, rebastinib tosylate
    DP-1919.TO, DCC-2036 
    synthetic small moleculetherapeuticTie2 TrkA
    riociguat
    BAY 63-2521 
    synthetic small moleculetherapeuticsoluble guanylate cyclase
    sofosbuvir
    PSI-7977 GS-7977
    synthetic small molecule prodrugtherapeuticHCV pol
    telmapitant synthetic small moleculetherapeutic
    tenofovir alafenamideGS-7340synthetic small molecule prodrugtherapeuticHIV-1 RT
    tergenpumatucel-L HALcelltherapeuticn/a
    tildrakizumabmAbtherapeuticIL-23
    ulodesine, ulodesine succinate
    BCX-4208 
    synthetic small moleculetherapeuticpurine nucleoside phosphorylase (PNP)
    vibegronsynthetic small moleculetherapeutic

  • Timeline of Structural Biology Nobel Prizes



    Our BFFs in PDBe have put together an interactive timeline of Structural Biology Nobel Prize Awards - to celebrate the award of the Nobel Prize for Chemistry 2012 to Lefkowitz and Kobilka.

    And now there is a class B GPCR structure known (but not public) things are moving along very nicely!

  • Brain-1.0 - Biomedical knowledge manipulation


    The world of data informatics is seeing a cultural change, from a world of databases, such as Chembl, to one where data is more self-descriptive and ad hoc queryable - an evolution into knowledge-bases: The data will be organized around controlled dictionaries and ontologies (Semantic Web), more exposed to programmatic and web service infrastructures, and more robustly linked to other repositories (for a current example of a large nascent network of coordinated data repositories, see the ELIXIR project).

    Brain is a library created to achieve such linkage: It can handle and query large biomedical knowledge-bases. The Brain library can also serve as a framework for users interested in Description Logic and Biology.

    Website of the library: http://loopasam.github.com/Brain/