ChEMBL blog

Rainbow tables for InChI keys - are they of any practical use?
23 Oct 2012

I love computer security stuff - 'love' in the sense of really interested with no real competence. If I could have a parallel professional life, it would be as a forensic accountant - now that would be exciting ;)

Here's an example of a Standard InChI to Standard InChI key transform, for the natural product morphine. The InChI key is a hash of the long InChI with a whole host of usability benefits.
```
InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1 -> BQJCRHHNABKAKU-KBQPJGBKSA-N
```
Anyway, one thing I recently started to think about is the Standard InChI to InChI key hash. This is widely viewed as a one-way function in that it is not possible to reconstruct the chemical structure from the hash - outside of a simple dictionary-based attack (so if you have a database of 2-D structures with precalculated keys, it's trivial to do the reverse lookup). The InChI key algorithm was not designed as a cryptographic hash, but it does a good job of condensing the long meta-character stuffed InChI into a short web friendly ASCII string with good collision properties - yes, there are some known duplicate hashes for different molecules, but they are believed to be rare; and if you can't handle these cases in systems you build, perhaps you should try a different job ;)

However, converting an InChI key back into a real structure is an important task, and there are many public and private services to do this - all essentially based on an extant dictionary lookup - of course these cover the majority of current common use cases, but not some of the more crazy innovative cases where you want to deliberately explore/search novel chemical space.

So, what does the world of password cracking teach us about this problem? Well, one cool thing is the development of Rainbow Tables - essentially very large tables of precomputed hashes for a given password space. So it seems to me, that in principle, you could exhaustively enumerate molecules (given composition and chemical stability common sense constraints) then build an InChI rainbow table to perform arbitrary lookups (within the original constraints). The size of the possible password and chemical spaces are sort of comparable, and already the password cracking community have applied GPU processing, highly optimised code, and you could also think of adding crowdsourcing as the way of generating additional computing resource to get the job done.

An example of the sort of software that handles this sort of thing in the password cracking space is RainbowCrack.

Something like GDB-13 (and higher) would be a great place to start for the initial seed structures, applying some frequent med-chem transforms to these using an simple empirical rule-base is also another interesting way to extend these towards more medchem friendly/optimised chemotypes. Imagine what we could do when we get to GDB-25 or so!

Once you had this InChI Rainbow table you would have an unmatchable InChI resolver. Extension of this concept to include salts (not literal as in chemistry, but literal as in password strengthening) open up some interesting opportunities for secure, blind data sharing based around a salted-InChI key - but that is for another day, over dinner, when I've had some Merlot.....

If anyone is interested in an internship on this sort of thing, using our large (~22K node) farm, and large storage (~15 PB) infrastructure here, get in touch.

As a final aside, the 'parent compound', 'salt form' of compound pair has a lot of similarities to the salting used in password strengthening, the odd thing is that the salt complexity in chemicals is relatively small (chloride, sodium, malate, etc), so the salt complexity is only a few bits (8 bits would be 255 salts) compared to the 48 to 128 bit salts of typical passwords. So an interesting subproblem is the generation of salted-hashed-forms of anions/cations. This is quite funny if you think about it - really it is!
Interest in a ChEMBL seminar in your lab in the South-East of the US next Spring?
21 Oct 2012

I'm chairing a session at the ACS Spring meeting in New Orleans, in early April 2013 (the 7th thru 11th are the dates of the ACS meeting itself, but I'll probably be finished by the 9th) and am making a visit to Miami and Kentucky on the same trip. I can probably squeeze in two more lab visits if there is interest in a ChEMBL seminar (I would need to leave the US at the latest on Tuesday 16th April). I'll need to look into the practicalities of travel, and realistically they'll need to be in the South-East, but I'm pretty hardy in the air - I don't mind odd timed flights.

So if there is any interest, let me know.
PhD Position in the ChEMBL Group for an October 2013 Start
20 Oct 2012

We have a PhD position available in the group for an October 2013 start. Details of the EMBL PhD Programm (EIPP) can be found here, with more details on stipend etc. here. Students at the EBI are also typically registered at the University of Cambridge, and fully participate in college life with access to all the facilities of the University. Several other PIs at the EBI, and the rest of EMBL have positions available, so these may be of interest too.

The project areas we are interested in for the ChEMBL group are connected to computational approaches to drug discovery or drug safety. However, being part of EMBL gives students access to great experimental facilities, and we regularly collect data sets useful for our work, or test and validate predictions. We have a few ideas at the moment that would be suitable for a PhD project, including...
- Reconstruction of assay cascades from ChEMBL.
- Is the concept of a 'disease pathway' a useful one?
- Automated design and optimization of biological drugs using a rule-based approach.
- Polypharmacology, Compound Promiscuity and Drug Safety.
- The structural basis of allosteric regulation.
But proposals for other areas of research are very welcome.

The deadline for applications is the 19th November 2012, but you need to register for the online registration system by the 12th November 2012. Interviews will be in Hinxton on the 6th to 8th February 2013.
Drug Approval Timeline Visualisation
19 Oct 2012

We're playing around with some visualisation techniques at the moment for ChEMBL, with one of our interests being the display of timelines. Here is a little standalone visualisation of the timeline of FDA drug approvals, annotated with ATC codes, loaded with some toy data (so don't rely on it for structures/publications/analysis!).

Update: So the toolkit we use looks to be quite browser sensitive - we'll look into this, but by default, it looks like it doesn't render in Chrome.....
LinkedIn Endorsements and that Johari Window thing....
11 Oct 2012

So, LinkedIn have just started up this endorsement thing, if you've listed some skills, your contacts are variously prompted to endorse/validate you against these skills. A sort of "X-factor", or "Scientist's got Talent"vote.

A picture of my current endorsements is above - yeah, yeah, stop laughing - it's only been going a few days, and in a couple of weeks everything's gonna be maxed out (I hope).

I find this whole concept pretty unsettling; for instance, if people have voted for my bioinformatics skills and ignored the chemoinformatics button, does it mean they think I stink at that? Is it a back-handed complement they've just paid me?

The other odd thing is that my personal perception of my skills is not well correlated with the LinkedIn world's views (hence the Johari Window reference in the title). Perhaps, people don't really see the private underlying enabling skills (e.g. Fortran) that are sublimed into the public facing stuff (e.g. Bioinformatics) we do.

I know I am obsessive and paranoid, but I know a lot of more obsessive and paranoid (and fragile) people than me, and they are probably losing sleep over this sort of stuff.

Come on, everyone knows I have leet Fortran haxor skillz - get clickin'

USAN Watch - August/September 2012

11 Oct 2012

The USANs for August and September 2012 have recently been published. Sorry for being slow in posting these.... structures will appear over the next few days, and if anyone has the research codes for the various Merck compounds in this list, please post in the comments (telmapitant, vibregon and tildrakizumab).

USAN	Research Code	Structure	Drug Class	Therapeutic class	Target
aldoxorubicin, aldoxorubicin hydrochloride	INNO-206		natural product-derived small molecule	therapeutic	topoisomerase II
belnacasan	VX-765		synthetic small molecule	therapeutic	ICE-1
deferitazole, deferitazole magnesium	FBS-071, SPD-602		synthetic small molecule	therapeutic	n/a
delparantag, delparantag pentahydrochloride	PMX-60056		synthetic small molecule	therapeutic	heparin
dupilumab	REGN-668, SAR-231893		mAb	therapeutic	IL4R-alpha
elosulfase alfa	BMN 110, rhGALNS, N-acetylgalactosamine-6-sulfatase, chondroitinsulfatase, galactose-6-sulfate sulfatase, EC=3.1.6.4		enzyme	therapeutic	n/a
emixustat, emixustat hydrochloride	ACU-4429		synthetic small molecule	therapeutic	RPE-65
empagliflozin	BI-10773		synthetic small molecule	therapeutic	SGLT2
entolimod	CBLB-502		protein	therapeutic	TLR5
eravacycline	TP-434		synthetic small molecule	therapeutic	30S ribosome
evodenoson	ATL313, DE-112		synthetic small molecule	therapeutic	A2A
filorexant				therapeutic
gandotinib	LY-2784544		synthetic small molecule	therapeutic	JAK2

ilorasertib	ABT-348, A-968660		synthetic small molecule	therapeutic	AUR VEGFR SRC
lusutrombopag	S-888711		synthetic small molecule	therapeutic
mericitabine	RO5024048, R7128		synthetic small molecule	therapeutic	HCV NS5B polymerase
nelipepimut-s	E75 peptide		peptide vaccine	therapeutic	HER2
nesvacumab	REGN-910		mAb	therapeutic	angiopoeitin 2
ozanezumab	GSK-1223249		mAb	therapeutic	Nogo-A
peginesatide, peginesatide acetate	AS-37702, Omontys		peptide	therapeutic	EPOr
peginterferon beta-1a	BIIB-017		protein	therapeutic	IFNAR
pidilizumab	CT-011		mAb	therapeutic	PD1
pimasertib	MSC1936369A; AS703026; EMD 1036239		synthetic small molecule	therapeutic	MEK
ramucirumab	IMC1121B		mAb	therapeutic	VEGFR2
rebastinib, rebastinib tosylate	DP-1919.TO, DCC-2036		synthetic small molecule	therapeutic	Tie2 TrkA
riociguat	BAY 63-2521		synthetic small molecule	therapeutic	soluble guanylate cyclase
sofosbuvir	PSI-7977 GS-7977		synthetic small molecule prodrug	therapeutic	HCV pol
telmapitant			synthetic small molecule	therapeutic
tenofovir alafenamide	GS-7340		synthetic small molecule prodrug	therapeutic	HIV-1 RT
tergenpumatucel-L	HAL		cell	therapeutic	n/a
tildrakizumab			mAb	therapeutic	IL-23
ulodesine, ulodesine succinate	BCX-4208		synthetic small molecule	therapeutic	purine nucleoside phosphorylase (PNP)
vibegron			synthetic small molecule	therapeutic

Timeline of Structural Biology Nobel Prizes
11 Oct 2012

Our BFFs in PDBe have put together an interactive timeline of Structural Biology Nobel Prize Awards - to celebrate the award of the Nobel Prize for Chemistry 2012 to Lefkowitz and Kobilka.

And now there is a class B GPCR structure known (but not public) things are moving along very nicely!
Brain-1.0 - Biomedical knowledge manipulation
09 Oct 2012

The world of data informatics is seeing a cultural change, from a world of databases, such as Chembl, to one where data is more self-descriptive and ad hoc queryable - an evolution into knowledge-bases: The data will be organized around controlled dictionaries and ontologies (Semantic Web), more exposed to programmatic and web service infrastructures, and more robustly linked to other repositories (for a current example of a large nascent network of coordinated data repositories, see the ELIXIR project).

Brain is a library created to achieve such linkage: It can handle and query large biomedical knowledge-bases. The Brain library can also serve as a framework for users interested in Description Logic and Biology.

Website of the library: http://loopasam.github.com/Brain/

Documentation: https://github.com/loopasam/Brain/wiki