Salts, Hashes, Password Cracking and InChI keys
I rate the InChI Software that they’ve developed, as one of the seven wonders of the modern informatics world. Long live InChI !!
One of the great things is the InChI key, a hashed form of an InChI, that is fixed length, contains only alphabetic characters (and a dash) so is metacharacter friendly, so you don’t need to worry about escaping goofy things for the web and command line text processing. There are reports of clashes for hashes, but these are very, very rare and not too difficult to deal with. For me though, one of the usability issues with InChI keys is that they do not have usable neighborhood properties, at all. By this, two compounds that only differ by one or two characters in their InChI keys are not chemically similar at all, and they shouldn’t be, it’s a hash after all.
There would be times though when it would be nice to retrieve related compounds via an InChI key search - especially in these times of google and the web. There is a simple edge case for which it’s possible to enumerate some closely structurally related InChI keys, and it turns out that this edge case is pretty common and highly useful, especially for pharmaceutically important compounds - salts.
Here is a paragraph for new people to the field, the wizards can safely skip this… Many pharmaceuticals are salts, that is they are delivered in a form where the active ingredient is ionized (either negatively if the ‘parent’ drug is an acid, or positively if the parent drug is a base). To make a compound that is overall neutral, there is a ‘salt’ component with opposite charge. These simple concepts are actually quite difficult to write, and I can think of many exceptions, but for now this will do. So acidic drugs are often salts with something like a sodium ion, and basic drugs with something with a chloride ion. An example, sildenafil is the active ingredient of the drug Revatio/Viagra - but it is dosed as the citrate salt (sildenafil citrate). The citrate has nothing to do with the bioactivity and pharmacology in vivo, but may crucial to good solubility, absorption, formulation, manufacturing, etc. So if you have sildenafil citrate as a molecule, you create an InChI key, and you would be unable then to retrieve other interesting molecules such as sildenafil itself, or other sildenafil salts, such as sildenafil malate. The usual way to deal with this in databases (like ChEMBL) is to store the data against the actual tested compound (in this case sildenafil citrate), but also to ‘salt strip’ and normalize that compound to capture the active compound sildenafil - that way you’d have two InChI keys to play with at least - one for sildenafil citrate and one for sildenafil - but you would not be able to find sildenafil malate with either of these though.
So password hashing/cracking approaches offer a simple analogous approach to this problem - and this picks up on a thread in an earlier post where we discussed rainbow tables for very large-scale databases. It turns out that if you have a plaintext dictionary of passwords, it’s really trivial to calculate the hashed forms of these, and simply do a reverse lookup from the hashed form to the plaintext form of the password. The practicality of this relies on users being human and choosing simple passwords “password1” - do a google search on “password dictionaries” and you will see how non-random and simple they are. Although the possible search space is huge, people are lazy and don’t use passwords like “dFv4%a-<
The guys and gals from the password encryption communities aren’t usually interested in chemistry, so their use of the word salt is not the same as in our chemoinformatics community - but they are sort of similar after all. The other cool thing that the password cracking community get their hands dirty with is specialist computing, and this is one area where GPU computing has made a real impact, with the performance on a specialist simple highly-scaled task (hash generation for plain-text passwords) being quite astonishing.
Plaintext -> hash function -> hashed plaintext
Plaintext.salt -> hash function -> hashed plaintext.salt
The really great thing for drugs and drug-like molecules is that although there are a huge (astronomical scale) number of potential salts for a given parent drug, the actual number used in practice is small, in the order of 50 or so (and then some of these are only used for acid drugs, some only for bases, and this property reduces the combinatorics even further). This is due to the fact that when you develop a new drug you need to explain the activity of the entire compound, not just the parent drug, and so there is a big big disincentive to developing a drug with a novel exotic salt. Also in the lab, there are a small number of salts that synthetic chemists typically make. So in the context of InChI key based searches, a useful thing to do is to enumerate all reasonable salts for a compound that is either acidic or basic, calculate the InChI keys for these, and you then have a simple synonym table that captures all reasonable representations of a drug in different salt forms. So going back to our sildenafil case - we’d have access to linking between data for sildenafil malate and sildenafil citrate which is a pretty cool thing to do.
As an interesting exercise for the reader, there are some cases that require some careful consideration - atorvastatin for instance, this is a mono-anion (carries a single negative charge) and is dosed as the calcium salt - calcium is a di-cation (i.e. It has two positive charges), so to maintain neutrality, there are two atorvastatins for one calcium (even though the USAN is atorvastatin calcium and thereby doesn’t encode the stoichiometry!). But given a suitable salt dictionary, it’s possible handle this as a simple case-branch.
The basic approach could also handle prodrugs (for cases like series of simple esters) but the semantic restrictions from this tempting hackery kick in horribly here for real world use, so do not try that at home!
It is tempting then to develop a web service, or toolkit for local high-performance use, that has a simple fast lookup of “salted” InChI key synonyms, given either a parent compound or a specific salt; or even something that computes these salted forms in real time (but for that you’d need the InChI of course, or you’d need to have that InChI key - InChI pair already in another dictionary). Another useful application of this would be to give these InChI key equivalence tables to google/bing etc…...