Ever had a funny dream? The InChI filesystem.
I had a dream recently - I occasionally have really technical dreams, where unlike in the real world, I'm smart, I have great insights and I solve 'big' problems - then frustratingly the great insight of the dream disappears and I'm left with a half-formed memory and the complete lack of the insight. Of probably the hundred or so dreams like this I have ever had, a few have actually led to some interesting research, which I think has been useful - never as grand as the impact in my dream, but useful. But I also have dreams about discovering new British mushroom species, fishing for electric eels with capacitors and resistors as lures, so to be clear, most of my dreams are simple nonsense.
The dream was set in the future, and I was running a group looking after the largest chemical database in the world (clearly a fantasy!!). The database was huge, about 1020 molecules - the thing we did though was use InChIs for chemical structures but to get over a lot of the storage problems we use the filesystem structure itself to hold the structures and the relations between them. At the tips of the filesystem were just a set of standard files containing descriptors of the molecule - ./logP, containing a logP value, alongside a bunch of other useful descriptors. In this system we treated the InChI as the complete filename, with the slash layer separators (/) as directory names, so all the isomers of C3H4F2 were contained as subfiles of that directory on the /InChI=1 root filesystem. So really this is just using the hierarchical structure of the InChI itself in a hierarchical tree form.
In the database, we used links between files to store relationships (say from all tautomers to a standard InChI), but there were different types of links for isomers and salts, etc. The reason we did this was for space since the very size of the database precluded storing the data in a database, there was never any prospect of storing the data in core memory due to the huge size. This InChI filesystem approach was very efficient and scalable (there was something in the dream as well about having to use ZFS, which can currently scale to 16 Exabytes as a single volume. We'd optimized this though, for the really small block sizes required by the data). The directory/file dates were used to store history about the date of registration - this was important for patent novelty checking, the querying of the database was based on 'extended' unix filesystem tools, like a pharmacophore enabled 'find'. The duality of the filename as a location on a disk, and a location on the internet, and the ubiquity and beauty of everything in unix being a file also played their role.
Finally, there was something really really cool for drug discovery that could be done precisely because of this InChI as a filesystem model, and that's the bit that's missing from the recollection of the dream.
What a bummer!
Here's the directory containing the data for our good friend aspirin.
In the database, we used links between files to store relationships (say from all tautomers to a standard InChI), but there were different types of links for isomers and salts, etc. The reason we did this was for space since the very size of the database precluded storing the data in a database, there was never any prospect of storing the data in core memory due to the huge size. This InChI filesystem approach was very efficient and scalable (there was something in the dream as well about having to use ZFS, which can currently scale to 16 Exabytes as a single volume. We'd optimized this though, for the really small block sizes required by the data). The directory/file dates were used to store history about the date of registration - this was important for patent novelty checking, the querying of the database was based on 'extended' unix filesystem tools, like a pharmacophore enabled 'find'. The duality of the filename as a location on a disk, and a location on the internet, and the ubiquity and beauty of everything in unix being a file also played their role.
Finally, there was something really really cool for drug discovery that could be done precisely because of this InChI as a filesystem model, and that's the bit that's missing from the recollection of the dream.
What a bummer!
Here's the directory containing the data for our good friend aspirin.
/InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
This blogpost was based closely on a mail I sent to a friend/collaborator (you know who your are). I've also added a few more sentences to increase readability and accessibility for those outside of the InChI field. A few spurious personal references have been removed from the original text (particular people had roles, and to prevent them appearing as apparent fact in google, I've removed them; and anyway, the people may not have enjoyed their dream roles ;) ).
I've explained the contents of the dream to a few people now, and the story usually makes people smile (always a good sign), go quiet (an even better sign) and then ask a bunch of questions that try and dismiss it as fantasy (the best sign of all).
Since that time, we've reduced the core idea to practice -
- There's a tarfile of a toy InChI filesystem (thanks Gerard) that you can do a surprising amount of chemoinformatics with just ls and cat.
- Some initial work comparing the efficiency and scaling of this filesystem approach to classical prefix and suffix trees (thanks Michał), but these seem to have scaling problems.
- General cheminformatics InChI related stuff (thanks Francis).
If you'd like to know more, get in touch via the comments.