Are there around 1019 Lipinski-like small molecules?

06 May 2012

I'm a big fan of the work of Jean-Louis Reymond at the University of Berne, and am starting to imagine a time when the enormity of chemical space can be reasonably comprehensively mapped and explored, at least for 'fragment-sized' molecules. In the field of bioinformatics, the number of possible peptides is considered quite large - for example, for a peptide composed from the 20 natural peptides, there are 20¹⁰ possible distinct decapeptides (this is 10240000000000, or 1.024 x 10¹³ which is a big number of course, but not that big, and a decapeptide will have an average molecular weight of about 1,100 Da. For a 500ish molecular weight natural peptide there are only 3.2 million possibilities. However, small molecules comprehensively trash these 'biologically constrained' numbers, making cheminformatics I think a great frontier and challenge for HPC and "large data".

The GDB databases give some idea of the size of drug-like chemical space. If you take the current GDB databases, and plot the size of the library as a function of the number of heavy atoms...

...you get a classic log plot, essentially the largest library is so much bigger than the smaller sets that it dominates the number of compounds in the library. So on a linear scale plot it looks like this,

but on a log scale, its approximately linear, and a regression can be readily established against this.

So, for the GDB containing 33 heavy atoms (which at an average heavy atom mass of 15 Da, corresponds to a molecular weight of around 500), gives about 10¹⁹ to 10²⁰ distinct molecules. Of course, there are a bunch of assumptions behind the GDB enumeration approach (limited elements, but sensibly limited, the fraction of Lipinski compliant molecules within that set is an open question, but even if only 1% are, then it doesn't affect this number too much.

10¹⁹ is too big to even think about storing - as SMILES it is a zettabyte scale storage problem alone, but smart subset sampling, and the ever growing advances in data compression, processor power and connectivity, will no doubt start to chip away at this challenge of chemical comprehensiveness.

As an aside - a google search shows that one of the largest storage arrays in the world at the moment is a 150 petabyte system at IBM Alamaden - so 1 zettabyte is about 7,000 times the size of this.