A sort of H-index for the coverage of bioactivity databases


After a long time out of the office on summer holiday, I'm just sorting out my satchel, uniform and pencil case for the autumn term. I've missed being in the office, and I've missed my group and colleagues. I've had time to think of some ideas, some bad, and some potentially useful - the group are pretty good at sorting out into the relevant piles.

So here's a little idea about quantifying the coverage/diversity of the contents of a bioactivity database (like ChEMBL, but also the internal knowledge of a company in it's screening and lead optimisation programs, etc). Essentially, it's applying the H-index, regularly used for citation analysis to bioassay results. There's a lot of criticism of the H-index in it's use of comparing researchers, and plenty of problems in cross-field comparison, but that is not for here. However, the H-index is a pretty robust statistic capturing the structure of a frequency-class distribution.

In the context of bioassay data, the H-index (well lets call it the Ch-index and Ass-index from now on to avoid confusion), can capture the number of bioassays data-points for a set of compounds, or the number of compounds screened across a set of bioassays. Probably best illustrated with a series of pictures of hypothetical bioactivity matrices - a red cell indicates the presence of a measured bioactivity - the columns are assays, the rows are compounds.

So here is high-throughput screen - a single assay, with a large number of tested compounds.


And here is a broad-profiling of a single compound.


Here is a sparse matrix, essentially full of cherry picked bioassay datapoints of one compound in one assay - there's very little SAR data within this set (so allowing the exploration of differences within an assay or compound series), and so building predictive models, and whole bunch of other stuff that one would want to do is difficult. 




Imagine some more experiments (profiling) are done on this set of assays/compounds, and you end up with a matrix such as the following. You can see there's now blocks of data, and stripes (both columns and rows) A row is a compound run across multiple assays, and a column is an assay with multiple compounds tested. Of course, the axes can be ordered to maximise the 'blockiness' of the view of the data.


Here is after some more assays are run, the Ch- and Ass-index will both increase further. The data becomes more useful, since it is likely that a larger number of queries one would want to make would be actually already known, and for the missing ones, one would assume that better predictive models could be built.


Finally, complete knowledge, everything becomes a simple lookup, assuming the data is accurate (etc.).


So as one goes through this progression of filling in the matrix (and expense incurred) the Ch- and Ass-indices both get larger. For the above, the possible 'space' has been confined, but of course, new compounds are made all the time, and new bioassays are developed all the time, the total possible space increases.

There are also some interesting features to the above; imagine collapsing data across assays for orthologues - if you assume that mouse, and dog, and human and zebra activities for a given target are all pretty much the same, you don't really 'value' an extra species added to the matrix. You can however go further, and collapse across protein families (for example pfam domains), to get an idea of total target class diversity. Similarly, it's possible to index/cluster compounds by shared scaffolds/chemotypes, and one can imagine the exploration of a series of 'lenses', that allow one to view coverage from a sets of different perspective.

So how does this map on to the real world?

ChEMBL classic (for want of a better phrase) is like the 4th matrix in the series above, largely a set of stripe- and some block-like structures around a chemical series (since chemists typically explore the chemical space around a lead in optimisation) and screen against related targets. ChEMBL depositions, such as the great GSK PKIS sets, are larger blocks, more comprehensive profiling of a set of compounds across a set of assays. PubChem Bioassay - specifically the output of the NIH MLP, is 'complete' for a relatively small set of assays, but for compounds that are within the set diverse.

Finally - Ass-index is probably not the best acronym.

jpo