Privileged Scaffold Identification from StARlite data
A common requirement in hit discovery is the selection of a set of compounds likely to be bioactive against a particular target or target family (for example a protein kinase focussed library, etc), there are many ways of doing this task, and for screening a new target this sort of approach is often very economically cost-effective. Usually there is the concept of a set of active series, or chemotypes for a target family, and this is often expressed as a 'scaffold' - the chemical core - often a rigid part of a molecule with all the surface 'decoration' removed. More particularly, this 'scaffold' is often associated with some synthetic accessibility concepts for use in library synthesis, parallel chemistry and so forth. Anyway, forgetting some of the details and complexities of deciding 'what is a series', 'what is a scaffold', etc. here is an annotated workflow.
In practice, the active and pseudo-inactive sets can be used to build far more sophistacted activity classifiers, pharmacophores, etc.
Let's look at a toy example.
We did a blast query on StARlite using the sequence of human MCH-1 as a query. This pulls out a list of targets ranked by similarity to MCH-1, we selected a set of similar sequences as our MCH-1-like active set, and extracted all molecules and assays assigned to this set of sequences. We then looked at the scaffold frequency rank within the active set (dark blue) compared to the rank of the scaffold of the entire database (light blue). We then computed the enrichment ratio of the active set to the entire database (Fragment Specificity, in red). A non-enriched scaffold will show a value close to zero, the higher the degree of association, the higher the score. Finally, we 'normalised' the relative enrichment by molecular weight of the scaffold (Fragment Elegance, light green).
So overall, we identify a set of scaffolds that are biased towards a particular activity class - in this case, scaffolds 6 and 7 (dark blue) are probably the best, whilst 8, although more associated with MCH-1-like activity (a higher red score) are less attractive due to the inherently higher molecular weight (and synthetic access difficulties) of this particular scaffold.