ChEMBL blog

SMR Meeting: Optimising Locally Delivered Drugs
06 May 2009
The Society for Medicines Research (SMR) is a UK-based society for those interested in drug research, they hold a regular series of meetings themed around particular areas.
When I got home yesterday, in my letterbox,I found the flyer for their next meeting, which looks good, so I thought I would post details here, and potentially encourage people to either go, or maybe even join the SMR.
The next meeting focussing on site-specific delivery of particular therapeutic agents and principles of local drug delivery is on the 11th June and is at Horsham in West Sussex. The program and registration form can be found here.
The picture above I found on the web, It is of a 'letterbox' letterbox, sort of recursive, eh?
StARlite schema walkthrough
04 May 2009

Now that is one cool tattoo! Hey, "StARlite" has eight letters too, that gives me a good idea for a prize in a future competition....
The next StARlite schema walkthrough will be held on Friday 8th May, at 11am BST. It will last about 45 minutes, and require dialling a UK phone number for audio. We now have a reliable web desktop sharing system (we have settled on the wonderful Open Source webhuddle, also on SourceForge) and a reliable conference phone system, so there should be no technical issues ;), this time.
If you wish to take part, please click this link for further details....
Some Queries For Data Retrieval From StARlite
04 May 2009
It is a public holiday in the UK today (we call these bank holidays, for reasons that seem obscure nowadays). The weather is traditionally bad on such days, and today is no exception, at least the remainder of the week, when we return to work, will be fine and sunny.
We will run a further StARlite schema and query walkthrough webinar shortly, but in the meantime here are some skeleton sql queries, that perform a set of related queries retrieving compounds/bioactivities for a given target. In this case the target is human PDE4A (for which the tid is 3), and human PDE5A (for which the tid is 276). We will walk through getting these unique target identifiers (or tids) on another occasion, but suffice it to say, that this is easy, especially programmatically, using blastp.
Firstly, retrieving a set of potent inhibitors of human PDE4A or PDE5A. There are a number of parameters one needs to set to actually do this (the end-point, the affinity cutoff, etc. Specifically here we have selected high confidence assay to target assignments (the a2t.confidence=7 bit), and where the potency is better than 1000nM for an IC50 measurement. This is a pretty generic query, and piping in the target tid to this covers a surprisingly frequent use the the data.
```
select  act.molregno, act.activity_type, act.relation as operator, act.standard_value, act.standard_units, 
   td.pref_name, td.organism, 
   a.description as assay_description, 
   docs.journal, docs.year, docs.volume, docs.first_page, docs.pubmed_id, cr.compound_key
from  target_dictionary td, 
   assay2target a2t,    
   assays a, 
   activities act, 
   docs, 
   compound_records cr
where  td.tid in (3,276)
and  td.tid = a2t.tid
and  a2t.confidence = 7
and  a2t.assay_id = a.assay_id
and  a2t.assay_id = act.assay_id
and  act.doc_id = docs.doc_id
and  act.record_id = cr.record_id
and  act.activity_type = 'IC50'
and  act.relation in( '=', '<')
and  act.standard_units = 'nM'
and  act.standard_value <=1000
and  a.assay_type = 'B';
```
Here is a modified form to retrieve just the compound identifiers (molregno)
```
select  distinct act.molregno
from  target_dictionary td, 
  assay2target a2t,    
  assays a,  
  activities act
where  td.tid in (3,276)
and  td.tid = a2t.tid
and  a2t.confidence = 7
and  a2t.assay_id = a.assay_id
and  a2t.assay_id = act.assay_id
and  act.activity_type = 'IC50'
and  act.relation in( '=', '<')
and  act.standard_units = 'nM'
and  act.standard_value <=1000
and  a.assay_type = 'B';
```
Also a common requirement is to get the associated molecule structures from the database - here the syntax is for an sdf format output and the query does not rely on any fancy chemical cartridge manipulation (since we store the molfiles in a clob called molfile in the COMPOUNDS table). The query here simply retrieves the structures, and not the associated bioactivity data. The goofy looking concatenations (||) and newlines (chr(10)) just make sure that a validly formatted sdf file emerges at the end.
```
select  c.molfile || chr(10) || '> ' ||chr(10)|| c.molregno||chr(10)||chr(10)||'$$$$'||chr(10)
from  compounds c, 
  (select distinct act.molregno
  from  target_dictionary td, 
    assay2target a2t,    
    assays a, 
   activities act
 where  td.tid in (3,276)
 and  td.tid = a2t.tid
 and  a2t.confidence = 7
 and  a2t.assay_id = a.assay_id
 and  a2t.assay_id = act.assay_id
 and  act.activity_type = 'IC50'
 and  act.relation in( '=', '<')
 and  act.standard_units = 'nM'
 and  act.standard_value <=1000
 and  a.assay_type = 'B') t1
where  t1.molregno = c.molregno;
```
New Staff In The Group
24 Apr 2009
We are coming to the end of our current round of recruitment, and now have new starters regularly starting, laying claim to their part of the caravan, etc.. Here is a link to the group members.
Comparative Screening File Analysis
24 Apr 2009
Here is another oldie of ours. The basic idea is to use the scaffolds in StARlite (which are all known bioactives) to select a physical compound set for screening. Of course, there are important questions over diversity, the number of compounds, but all of that detail is for another day.
Indexing of a set of 2D structures with their scaffolds is a straightforward thing to do, although there are many methods to actually do this decomposition. We did this for StARlite, and you can then look at the frequency composition of scaffolds (a power law again, quel surprise) and then we came across a similar list of ranked scaffolds from Novartis (analysis by Ertl, Koch and Roggio) I haven't seen this in the primary literature, but it was in a lovely book I picked up in the foyer of Novartis on one visit). Why not compare these lists of scaffolds? This would show the enrichment or depletion of the most common scaffolds, and one could imagine using this to select compounds for future purchase to achieve file 'balancing', or to identify areas of biology/pharmacology/targets that the compound file is particularly suitable for. For the most frequent scaffolds, they tend to be simple rings, but further down the frequency list are more complicated systems. Many, many more applications can be thought of that are related to this basic underlying comparison.
The ranked scaffold list looks like this (so benzene is the most common, then pyridine, then piperidine, down to isoquinoline in rank position 35).

The comparison of StARlite and the Novartis screening file looks like this....

So, the Novartis file has more pyrimidines, morpholines and pyrazoles compared to StARlite, and is depleted in pyrrolidines, tetrahydrofurans, purines and tetrazoles. StARlite is probably a pretty good representation of the composition of the scaffold distribution for the entire industry.
PhD Studentship within ChEMBL
16 Apr 2009
The Chemogenomics (ChEMBL) group at the European Bioinformatics Institute (EMBL-EBI) in Hinxton, Cambridge, United Kingdom, has an opening for a PhD position. The EMBL-EBI is one of four outstations of the European Molecular Biology Laboratory (EMBL) and is a great place to do research in biology, chemistry, and the interface with drug discovery. The EMBL is a unique environment for such interdisciplinary research, in a highly multi-cultural and international setting.
The ChEMBL group does research in target discovery and selection, lead discovery and optimisation and explores the role of genomics techniques in drug discovery, as such studies will involve large-scale data analysis, computing and programming, machine learning, protein structure and modelling, and broad coverage of drug discovery strategies. The successful candidate is free to choose from a range of topics (see below) or suggest ideas for their own project.
Applications need to be submitted by 15 July 2009 through the EMBL PhD programme and ideally the candidate should clearly indicate that they have a preference to work with us here at the EMBL-EBI. We would also consider favorably applicants from non-traditional academic backgrounds (re-training, career-swtiching, etc.)
Here are a few suggestions for potential projects (click here for some more details on some of these).
1. Non-human Secreted Proteins as New Therapies.
2. Monoclonal Antibody Drug ‘Rescue’.
3. Automated Drug Design (‘Robot-Chemist’).
4. Drug Design Strategies for Robustness to Acquired Resistance.
5. Drug and Target Annotation of Novel Genomes.
There is also studentships available within the excellent Steinbeck group at the EMBL-EBI (and I greatly thank Christoph for prompting me to place this posting in the ChEMBL blog, and also apologise for my shameless lifting of his text ;) )
Bioactive Peptides
15 Apr 2009
Peptides (short polymers built from amide-linked alpha-amino acids) are one of the largest classes of bioactive compounds. Many drugs are peptides, or peptide derivatives; furthermore the ready accessibility of amino-acid monomers and chemistry for reliable assembly have led to very extensive characterisation of peptides as bioactives. An additional very useful property of peptide derivatives is that due to their modular nature, and the conformational constraints enforced by stereochemistry and the peptide backbone geometry, it is often possible to get some pretty good QSARs derived from amino-acid sidechain properties. It is also possible to automatically classify peptides into various subclasses (natural, non-natural, N-capped, C-capped, cyclic, etc,) in some sort of ontology-based classification.
The following is a brief overview of the peptide content of StARlite (release 31). In total, 41,128 compounds contain the simplest possible dipeptide substructure (di-glycine), this corresponds to about 9% of StARlite; so as a first approximation it is possible to say that 9% of StARlite is peptidic in nature (this also happens to be the largest single non-trivial structural class in StARlite). A table was then built of all distinct peptide units of a given length (up to 10 amino-acids in this case). The data is as follows....

peptide length # length or longer # exact length

2 41,128 16,512

3 24,616 6,079

4 18,537 3,228

5 15,309 2,397

6 12,912 2,369

7 10,543 1,769

8 8,774 1,905

9 6,869 1,302

10 5,567 1,242

Considering all possible natural amino-acid dipeptides gives 400 distinct dipeptides (20^2), this compares to the 16,512 dipeptides found in StARlite, implying a very diverse and expanded set of amino-acids. It would be pretty interesting to find out what fraction of the 400 possible natural dipeptides are actually sampled. Of course, much of the variation of the dipeptides will come from groups attached N- and C-terminal to the dipeptide, but even so, the sampled variation of sidechains is pretty good. There are 8,000 distinct tri-peptides (20^3) constructed from the 20 natural amino-acids, it is clear that, even assuming the tripeptides are all simple and unelaborated) that there is poor coverage of tripeptides (6,079 vs 8,000) - chemical diversity scales very poorly! It is also pretty clear that there is a pseudo-power law distribution to the observed peptide length distribution (see below).
Here is a graph, I know it is bad practice not having labelled axes, but the x-axis is the peptide length, and the y-axis is the frequency of that class. Green is the class of that length or more, and yellow is of that exact length class.

I have also pulled back some ligand efficiency data for this peptide set, at first glance, it looks very interesting..... More later.
Books and Papers - 10 - A Travel Guide To Scientific Sites Of The British Isles
13 Apr 2009
Oh my goodness! My head is splitting - I acquired a headcold on Friday, and today am still laid low. Other than interviews tomorrow, I am sorely tempted to spend the following day with Mr. Duvet and Mrs. Pillow.....
Anyway, a book, if the preface is not a call to arms, I don't know what is, this really is an essential book for every scientist (who lives in or visits the British Isles). It is also a book that when I reach for it from the shelves, the kids run to tidy up, or state they have homework. regardless, I will quote from the first couple of sentences from the preface....
The Population of the British Isles is less than 0.2% that of the entire earth (sic); yet this tiny fraction of human society is responsible for an enormous number of cultural advances in both the arts and sciences. Public appreciation for the men and women of Britain and Ireland who wrote, painted, composed music, etc. is evident wherever one looks, but the recognition of explorers of nature are harder to find.'

For example, did you know that the Occam of Occam's Razor, is derived from William of Ockham in Surrey! Cool!
```
%T A Travel Guide To Scientific Sites Of The British Isles
%A Charles Tanford
%A Jacqueline Reynolds
%D 1995
%I John Wiley & Sons
%O ISBN 0-471-95070-2
```

peptide length	# length or longer	# exact length
2	41,128	16,512
3	24,616	6,079
4	18,537	3,228
5	15,309	2,397
6	12,912	2,369
7	10,543	1,769
8	8,774	1,905
9	6,869	1,302
10	5,567	1,242