ChEMBL blog

Are there around 1019 Lipinski-like small molecules?
06 May 2012

I'm a big fan of the work of Jean-Louis Reymond at the University of Berne, and am starting to imagine a time when the enormity of chemical space can be reasonably comprehensively mapped and explored, at least for 'fragment-sized' molecules. In the field of bioinformatics, the number of possible peptides is considered quite large - for example, for a peptide composed from the 20 natural peptides, there are 20¹⁰ possible distinct decapeptides (this is 10240000000000, or 1.024 x 10¹³ which is a big number of course, but not that big, and a decapeptide will have an average molecular weight of about 1,100 Da. For a 500ish molecular weight natural peptide there are only 3.2 million possibilities. However, small molecules comprehensively trash these 'biologically constrained' numbers, making cheminformatics I think a great frontier and challenge for HPC and "large data".

The GDB databases give some idea of the size of drug-like chemical space. If you take the current GDB databases, and plot the size of the library as a function of the number of heavy atoms...

...you get a classic log plot, essentially the largest library is so much bigger than the smaller sets that it dominates the number of compounds in the library. So on a linear scale plot it looks like this,

but on a log scale, its approximately linear, and a regression can be readily established against this.

So, for the GDB containing 33 heavy atoms (which at an average heavy atom mass of 15 Da, corresponds to a molecular weight of around 500), gives about 10¹⁹ to 10²⁰ distinct molecules. Of course, there are a bunch of assumptions behind the GDB enumeration approach (limited elements, but sensibly limited, the fraction of Lipinski compliant molecules within that set is an open question, but even if only 1% are, then it doesn't affect this number too much.

10¹⁹ is too big to even think about storing - as SMILES it is a zettabyte scale storage problem alone, but smart subset sampling, and the ever growing advances in data compression, processor power and connectivity, will no doubt start to chip away at this challenge of chemical comprehensiveness.

As an aside - a google search shows that one of the largest storage arrays in the world at the moment is a 150 petabyte system at IBM Alamaden - so 1 zettabyte is about 7,000 times the size of this.
PPI Library - Part 3
03 May 2012

It turns out that scientists and the rest of the world interpret 'PPIs' as very different acronyms - as the amount of spam comment filtering for Payment Protection Insurance I’ve had to delete shows. Anyway, life got in the way of science for a few weeks for me ( :( ), but some more of the PPI work is described here.

A very simple algorithm was applied to build a library of experimental peptide conformers. Firstly every tetra-peptide from a protein structure was extracted; one of these peptides was then taken as a seed for a conformational cluster, and subsequent tetra-peptides were fitted to this fragment. If the RMSD for the main-chain atoms was lower that a cutoff parameter, the original 'seed' fragment was taken as representative of that cluster. If the RMSD was greater than the cutoff, then a new cluster was established, and any subsequent tetra-peptides were fitted to both cluster representatives, and so forth. As more unique peptide conformers are seen (defined according to the RSMD cutoff) the number of clusters increases. Of course, the population of each column is stored - some conformers are really common (alpha-helix and beta-strand fragments) and others are rare/experimental errors.

At a large cutoff parameter, all tetra-peptides would cluster in the same set as the initial seed, and at a sufficiently small cutoff, then every tetra-peptide would be unique.

When applied to 2ptn (bovine trypsin, for deeply routed reasons my favorite PDB entry ever, and contain most features of globular proteins, secondary and super-secondary structure, turns, etc.) the following number of representative clusters were found, shown as a function of the RMSD cutoff. One way of thinking of this approach, is that the library can be though of containing every possible peptide conformation, at a given error/variation/resolution. So, it’s a sort of variable ‘resolution’ library. For 2ptn, you can see that the library complexity takes off below about 0.7 Angstrom RMSD. There is the asymptote at around 220, since this is about the number of residues in 2ptn.
There are a few tricks that need handling in the code, primarily in the treatment of peptides that span chain breaks in the protein structure - for this analysis, the four residues needed to be covalently contiguous (i.e. No internal chain breaks).

So, we now have a way of building a representative library of peptide conformers that we can think about suing as scaffolds for mimicking in our PPI library (as well as the mainchain donor/acceptor positions, we also have the C-alpha to C-beta vectors).

The next step is to extend this approach to a larger, more representative library of protein structures, let's use a validated (but ancient) paper for this.
```
%A U. Hobohm
%A C. Sander
%T Enlarged representative set of protein structures
%J Protein Science
%V 3
%P 522-524
%D 1994
```
Trivia: The photo above is of one of my sons, on mayoral voting day 2012, in a very wet London. You are never too young to learn about politics!

Update: Sorry the figures got barfed by the blogger software with a bad url, and got lost, so I've replaced them.
Deadline Approaching for Computational Drug Discovery Course
03 May 2012

The deadline of 7th May is quickly approaching to register for the course we are hosting, here in Hinxton - "Joint EMBL-EBI and Wellcome Trust Resources for Computational Drug Discovery". This joint EBI-Wellcome Trust course aims to provide the participants with the principles of chemical biology and how to use computational methods to probe, explore and modulate biological systems using chemical tools. The course will be comprised of a mixture of lectures and hands-on components. The conceptual framework will be covered, as well as direct practical experience of retrieving and analysing chemogenomics data. Participants will be able to do their own target analysis and identify appropriate chemical tools for probing biological systems of interest to them.

Check out more details on the link, above.
How far behind the patent literature is the primary literature?
02 May 2012

I can't believe that people haven't looked at this before, and I should have looked on 'The Interwebs', so I'm not claiming this is world leading or anything; but here's a little bit of analysis on joining ChEMBL with some of the recently released patent data - addressing the question - 'how far does the published literature lag behind the patent literature?'. Basic workflow is to identify a set of compounds in chembl for which we have patent data for, then get dates for the patents, and for each molecule in both sets calculate the difference between the earliest literature date and the earliest patent date. This is what the distribution looks like....

So, on average it looks like the literature is about two to three years behind the patent literature, which is closer than I thought. The eagle-eyed will of course note that there are quite a few negative dates here - so a patent was filed containing the compound structure after a literature publication. A key point though is that there is no distinction between the compound being in the claims in the patent as opposed to just mentioned.

More to do on this, but it's an interesting start. If there's interest in exactly what was done, source of data, etc., I can go into that in the comments section....

Thanks to George for pulling together the data!
New Drug Approvals 2012 - Pt. X - Avanafil (StendraTM)
01 May 2012

ATC code: G04BE (partial) Wikipedia: Avanafil

On April 27th, the FDA approved Avanafil (tradename: Stendra; Research Code: TA-1790), a phosphodiesterase 5 (PDE5) inhibitor for the treatment of erectile dysfunction (ED). ED is a sexual dysfunction characterized by the inability to produce an erection of the penis. The physiologic mechanism of penile erection involves the release of nitric oxide in the corpus cavernosum during sexual stimulation, which in turn activates the enzyme guanylate cyclase, resulting in increased levels of cyclic guanosine monophosphate (cGMP). cGMP produces relaxation of smooth muscle tissues, which in the corpus cavernosum results in vasodilation and increased blood flow. Avanafil (PubChem: CID9869929, ChemSpider: 8045620) enhances the relaxant effects of cGMP by selectively inhibiting PDE5 (ChEMBL: CHEMBL1827; Uniprot: O76074), an enzyme responsible for the degradation of cGMP.

Other PDE5 inhibitors are already available on the market and these include Sildenafil (approved in 1998; tradename: Viagra, Revatio; ChEMBL: CHEMBL192), Tadalafil (approved in 2003; tradename: Cialis; ChEMBL: CHEMBL779) and Vardenafil (approved in 2003; tradename: Levitra; ChEMBL: CHEMBL1520). These other PDE5 inhibitors are also approved for the treatment of pulmonary arterial hypertension (PAH).

PDE5 is an 875 amino acid-long enzyme (EC=3.1.4.35), belonging to the cyclic nucleotide phosphodiesterase family (PFAM: PF00233).
```
>PDE5A_HUMAN cGMP-specific 3',5'-cyclic phosphodiesterase
MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE
RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG
TVSFLSDSEKKEQMPLTPPRFDHDEGDQCSRLLELVKDISSHLDVTALCHKIFLHIHGLI
SADRYSLFLVCEDSSNDKFLISRLFDVAEGSTLEEVSNNCIRLEWNKGIVGHVAALGEPL
NIKDAYEDPRFNAEVDQITGYKTQSILCMPIKNHREEVVGVAQAINKKSGNGGTFTEKDE
KDFAAYLAFCGIVLHNAQLYETSLLENKRNQVLLDLASLIFEEQQSLEVILKKIAATIIS
FMQVQKCTIFIVDEDCSDSFSSVFHMECEELEKSSDTLTREHDANKINYMYAQYVKNTME
PLNIPDVSKDKRFPWTTENTGNVNQQCIRSLLCTPIKNGKKNKVIGVCQLVNKMEENTGK
VKPFNRNDEQFLEAFVIFCGLGIQNTQMYEAVERAMAKQMVTLEVLSYHASAAEEETREL
QSLAAAVVPSAQTLKITDFSFSDFELSDLETALCTIRMFTDLNLVQNFQMKHEVLCRWIL
SVKKNYRKNVAYHNWRHAFNTAQCMFAALKAGKIQNKLTDLEILALLIAALSHDLDHRGV
NNSYIQRSEHPLAQLYCHSIMEHHHFDQCLMILNSPGNQILSGLSIEEYKTTLKIIKQAI
LATDLALYIKRRGEFFELIRKNQFNLEDPHQKELFLAMLMTACDLSAITKPWPIQQRIAE
LVATEFFDQGDRERKELNIEPTDLMNREKKNKIPSMQVGFIDAICLQLYEALTHVSEDCF
PLLDGCRKNRQKWQALAEQQEKMLINGESGQAKRN
```
Several crystal structures of PDE5 are now available. The catalytic domain of human PDE5 complexed with sildenafil is shown below (PDBe:1tbf)

Preclinical studies have shown that Avanafil strongly inhibits PDE5 (half maximal inhibitory concentration = 5.2 nM) in a competitive manner and is 100-fold more potent for PDE5 than PDE6, which is found in the retina and is responsible for phototransduction. Also, Avanafil has shown higher selectivity (120-fold) against PDE6 than Sildenafil (16-fold) and Vardenafil (21-fold), and high selectivity (>10 000-fold) against PDE1 compared with Sildenafil (380-fold) and Vardenafil (1000-fold).

Avanafil has also been reported to be a faster-acting drug than Sildenafil, with an onset of action as little as 15 minutes as opposed to 30 minutes for the other drugs.

Avanafil is a synthetic small molecule, with one chiral center. Avanafil has a molecular weight of 483.95 Da, an ALogP of 2.16, 3 hydrogen bond donors and 9 hydrogen bond acceptors and thus fully rule-of-five compliant. (IUPAC: 4-[(3-chloro-4-methoxyphenyl)methylamino]-2-[(2S)-2-(hydroxymethyl)-pyrrolidin-1-yl]-N-(pyrimidin-2-ylmethyl)pyrimidine-5-carboxamide; Canonical Smiles: COC1=C(C=C(C=C1)CNC2=NC(=NC=C2C(=O)NCC3=NC=CC=N3)N4CCC[C@H]4CO)Cl; InChI: InChI=1S/C23H26ClN7O3/c1-34-19-6-5-15(10-18(19)24)11-27-21-17(22(33)28-

13-20-25-7-3-8-26-20)12-29-23(30-21)31-9-2-4-16(31)14-32/h3,5-8,10,12,

16,32H,2,4,9,11,13-14H2,1H3,(H,28,33)(H,27,29,30)/t16-/m0/s1)

The recommended starting dose of Avanafil is 100 mg and should be taken orally as needed approximately 30 minutes before sexual activity. Depending on individual efficacy and tolerability, the dose can be varied to a maximum dose of 200 mg or decreased to 50 mg. The lowest dose that provides efficacy should be used. The maximum recommended dosing frequency is once per day.

Avanafil is rapidly absorbed after oral administration, with a median T_max of 30 to 45 minutes in the fasted state and 1.12 to 1.25 hours when taken with a high fat meal. Avanafil is approximately 99% bound to plasma proteins and has been found to not accumulate in plasma. It is predominantely cleared by hepatic metabolism, mainly by CYP3A4 enzyme and to a minor extent by CYP2c isoform. The plasma concentrations of the major metabolites, M4 and M16, are approximately 23% and 29% of that of the parent compound, respectively. The M4 metabolite accounts for approximately 4% of the pharmacologic activity of Avanafil, with an in vitro inhibitory potency for PDE5 of 18% of that of Avanafil. The M16 metabolite has been found inactive against PDE5. After oral administration, Avanafil is excreted as metabolites mainly in the feces (approximately 62% of administrated dose) and to a lesser extent in the urine (approximately 21% of the administrated dose). Avanafil has a terminal elimination half-life (t_1/2) of approximately 5 hours, which is comparable to that of Sildenafil (3-4h) and Vardenafil (4-5h), but very short relative to the very long half-life of Tadalafil (17.5h).

The full prescribing information of Avanafil can be found here.

The license holder is Vivus, Inc.
ChEMBL Webinar on 30th May in Japanese only
01 May 2012

For Japanese ChEMBLers,

ケンブルのオンラインセミナー(ウェビナー)を5月30日、日本時間午後５時(UK午前９時)より行います。ケンブルの概要及びインタフェースでの検索方法などについて紹介します。当日は、日本人スタッフが日本語で行います。もちろん質問もOK。どなたでも参加可能です。

利用方法もとても簡単です(ブラウザ+音声は電話回線)。参加登録は、Doodle(名前英語とメールアドレスを記入)か、または、担当者の池田までお問い合わせください。

他のウェビナーのスケジュールはこちらです(UK時間に注意)。ご要望があれば、今後も日本時間に合わせたウェビナーを検討致します。お気軽にご連絡ください。

また、5月11日に日本でケンブルデータベースの発表を行います。ご興味ある方はこちらをどうぞ。
Last chance to sign up for the webinar on Web Services - 2nd May
30 Apr 2012

This is a last chance call for people who want to sign up for the "Web Services" webinar that will be hosted this Wednesday 2nd May at 3.30pm (GMT+1).

It will be a 45 minute webinar that will take you through the ChEMBL web services.

Remember to register your interest in our webinars on the Doodle Poll. Make sure that you leave your email address as well as your name so that we can send the connection details to you. Any problems, please contact chembl-help@ebi.ac.uk.

The poll will be closed tomorrow, 1st May to allow us to send out the connection details to the attendees.
Bio-IT World, Boston
24 Apr 2012

Today, I (Louisa) gave a workshop talk at the Bio-IT World Expo in Boston. The title of the talk was 'Curating and Mapping the Drug Name Space in ChEMBL'. This expo is being held at the Boston World Trade Center, which is a great location right on the waterfront. It is the 10th anniversary of the expo, so I am expecting the rest of the conference to be very interesting and I have already tagged some talks that I definitely want to go to.