• Crowdsourced Validation of Research Code Compounds


    I love the drug discovery process, and how it works, the progression from 'secret' to 'public' as the health benefits of a potential new drug become clearer. And there is associated excitement of seeing a compound or target validation result for the first time - these things can really change your world view and progress your own ideas. This free and liquid exchange of accurate reliable data is a fundamental part of invention, discovery and the progress of science.

    One part of this process is the Research Code of a compound - the name that looks like a few characters, then some integers (usually), e.g. UK-92,480. There are some other posts on research codes on the ChEMBL-og, and also there's a few tables in the ChEMBL database that contain some useful metadata stylee stuff connected with these.

    Anyway, making a reliable link between a research code and a 2-D structure or sequence is an important thing, and should be easier than it is. So, over a dinner the other evening, in Vienna, with Tudor Oprea, we stumbled across the following idea, and I think it's a reasonable one, worth sharing, and following up on.

    There are sometimes errors in the structures associated with research codes, in the literature, on the web, etc. The people who know the compound structures best are the companies themselves, and given that they have disclosed them themselves, either as some form of scientific publication, or at a conference, or in regulatory filings, etc. it is not secret or confidential data. It's public, pure and simple. 

    So, here's the proposal - why don't we bundle up compound structures for which we have both a published research code, and a 2D structure for each company, and then get them validated by the company that originally synthesized them. Ship a simple list of Research code and InChI pairs, and get them checked, and potentially corrected. This 'blessing' of the name provides a source of provenance, and high quality to other resources that may inherit and use them. It also provides a mapping between the name and an InChI, and I'm sure you'll all know the infinite array of possibilities that can be done once you have an InChI ;)

    We have done something very similar to this idea already with a large pharma company, but as one way of estimating the errors in ChEMBL. This had really interesting results (I'll try and get permission to share the outcome of this, but it was definitely worth doing). I floated this idea with a few Pharma people at a recent meeting, and the response was favourable. So we'll try and progress this as a pre-competitive project, and ensure that the data ends up in the public domain - if anyone else wants to join these efforts, get in touch....

    Notes:


    • Of course, this will involve some effort on behalf of the companies doing the validation, and is not intended as a way to 'fish' for non-disclosed, or 'borderline' disclosed structures. So maybe restricting it to cases where there is a PubMed source ensures that any compounds really are definitively public is a good place to start.
    • I find some of the best sources of newly disclosed structures are far-east chemical suppliers, they are great at getting products available really quickly. I would like to find out what websites they use to get the info so fast.
    • It's also potentially a little bit like the trust certificates given out and assigned to web sites by certification authorities, so that when you use a secure, private, web resource like ChEMBL, or FaceBook, over https: you know you're getting the real deal.

  • Late Stage Kinase Inhibitors Over Time


    A bunch of new USANs have just been defined, including some for protein kinase inhibitors; so this prompted me to update the kinase data, and plot the number of USANs (which is a pretty good, unambiguous proxy of late stage clinical compounds) as a function of time. So far it looks like the rate of development is slowing a little for 2012 (but this can change of course, depending on what happens in the remainder of the year).

    An interesting statistic is that 20% of protein kinase inhibitors (including the mTors) to have a USAN assigned have been approved (and overall 5.4% of all current clinical stage kinase inhibitors). Of course, these are not steady state figures, and there are still compounds coming through, so these are approximate lower bounds - but the percentages are unlikely to drift much higher.

    Update: Oh, the axes! - vertical is cumulative number of USANs, horizontal is year (so 87 is 1987, and 12 is 2012).

  • ACS Philadelphia (posted by Louisa)


    I (Louisa) will be attending the ACS national meeting in Philadelphia, which opens this Sunday afternoon (19th August).
    If anyone would like to meet with me to discuss ChEMBL or learn more about what we do, please feel free to email me at chembl-help@ebi.ac.uk. Providing that my hotel has internet, I shall be able to respond and organise something.
    Additionally, if you see me wandering around the conference, probably picking up free pens, please stop me for a chat. It's always nice to meet our ChEMBL users.

  • Paper: Annotating Human P-Glycoprotein Bioassay Data


    A paper on extraction, curation and comparison of literature data for a key transporter in drug discovery P-glycoprotein Uniprot:P08183 (aka MDR1, pgp, etc). It details some interesting comparisons of different assay formats, readouts, cell-backgrounds and so forth.

    Here's a link to the paper (It should be Open Access, but isn't at the moment :( ).

    %T Annotating Human P-Glycoprotein Bioassay Data
    %J Mol. Informatics
    %A B. Zdrazil 
    %A M. Pinto
    %A P. Vasanthanathan
    %A A.J. Williams
    %A L.Z. Balderud
    %A O. Engkvist 
    %A C. Chichester
    %A A. Hersey 
    %A J.P. Overington
    %A G.F. Ecker
    %D 2012
    %V 31
    %O DOI: 10.1002/minf.201200059
    

  • Webinar: Accessing ChEMBL Web Services via Workflow Tools.


    This is gentle reminder for the forthcoming webinar on accessing ChEMBL REST web services via KNIME and Pipeline Pilot at 3.30 pm BST on Wednesday, 8th August 2012 (aka tomorrow). Please email me in advance in order to register for this.

  • Kinase Inhibitors and Companies Interested In Them




    Day 324 of my holiday (or that's what it feels like) and I'm missing making graphs - so I revisited Numbers, the Apple spreadsheet program, trying to get some pivot table sort of plots from a spreadsheet - it was a pain, but here is a view of kinase inhibitors that have reached clinical trials, grouped by company. Click the image above to see it larger. The labels should be read from left to right, and so Pfizer (and their acquired companies) has developed the most kinase inhibitors, then Roche (and their acquired companies) the next most, etc. It's interesting to see the ubiquitous power law at operation here, with nine companies accounting for over half of the kinase inhibitors to have entered clinical trials.

    Would be interesting to normalise these values by total R&D spend, and so reflect companies that are kinase specialists (I don't have this data), but a few standouts are Exelixis, Takeda, Array and Astex Therapeutics.

    Oh, the data also includes the rolimus MTOR inhibitors....

    Update - Since making the figure, I came across another compound from Lilly, and then by looking at their pipeline website (which is now new and fancy looking, but not so good for scraping) found some more Lilly compounds. So the total number of kinase inhibitors used to generate this data is 379, and Lilly have jumped in position (the figure has been replaced with the new version).

  • ChEMBL PostgreSQL



    With the aim of providing more options to access the ChEMBL database, a PostgreSQL version of the most recent ChEMBL release is now available on the ChEMBL FTP site, (thanks to the Ora2Pg project for making the conversion process relatively painless). 

    The main goal of this project is make it easier for users to integrate the chemical data in the ChEMBL database with freely available chemical cartridges, such as the excellent RDKit and Bingo. Now that we have the PostgreSQL version of the database available, we are in the process of benchmarking the aforementioned chemical cartridges - we will report back soon the results of the benchmarking exercises we are undertaking. We are also looking to build and release a virtual machine, which will come preloaded with ChEMBL, PostgreSQL, RDKit and/or Bingo. When we have more details on this we will let you know.

    Right now everyone has the opportunity to download and install the PostgreSQL version of the ChEMBL database and optionally install a chemical cartridge. We hope this will help as many projects as possible and any comments or feedback will be very much appreciated. Enjoy it :)

    (You can download the ChEMBL_14 PostgreSQL here, the tarball also comes with some basic install instructions, but does assume you have a PostgreSQL instance up and running).

  • Books: Assay Guidance Manual



    Came across this great online book, developed jointly by researchers from Eli Lilly, and staff from the NIH - it's a great overview of the factors and methodology of HTS and SAR measurements, and provides a great orientation for new researchers interested in the field.

    Here's the text from the book web page....

    The collection of chapters in this eBook is written to provide guidance to investigators who are interested in developing assays useful for the evaluation of collections of molecules to identify probes that modulate the activity of biological targets, pathways, and cellular phenotypes. These probes may be candidates for further optimization and investigation in drug discovery and development.
    Originally written as a guide for therapeutic project teams within a major pharmaceutical company, this manual has been adapted to provide guidelines for scientists in academic, non-profit, government and industrial research laboratories to develop potential assay formats compatible with High Throughput Screening (HTS) and Structure Activity Relationship (SAR) measurements of new and known molecular entities. Topics addressed in this manual include:
    • Development of optimal assay reagents.
    • Optimization of assay protocols with respect to sensitivity, dynamic range, signal intensity and stability.
    • Adopting screening assays from bench scale assays to automation and scale up in microtiter plate formats.
    • Statistical concepts and tools for validation of assay performance parameters.
    • Secondary follow up assay development for chemical probe validation and SAR refinement.
    • Data standards to be followed in reporting screening and SAR assay results.
    • Glossaries and definitions.
    This manual will be continuously updated with contributions from experienced scientists from multiple disciplines working in drug discovery & development worldwide. An open submission and review process will be implemented in the near future on this eBook website, hosted by the National Library of Medicine with content management by the National Center for Advancing Translational Sciences (NCATS, http://ncats.nih.gov/), the newest component of the National Institutes of Health (NIH).

    %T Assay Guidance Manual
    %D 2012
    %E G.S. Sittampalam
    %E N. Gal-Edd
    %E J. Weidner
    %E D. Auld
    %E M. Glicksman
    %E M. Arkin
    %E A. Napper
    %E J. Inglese
    %I Eli Lilly & Company and the National Center for Advancing Translational Sciences
    %O http://www.ncbi.nlm.nih.gov/books/NBK53196/
    %O Bookshelf ID: NBK53196
    %O PMID: 22553861