• Conference - NCRI-NCIN Joint meeting presentation

    I (jpo) am speaking on the ChEMBL databases at a NCRI (National Cancer Research Institute) Informatics - NCIN (National Cancer Intelligence Network) meeting in London on the 12th February.

  • ChEMBL Postdoctoral Position Now Online

    There is another post for the group now posted online This is a primarily research role developing data-mining and KDD methods against the ChEMBL databases - leading towards the development of automated lead optimisation approaches (a 'Robot-Chemist'). Please mail me at jpo (at) ebi.ac.uk for more details if you are interested in knowing more about the role or project.

  • Books and Papers - 2 - The Tufte For A New Generation

    This year, Santa Claus delivered a book (I always thought he lived at the North Pole, but clearly he is now based in The Amazon) that I had seen advertised a few places - 'Information Dashboard Design: The Effective Visual Communication Of Data', by Stephen Few. It is one of those compelling books that although the fundamental message is simple and arguably obvious, it is nonetheless a delight to read, and I learnt a lot from it. The basic theme of the book is in the necessary features and design of intuitive interfaces, and in particular those that need to display quantitative and comparative numerical data. My first contact with books of this type were with the classics of Edward Tufte, which remain timeless, but are complemented by this book addressing HCI issues.

    %D 2008
    %A Stephen Few
    %T Information Dashboard Design: The Effective Visual Communication Of Data
    %I O'Reilly
    %O ISBN 978-0596100162
    
    

  • Drug Target Assessment

    One of the long-term projects we have been involved with is the prediction or assessment of the likelihood of success of a particular drug discovery program (this elusive, but seductive concept of druggability (or drugability if you prefer)). We have a specific focus on using sequence (and other related data such as 3-D structure, previous screening and pharmacology data, etc) to analyse and score a particular target or set of targets. One of the projects we shall run at the EBI is a 'druggability portal' (which will provide a per sequence view of a target, and also provide some tools to perform multiparametric scoring and ranking across a set of targets, up to genome scale). The intent is that this will be a 'learning' system, responsive to new data, disclosure and progress in the field. Hopefully, we will also be able to code various drug discovery strategies into the approach as well (biotherapeutics, prodrugs, orals, parenterals, fast-follower, HTS, etc). Finally, we will also aim to include 'proprietary' knowledge into the system as well, so that researchers can modify the scores for particular targets/genomes based on their own experiences or technological biases.

    Continuing the use of (very) strained puns in project names, we will call this project drugEBIlity (I know the upper-case I and lower-case l look the same, but in the logo it looks OK).

    As ever, anyone interested in collaborating or contributing to this program of work is very welcome. Finally, one of the web developer roles we have available at the moment in the group will optimise existing methods, and implement new approaches to target analysis within the web portal.

  • ChEMBL Target Dictionary

    Here is a link to the ChEMBL databases target dictionary. This contains the sequences of the targets contained within the entire set of ChEMBL databases, with a few exceptions (primarily around CandiStore entries). The vast majority of these are from the StARlite medicinal chemistry database, however, not all of them currently are, so caveat emptor.

    The file is around 2.4MB in size, is in fasta format, and the identifiers are simply the internal database identifiers (tids), but there are also organism and trivial protein names as well. The exercise of linking these through to UniProt, RefSeq, etc, etc. is left, as they often say, as an exercise for the reader (for now). However, it should give some idea of the diversity and distribution of sequences within the databases.

  • Books and Papers - 1 - Walter Sneader

    Some of the best books on the history of drug discovery have been written by Walter Sneader of University of Strathclyde. Recently I came across a copy The Evolution of Modern Medicine for only $1 in the excellent Strand Books on Broadway, NYC. Of particular interest is the phylogeny of key drug series.

    Here are the references in the old-school refer format.

    %D 1985
    %A Walter Sneader
    %B The Evolution of Modern Medicines
    %I John Wiley & Sons
    %O ISBN 978-0471904717
    
    %D 1986
    %A Walter Sneader
    %B Drug Development: From Laboratory to Clinic
    %A John Wiley & Sons
    %O ISBN 471-91116X
    
    %D 1996 
    %A Walter Sneader 
    %B Drug Prototypes And Their Exploitation
    %I John Wiley & Sons
    %O ISBN 978-0471948476
    
    %D 2005 
    %A Walter Sneader
    %B Drug Discovery: A History 
    %I WileyBlackwell
    %O ISBN 978-0471899808
    

  • Details for recruitment of ChEMBL positions available online

    Some job adverts are available on the EMBL website. I will expand this post a little more later with links to each of the individual ChEMBL positions.

  • Group Coordinator
  • Senior Data Integration and Development Officer
  • Scientific Application Developer
  • Chemical Content Curator
  • 2 Web Application Developers

    Closing dates for application are in the links above.

  • StARlite: Compounds, Targets and Publications over time

    Given that StARlite lives, I should update my general slides, put together some new views, and so on. And so here are a couple of views on the publication rate of compounds. We started to abstract data from 'core' journals published in 1980 and onwards, and then added ten additional journals from 2007 onwards, hence the spike. Click images below for a usable view.

    Firstly, compound records per year.

    The SQL for this is trivial....

    select year, count(distinct act.record_id ) as comp_record from docs d, activities act where act.doc_id = d.doc_id group by year;

    Secondly, cumulative number of compound records.

    Thirdly, targets per year. However, note that this is not new targets per year, just the number of distinct targets that were published within a given year.

    Fourthly, cumulative targets covered over time. This of course, is unique targets.

    Finally, the papers abstracted per year.