Provenance is not a region in the south of France

29 Jun 2012

We're starting to think more seriously about issues such as deposition identifiers for ChEMBL, since we are getting significant interest in direct depositions, often as a sort of supplementary material associated with a publication. It is great that we're seeing this interest from two communities, close to our hearts and interests - 1) large pharma actually walking-the-walk on data sharing, as opposed to the talking-the-talk with little follow through (We'll be able to make an announcement on this fairly soon). and 2) the neglected disease community (again an exciting announcement soon, I feel like such a tease, sorry).

Some of the other resources at the EBI are starting to assign DOIs for these sort of citable depositions (e.g. Intact), so your thoughts on the positives and negatives of this approach would be appreciated.

This leads on to the broader issue of data provenance - I would like to think that the data in ChEMBL has reasonable provenance - there are pointers to the original publications, depositions, database identifiers are stable (so if we fix a target or compound structure, we assign a new identifier - others not doing this is a big pain for our integration, and this is one of the reasons we built UniChem). The data is also versioned and there is a proper defined and stable license for the data (CC-BY-SA 3.0).

One of the issue though we have with support is with derivative works - lots of resources have integrated ChEMBL now - this is fabulous for us - sometimes as a one-time load for an analysis, and sometimes regularly updated. However, some resources are woefully out of date in keeping their copy of ChEMBL up to date. Imagine, those poor users losing out on all the fixing and curation the annotation pixies here in Hinxton have been up to. Most sites don't even maintain an inventory of their feeder sources, let alone versioning, release info and dates.

Is there any best practice in this area? We could publicise loads of ChEMBL here on the ChEMBL-og for example?

Anyway, these sort of data provenance issues are one of the areas being studied in the fabulous IMI OpenPHACTS project.

As always, comments or an email would be most welcome (I must admit, I get by far most feedback by mail, and feel a little bit guilty in posting this publicly - so be brave, post in the comments!).