ChEMBL and Handling of Retracted Papers
There is much attention paid to retracted data and errors in the literature, and also to resources that use the literature to build knowledge on top of published papers (for example ChEMBL). Sometimes there is a deliberate intent to deceive, and other times an accident in data processing and interpretation. The Retraction Watch blog is great reading on a long train journey if you want to see some of the pre-formal retraction discussion. With advances in text mining (in the broadest sense, so including images as text, etc.) and secondly with more publications becoming Open Access, it is easier to find and flag these errors; for example see some of the pioneering work and ideas of Peter Murray-Rust. We find errors and inconsistencies in the literature really frequently - units that don't make sense, an end point inconsistent with the reported assay, etc. We either fix, or flag these sort of inconsistencies when we curate data. In general, science is pretty robust to these errors, and most errors, to be frank, have little impact in many/most realistic applications of the data (and consequently on literature derived datasources). What we don't do is contact the editors of the journal or the original authors - and maybe this is something we should start doing.
Given that we now are running SureChEMBL, which is completely automated in it's operation, we are thinking carefully about errors, and how to flag or mark them in some way (for example in the accuracy of text extracted chemical structures) - I think research into the processing and filtering of such 'big data' is going to be a very active and important field in the near future - and is core to reproducibility of analyses. I've looked a little at some cases where ChEMBL structures are the odd one out compared to other public chemistry resources - sometimes we've been wrong, and by comparison with other sources, we've then fixed things - sometimes though, we're right and the rest of the community is 'wrong'. For me this is the way that resources like ChEMBL improve, by verifying the data we hold in whatever reasonable way is available to us. Using simple consensus or voting approaches to validate proof is often right, and often wrong - the most insidious case is where wrong data is propagated without provenance, and this is especially problematic in integration resources which merge data from many sources. I have a draft blog post on some of the analyses I've done, but this currently unfinished work, but will contact the other data providers first to feedback the differences.
There is one particular type of error though that can be captured semi-automatically, and then included - formal retractions of papers. The PubMed search above (in the picture) shows the retractions recorded for J. Med. Chem. - a small number you'd agree, it's then straightforward to identify the source papers and flag these in some way. Based on what I've found so far, the issue with the literature extracted in ChEMBL is very minor, but still important if you are basing work on analyses that rely on these particular data.
We're still deciding what to do in ChEMBL, but when we've settled on exactly what to do, we'll process the data to correct for these retractions and corrections.
I must acknowledge the fantastic Laura Furlong at IMIM for help with this problem - Laura responded to a twitter post I wrote asking the question of how to link retractions with original papers - so social media does work in science.