Recovering useful data from images of graphs

Lots of data is only ever reported as graphs, and for human readers, this is a pretty sensible thing to do - we're pretty good at pattern matching, and making sense of what we see. But to reuse data, to fit it to a different interpretative model, or compare it, and even to compare two plots, we need the underlying data. We're particularly interested in PK data at the moment, and one of these cases where data is only usual available as an image is for drug plasma concentration time-course data.

So, it turns out it's pretty quick to build a data table from time-course data like this, and then play around with fitting new functions to the data, etc. Above is the curve for a typical drug - the time taken (for me, you may be quicker or slower than this) from opening the publication to a data set is about 7 minutes. So this is about six curves an hour (with some YouTube kitten video time thrown in to relax the eyes) and so about 50 a working day - so in a month it would be possible for one person to capture the published time-course data for all approved oral drugs. That's pretty cool, don't you think? Which means, four people would be able to do this in a week, and twenty people do this task in a day.

Here's the digitised data from the graph above. First column is time in hours, and second is plasma concentration in ng.ml-1. The data relates to a 100 mg single dose of DEIYFTQMQPDXOT-UHFFFAOYSA-N. 

0.0    0.0
0.36  56.7
0.60 327.0
1.1  445.0
1.6  413.0
2.0  355.0
3.0  250.0
4.0  178.0
6.0   80.6
8.0   50.8
10.0  35.9
12.0  22.8
18.0  12.3
24.0   9.3

Of course, there is a big problem with single vs multiple dosing, dose accumulation, population variance, confidence intervals. There's also some errors in the digitizing process, but there are ways to estimate what these are, and probably they are smaller than the variance in the underlying data in this case.

Would anyone want to help me with this task - we can pool the datasets and no doubt find one or two useful things for publications (e.g. what is the distribution of Cmax for once a day dosed drugs (in uM)). All the data would need to end up in the public domain of course......

Yay crowdsourcing!

jpo

Update - with the data above, it's now possible to feed it to sites like the excellent http://sbpkpd.org and explore fit against a number of canonical models.