How far behind the patent literature is the primary literature?


I can't believe that people haven't looked at this before, and I should have looked on 'The Interwebs', so I'm not claiming this is world leading or anything; but here's a little bit of analysis on joining ChEMBL with some of the recently released patent data - addressing the question - 'how far does the published literature lag behind the patent literature?'. Basic workflow is to identify a set of compounds in chembl for which we have patent data for, then get dates for the patents, and for each molecule in both sets calculate the difference between the earliest literature date and the earliest patent date. This is what the distribution looks like....


So, on average it looks like the literature is about two to three years behind the patent literature, which is closer than I thought. The eagle-eyed will of course note that there are quite a few negative dates here - so a patent was filed containing the compound structure after a literature publication. A key point though is that there is no distinction between the compound being in the claims in the patent as opposed to just mentioned.

More to do on this, but it's an interesting start. If there's interest in exactly what was done, source of data, etc., I can go into that in the comments section....

Thanks to George for pulling together the data!