Research Code Distribution Across the Literature
As you will see from many previous blog posts, we like compound research codes (alternatively known as company codes). These are often the first and primal identifier for a compound in the literatures, and reports of activity often pre-date disclosure of structure. They are of key importance in having view of state-of-the-art for progress against a target, etc, and are often the only way to search for bioactivity across a broad set of literature sources. They also have many other applications, such as competitive intelligence, investor/analyst data mining, etc.
Above is a plot of the frequency of occurrence in full text of 808,378 Open Access EuropePMC articles for GSK codes across EuropePMC - I’ve canonicalised them to account for differences in punctuation (“GSK-123456”, “GSK123456” and “GSK 123456”). If you do a google search for the most common ones, you will easily see the importance of these compounds across drug discovery and pharmacology. For example GSK-1521498 is a mu opioid receptor antagonist, with (as of today) 158,000 hits in google).
I chose GSK codes for a few reasons 1) because they are a large, ethical research company, with a commitment to publication of research 2) that they are local to us here in Hinxton, and it’s easy to ask questions of our local friends 3) GSK don’t switch early company codes to development stage codes, or use a duality of names, which complicate simple analysis - a nice example here is for the recently approved kinase inhibitor Alectinib, which goes under the following codes: RG-7853, CH-5424802, AF-802 and Ro-5424802. In fact there are some complexities - for example in the GSK pipeline web-page, they don't themselves use the GSK prefix, just integers; but here, in this context, it is implicit they will be GSK codes.
Oh, I also stripped salt/batch codes (so GSK-123456A became GSK-123456). There are quite a few out of range values (too small, or unbelievably big) and then there are various ‘mutations’ that occur; either in writing the manuscript, or in typesetting, etc. Examples here are that there are 4 occurrences of GSK-112012 which is a small typo from the far more frequent (221 times) GSK-1120212, and it is easy to see how a simple transposition error could have caused this. To be clear, there will be a GSK-112012, it’s a valid name, but the likelihood is that references to this in the literature, without being supported by other evidence, are in fact about GSK-1120212. Interestingly, the occurrence rate of these mutants is about 10% of all unique GSK numbers (and this is a lower estimate - my first pass attempt at finding these relied on the first few digits being correct, edit-distance based clustering would be the place to start here of course). However, there do seem to be some more common transcription errors as one would expect for strings containing mostly numbers (so GSK-123456 -> GSK-12346 is a lot more likely to happen than becoming GSK-123q56). It’s likely to be the case that a set of 'real world' typos can readily be built to build modified ‘edit distances’ useful in cleaning up data. With such a high potential error rate, this could become critical in real-world use. Interestingly, these errors then propagate from paper to paper as they are copied from one source to another.
Below is the frequency distribution of each cleaned up GSK- code, it shows the classic log-normal/power law distribution - with some compounds, very likely the most interesting, with most data. They are also likely to be the most progressed towards becoming a real drug. The long tail is there too, and one would expect that this long tail is more likely to be full of errors than the commonly referred to compounds.
And here is a frequency scatterplot. Many compounds are mentioned only once in the literature. This dual-domain (1- order in time, not linear though! from ordinal number in the research code, and 2- frequency of mentions in the literature) ’frequency spectrum’ is really interesting and useful, as future posts will outline. There is also another time-domain at work here - the time of disclosure/publication.
This initial analysis is just for EuropePMC full text content, but of course a similar analysis can be done across ChEMBL, SureChEMBL (for patents), the internet (in both search engine index, and with more complexity and difficulty across the dark-web). Of course, this can be combined with the list of research codes, and tracking across company mergers that is part of ChEMBL as well.
Toodle-pip for now!
jpo and Jee-Hyub Kim (McEntyre group)
Oh, I also stripped salt/batch codes (so GSK-123456A became GSK-123456). There are quite a few out of range values (too small, or unbelievably big) and then there are various ‘mutations’ that occur; either in writing the manuscript, or in typesetting, etc. Examples here are that there are 4 occurrences of GSK-112012 which is a small typo from the far more frequent (221 times) GSK-1120212, and it is easy to see how a simple transposition error could have caused this. To be clear, there will be a GSK-112012, it’s a valid name, but the likelihood is that references to this in the literature, without being supported by other evidence, are in fact about GSK-1120212. Interestingly, the occurrence rate of these mutants is about 10% of all unique GSK numbers (and this is a lower estimate - my first pass attempt at finding these relied on the first few digits being correct, edit-distance based clustering would be the place to start here of course). However, there do seem to be some more common transcription errors as one would expect for strings containing mostly numbers (so GSK-123456 -> GSK-12346 is a lot more likely to happen than becoming GSK-123q56). It’s likely to be the case that a set of 'real world' typos can readily be built to build modified ‘edit distances’ useful in cleaning up data. With such a high potential error rate, this could become critical in real-world use. Interestingly, these errors then propagate from paper to paper as they are copied from one source to another.
And here is a frequency scatterplot. Many compounds are mentioned only once in the literature. This dual-domain (1- order in time, not linear though! from ordinal number in the research code, and 2- frequency of mentions in the literature) ’frequency spectrum’ is really interesting and useful, as future posts will outline. There is also another time-domain at work here - the time of disclosure/publication.
This initial analysis is just for EuropePMC full text content, but of course a similar analysis can be done across ChEMBL, SureChEMBL (for patents), the internet (in both search engine index, and with more complexity and difficulty across the dark-web). Of course, this can be combined with the list of research codes, and tracking across company mergers that is part of ChEMBL as well.
Toodle-pip for now!
jpo and Jee-Hyub Kim (McEntyre group)