Finding key compounds in med. chemistry patents: The open way


A couple of us attended the 3rd RDKit UGM, hosted by Merck in Darmstadt this year. It was an excellent opportunity to catch up with RDKit developments and applications and meet up with other loyal "RDKitters".

I presented a talk-torial there and went through an IPython Notebook, which some of you may find useful. It uses patent chemistry data extracted from SureChEMBL and after a series of filtering steps, it follows a few "traditional" chemoinformatics approaches with a set of claimed compounds. My ultimate aim was to identify "key compounds" in patents using compound information alone, inspired by papers such as this and this. The crucial difference is that these authors used commercial data and software, where in this implementation everything is free and open. At the same time, I wanted to show off what the combination of pandas, scikit-learn, mpld3, Beaker, RDKit, IPython Notebook and SureChEMBL can do nowadays (hint: a lot). 

So, here is the Notebook and here are the associated slides which give a bit of background and context. 

Obviously, the logic and steps can be reimplemented with other toolkits or workflow tools, such as KNIME


George