Clustering of a few chemical databases

Above is a PCA plot of a number (17) of databases contained within UniChem (click image to make it bigger). The plot contains 30 odd % of the variance, so certainly not the complete picture, and pretty close to the dignity level of plotting in 2D! Check out for more details on the databases and versions, the distance metric reflects the fraction of shared standard InChI keys.

Anyway, it's pretty interesting, there are arguably three arms to this...

  • Top left - screening compound databases (zinc, euos and emolecules)
  • Bottom centre - patent databases (patents, ibm and surechem)
  • Centre right - drug databases (drugbank, et al)

In the middle is Chembl itself, which sort of makes sense, as there will be fractionally more overlap with patent db's, available compounds, and drugs.