Challenges of data integration

Here is an example of why data integration across resources is hard - and hard at multiple levels. It's presented as an example of where things can trip up the unwary..... In 1999, the structure of carbonic anhydrases II and IV complexed to the approved drug brinzolamide were published. These structures gave insight into how the drugs bind their targets, and also selectivity. The coordinate set we''re specifically interested in here is 1a42. 3-D complexes of ligands with proteins are of incredible importance in many areas, not least in docking method/scoring function development.

In the original paper the structure is drawn as....


which is correct, and the same as found in wikipedia (Brinzolamide) and ChEMBL (CHEMBL220491). The structure in the PDB file however is something else, an isomer of brinzolamide - the IUPAC name of the ligand in the coordinate set is 

(4R)-2-(2-ETHOXYETHYL)-4-(ETHYLAMINO)-3,4-DIHYDRO-2H-THIENO[3,2-E][1,2]THIAZINE-6-SULFONAMIDE 1,1-DIOXIDE


So instead of having the methoxy at the chain terminus there is an ethoxy, and the chain between the oxygen and the ring is one atom shorter - the oxygen has migrated one atom along. This is not brinzolamide, but it's called brinzolamide in the PDB entry (but the structure's wrong), and the paper (where the structure is right). The most likely explanation is that when the crystallographers built the topology file for the ligand they mistyped the atom name/element/whatever - this was a pain back then.

As a first question - what should PDB curators do here? Spot the error and fix the data, probably not, that's not the way that PDB works, but this post-loading, fixing and futzing around with the original data is common in other databases (e.g. ChEMBL). For me, this is the difference between an archive, an a curated resource.

The next level of ambiguity is where people try and extract the chemical structure from the PDB entry (for historical reasons there is no connection table in the PDB file); there's two general ways of doing this - 1) from the IUPAC name in the header, and 2) from the coordinates. Most workers have tackled the latter method, but working out the bonding from the coordinates is surprisingly hard to do completely correctly.

So what happens in this case? Well the AffinDB resource has this (AffinDB is a great resource for structure-based design and dockers).

So, two problems here, loss of stereochem off the ring (it is unambiguous in the 3D coords) and secondly the loss of the double bond for the thiophene - this has the side effect of introducing two new chiral centres into the molecule (so eight possible enantiomers, from the one defined structure used in the experiment). So the ligand, if converted to 3D, from the above structure could not recover the geometry as found in the database. Also the sulphonamide, which binds to the zinc in CAH will no longer be acidic (aryl sulphonamides are weakly acidic), and so the difference will have big differences in terms of inhibitor properties.

And what about PDBe? Well the structure there is this....


Which gets the double bond in the thiophene right, but introduces a spurious chiral centre at the sulphonamide nitrogen. This is quite a subtle case, since the nitrogen in a sulphonamide has a lot of sp3 character, and maybe one configuration is trapped in the crystal complex - but in solution, it will very rapidly invert and equilibrate - it is not a chiral centre.

In summary, this chain of events makes the data integration problem a hard one (for example if one wanted to query across ChEMBL, PDBe, AffinDB at least), and there are confounding statements on what the identity of a particular molecule is, and taking the PDB entry on face value would be confusing. So, data integration is hard! - 'Trust and Verify' is the mantra, but trust and verify names and synonyms even more.