add_chembl_compound_properties module

Add ChEMBL compound properties to the dataset.

add_chembl_compound_properties.add_all_chembl_compound_properties(dataset: Dataset, chembl_con: Connection, limit_to_literature: bool)[source]

Add ChEMBL-based compound properties to the given compound-target pairs, specifically:

  • the first publication date of a compound (first_publication_cpd)

  • ChEMBL compound properties

  • InChI, InChI key and canonical smiles

  • ligand efficiency metrics

  • ATC classifications

Parameters:
  • dataset (Dataset) – Dataset with compound-target pairs. Will be updated to include compound properties.

  • chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.

  • limit_to_literature (bool) – Base first_publication_cpd on literature sources only if True. Base it on all available sources otherwise.

add_chembl_compound_properties.calculate_ligand_efficiency_metrics(dataset: Dataset)[source]

Calculate and add the ligand efficiency metrics for the compounds based on the mean pchembl values for a compound-target pair and the following ligand efficiency (LE) formulas:

\[ \begin{align}\begin{aligned}LE &= \frac{\Delta G}{HA} \qquad \qquad \text{where } \Delta G = - RT \ln(K_d) \text{, } - RT\ln(K_i) \text{, or} - RT\ln(IC_{50})\\LE &= \frac{2.303 \cdot 298 \cdot 0.00199 \cdot pchembl \_ value} {heavy \_ atoms}\\BEI &= \frac{pchembl \_ mean \cdot 1000}{mw \_ freebase}\\SEI &= \frac{pchembl \_ mean \cdot 100}{PSA}\\LLE &= pchembl \_ mean - ALOGP\end{aligned}\end{align} \]

Since LE metrics are based on pchembl values, they are calculated twice. Once for the pchembl values based on binding + functional assays (BF) and once for the pchembl values based on binding assays only (B).

Parameters:

dataset (Dataset) – Dataset with compound-target pairs. Will be updated to include ligand efficiency metrics.

add_chembl_compound_properties.get_atc_classification(chembl_con: Connection) DataFrame[source]

Query ATC classifications (level 1) from the atc_classification and molecule_atc_classification tables. ATC level annotations for the same parent_molregno are combined into one description that concatenates all descriptions sorted alphabetically into one string with ‘ | ‘ as a separator.

Parameters:

chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.

Returns:

Pandas DataFrame with ATC annotations in ChEMBL.

Return type:

pd.DataFrame

add_chembl_compound_properties.get_chembl_properties_and_structures(chembl_con: Connection) DataFrame[source]

Get compound properties from the compound_properties table (e.g., alogp, #hydrogen bond acceptors / donors, etc.). Get InChI, InChI key and canonical smiles.

Parameters:

chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.

Returns:

Pandas DataFrame with compound properties and structures for all compound ids in ChEMBL

Return type:

pd.DataFrame

add_chembl_compound_properties.get_first_publication_cpd_date(chembl_con: Connection, limit_to_literature: bool) DataFrame[source]

Query and calculate the first publication of a compound based on ChEMBL data (column name: first_publication_cpd). If limit_to_literature is True, this corresponds to the first appearance of the compound in the literature according to ChEMBL. Otherwise this is the first appearance in any source in ChEMBL.

Parameters:
  • chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.

  • limit_to_literature (bool) – Base first_publication_cpd on literature sources only if True.

Returns:

Pandas DataFrame with parent_molregno and first_publication_cpd from ChEMBL.

Return type:

pd.DataFrame