add_chembl_compound_properties module
Add ChEMBL compound properties to the dataset.
- add_chembl_compound_properties.add_all_chembl_compound_properties(dataset: Dataset, chembl_con: Connection, limit_to_literature: bool)[source]
Add ChEMBL-based compound properties to the given compound-target pairs, specifically:
the first publication date of a compound (first_publication_cpd)
ChEMBL compound properties
InChI, InChI key and canonical smiles
ligand efficiency metrics
ATC classifications
- Parameters:
dataset (Dataset) – Dataset with compound-target pairs. Will be updated to include compound properties.
chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.
limit_to_literature (bool) – Base first_publication_cpd on literature sources only if True. Base it on all available sources otherwise.
- add_chembl_compound_properties.calculate_ligand_efficiency_metrics(dataset: Dataset)[source]
Calculate and add the ligand efficiency metrics for the compounds based on the mean pchembl values for a compound-target pair and the following ligand efficiency (LE) formulas:
\[ \begin{align}\begin{aligned}LE &= \frac{\Delta G}{HA} \qquad \qquad \text{where } \Delta G = - RT \ln(K_d) \text{, } - RT\ln(K_i) \text{, or} - RT\ln(IC_{50})\\LE &= \frac{2.303 \cdot 298 \cdot 0.00199 \cdot pchembl \_ value} {heavy \_ atoms}\\BEI &= \frac{pchembl \_ mean \cdot 1000}{mw \_ freebase}\\SEI &= \frac{pchembl \_ mean \cdot 100}{PSA}\\LLE &= pchembl \_ mean - ALOGP\end{aligned}\end{align} \]Since LE metrics are based on pchembl values, they are calculated twice. Once for the pchembl values based on binding + functional assays (BF) and once for the pchembl values based on binding assays only (B).
- Parameters:
dataset (Dataset) – Dataset with compound-target pairs. Will be updated to include ligand efficiency metrics.
- add_chembl_compound_properties.get_atc_classification(chembl_con: Connection) DataFrame [source]
Query ATC classifications (level 1) from the atc_classification and molecule_atc_classification tables. ATC level annotations for the same parent_molregno are combined into one description that concatenates all descriptions sorted alphabetically into one string with ‘ | ‘ as a separator.
- Parameters:
chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.
- Returns:
Pandas DataFrame with ATC annotations in ChEMBL.
- Return type:
pd.DataFrame
- add_chembl_compound_properties.get_chembl_properties_and_structures(chembl_con: Connection) DataFrame [source]
Get compound properties from the compound_properties table (e.g., alogp, #hydrogen bond acceptors / donors, etc.). Get InChI, InChI key and canonical smiles.
- Parameters:
chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.
- Returns:
Pandas DataFrame with compound properties and structures for all compound ids in ChEMBL
- Return type:
pd.DataFrame
- add_chembl_compound_properties.get_first_publication_cpd_date(chembl_con: Connection, limit_to_literature: bool) DataFrame [source]
Query and calculate the first publication of a compound based on ChEMBL data (column name: first_publication_cpd). If limit_to_literature is True, this corresponds to the first appearance of the compound in the literature according to ChEMBL. Otherwise this is the first appearance in any source in ChEMBL.
- Parameters:
chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.
limit_to_literature (bool) – Base first_publication_cpd on literature sources only if True.
- Returns:
Pandas DataFrame with parent_molregno and first_publication_cpd from ChEMBL.
- Return type:
pd.DataFrame