Columns in the Final Dataset

This page provides explanations for all columns available in the final dataset.

More information on ChEMBL-based columns can be found in the respective ChEMBL schema documentation. The information on this page mostly corresponds to the ChEMBL 32 schema documentation.

Initial Query

PChEMBL Values

The pchembl_value is later aggregated into mean, max and median per compound-target pair and dropped.

Column Name	Type	Info Re.	Based On	Description / Notes
pchembl_value	Float	Compound-Target Pair	ChEMBL: activities	Negative log of selected concentration-response activity values (IC50 / EC50 / XC50 / AC50 / Ki / Kd / Potency).

Compound Information

Column Name	Type	Info Re.	Based On	Description / Notes
parent_molregno	Int	Compound	ChEMBL: molecule_dictionary	Internal Primary Key for the molecule
parent_chemblid	String	Compound	“	ChEMBL identifier for this compound (for use on web interface etc)
parent_pref_name	String	Compound	“	Preferred name for the molecule
max_phase	Float	Compound	“	Maximum phase of development reached for the compound across all indications [1]
first_approval	Int	Compound	“	Earliest known approval year for the drug (NULL is the default value)
usan_year	Int	Compound	“	The year in which the application for a USAN/INN name was granted. (NULL is the default value)
black_box_warning	Int	Compound	“	Indicates that the drug has a black box warning (1 = yes, 0 = default value)
prodrug	Int	Compound	“	Indicates that the drug is a pro-drug (1 = yes, 0 = no, -1 = preclinical compound ie not a drug)
oral	Int	Compound	“	Indicates whether the drug is known to be administered orally (1 = yes, 0 = default value)
parenteral	Int	Compound	“	Indicates whether the drug is known to be administered parenterally (1 = yes, 0 = default value)
topical	Int	Compound	“	Indicates whether the drug is known to be administered topically (1 = yes, 0 = default value)

Target Information

Column Name	Type	Info Re.	Based On	Description / Notes
tid	Int	Target	ChEMBL: assays	Unique ID for the target
mutation	String	Target	ChEMBL: variant_sequences	Details of variant(s) used, with residue positions adjusted to match provided sequence.
target_chembl_id	String	Target	ChEMBL: target_dictionary	ChEMBL identifier for this target (for use on web interface etc)
target_pref_name	String	Target	“	Preferred target name: manually curated
target_type	String	Target	“	Describes whether target is a protein, an organism, a tissue etc.
organism	String	Target	“	Source organism of molecuar target or tissue, or the target organism if compound activity is reported in an organism rather than a protein or tissue

Helper Columns

These columns are combination of other columns, used for easier processing of the dataset.

Column Name	Type	Info Re.	Based On	Description / Notes
tid_mutation	String	Target	tid + ‘_’ + mutation	Helper column
cpd_target_pair	String	Compound-Target Pair	parent_molregno + ‘_’ + tid	Helper column
cpd_target_pair_ mutation	String	Compound-Target Pair	parent_molregno + ‘_’ + tid_mutation	Helper column

Aggregated Values

Aggregated per compound-target pair using parent_molregno and tid_mutation.

Column Name	Type	Info Re.	Description / Notes
pchembl_value_mean_BF / _B	Float	Compound-Target Pair	Mean pchemb_value for the compound-target pair
pchembl_value_max_BF / _B	Float	Compound-Target Pair	Maximum pchemb_value for the compound-target pair
pchembl_value_median_BF / _B	Float	Compound-Target Pair	Median pchemb_value for the compound-target pair
first_publication_ cpd_target_pair_BF /_B	Int	Compound-Target Pair	First publication in ChEMBL with this compound-target pair
first_publication_ cpd_target_pair_ w_pchembl_BF / _B	Int	Compound-Target Pair	First publication in ChEMBL with this compound-target pair and an associated pchembl value

Naming Convention: B vs. BF

These values are aggregated based on different subsets of the full dataset. The corresponding columns in the final dataset have a suffix that corresponds to the assay types the value is based on:

_BF: based on binding + functional assays
_B: based on binding assays

DTI (Drug-Target Interaction) Annotations

Based on cpd_target_pair, does not include mutation information.

Column Name	Type	Info Re.	Based On	Description / Notes
therapeutic_target	Bool	Target	ChEMBL: drug_mechanism table	Is the target in the drug mechanism table?
DTI	String	Compound-Target Pair	Assigned as below	Drug target interaction (DTI) annotation

Mechanism to Assign DTI

In DM Table? [2]	max_phase? [3]	Th. Target? [4]	DTI	Explanation
Yes	4	–	D_DT	Drug - drug target
Yes	3	–	C3_DT	Clinical candidate in phase 3 - drug target
Yes	2	–	C2_DT	Clinical candidate in phase 2 - drug target
Yes	1	–	C1_DT	Clinical candidate in phase 1 - drug target
Yes	< 1	–	C0_DT	Compound in unknown clinical phase [5] - drug target
No	–	Yes	DT	Drug target
No	–	No	NDT	Not drug target

MAX_PHASE in ChEMBL

Before ChEMBL 32, compounds with a max_phase not between 1 and 4 were assigned a max_phase of 0.

From ChEMBL 32 onwards, compounds with a max_phase not between 1 and 4 can have three possible values:

- 0.5 = early phase 1 clinical trials
- -1 = clinical phase unknown for drug or clinical candidate drug, i.e., where ChEMBL cannot assign a clinical phase
- NULL = preclinical compounds with bioactivity data

Compound and Target Properties Based on ChEMBL Data

First publication

In contrast to the aggregated time-related fields, this field takes all of ChEMBL and not just the time-related data within the dataset into account.

Column Name	Type	Info Re.	Based On	Description / Notes
first_publication_cpd	Int	Compound	ChEMBL: docs	First appearance of the compound in the literature

Compound Properties

Column Name	Type	Info Re.	Based On	Description / Notes
mw_freebase	Float	Compound	ChEMBL: compound_properties	Molecular weight of parent compound
alogp	Float	Compound	“	Calculated ALogP
hba	Int	Compound	“	Number hydrogen bond acceptors
hbd	Int	Compound	“	Number hydrogen bond donors
psa	Float	Compound	“	Polar surface area
rtb	Int	Compound	“	Number rotatable bonds
ro3_pass	String	Compound	“	Indicates whether the compound passes the rule-of-three (mw < 300, logP < 3 etc)
num_ro5_violations	Int	Compound	“	Number of violations of Lipinski’s rule-of-five, using HBA and HBD definitions
full_mwt	Float	Compound	“	Molecular weight of the full compound including any salts
aromatic_rings	Int	Compound	“	Number of aromatic rings
heavy_atoms	Int	Compound	“	Number of heavy (non-hydrogen) atoms
qed_weighted	Float	Compound	“	Weighted quantitative estimate of drug likeness (as defined by Bickerton et al., Nature Chem 2012)
full_molformula	String	Compound	“	Molecular formula for the full compound (including any salt)

Compound Structures

Column Name	Type	Info Re.	Based On	Description / Notes
standard_inchi	String	Compound	ChEMBL: compound_structures	IUPAC standard InChI for the compound
standard_inchi_key	String	Compound	“	IUPAC standard InChI key for the compound
canonical_smiles	String	Compound	“	Canonical smiles, generated using RDKit

ATC and Target Class

Column Name	Type	Info Re.	Based On	Description / Notes
atc_level1	String	Compound	ChEMBL: atc_classification, molecule_atc_ classification	Anatomical Therapeutic Chemical (ATC) classification, level 1
target_class_l1	String	Target	ChEMBL: protein_classification, protein_family_ classification	Target class, level 1 (more general)
target_class_l2	String	Target	“	Target class, level 2 (more detailed)

Ligand Efficiency Metrics

Calculated based on pchembl_value_mean.

Since LE metrics are based on pChEMBL values, they are calculated twice. Once for the pChEMBL values based on binding and functional assays (suffix _BF) and once for the pChEMBL values based on binding assays only (suffix _B).

Column Name	Type	Info Re.	Description / Notes
LE_BF / LE_B	Float	Compound	Ligand efficiency
BEI_BF / BEI_B	Float	Compound	Binding efficiency index
SEI_BF / SEI_B	Float	Compound	Surface efficiency index
LLE_BF / LLE_B	Float	Compound	Lipophilic ligand efficiency

Equations

\begin{flalign*} LE &= \frac{2.303 \cdot 298 \cdot 0.00199 \cdot pchembl\_value} {heavy\_atoms} \\ BEI &= \frac{pchembl\_mean \cdot 1000}{mw\_freebase} \\ SEI &= \frac{pchembl\_mean \cdot 100}{PSA} \\ LLE &= pchembl\_mean - ALogP \\ \end{flalign*}

RDKit-Based Compound Descriptors

Built-in Methods

These compound descriptors are calculated using built-in RDKit methods from Descriptors and rdMolDescriptors.

Column Name	Type	Info Re.	Based On	Description / Notes
fraction_csp3	Float	Compound	canonical_smiles + built-in RDKit methods	Fraction of C atoms that are SP3 hybridized (rdkit.Chem.Descriptors. FractionCSP3)
ring_count	Int	Compound	“	(rdkit.Chem.Descriptors. RingCount)
num_aliphatic_ rings	Int	Compound	“	Number of aliphatic (containing at least one non-aromatic bond) rings (rdkit.Chem.Descriptors. NumAliphaticRings)
num_aliphatic_ carbocycles	Int	Compound	“	Number of aliphatic (containing at least one non-aromatic bond) carbocycles (rdkit.Chem.Descriptors. NumAliphaticCarbocycles)
num_aliphatic_ heterocycles	Int	Compound	“	Number of aliphatic (containing at least one non-aromatic bond) heterocycles (rdkit.Chem.Descriptors. NumAliphaticHeterocycles)
num_aromatic_ rings	Int	Compound	“	Number of aromatic rings (rdkit.Chem.Descriptors. NumAromaticRings)
num_aromatic_ carbocycles	Int	Compound	“	Number of aromatic carbocycles (rdkit.Chem.Descriptors. NumAromaticCarbocycles)
num_aromatic_ heterocycles	Int	Compound	“	Number of aromatic heterocycles (rdkit.Chem.Descriptors. NumAromaticHeterocycles)
num_saturated_ rings	Int	Compound	“	Number of saturated rings (rdkit.Chem.Descriptors. NumSaturatedRings)
num_saturated_ carbocycles	Int	Compound	“	Number of saturated carbocycles (rdkit.Chem.Descriptors. NumSaturatedCarbocycles)
num_saturated_ heterocycles	Int	Compound	“	Number of saturated heterocycles (rdkit.Chem.Descriptors. NumSaturatedHeterocycles)
num_stereocentres	Int	Compound	“	Number of atomic stereocenters (specified and unspecified) (rdkit.Chem.rdMolDescriptors. CalcNumAtomStereoCenters)
num_heteroatoms	Int	Compound	“	Number of heteroatoms (rdkit.Chem.Descriptors. NumHeteroatoms)

Bespoke Methods

These compound descriptors are calculated using custom RDKit-based methods.

Column Name	Type	Info Re.	Based On	Description / Notes
aromatic_atoms	Int	Compound	canonical_smiles + RDKit-based methods	Number of aromatic atoms
aromatic_c	Int	Compound	“	Number of aromatic C
aromatic_n	Int	Compound	“	Number of aromatic N
aromatic_hetero	Int	Compound	“	Number of aromatic hetero atoms
scaffold_ w_stereo	String	Compound	“	Scaffold SMILES, including stereochemistry information
scaffold_ wo_stereo	String	Compound	“	Scaffold SMILES of the molecule after removing stereochemistry information

Annotations for Filtering

Columns are only available for the full dataset to facilitate the filtering into subsets.

Helper Columns

pair_mutation_in_dm_table and pair_in_dm_table are similar fields. They differ in whether mutation information is taken into account, reflecting that mutation information is only sometimes taken into account when calculating fields and adding rows to the dataset.

pair_mutation_in_dm_table:
Is the compound-target pair in the drug_mechanism table when taking mutation information into account? Mutation information IS taken into account when adding pairs to the dataset because they appear in the drug_mechanism table. (cpd A, target B without mutation) will be added to the set of existing compound-target pairs with pChEMBL values if there is a pair with a pChEMBL value for (cpd A, target B with mutation C) but there is no pair with a pChEMBL value for (cpd A, target B without mutation). It is used to determine keep_for_binding which in turn is used to determine the B subset of data based on binding assays.

pair_in_dm_table:
Is the compound-target pair in the drug_mechanism table when ignoring mutation information? Mutation information is NOT taken into account when assigning DTI values.

Column Name	Type	Info Re.	Description / Notes
pair_mutation_in_dm_table	Bool	Compound-Target Pair	Is the compound-target pair (taking mutation annotation into account) in the drug mechanism table?
pair_in_dm_table	Bool	Compound-Target Pair	Is the compound-target pair (ignoring mutation annotation) in the drug mechanism table?
keep_for_binding	Bool	Compound-Target Pair	Rows to keep if interested in information based only on binding assays + the drug_mechanism table. True if pchembl_value_mean_B (based on binding assays) exists or if pair_mutation_in_dm_table == True, i.e., the pair (including mutation information) is in the drug mechanism table.

Filtering Columns

Column Name	Type	Info Re.	Assays	#Comparators [6]	Other
BF_100	Bool	Compound-Target Pair	binding + functional	>= 100
BF_100_c_dt_d_dt	Bool	Compound-Target Pair	binding + functional	>= 100	at least one compound with an annotation of D_DT or C<p>_DT (C0_DT, C1_DT, C2_DT, C3_DT) per target
BF_100_d_dt	Bool	Compound-Target Pair	binding + functional	>= 100	at least one compound with an annotation of D_DT per target
B_100	Bool	Compound-Target Pair	binding	>= 100
B_100_c_dt_d_dt	Bool	Compound-Target Pair	binding	>= 100	at least one compound with an annotation of D_DT or C<p>_DT (C0_DT, C1_DT, C2_DT, C3_DT) per target
B_100_d_dt	Bool	Compound-Target Pair	binding	>= 100	at least one compound with an annotation of D_DT per target