Columns in the Final Dataset
This page provides explanations for all columns available in the final dataset.
More information on ChEMBL-based columns can be found in the respective ChEMBL schema documentation. The information on this page mostly corresponds to the ChEMBL 32 schema documentation.
Initial Query
PChEMBL Values
The pchembl_value is later aggregated into mean, max and median per compound-target pair and dropped.
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
pchembl_value |
Float |
Compound-Target Pair |
ChEMBL: activities |
Negative log of selected concentration-response activity values (IC50 / EC50 / XC50 / AC50 / Ki / Kd / Potency). |
Compound Information
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
parent_molregno |
Int |
Compound |
ChEMBL: molecule_dictionary |
Internal Primary Key for the molecule |
parent_chemblid |
String |
Compound |
“ |
ChEMBL identifier for this compound (for use on web interface etc) |
parent_pref_name |
String |
Compound |
“ |
Preferred name for the molecule |
max_phase |
Float |
Compound |
“ |
Maximum phase of development reached for the compound across all indications [1] |
first_approval |
Int |
Compound |
“ |
Earliest known approval year for the drug (NULL is the default value) |
usan_year |
Int |
Compound |
“ |
The year in which the application for a USAN/INN name was granted. (NULL is the default value) |
black_box_warning |
Int |
Compound |
“ |
Indicates that the drug has a black box warning (1 = yes, 0 = default value) |
prodrug |
Int |
Compound |
“ |
Indicates that the drug is a pro-drug (1 = yes, 0 = no, -1 = preclinical compound ie not a drug) |
oral |
Int |
Compound |
“ |
Indicates whether the drug is known to be administered orally (1 = yes, 0 = default value) |
parenteral |
Int |
Compound |
“ |
Indicates whether the drug is known to be administered parenterally (1 = yes, 0 = default value) |
topical |
Int |
Compound |
“ |
Indicates whether the drug is known to be administered topically (1 = yes, 0 = default value) |
Target Information
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
tid |
Int |
Target |
ChEMBL: assays |
Unique ID for the target |
mutation |
String |
Target |
ChEMBL: variant_sequences |
Details of variant(s) used, with residue positions adjusted to match provided sequence. |
target_chembl_id |
String |
Target |
ChEMBL: target_dictionary |
ChEMBL identifier for this target (for use on web interface etc) |
target_pref_name |
String |
Target |
“ |
Preferred target name: manually curated |
target_type |
String |
Target |
“ |
Describes whether target is a protein, an organism, a tissue etc. |
organism |
String |
Target |
“ |
Source organism of molecuar target or tissue, or the target organism if compound activity is reported in an organism rather than a protein or tissue |
Helper Columns
These columns are combination of other columns, used for easier processing of the dataset.
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
tid_mutation |
String |
Target |
tid + ‘_’ + mutation |
Helper column |
cpd_target_pair |
String |
Compound-Target Pair |
parent_molregno + ‘_’ + tid |
Helper column |
cpd_target_pair_ mutation |
String |
Compound-Target Pair |
parent_molregno + ‘_’ + tid_mutation |
Helper column |
Aggregated Values
Aggregated per compound-target pair using parent_molregno and tid_mutation.
Column Name |
Type |
Info Re. |
Description / Notes |
---|---|---|---|
pchembl_value_mean_BF / _B |
Float |
Compound-Target Pair |
Mean pchemb_value for the compound-target pair |
pchembl_value_max_BF / _B |
Float |
Compound-Target Pair |
Maximum pchemb_value for the compound-target pair |
pchembl_value_median_BF / _B |
Float |
Compound-Target Pair |
Median pchemb_value for the compound-target pair |
first_publication_ cpd_target_pair_BF /_B |
Int |
Compound-Target Pair |
First publication in ChEMBL with this compound-target pair |
first_publication_ cpd_target_pair_ w_pchembl_BF / _B |
Int |
Compound-Target Pair |
First publication in ChEMBL with this compound-target pair and an associated pchembl value |
Naming Convention: B vs. BF
These values are aggregated based on different subsets of the full dataset. The corresponding columns in the final dataset have a suffix that corresponds to the assay types the value is based on:
_BF: based on binding + functional assays
_B: based on binding assays
DTI (Drug-Target Interaction) Annotations
Based on cpd_target_pair, does not include mutation information.
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
therapeutic_target |
Bool |
Target |
ChEMBL: drug_mechanism table |
Is the target in the drug mechanism table? |
DTI |
String |
Compound-Target Pair |
Assigned as below |
Drug target interaction (DTI) annotation |
Mechanism to Assign DTI
In DM Table? [2] |
max_phase? [3] |
Th. Target? [4] |
DTI |
Explanation |
---|---|---|---|---|
Yes |
4 |
– |
D_DT |
Drug - drug target |
Yes |
3 |
– |
C3_DT |
Clinical candidate in phase 3 - drug target |
Yes |
2 |
– |
C2_DT |
Clinical candidate in phase 2 - drug target |
Yes |
1 |
– |
C1_DT |
Clinical candidate in phase 1 - drug target |
Yes |
< 1 |
– |
C0_DT |
Compound in unknown clinical phase [5] - drug target |
No |
– |
Yes |
DT |
Drug target |
No |
– |
No |
NDT |
Not drug target |
Is the compound-target pair in the drug_mechanisms table? = Is it a known relevant compound-target interaction?
What is the max_phase of the compound? = Is it a drug / clinical compound?
Is the target in the drug_mechanisms table? = Is it a therapeutic target?
There have been changes to the max_phase field in ChEMBL with version 32. C0_DT groups together all compounds with a max_phase not between 1 and 4. See MAX_PHASE in ChEMBL
MAX_PHASE in ChEMBL
Before ChEMBL 32, compounds with a max_phase not between 1 and 4 were assigned a max_phase of 0.
Compound and Target Properties Based on ChEMBL Data
First publication
In contrast to the aggregated time-related fields, this field takes all of ChEMBL and not just the time-related data within the dataset into account.
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
first_publication_cpd |
Int |
Compound |
ChEMBL: docs |
First appearance of the compound in the literature |
Compound Properties
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
mw_freebase |
Float |
Compound |
ChEMBL: compound_properties |
Molecular weight of parent compound |
alogp |
Float |
Compound |
“ |
Calculated ALogP |
hba |
Int |
Compound |
“ |
Number hydrogen bond acceptors |
hbd |
Int |
Compound |
“ |
Number hydrogen bond donors |
psa |
Float |
Compound |
“ |
Polar surface area |
rtb |
Int |
Compound |
“ |
Number rotatable bonds |
ro3_pass |
String |
Compound |
“ |
Indicates whether the compound passes the rule-of-three (mw < 300, logP < 3 etc) |
num_ro5_violations |
Int |
Compound |
“ |
Number of violations of Lipinski’s rule-of-five, using HBA and HBD definitions |
cx_most_apka |
Float |
Compound |
“ |
The most acidic pKa calculated using ChemAxon |
cx_most_bpka |
Float |
Compound |
“ |
The most basic pKa calculated using ChemAxon |
cx_logp |
Float |
Compound |
“ |
The calculated octanol/water partition coefficient using ChemAxon |
cx_logd |
Float |
Compound |
“ |
The calculated octanol/water distribution coefficient at pH7.4 using ChemAxon |
molecular_species |
String |
Compound |
“ |
Indicates whether the compound is an acid/base/neutral |
full_mwt |
Float |
Compound |
“ |
Molecular weight of the full compound including any salts |
aromatic_rings |
Int |
Compound |
“ |
Number of aromatic rings |
heavy_atoms |
Int |
Compound |
“ |
Number of heavy (non-hydrogen) atoms |
qed_weighted |
Float |
Compound |
“ |
Weighted quantitative estimate of drug likeness (as defined by Bickerton et al., Nature Chem 2012) |
mw_monoisotopic |
Float |
Compound |
“ |
Monoisotopic parent molecular weight |
full_molformula |
String |
Compound |
“ |
Molecular formula for the full compound (including any salt) |
hba_lipinski |
Int |
Compound |
“ |
Number of hydrogen bond acceptors calculated according to Lipinski’s original rules (i.e., N + O count)) |
hbd_lipinski |
Int |
Compound |
“ |
Number of hydrogen bond donors calculated according to Lipinski’s original rules (i.e., NH + OH count) |
num_lipinski_ ro5_violations |
Int |
Compound |
“ |
Number of violations of Lipinski’s rule of five using HBA_LIPINSKI and HBD_LIPINSKI counts |
Compound Structures
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
standard_inchi |
String |
Compound |
ChEMBL: compound_structures |
IUPAC standard InChI for the compound |
standard_inchi_key |
String |
Compound |
“ |
IUPAC standard InChI key for the compound |
canonical_smiles |
String |
Compound |
“ |
Canonical smiles, generated using RDKit |
ATC and Target Class
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
atc_level1 |
String |
Compound |
ChEMBL: atc_classification, molecule_atc_ classification |
Anatomical Therapeutic Chemical (ATC) classification, level 1 |
target_class_l1 |
String |
Target |
ChEMBL: protein_classification, protein_family_ classification |
Target class, level 1 (more general) |
target_class_l2 |
String |
Target |
“ |
Target class, level 2 (more detailed) |
Ligand Efficiency Metrics
Calculated based on pchembl_value_mean.
Since LE metrics are based on pChEMBL values, they are calculated twice. Once for the pChEMBL values based on binding and functional assays (suffix _BF) and once for the pChEMBL values based on binding assays only (suffix _B).
Column Name |
Type |
Info Re. |
Description / Notes |
---|---|---|---|
LE_BF / LE_B |
Float |
Compound |
Ligand efficiency |
BEI_BF / BEI_B |
Float |
Compound |
Binding efficiency index |
SEI_BF / SEI_B |
Float |
Compound |
Surface efficiency index |
LLE_BF / LLE_B |
Float |
Compound |
Lipophilic ligand efficiency |
Equations
RDKit-Based Compound Descriptors
Built-in Methods
These compound descriptors are calculated using built-in RDKit methods from Descriptors and rdMolDescriptors.
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
fraction_csp3 |
Float |
Compound |
canonical_smiles + built-in RDKit methods |
Fraction of C atoms that are SP3 hybridized (rdkit.Chem.Descriptors. FractionCSP3) |
ring_count |
Int |
Compound |
“ |
(rdkit.Chem.Descriptors. RingCount) |
num_aliphatic_ rings |
Int |
Compound |
“ |
Number of aliphatic (containing at least one non-aromatic bond) rings (rdkit.Chem.Descriptors. NumAliphaticRings) |
num_aliphatic_ carbocycles |
Int |
Compound |
“ |
Number of aliphatic (containing at least one non-aromatic bond) carbocycles (rdkit.Chem.Descriptors. NumAliphaticCarbocycles) |
num_aliphatic_ heterocycles |
Int |
Compound |
“ |
Number of aliphatic (containing at least one non-aromatic bond) heterocycles (rdkit.Chem.Descriptors. NumAliphaticHeterocycles) |
num_aromatic_ rings |
Int |
Compound |
“ |
Number of aromatic rings (rdkit.Chem.Descriptors. NumAromaticRings) |
num_aromatic_ carbocycles |
Int |
Compound |
“ |
Number of aromatic carbocycles (rdkit.Chem.Descriptors. NumAromaticCarbocycles) |
num_aromatic_ heterocycles |
Int |
Compound |
“ |
Number of aromatic heterocycles (rdkit.Chem.Descriptors. NumAromaticHeterocycles) |
num_saturated_ rings |
Int |
Compound |
“ |
Number of saturated rings (rdkit.Chem.Descriptors. NumSaturatedRings) |
num_saturated_ carbocycles |
Int |
Compound |
“ |
Number of saturated carbocycles (rdkit.Chem.Descriptors. NumSaturatedCarbocycles) |
num_saturated_ heterocycles |
Int |
Compound |
“ |
Number of saturated heterocycles (rdkit.Chem.Descriptors. NumSaturatedHeterocycles) |
num_stereocentres |
Int |
Compound |
“ |
Number of atomic stereocenters (specified and unspecified) (rdkit.Chem.rdMolDescriptors. CalcNumAtomStereoCenters) |
num_heteroatoms |
Int |
Compound |
“ |
Number of heteroatoms (rdkit.Chem.Descriptors. NumHeteroatoms) |
Bespoke Methods
These compound descriptors are calculated using custom RDKit-based methods.
Column Name |
Type |
Info Re. |
Based On |
Description / Notes |
---|---|---|---|---|
aromatic_atoms |
Int |
Compound |
canonical_smiles + RDKit-based methods |
Number of aromatic atoms |
aromatic_c |
Int |
Compound |
“ |
Number of aromatic C |
aromatic_n |
Int |
Compound |
“ |
Number of aromatic N |
aromatic_hetero |
Int |
Compound |
“ |
Number of aromatic hetero atoms |
scaffold_ w_stereo |
String |
Compound |
“ |
Scaffold SMILES, including stereochemistry information |
scaffold_ wo_stereo |
String |
Compound |
“ |
Scaffold SMILES of the molecule after removing stereochemistry information |
Annotations for Filtering
Columns are only available for the full dataset to facilitate the filtering into subsets.
Helper Columns
pair_mutation_in_dm_table and pair_in_dm_table are similar fields. They differ in whether mutation information is taken into account, reflecting that mutation information is only sometimes taken into account when calculating fields and adding rows to the dataset.
- pair_mutation_in_dm_table:
Is the compound-target pair in the drug_mechanism table when taking mutation information into account? Mutation information IS taken into account when adding pairs to the dataset because they appear in the drug_mechanism table. (cpd A, target B without mutation) will be added to the set of existing compound-target pairs with pChEMBL values if there is a pair with a pChEMBL value for (cpd A, target B with mutation C) but there is no pair with a pChEMBL value for (cpd A, target B without mutation). It is used to determine keep_for_binding which in turn is used to determine the B subset of data based on binding assays.
- pair_in_dm_table:
Is the compound-target pair in the drug_mechanism table when ignoring mutation information? Mutation information is NOT taken into account when assigning DTI values.
Column Name |
Type |
Info Re. |
Description / Notes |
---|---|---|---|
pair_mutation_in_dm_table |
Bool |
Compound-Target Pair |
Is the compound-target pair (taking mutation annotation into account) in the drug mechanism table? |
pair_in_dm_table |
Bool |
Compound-Target Pair |
Is the compound-target pair (ignoring mutation annotation) in the drug mechanism table? |
keep_for_binding |
Bool |
Compound-Target Pair |
Rows to keep if interested in information based only on binding assays + the drug_mechanism table. True if pchembl_value_mean_B (based on binding assays) exists or if pair_mutation_in_dm_table == True, i.e., the pair (including mutation information) is in the drug mechanism table. |
Filtering Columns
Column Name |
Type |
Info Re. |
Assays |
#Comparators [6] |
Other |
---|---|---|---|---|---|
BF_100 |
Bool |
Compound-Target Pair |
binding + functional |
>= 100 |
|
BF_100_c_dt_d_dt |
Bool |
Compound-Target Pair |
binding + functional |
>= 100 |
at least one compound with an annotation of D_DT or C<p>_DT (C0_DT, C1_DT, C2_DT, C3_DT) per target |
BF_100_d_dt |
Bool |
Compound-Target Pair |
binding + functional |
>= 100 |
at least one compound with an annotation of D_DT per target |
B_100 |
Bool |
Compound-Target Pair |
binding |
>= 100 |
|
B_100_c_dt_d_dt |
Bool |
Compound-Target Pair |
binding |
>= 100 |
at least one compound with an annotation of D_DT or C<p>_DT (C0_DT, C1_DT, C2_DT, C3_DT) per target |
B_100_d_dt |
Bool |
Compound-Target Pair |
binding |
>= 100 |
at least one compound with an annotation of D_DT per target |
Comparator compounds in this context are all compounds with a pchembl_value_mean_BF / _B. I.e., this includes compounds with a DTI of D_DT or C<p>_DT.