Columns in the Final Dataset

This page provides explanations for all columns available in the final dataset.

More information on ChEMBL-based columns can be found in the respective ChEMBL schema documentation. The information on this page mostly corresponds to the ChEMBL 32 schema documentation.

Initial Query

PChEMBL Values

The pchembl_value is later aggregated into mean, max and median per compound-target pair and dropped.

Column Name

Type

Info Re.

Based On

Description / Notes

pchembl_value

Float

Compound-Target Pair

ChEMBL: activities

Negative log of selected concentration-response activity values (IC50 / EC50 / XC50 / AC50 / Ki / Kd / Potency).

Compound Information

Column Name

Type

Info Re.

Based On

Description / Notes

parent_molregno

Int

Compound

ChEMBL: molecule_dictionary

Internal Primary Key for the molecule

parent_chemblid

String

Compound

ChEMBL identifier for this compound (for use on web interface etc)

parent_pref_name

String

Compound

Preferred name for the molecule

max_phase

Float

Compound

Maximum phase of development reached for the compound across all indications [1]

first_approval

Int

Compound

Earliest known approval year for the drug (NULL is the default value)

usan_year

Int

Compound

The year in which the application for a USAN/INN name was granted. (NULL is the default value)

black_box_warning

Int

Compound

Indicates that the drug has a black box warning (1 = yes, 0 = default value)

prodrug

Int

Compound

Indicates that the drug is a pro-drug (1 = yes, 0 = no, -1 = preclinical compound ie not a drug)

oral

Int

Compound

Indicates whether the drug is known to be administered orally (1 = yes, 0 = default value)

parenteral

Int

Compound

Indicates whether the drug is known to be administered parenterally (1 = yes, 0 = default value)

topical

Int

Compound

Indicates whether the drug is known to be administered topically (1 = yes, 0 = default value)

Target Information

Column Name

Type

Info Re.

Based On

Description / Notes

tid

Int

Target

ChEMBL: assays

Unique ID for the target

mutation

String

Target

ChEMBL: variant_sequences

Details of variant(s) used, with residue positions adjusted to match provided sequence.

target_chembl_id

String

Target

ChEMBL: target_dictionary

ChEMBL identifier for this target (for use on web interface etc)

target_pref_name

String

Target

Preferred target name: manually curated

target_type

String

Target

Describes whether target is a protein, an organism, a tissue etc.

organism

String

Target

Source organism of molecuar target or tissue, or the target organism if compound activity is reported in an organism rather than a protein or tissue

Helper Columns

These columns are combination of other columns, used for easier processing of the dataset.

Column Name

Type

Info Re.

Based On

Description / Notes

tid_mutation

String

Target

tid + ‘_’ + mutation

Helper column

cpd_target_pair

String

Compound-Target Pair

parent_molregno + ‘_’ + tid

Helper column

cpd_target_pair_ mutation

String

Compound-Target Pair

parent_molregno + ‘_’ + tid_mutation

Helper column

Aggregated Values

Aggregated per compound-target pair using parent_molregno and tid_mutation.

Column Name

Type

Info Re.

Description / Notes

pchembl_value_mean_BF / _B

Float

Compound-Target Pair

Mean pchemb_value for the compound-target pair

pchembl_value_max_BF / _B

Float

Compound-Target Pair

Maximum pchemb_value for the compound-target pair

pchembl_value_median_BF / _B

Float

Compound-Target Pair

Median pchemb_value for the compound-target pair

first_publication_ cpd_target_pair_BF /_B

Int

Compound-Target Pair

First publication in ChEMBL with this compound-target pair

first_publication_ cpd_target_pair_ w_pchembl_BF / _B

Int

Compound-Target Pair

First publication in ChEMBL with this compound-target pair and an associated pchembl value

Naming Convention: B vs. BF

These values are aggregated based on different subsets of the full dataset. The corresponding columns in the final dataset have a suffix that corresponds to the assay types the value is based on:

  • _BF: based on binding + functional assays

  • _B: based on binding assays

DTI (Drug-Target Interaction) Annotations

Based on cpd_target_pair, does not include mutation information.

Column Name

Type

Info Re.

Based On

Description / Notes

therapeutic_target

Bool

Target

ChEMBL: drug_mechanism table

Is the target in the drug mechanism table?

DTI

String

Compound-Target Pair

Assigned as below

Drug target interaction (DTI) annotation

Mechanism to Assign DTI

In DM Table? [2]

max_phase? [3]

Th. Target? [4]

DTI

Explanation

Yes

4

D_DT

Drug - drug target

Yes

3

C3_DT

Clinical candidate in phase 3 - drug target

Yes

2

C2_DT

Clinical candidate in phase 2 - drug target

Yes

1

C1_DT

Clinical candidate in phase 1 - drug target

Yes

< 1

C0_DT

Compound in unknown clinical phase [5] - drug target

No

Yes

DT

Drug target

No

No

NDT

Not drug target

MAX_PHASE in ChEMBL

Before ChEMBL 32, compounds with a max_phase not between 1 and 4 were assigned a max_phase of 0.

From ChEMBL 32 onwards, compounds with a max_phase not between 1 and 4 can have three possible values:
- 0.5 = early phase 1 clinical trials
- -1 = clinical phase unknown for drug or clinical candidate drug, i.e., where ChEMBL cannot assign a clinical phase
- NULL = preclinical compounds with bioactivity data

Compound and Target Properties Based on ChEMBL Data

First publication

In contrast to the aggregated time-related fields, this field takes all of ChEMBL and not just the time-related data within the dataset into account.

Column Name

Type

Info Re.

Based On

Description / Notes

first_publication_cpd

Int

Compound

ChEMBL: docs

First appearance of the compound in the literature

Compound Properties

Column Name

Type

Info Re.

Based On

Description / Notes

mw_freebase

Float

Compound

ChEMBL: compound_properties

Molecular weight of parent compound

alogp

Float

Compound

Calculated ALogP

hba

Int

Compound

Number hydrogen bond acceptors

hbd

Int

Compound

Number hydrogen bond donors

psa

Float

Compound

Polar surface area

rtb

Int

Compound

Number rotatable bonds

ro3_pass

String

Compound

Indicates whether the compound passes the rule-of-three (mw < 300, logP < 3 etc)

num_ro5_violations

Int

Compound

Number of violations of Lipinski’s rule-of-five, using HBA and HBD definitions

cx_most_apka

Float

Compound

The most acidic pKa calculated using ChemAxon

cx_most_bpka

Float

Compound

The most basic pKa calculated using ChemAxon

cx_logp

Float

Compound

The calculated octanol/water partition coefficient using ChemAxon

cx_logd

Float

Compound

The calculated octanol/water distribution coefficient at pH7.4 using ChemAxon

molecular_species

String

Compound

Indicates whether the compound is an acid/base/neutral

full_mwt

Float

Compound

Molecular weight of the full compound including any salts

aromatic_rings

Int

Compound

Number of aromatic rings

heavy_atoms

Int

Compound

Number of heavy (non-hydrogen) atoms

qed_weighted

Float

Compound

Weighted quantitative estimate of drug likeness (as defined by Bickerton et al., Nature Chem 2012)

mw_monoisotopic

Float

Compound

Monoisotopic parent molecular weight

full_molformula

String

Compound

Molecular formula for the full compound (including any salt)

hba_lipinski

Int

Compound

Number of hydrogen bond acceptors calculated according to Lipinski’s original rules (i.e., N + O count))

hbd_lipinski

Int

Compound

Number of hydrogen bond donors calculated according to Lipinski’s original rules (i.e., NH + OH count)

num_lipinski_ ro5_violations

Int

Compound

Number of violations of Lipinski’s rule of five using HBA_LIPINSKI and HBD_LIPINSKI counts

Compound Structures

Column Name

Type

Info Re.

Based On

Description / Notes

standard_inchi

String

Compound

ChEMBL: compound_structures

IUPAC standard InChI for the compound

standard_inchi_key

String

Compound

IUPAC standard InChI key for the compound

canonical_smiles

String

Compound

Canonical smiles, generated using RDKit

ATC and Target Class

Column Name

Type

Info Re.

Based On

Description / Notes

atc_level1

String

Compound

ChEMBL: atc_classification, molecule_atc_ classification

Anatomical Therapeutic Chemical (ATC) classification, level 1

target_class_l1

String

Target

ChEMBL: protein_classification, protein_family_ classification

Target class, level 1 (more general)

target_class_l2

String

Target

Target class, level 2 (more detailed)

Ligand Efficiency Metrics

Calculated based on pchembl_value_mean.

Since LE metrics are based on pChEMBL values, they are calculated twice. Once for the pChEMBL values based on binding and functional assays (suffix _BF) and once for the pChEMBL values based on binding assays only (suffix _B).

Column Name

Type

Info Re.

Description / Notes

LE_BF / LE_B

Float

Compound

Ligand efficiency

BEI_BF / BEI_B

Float

Compound

Binding efficiency index

SEI_BF / SEI_B

Float

Compound

Surface efficiency index

LLE_BF / LLE_B

Float

Compound

Lipophilic ligand efficiency

Equations

\begin{flalign*} LE &= \frac{2.303 \cdot 298 \cdot 0.00199 \cdot pchembl\_value} {heavy\_atoms} \\ BEI &= \frac{pchembl\_mean \cdot 1000}{mw\_freebase} \\ SEI &= \frac{pchembl\_mean \cdot 100}{PSA} \\ LLE &= pchembl\_mean - ALogP \\ \end{flalign*}

RDKit-Based Compound Descriptors

Built-in Methods

These compound descriptors are calculated using built-in RDKit methods from Descriptors and rdMolDescriptors.

Column Name

Type

Info Re.

Based On

Description / Notes

fraction_csp3

Float

Compound

canonical_smiles + built-in RDKit methods

Fraction of C atoms that are SP3 hybridized (rdkit.Chem.Descriptors. FractionCSP3)

ring_count

Int

Compound

(rdkit.Chem.Descriptors. RingCount)

num_aliphatic_ rings

Int

Compound

Number of aliphatic (containing at least one non-aromatic bond) rings (rdkit.Chem.Descriptors. NumAliphaticRings)

num_aliphatic_ carbocycles

Int

Compound

Number of aliphatic (containing at least one non-aromatic bond) carbocycles (rdkit.Chem.Descriptors. NumAliphaticCarbocycles)

num_aliphatic_ heterocycles

Int

Compound

Number of aliphatic (containing at least one non-aromatic bond) heterocycles (rdkit.Chem.Descriptors. NumAliphaticHeterocycles)

num_aromatic_ rings

Int

Compound

Number of aromatic rings (rdkit.Chem.Descriptors. NumAromaticRings)

num_aromatic_ carbocycles

Int

Compound

Number of aromatic carbocycles (rdkit.Chem.Descriptors. NumAromaticCarbocycles)

num_aromatic_ heterocycles

Int

Compound

Number of aromatic heterocycles (rdkit.Chem.Descriptors. NumAromaticHeterocycles)

num_saturated_ rings

Int

Compound

Number of saturated rings (rdkit.Chem.Descriptors. NumSaturatedRings)

num_saturated_ carbocycles

Int

Compound

Number of saturated carbocycles (rdkit.Chem.Descriptors. NumSaturatedCarbocycles)

num_saturated_ heterocycles

Int

Compound

Number of saturated heterocycles (rdkit.Chem.Descriptors. NumSaturatedHeterocycles)

num_stereocentres

Int

Compound

Number of atomic stereocenters (specified and unspecified) (rdkit.Chem.rdMolDescriptors. CalcNumAtomStereoCenters)

num_heteroatoms

Int

Compound

Number of heteroatoms (rdkit.Chem.Descriptors. NumHeteroatoms)

Bespoke Methods

These compound descriptors are calculated using custom RDKit-based methods.

Column Name

Type

Info Re.

Based On

Description / Notes

aromatic_atoms

Int

Compound

canonical_smiles + RDKit-based methods

Number of aromatic atoms

aromatic_c

Int

Compound

Number of aromatic C

aromatic_n

Int

Compound

Number of aromatic N

aromatic_hetero

Int

Compound

Number of aromatic hetero atoms

scaffold_ w_stereo

String

Compound

Scaffold SMILES, including stereochemistry information

scaffold_ wo_stereo

String

Compound

Scaffold SMILES of the molecule after removing stereochemistry information

Annotations for Filtering

Columns are only available for the full dataset to facilitate the filtering into subsets.

Helper Columns

pair_mutation_in_dm_table and pair_in_dm_table are similar fields. They differ in whether mutation information is taken into account, reflecting that mutation information is only sometimes taken into account when calculating fields and adding rows to the dataset.

  • pair_mutation_in_dm_table:

    Is the compound-target pair in the drug_mechanism table when taking mutation information into account? Mutation information IS taken into account when adding pairs to the dataset because they appear in the drug_mechanism table. (cpd A, target B without mutation) will be added to the set of existing compound-target pairs with pChEMBL values if there is a pair with a pChEMBL value for (cpd A, target B with mutation C) but there is no pair with a pChEMBL value for (cpd A, target B without mutation). It is used to determine keep_for_binding which in turn is used to determine the B subset of data based on binding assays.

  • pair_in_dm_table:

    Is the compound-target pair in the drug_mechanism table when ignoring mutation information? Mutation information is NOT taken into account when assigning DTI values.

Column Name

Type

Info Re.

Description / Notes

pair_mutation_in_dm_table

Bool

Compound-Target Pair

Is the compound-target pair (taking mutation annotation into account) in the drug mechanism table?

pair_in_dm_table

Bool

Compound-Target Pair

Is the compound-target pair (ignoring mutation annotation) in the drug mechanism table?

keep_for_binding

Bool

Compound-Target Pair

Rows to keep if interested in information based only on binding assays + the drug_mechanism table. True if pchembl_value_mean_B (based on binding assays) exists or if pair_mutation_in_dm_table == True, i.e., the pair (including mutation information) is in the drug mechanism table.

Filtering Columns

Column Name

Type

Info Re.

Assays

#Comparators [6]

Other

BF_100

Bool

Compound-Target Pair

binding + functional

>= 100

BF_100_c_dt_d_dt

Bool

Compound-Target Pair

binding + functional

>= 100

at least one compound with an annotation of D_DT or C<p>_DT (C0_DT, C1_DT, C2_DT, C3_DT) per target

BF_100_d_dt

Bool

Compound-Target Pair

binding + functional

>= 100

at least one compound with an annotation of D_DT per target

B_100

Bool

Compound-Target Pair

binding

>= 100

B_100_c_dt_d_dt

Bool

Compound-Target Pair

binding

>= 100

at least one compound with an annotation of D_DT or C<p>_DT (C0_DT, C1_DT, C2_DT, C3_DT) per target

B_100_d_dt

Bool

Compound-Target Pair

binding

>= 100

at least one compound with an annotation of D_DT per target