User Guide

The default version of the dataset (the full dataset as a CSV file based on the newest ChEMBL version) can be generated by calling

python main.py -o <output_path>

with further options explained in Arguments.

An overview of the available arguments is also available by calling

python main.py --help

The output will always contain the full dataset as a CSV file. The arguments only allow for the output of additional files or modify how the full dataset is extracted.

Arguments

Parameter	Required	Flag	Default	Explanation
--chembl, -c	No	No	None	ChEMBL version. The latest available ChEMBL version is used if this is not set.
--sqlite, -s	No	No	None	Path to SQLite database. If this is not set, ChEMBL is downloaded as an SQLite database and handled using the chembl_downloader package.
--output, -o	Yes	No	None	Path to write the output file(s) to.
--delimiter, -d	No	No	;	Delimiter in output csv-files.
--all_sources	No	Yes	n/a	Include all sources if this is set. By default, this is not set, and the dataset is calculated based on only literature sources.
--rdkit	No	Yes	n/a	Calculate RDKit-based compound properties if this is set.
--excel	No	Yes	n/a	Write the results to excel. Note: this may fail if the output is too large. The results will always be written to csv.
--BF	No	Yes	n/a	Write the subsets based on binding and functional assays.
--B	No	Yes	n/a	Write the subsets based on binding assays.
--debug	No	Yes	n/a	Log additional debugging information.

Accessing ChEMBL

ChEMBL is accessed either through a given path to an SQLite database download or through the chembl_downloader package. In both cases, SQLite is used to query ChEMBL. Some of the earlier ChEMBL versions are missing tables or fields required to calculate the dataset. Therefore, the earliest ChEMBL version for which the dataset can be calculated is ChEMBL 26.