User Guide

The default version of the dataset (the full dataset as a CSV file based on the newest ChEMBL version) can be generated by calling

python main.py -o <output_path>

with further options explained in Arguments.

An overview of the available arguments is also available by calling

python main.py --help

The output will always contain the full dataset as a CSV file. The arguments only allow for the output of additional files or modify how the full dataset is extracted.

Arguments

Parameter

Required

Flag

Default

Explanation

--chembl, -c

No

No

None

ChEMBL version. The latest available ChEMBL version is used if this is not set.

--sqlite, -s

No

No

None

Path to SQLite database. If this is not set, ChEMBL is downloaded as an SQLite database and handled using the chembl_downloader package.

--output, -o

Yes

No

None

Path to write the output file(s) to.

--delimiter, -d

No

No

;

Delimiter in output csv-files.

--all_sources

No

Yes

n/a

Include all sources if this is set. By default, this is not set, and the dataset is calculated based on only literature sources.

--rdkit

No

Yes

n/a

Calculate RDKit-based compound properties if this is set.

--excel

No

Yes

n/a

Write the results to excel. Note: this may fail if the output is too large. The results will always be written to csv.

--BF

No

Yes

n/a

Write the subsets based on binding and functional assays.

--B

No

Yes

n/a

Write the subsets based on binding assays.

--debug

No

Yes

n/a

Log additional debugging information.

Accessing ChEMBL

ChEMBL is accessed either through a given path to an SQLite database download or through the chembl_downloader package. In both cases, SQLite is used to query ChEMBL. Some of the earlier ChEMBL versions are missing tables or fields required to calculate the dataset. Therefore, the earliest ChEMBL version for which the dataset can be calculated is ChEMBL 26.