clean_dataset module

Methods related to cleaning the dataset.

clean_dataset.clean_dataset(dataset: Dataset, calculate_rdkit: bool) → DataFrame[source]

Clean the dataset by

changing nan values and empty strings to None
setting the type of relevant columns to Int64
rounding floats to 4 decimal places (with the exception of max_phase which is not rounded)
reordering columns
sorting rows by cpd_target_pair_mutation

Parameters:

dataset (Dataset) – Dataset with compound-target pairs. Will be updated to clean version with the updates described above.
calculate_rdkit (bool) – True if the DataFrame contains RDKit-based compound properties

clean_dataset.clean_none_values(dataset: Dataset)[source]: Change nan values and empty strings to None for consistency.

clean_dataset.remove_compounds_without_smiles_and_mixtures(dataset: Dataset, chembl_con: Connection)[source]

Remove

Since compound information is aggregated for the parents of salts, the number of smiles with a dot is relatively low.

Parameters:

dataset (Dataset) – Dataset with compound-target pairs. Will be updated to only include compound-target pairs with a smiles that does not contain a ‘.’
chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.

clean_dataset.reorder_columns(dataset, calculate_rdkit)[source]: Reorder the columns in the DataFrame.

clean_dataset.round_floats(dataset, decimal_places=4)[source]: Round float columns to <decimal_places> decimal places. This does not apply to max_phase.

clean_dataset.set_types_to_int(dataset, calculate_rdkit)[source]: Set the type of relevant columns to Int64.