clean_dataset module
Methods related to cleaning the dataset.
- clean_dataset.clean_dataset(dataset: Dataset, calculate_rdkit: bool) DataFrame [source]
Clean the dataset by
changing nan values and empty strings to None
setting the type of relevant columns to Int64
rounding floats to 4 decimal places (with the exception of max_phase which is not rounded)
reordering columns
sorting rows by cpd_target_pair_mutation
- Parameters:
dataset (Dataset) – Dataset with compound-target pairs. Will be updated to clean version with the updates described above.
calculate_rdkit (bool) – True if the DataFrame contains RDKit-based compound properties
- clean_dataset.clean_none_values(dataset: Dataset)[source]
Change nan values and empty strings to None for consistency.
- clean_dataset.remove_compounds_without_smiles_and_mixtures(dataset: Dataset, chembl_con: Connection)[source]
Remove
compounds without a smiles
compounds with smiles containing a dot (mixtures and salts).
Since compound information is aggregated for the parents of salts, the number of smiles with a dot is relatively low.
- Parameters:
dataset (Dataset) – Dataset with compound-target pairs. Will be updated to only include compound-target pairs with a smiles that does not contain a ‘.’
chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.
- clean_dataset.reorder_columns(dataset, calculate_rdkit)[source]
Reorder the columns in the DataFrame.