clean_dataset module

Methods related to cleaning the dataset.

clean_dataset.clean_dataset(dataset: Dataset, calculate_rdkit: bool) DataFrame[source]

Clean the dataset by

  • changing nan values and empty strings to None

  • setting the type of relevant columns to Int64

  • rounding floats to 4 decimal places (with the exception of max_phase which is not rounded)

  • reordering columns

  • sorting rows by cpd_target_pair_mutation

Parameters:
  • dataset (Dataset) – Dataset with compound-target pairs. Will be updated to clean version with the updates described above.

  • calculate_rdkit (bool) – True if the DataFrame contains RDKit-based compound properties

clean_dataset.clean_none_values(dataset: Dataset)[source]

Change nan values and empty strings to None for consistency.

clean_dataset.remove_compounds_without_smiles_and_mixtures(dataset: Dataset, chembl_con: Connection)[source]

Remove

  • compounds without a smiles

  • compounds with smiles containing a dot (mixtures and salts).

Since compound information is aggregated for the parents of salts, the number of smiles with a dot is relatively low.

Parameters:
  • dataset (Dataset) – Dataset with compound-target pairs. Will be updated to only include compound-target pairs with a smiles that does not contain a ‘.’

  • chembl_con (sqlite3.Connection) – Sqlite3 connection to ChEMBL database.

clean_dataset.reorder_columns(dataset, calculate_rdkit)[source]

Reorder the columns in the DataFrame.

clean_dataset.round_floats(dataset, decimal_places=4)[source]

Round float columns to <decimal_places> decimal places. This does not apply to max_phase.

clean_dataset.set_types_to_int(dataset, calculate_rdkit)[source]

Set the type of relevant columns to Int64.