sanity_checks module

Perform sanity checks on the dataset.

sanity_checks.check_atc(df_result: DataFrame, atc_levels: DataFrame)[source]: Check that atc_level1 information is only null if the parent_molregno is not in the respective table.

sanity_checks.check_compound_props(df_result: DataFrame, df_cpd_props: DataFrame)[source]

Check that compound props are only null if

the property in the parent_molregno is not in df_cpd_props
or if the value in the compound props table is null.

sanity_checks.check_for_mixed_types(df_result: DataFrame)[source]: Check that there are no mixed types in columns with dtype=object.

sanity_checks.check_ligand_efficiency_metrics(df_result: DataFrame)[source]: Check that ligand efficiency metrics are only null when at least one of the values used to calculate them is null. Ligand efficiency metrics are only null when at least one of the values used to calculate them is null.

sanity_checks.check_null_values(df_result: DataFrame)[source]: Check if any columns contain nan or null which aren’t recognised as null values.

sanity_checks.check_pairs_without_pchembl_are_in_drug_mechanisms(df_result: DataFrame)[source]: Check that rows without a pchembl value based on binding+functional assays (pchembl_x_BF) are in the drug_mechanism table. Note that this is not true for the pchembl_x_B columns which are based on binding data only. They may be in the table because there is data based on functional assays but no data based on binding assays. All pchembl_value_x_BF columns without a pchembl should be in the dm table.

sanity_checks.check_rdkit_props(df_result: DataFrame)[source]: Check that columns set by the RDKit are only null if there is no canonical SMILES for the molecule. Scaffolds are excluded from this test because they can be None if the molecule is acyclic.

sanity_checks.check_target_classes(df_result: DataFrame, target_classes_level1: DataFrame, target_classes_level2: DataFrame)[source]: Check that target class information is only null if the target id is not in the respective table.

sanity_checks.sanity_checks(dataset: Dataset)[source]

Check basic assumptions about the finished dataset, specifically:

no columns contain nan or null values which aren’t recognised as null values
there are no mixed types in columns with dtype=object

Parameters:

dataset (Dataset) – Dataset with compound-target pairs.
calculate_rdkit (bool) – True if the DataFrame contains RDKit-based compound properties

sanity_checks.test_equality(current_df: DataFrame, read_file_name: str, assay_type: str, file_type_list: list[str], calculate_rdkit: bool)[source]

Check that the file that was written to <read_file_name> is identical to the DataFrame <current_df> it was based on.

Parameters:

current_df (pd.DataFrame) – Pandas DataFrame that was written to read_file_name
read_file_name (str) – Name of the file current_df was written to
assay_type (str) – Types of assays current_df contains information about. Options: “BF” (binding+functional), “B” (binding), “all” (contains both BF and B information)
file_type_list (list[str]) – List of file extensions used with read_file_name. Options: csv, xlsx
calculate_rdkit (bool) – If True, current_df contains RDKit-based columns