get_stats module

Get statistics of dataset for final results and debugging.

get_stats.add_dataset_sizes(dataset: Dataset, df: DataFrame, label: str)[source]

Count and add representative counts of df used for debugging to the dataset.

Parameters:
  • dataset (Dataset) – Dataset with compound-target pairs and debugging sizes.

  • df (pd.DataFrame) – Pandas DataFrame with current compound-target pairs

  • label (str) – Description of pipeline step (e.g., initial query).

get_stats.add_debugging_info(dataset: Dataset, df: DataFrame, label: str)[source]

Wrapper for add_dataset_sizes. Handles logging level.

get_stats.get_dataset_sizes(df: DataFrame, label: str) DataFrame[source]

Calculate the number of unique compounds, targets and pairs for df and df limited to drugs.

Parameters:
  • df (pd.DataFrame) – Pandas DataFrame for which the dataset sizes should be calculated.

  • label (str) – Description of pipeline step (e.g., initial query).

Returns:

Pandas DataFrame with calculated unique counts.

Return type:

pd.DataFrame

get_stats.get_stats_columns() tuple[list[str], list[str]][source]

Get the relevant columns for which stats should be calculated and a list of descriptions corresponding to the columns.

get_stats.get_stats_for_column(df: DataFrame, column: str, columns_desc: str) list[list[str, str, int]][source]

Calculate the number of unique values in df[column] and various subsets of df.

Parameters:
  • df (pd.DataFrame) – Pandas Dataframe for which the number of unique values should be calculated

  • column (str) – Column of df that the values should be calculated for

  • columns_desc (str) – Description of the column

Returns:

List of results in the format [column_name, subset_type, size]

Return type:

list[list[str, str, int]]