get_stats module
Get statistics of dataset for final results and debugging.
- get_stats.add_dataset_sizes(dataset: Dataset, df: DataFrame, label: str)[source]
Count and add representative counts of df used for debugging to the dataset.
- Parameters:
dataset (Dataset) – Dataset with compound-target pairs and debugging sizes.
df (pd.DataFrame) – Pandas DataFrame with current compound-target pairs
label (str) – Description of pipeline step (e.g., initial query).
- get_stats.add_debugging_info(dataset: Dataset, df: DataFrame, label: str)[source]
Wrapper for add_dataset_sizes. Handles logging level.
- get_stats.get_dataset_sizes(df: DataFrame, label: str) DataFrame [source]
Calculate the number of unique compounds, targets and pairs for df and df limited to drugs.
- Parameters:
df (pd.DataFrame) – Pandas DataFrame for which the dataset sizes should be calculated.
label (str) – Description of pipeline step (e.g., initial query).
- Returns:
Pandas DataFrame with calculated unique counts.
- Return type:
pd.DataFrame
- get_stats.get_stats_columns() tuple[list[str], list[str]] [source]
Get the relevant columns for which stats should be calculated and a list of descriptions corresponding to the columns.
- get_stats.get_stats_for_column(df: DataFrame, column: str, columns_desc: str) list[list[str, str, int]] [source]
Calculate the number of unique values in df[column] and various subsets of df.
- Parameters:
df (pd.DataFrame) – Pandas Dataframe for which the number of unique values should be calculated
column (str) – Column of df that the values should be calculated for
columns_desc (str) – Description of the column
- Returns:
List of results in the format [column_name, subset_type, size]
- Return type:
list[list[str, str, int]]