snowmobile.core.snowframe

DataFrame extensions; primarily includes comparison operators.

Module Contents

Classes

SnowFrame

Extends a DataFrame with a .snf entry point.

class snowmobile.core.snowframe.SnowFrame(df: pandas.DataFrame)

Bases: snowmobile.core.Generic

Extends a DataFrame with a .snf entry point.

shared_cols(self, df2: pandas.DataFrame)List[Tuple[pd.Series, pd.Series]]

Returns list of tuples containing column pairs that are common between two DataFrames.

static series_max_diff_abs(col1: pandas.Series, col2: pandas.Series, tolerance: float)bool

Determines if the max absolute difference between two pandas.Series is within a tolerance level.

static series_max_diff_rel(col1: pandas.Series, col2: pandas.Series, tolerance: float)bool

Determines if the maximum relative difference between two pandas.Series is within a tolerance level.

df_max_diff_abs(self, df2: pandas.DataFrame, tolerance: float)bool

Determines if the maximum absolute difference between any value in the shared columns of 2 DataFrames is within a tolerance level.

df_max_diff_rel(self, df2: pandas.DataFrame, tolerance: float)bool

Determines if the maximum relative difference between any value in the shared columns of 2 DataFrames is within a tolerance level.

df_diff(self, df2: pandas.DataFrame, abs_tol: Optional[float] = None, rel_tol: Optional[float] = None)bool

Determines if the column-wise difference between two DataFrames is within a relative or absolute tolerance level.

Note

  • df1 and df2 are assumed to have a shared, pre-defined index.

  • Exactly one of abs_tol and rel_tol is expected to be a a valid float; the other is expected to be None.

  • If valid float values are provided for both abs_tol and rel_tol, the outcome of the maximum absolute difference with respect to abs_tol will be returned regardless of the value of rel_tol.

Parameters
  • df2 (pd.DataFrame) – 2nd DataFrame for comparison.

  • abs_tol (float) – Absolute tolerance; default is None.

  • rel_tol (float) – Relative tolerance; default is None.

Returns (bool):

Boolean indicating whether or not difference is within tolerance.

partitions(self, on: str)Dict[str, pd.DataFrame]

Returns a dictionary of DataFrames given a DataFrame and a partition column.

Note

  • The number of distinct values within partition_on column will be 1:1 with the number of partitions that are returned.

  • The partition_on column is dropped from the partitions that are returned.

  • The depth of a vertical concatenation of all partitions should equal the depth of the original DataFrame.

Parameters

on (str) – The column name to use for partitioning the data.

Returns (Dict[str, pd.DataFrame]):

Dictionary of {(str) partition_value: (pd.DataFrame) associated subset of df}

ddl(self, table: str)str

Returns a string containing ‘create table’ DDL given a table name

lower(self, col: Optional[str] = None)pandas.DataFrame

Lower cases all column names or all values within col if pr.

upper(self, col: Optional[str] = None)pandas.DataFrame

Upper cases all column names or all values within col if pr.

reformat(self)

Re-formats DataFrame’s columns via Column.reformat().

append_dupe_suffix(self)

Adds a trailing index number ‘_i’ to duplicate column names.

to_list(self, col: Optional[str] = None, n: Optional[int] = None)List

Succinctly retrieves a column as a list.

Parameters
  • col (str) – Name of column.

  • n (int) – Number of records to return; defaults to full depth of column.

add_tmstmp(self, col_nm: Optional[str] = None)pandas.DataFrame

Adds a column containing the current timestamp to a DataFrame.

Parameters

col_nm (str) – Name for column; defaults to LOADED_TMSTMP.

property original(self)pandas.DataFrame

Returns the DataFrame in its original form (drops columns added by SnowFrame and reverts to original column names).

property has_dupes(self)bool

DataFrame has duplicate column names.

cols_matching(self, patterns: List[str], ignore_patterns: List[str] = None)List[str]

Returns a list of columns given a list of patterns to find.

Parameters
  • patterns (List[str]) – List of regex patterns to match columns on.

  • ignore_patterns (List[str]) – Optional list of regex patterns to exclude.

Returns (List[str]):

List of columns found/excluded.

cols_ending(self, nm: str, ignore_patterns: Optional[List] = None)List[str]

Returns all columns up to nm in a DataFrame.

Parameters
  • nm (str) – Name of column to end index at.

  • ignore_patterns (List[str]) – Optional list of regex patterns to exclude in the list that’s returned; primarily used to for getting end-index-at list while excluding src_description.

Returns (List[str]):

List of column names matching criterion.