`snowmobile.core.snowframe`¶

DataFrame extensions; primarily includes comparison operators.

Module Contents¶

Classes¶

SnowFrame

Extends a DataFrame with a .snf entry point.

class snowmobile.core.snowframe.SnowFrame(df: pandas.DataFrame)¶

Bases: snowmobile.core.Generic

Extends a DataFrame with a .snf entry point.

shared_cols(self, df2: pandas.DataFrame) → List[Tuple[pd.Series, pd.Series]]¶: Returns list of tuples containing column pairs that are common between two DataFrames.

static series_max_diff_abs(col1: pandas.Series, col2: pandas.Series, tolerance: float) → bool ¶: Determines if the max absolute difference between two pandas.Series is within a tolerance level.

static series_max_diff_rel(col1: pandas.Series, col2: pandas.Series, tolerance: float) → bool ¶: Determines if the maximum relative difference between two pandas.Series is within a tolerance level.

df_max_diff_abs(self, df2: pandas.DataFrame, tolerance: float) → bool ¶: Determines if the maximum absolute difference between any value in the shared columns of 2 DataFrames is within a tolerance level.

df_max_diff_rel(self, df2: pandas.DataFrame, tolerance: float) → bool ¶: Determines if the maximum relative difference between any value in the shared columns of 2 DataFrames is within a tolerance level.

df_diff(self, df2: pandas.DataFrame, abs_tol: Optional[float] = None, rel_tol: Optional[float] = None) → bool ¶

Determines if the column-wise difference between two DataFrames is within a relative or absolute tolerance level.

Note

df1 and df2 are assumed to have a shared, pre-defined index.
Exactly one of abs_tol and rel_tol is expected to be a a valid float; the other is expected to be None.
If valid float values are provided for both abs_tol and rel_tol, the outcome of the maximum absolute difference with respect to abs_tol will be returned regardless of the value of rel_tol.

Parameters

df2 (pd.DataFrame) – 2nd DataFrame for comparison.
abs_tol (float) – Absolute tolerance; default is None.
rel_tol (float) – Relative tolerance; default is None.

Returns (bool):: Boolean indicating whether or not difference is within tolerance.

partitions(self, on: str) → Dict[str, pd.DataFrame]¶

Returns a dictionary of DataFrames given a DataFrame and a partition column.

Note

The number of distinct values within partition_on column will be 1:1 with the number of partitions that are returned.
The partition_on column is dropped from the partitions that are returned.
The depth of a vertical concatenation of all partitions should equal the depth of the original DataFrame.

Parameters: on (str) – The column name to use for partitioning the data.

Returns (Dict[str, pd.DataFrame]):: Dictionary of {(str) partition_value: (pd.DataFrame) associated subset of df}

ddl(self, table: str) → str ¶: Returns a string containing ‘create table’ DDL given a table name

lower(self, col: Optional[str] = None) → pandas.DataFrame ¶: Lower cases all column names or all values within col if pr.

upper(self, col: Optional[str] = None) → pandas.DataFrame ¶: Upper cases all column names or all values within col if pr.

reformat(self)¶: Re-formats DataFrame’s columns via Column.reformat().

append_dupe_suffix(self)¶: Adds a trailing index number ‘_i’ to duplicate column names.

to_list(self, col: Optional[str] = None, n: Optional[int] = None) → List¶

Succinctly retrieves a column as a list.

Parameters

col (str) – Name of column.
n (int) – Number of records to return; defaults to full depth of column.

add_tmstmp(self, col_nm: Optional[str] = None) → pandas.DataFrame ¶

Adds a column containing the current timestamp to a DataFrame.

Parameters: col_nm (str) – Name for column; defaults to LOADED_TMSTMP.

property original(self) → pandas.DataFrame ¶: Returns the DataFrame in its original form (drops columns added by SnowFrame and reverts to original column names).

property has_dupes(self) → bool ¶: DataFrame has duplicate column names.

cols_matching(self, patterns: List[str], ignore_patterns: List[str] = None) → List[str]¶

Returns a list of columns given a list of patterns to find.

Parameters

patterns (List[str]) – List of regex patterns to match columns on.
ignore_patterns (List[str]) – Optional list of regex patterns to exclude.

Returns (List[str]):: List of columns found/excluded.

cols_ending(self, nm: str, ignore_patterns: Optional[List] = None) → List[str]¶

Returns all columns up to nm in a DataFrame.

Parameters

nm (str) – Name of column to end index at.
ignore_patterns (List[str]) – Optional list of regex patterns to exclude in the list that’s returned; primarily used to for getting end-index-at list while excluding src_description.

Returns (List[str]):: List of column names matching criterion.

snowmobile.core.snowframe¶

Module Contents¶

Classes¶

`snowmobile.core.snowframe`¶