-
Notifications
You must be signed in to change notification settings - Fork 180
Open
Description
Summary
Implement compare_df_cols() and compare_df_cols_same() from R janitor to help users compare DataFrame columns before row-binding operations.
Background
When combining multiple DataFrames with pd.concat() or similar operations, mismatched column types can cause silent data corruption or unexpected behavior. R janitor provides utilities to detect these issues before they occur.
Functions to Implement
1. compare_df_cols(*dfs, return="all", bind_method="bind_rows", strict_description=False)
Compare column types across multiple DataFrames.
import pandas as pd
import janitor
df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [3.0, 4.0], 'B': ['z', 'w'], 'C': [True, False]})
df3 = pd.DataFrame({'A': ['a', 'b'], 'B': ['c', 'd']})
janitor.compare_df_cols(df1, df2, df3)
# column_name df1 df2 df3
# 0 A int64 float64 object
# 1 B object object object
# 2 C NaN bool NaN
# With named arguments
janitor.compare_df_cols(train=df1, test=df2)
# column_name train test
# 0 A int64 float64
# 1 B object object
# 2 C NaN bool
# Return only mismatching columns
janitor.compare_df_cols(df1, df2, df3, return="mismatch")
# column_name df1 df2 df3
# 0 A int64 float64 object
# 1 C NaN bool NaN2. compare_df_cols_same(*dfs, bind_method="bind_rows", verbose=True)
Boolean check if DataFrames can safely bind.
janitor.compare_df_cols_same(df1, df2)
# Prints mismatch info if verbose=True
# Returns: False
janitor.compare_df_cols_same(df1, df1)
# Returns: True3. describe_class(series, strict_description=True)
Describe the class of a pandas Series, with special handling for categoricals.
janitor.describe_class(pd.Series([1, 2, 3]))
# 'int64'
janitor.describe_class(pd.Categorical(['a', 'b', 'a']))
# 'category(levels=["a", "b"])' # when strict_description=True
# 'category' # when strict_description=FalseR janitor Reference
From compare_df_cols.R:
Key features:
returnparameter: "all", "match", or "mismatch"bind_methodparameter: "bind_rows" (missing cols OK) vs "rbind" (missing cols = mismatch)strict_description: Whether to include factor levels in comparison- Accepts both individual DataFrames and lists of DataFrames
- Named arguments become column names in output
Use Cases
- Before concatenating train/test splits: Ensure column types match
- Validating data pipelines: Check that transformed data matches expected schema
- Debugging type coercion issues: Find where types diverge across datasets
- Data quality checks: Identify schema drift in data sources
Implementation Notes
- Should accept
*argsfor unnamed DataFrames and**kwargsfor named ones - Use
inspectmodule or similar to get variable names when not provided - Consider supporting lists of DataFrames as a single argument
- For categoricals with
strict_description=True, include the categories in the description
Labels
- enhancement
- good first issue
- help wanted
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels