Skip to content

Feature: Implement compare_df_cols() for DataFrame comparison before binding #1558

@ericmjl

Description

@ericmjl

Summary

Implement compare_df_cols() and compare_df_cols_same() from R janitor to help users compare DataFrame columns before row-binding operations.

Background

When combining multiple DataFrames with pd.concat() or similar operations, mismatched column types can cause silent data corruption or unexpected behavior. R janitor provides utilities to detect these issues before they occur.

Functions to Implement

1. compare_df_cols(*dfs, return="all", bind_method="bind_rows", strict_description=False)

Compare column types across multiple DataFrames.

import pandas as pd
import janitor

df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [3.0, 4.0], 'B': ['z', 'w'], 'C': [True, False]})
df3 = pd.DataFrame({'A': ['a', 'b'], 'B': ['c', 'd']})

janitor.compare_df_cols(df1, df2, df3)
#   column_name      df1      df2      df3
# 0           A    int64  float64   object
# 1           B   object   object   object
# 2           C      NaN     bool      NaN

# With named arguments
janitor.compare_df_cols(train=df1, test=df2)
#   column_name    train     test
# 0           A    int64  float64
# 1           B   object   object
# 2           C      NaN     bool

# Return only mismatching columns
janitor.compare_df_cols(df1, df2, df3, return="mismatch")
#   column_name      df1      df2      df3
# 0           A    int64  float64   object
# 1           C      NaN     bool      NaN

2. compare_df_cols_same(*dfs, bind_method="bind_rows", verbose=True)

Boolean check if DataFrames can safely bind.

janitor.compare_df_cols_same(df1, df2)
# Prints mismatch info if verbose=True
# Returns: False

janitor.compare_df_cols_same(df1, df1)
# Returns: True

3. describe_class(series, strict_description=True)

Describe the class of a pandas Series, with special handling for categoricals.

janitor.describe_class(pd.Series([1, 2, 3]))
# 'int64'

janitor.describe_class(pd.Categorical(['a', 'b', 'a']))
# 'category(levels=["a", "b"])'  # when strict_description=True
# 'category'  # when strict_description=False

R janitor Reference

From compare_df_cols.R:

Key features:

  • return parameter: "all", "match", or "mismatch"
  • bind_method parameter: "bind_rows" (missing cols OK) vs "rbind" (missing cols = mismatch)
  • strict_description: Whether to include factor levels in comparison
  • Accepts both individual DataFrames and lists of DataFrames
  • Named arguments become column names in output

Use Cases

  1. Before concatenating train/test splits: Ensure column types match
  2. Validating data pipelines: Check that transformed data matches expected schema
  3. Debugging type coercion issues: Find where types diverge across datasets
  4. Data quality checks: Identify schema drift in data sources

Implementation Notes

  1. Should accept *args for unnamed DataFrames and **kwargs for named ones
  2. Use inspect module or similar to get variable names when not provided
  3. Consider supporting lists of DataFrames as a single argument
  4. For categoricals with strict_description=True, include the categories in the description

Labels

  • enhancement
  • good first issue
  • help wanted

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions