Skip to content

Feature: Implement utility functions from R janitor #1560

@ericmjl

Description

@ericmjl

Summary

Implement several utility functions from R janitor that provide helpful data validation and manipulation capabilities.

Functions to Implement

1. single_value(x, missing=None, warn_if_all_missing=False, info=None)

Extract and validate that a vector/Series contains only a single unique value (excluding missing values).

import janitor

# Basic usage
janitor.single_value([1, 1, 1, None])
# 1

janitor.single_value([None, "a"], missing=[None, "a"])
# "a" (first missing value when all are missing)

# Useful in groupby operations
df.groupby('A').agg(B=('B', janitor.single_value))

# With info for better error messages
df.groupby('A').apply(
    lambda g: janitor.single_value(g['B'], info=f"group A={g['A'].iloc[0]}")
)

# Raises error if multiple values found:
janitor.single_value([1, 2, 3])
# ValueError: More than one (3) value found (1, 2, 3)

Use case: Validate that a column has a single value within each group, common in data cleaning.

2. get_one_to_one(df)

Find columns that have a 1:1 mapping to each other (useful for identifying redundant columns or validating relationships).

df = pd.DataFrame({
    'Lab_Test_Long': ['Cholesterol, LDL', 'Cholesterol, LDL', 'Glucose'],
    'Lab_Test_Short': ['CLDL', 'CLDL', 'GLUC'],
    'LOINC': [12345, 12345, 54321],
    'Person': ['Sam', 'Bill', 'Sam']
})

janitor.get_one_to_one(df)
# [['Lab_Test_Long', 'Lab_Test_Short', 'LOINC']]
# These three columns all map 1:1 to each other

Use case: Data validation, identifying lookup table candidates, detecting redundant columns.

3. round_half_up(x, digits=0)

Round numeric values with "half up" rounding (like Excel), where 0.5 rounds to 1.

import janitor

janitor.round_half_up(12.5)  # 13 (not 12 like Python's default)
janitor.round_half_up(1.125, digits=2)  # 1.13
janitor.round_half_up(-0.5)  # -1 (rounds away from zero)

# Compare to Python default:
round(12.5)  # 12 (banker's rounding - round half to even)

4. signif_half_up(x, digits=6)

Round to significant digits with "half up" rounding.

janitor.signif_half_up(12.5, digits=2)  # 13
janitor.signif_half_up(1.125, digits=3)  # 1.13
janitor.signif_half_up(-2.5, digits=1)  # -3

5. paste_skip_na(*args, sep=" ", collapse=None)

Concatenate strings while skipping NA values (like paste() in R but smarter).

janitor.paste_skip_na("A", None)
# "A"

janitor.paste_skip_na("A", None, ["B", None], sep=",")
# ["A,B", "A"]

janitor.paste_skip_na(None, None, None)
# None (preserves NA when all values are NA)

Use case: Building composite strings from multiple columns where some may be missing.

6. top_levels(series, n=2, show_na=False)

Generate a frequency table of a categorical/factor variable grouped into top-n, bottom-n, and middle levels.

s = pd.Categorical(['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C'])
janitor.top_levels(s, n=2)
#            level  n  percent
# 0           A, B  4     0.50
# 1  <<Middle (1)>>  1     0.125
# 2           D, E  3     0.375

Use case: Summarizing high-cardinality categorical variables.

R janitor Reference

Implementation Notes

  1. single_value should work with both lists and pandas Series
  2. get_one_to_one returns a list of lists (groups of 1:1 columns)
  3. round_half_up implementation from StackOverflow
  4. These can be standalone functions in janitor/functions/ or janitor/utils.py

Labels

  • enhancement
  • good first issue
  • help wanted

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions