-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Summary
Implement several utility functions from R janitor that provide helpful data validation and manipulation capabilities.
Functions to Implement
1. single_value(x, missing=None, warn_if_all_missing=False, info=None)
Extract and validate that a vector/Series contains only a single unique value (excluding missing values).
import janitor
# Basic usage
janitor.single_value([1, 1, 1, None])
# 1
janitor.single_value([None, "a"], missing=[None, "a"])
# "a" (first missing value when all are missing)
# Useful in groupby operations
df.groupby('A').agg(B=('B', janitor.single_value))
# With info for better error messages
df.groupby('A').apply(
lambda g: janitor.single_value(g['B'], info=f"group A={g['A'].iloc[0]}")
)
# Raises error if multiple values found:
janitor.single_value([1, 2, 3])
# ValueError: More than one (3) value found (1, 2, 3)Use case: Validate that a column has a single value within each group, common in data cleaning.
2. get_one_to_one(df)
Find columns that have a 1:1 mapping to each other (useful for identifying redundant columns or validating relationships).
df = pd.DataFrame({
'Lab_Test_Long': ['Cholesterol, LDL', 'Cholesterol, LDL', 'Glucose'],
'Lab_Test_Short': ['CLDL', 'CLDL', 'GLUC'],
'LOINC': [12345, 12345, 54321],
'Person': ['Sam', 'Bill', 'Sam']
})
janitor.get_one_to_one(df)
# [['Lab_Test_Long', 'Lab_Test_Short', 'LOINC']]
# These three columns all map 1:1 to each otherUse case: Data validation, identifying lookup table candidates, detecting redundant columns.
3. round_half_up(x, digits=0)
Round numeric values with "half up" rounding (like Excel), where 0.5 rounds to 1.
import janitor
janitor.round_half_up(12.5) # 13 (not 12 like Python's default)
janitor.round_half_up(1.125, digits=2) # 1.13
janitor.round_half_up(-0.5) # -1 (rounds away from zero)
# Compare to Python default:
round(12.5) # 12 (banker's rounding - round half to even)4. signif_half_up(x, digits=6)
Round to significant digits with "half up" rounding.
janitor.signif_half_up(12.5, digits=2) # 13
janitor.signif_half_up(1.125, digits=3) # 1.13
janitor.signif_half_up(-2.5, digits=1) # -35. paste_skip_na(*args, sep=" ", collapse=None)
Concatenate strings while skipping NA values (like paste() in R but smarter).
janitor.paste_skip_na("A", None)
# "A"
janitor.paste_skip_na("A", None, ["B", None], sep=",")
# ["A,B", "A"]
janitor.paste_skip_na(None, None, None)
# None (preserves NA when all values are NA)Use case: Building composite strings from multiple columns where some may be missing.
6. top_levels(series, n=2, show_na=False)
Generate a frequency table of a categorical/factor variable grouped into top-n, bottom-n, and middle levels.
s = pd.Categorical(['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C'])
janitor.top_levels(s, n=2)
# level n percent
# 0 A, B 4 0.50
# 1 <<Middle (1)>> 1 0.125
# 2 D, E 3 0.375Use case: Summarizing high-cardinality categorical variables.
R janitor Reference
Implementation Notes
single_valueshould work with both lists and pandas Seriesget_one_to_onereturns a list of lists (groups of 1:1 columns)round_half_upimplementation from StackOverflow- These can be standalone functions in
janitor/functions/orjanitor/utils.py
Labels
- enhancement
- good first issue
- help wanted