-
Notifications
You must be signed in to change notification settings - Fork 180
Open
Description
Summary
Implement convert_to_date() and convert_to_datetime() from R janitor to handle mixed-format date parsing, especially for data exported from Microsoft Excel.
Background
Real-world data often contains dates in mixed formats - some as Excel numeric dates, some as date strings, and some as proper datetime objects. R janitor provides utilities that intelligently parse these mixed formats into consistent date/datetime objects.
Functions to Implement
1. convert_to_date(x, character_fun=None, string_conversion_failure="error")
Parse mixed date formats into Python date objects.
import janitor
# Mixed format input
dates = ["2009-07-06", "40000", "40000.1", None]
janitor.convert_to_date(dates)
# [datetime.date(2009, 7, 6), datetime.date(2009, 7, 6), datetime.date(2009, 7, 6), None]
# Also works on DataFrames
df['date_col'] = df['date_col'].apply(janitor.convert_to_date)
# Or as a method:
df.transform_column('date_col', janitor.convert_to_date)2. convert_to_datetime(x, tz="UTC", character_fun=None, string_conversion_failure="error")
Parse mixed datetime formats into Python datetime objects.
datetimes = ["2009-07-06", "40000.1", "40000", None]
janitor.convert_to_datetime(datetimes, tz="UTC")
# [datetime(2009, 7, 6, 0, 0), datetime(2009, 7, 6, 2, 24), datetime(2009, 7, 6, 0, 0), None]3. sas_numeric_to_date(date_num=None, datetime_num=None, time_num=None, tz="UTC")
Convert SAS date/datetime/time numbers to Python objects.
janitor.sas_numeric_to_date(date_num=15639)
# datetime.date(2002, 10, 26)
janitor.sas_numeric_to_date(datetime_num=1217083532, tz="UTC")
# datetime(1998, 7, 26, 14, 45, 32, tzinfo=UTC)
janitor.sas_numeric_to_date(time_num=3600)
# datetime.time(1, 0, 0)4. excel_time_to_numeric(time_value, round_seconds=True)
Convert Excel time representations to numeric seconds.
janitor.excel_time_to_numeric("0.5") # Noon as Excel fraction
# 43200
janitor.excel_time_to_numeric("12:30 PM")
# 45000
janitor.excel_time_to_numeric("14:30:15")
# 52215R janitor Reference
Parsing Logic
The functions should detect and handle:
- Numeric values (0-1 range): Excel time fractions
- 5-digit integers/floats: Excel date numbers (days since 1899-12-30 or 1904-01-01)
- Date strings: Parse using pandas or dateutil
- Time strings: "12:30 PM", "14:30:15", etc.
- Mixed formats: Intelligently route each value to the appropriate parser
Relationship to Existing Functions
pyjanitor already has:
convert_excel_date()- Handles Excel numeric datesconvert_matlab_date()- Handles MATLAB datesconvert_unix_date()- Handles Unix timestamps (deprecated)
The new functions would:
- Build on
convert_excel_date()for numeric detection - Add string parsing capabilities
- Handle mixed-format columns automatically
- Add SAS date support
Implementation Notes
- Use pandas' built-in date parsing where possible
- Detect Excel numeric dates with regex:
^[0-9]{5}(?:\.[0-9]*)?$ - The
character_funparameter allows custom string parsing functions string_conversion_failurecontrols error vs warning on parse failures- Consider vectorized implementations for performance
Labels
- enhancement
- help wanted
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels