Skip to content

Feature: Implement convert_to_date() and convert_to_datetime() for mixed format parsing #1559

@ericmjl

Description

@ericmjl

Summary

Implement convert_to_date() and convert_to_datetime() from R janitor to handle mixed-format date parsing, especially for data exported from Microsoft Excel.

Background

Real-world data often contains dates in mixed formats - some as Excel numeric dates, some as date strings, and some as proper datetime objects. R janitor provides utilities that intelligently parse these mixed formats into consistent date/datetime objects.

Functions to Implement

1. convert_to_date(x, character_fun=None, string_conversion_failure="error")

Parse mixed date formats into Python date objects.

import janitor

# Mixed format input
dates = ["2009-07-06", "40000", "40000.1", None]
janitor.convert_to_date(dates)
# [datetime.date(2009, 7, 6), datetime.date(2009, 7, 6), datetime.date(2009, 7, 6), None]

# Also works on DataFrames
df['date_col'] = df['date_col'].apply(janitor.convert_to_date)
# Or as a method:
df.transform_column('date_col', janitor.convert_to_date)

2. convert_to_datetime(x, tz="UTC", character_fun=None, string_conversion_failure="error")

Parse mixed datetime formats into Python datetime objects.

datetimes = ["2009-07-06", "40000.1", "40000", None]
janitor.convert_to_datetime(datetimes, tz="UTC")
# [datetime(2009, 7, 6, 0, 0), datetime(2009, 7, 6, 2, 24), datetime(2009, 7, 6, 0, 0), None]

3. sas_numeric_to_date(date_num=None, datetime_num=None, time_num=None, tz="UTC")

Convert SAS date/datetime/time numbers to Python objects.

janitor.sas_numeric_to_date(date_num=15639)
# datetime.date(2002, 10, 26)

janitor.sas_numeric_to_date(datetime_num=1217083532, tz="UTC")
# datetime(1998, 7, 26, 14, 45, 32, tzinfo=UTC)

janitor.sas_numeric_to_date(time_num=3600)
# datetime.time(1, 0, 0)

4. excel_time_to_numeric(time_value, round_seconds=True)

Convert Excel time representations to numeric seconds.

janitor.excel_time_to_numeric("0.5")  # Noon as Excel fraction
# 43200

janitor.excel_time_to_numeric("12:30 PM")
# 45000

janitor.excel_time_to_numeric("14:30:15")
# 52215

R janitor Reference

Parsing Logic

The functions should detect and handle:

  1. Numeric values (0-1 range): Excel time fractions
  2. 5-digit integers/floats: Excel date numbers (days since 1899-12-30 or 1904-01-01)
  3. Date strings: Parse using pandas or dateutil
  4. Time strings: "12:30 PM", "14:30:15", etc.
  5. Mixed formats: Intelligently route each value to the appropriate parser

Relationship to Existing Functions

pyjanitor already has:

  • convert_excel_date() - Handles Excel numeric dates
  • convert_matlab_date() - Handles MATLAB dates
  • convert_unix_date() - Handles Unix timestamps (deprecated)

The new functions would:

  • Build on convert_excel_date() for numeric detection
  • Add string parsing capabilities
  • Handle mixed-format columns automatically
  • Add SAS date support

Implementation Notes

  1. Use pandas' built-in date parsing where possible
  2. Detect Excel numeric dates with regex: ^[0-9]{5}(?:\.[0-9]*)?$
  3. The character_fun parameter allows custom string parsing functions
  4. string_conversion_failure controls error vs warning on parse failures
  5. Consider vectorized implementations for performance

Labels

  • enhancement
  • help wanted

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions