Skip to content

feat: percentile and median ops#6153

Open
aaron-ang wants to merge 2 commits intoEventual-Inc:mainfrom
aaron-ang:median
Open

feat: percentile and median ops#6153
aaron-ang wants to merge 2 commits intoEventual-Inc:mainfrom
aaron-ang:median

Conversation

@aaron-ang
Copy link
Contributor

@aaron-ang aaron-ang commented Feb 10, 2026

Changes Made

  • median = 0.5 percentile.

Related Issues

Closes #3491.

@github-actions github-actions bot added the feat label Feb 10, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 10, 2026

Greptile Overview

Greptile Summary

This PR implements exact percentile and median aggregation operations for Daft, addressing issue #3491.

Implementation Overview:

  • Added percentile(percentage) and median() aggregation functions that compute exact percentiles using linear interpolation
  • Median is implemented as a special case of percentile at 0.5
  • Returns Float64 for all numeric input types
  • Properly handles null values, empty inputs, and grouped aggregations

Key Changes:

  • Core implementation (src/daft-core/src/array/ops/percentile.rs, src/daft-core/src/utils/stats.rs): Implements exact percentile calculation using sort + linear interpolation between adjacent values
  • Multi-stage aggregation (src/daft-local-plan/src/agg.rs): Uses list collection in first stage, percentile computation in second stage for distributed execution
  • Python API (daft/expressions/expressions.py, daft/functions/agg.py): Adds Expression.percentile(), Expression.median(), daft.functions.percentile(), and daft.functions.median() following the project's API consistency rule
  • SQL support (src/daft-sql/src/modules/aggs.rs): Adds PERCENTILE(column, percentage) and MEDIAN(column) SQL functions with validation that percentage is a float literal between 0.0 and 1.0
  • Comprehensive tests (tests/dataframe/test_percentile.py): Covers global/grouped aggregation, null handling, empty inputs, and SQL validation

Previous Thread Issues:

  • ✅ Both daft.functions.percentile and daft.functions.median now exist and are properly exported
  • ✅ Median returns scalar Float64, not a list (verified in implementation)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured, comprehensive, and follows established patterns in the codebase. It includes proper validation, null handling, comprehensive test coverage, and addresses both previous thread concerns. The percentile algorithm uses standard linear interpolation, type inference is correct (Float64), and the multi-stage aggregation strategy mirrors existing aggregations like mean and stddev.
  • No files require special attention

Important Files Changed

Filename Overview
src/daft-core/src/array/ops/percentile.rs New file implementing exact percentile calculation for Float64Array and ListArray with proper null handling
src/daft-core/src/utils/stats.rs Adds exact_percentile function using linear interpolation and percentile validation helper
src/daft-dsl/src/expr/mod.rs Adds Percentile and Median aggregation expressions with proper type inference to Float64
daft/expressions/expressions.py Adds percentile() and median() methods to Expression class with proper documentation
daft/functions/agg.py Adds percentile() and median() aggregation functions matching the Expression API
src/daft-local-plan/src/agg.rs Implements multi-stage aggregation strategy for percentile/median using list collection
src/daft-sql/src/modules/aggs.rs Adds SQL support for PERCENTILE and MEDIAN functions with validation and documentation
tests/dataframe/test_percentile.py Comprehensive tests covering global/grouped aggregation, null handling, and SQL validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Python API
    participant DSL
    participant LocalPlan
    participant RecordBatch
    participant Series
    participant Array
    participant Stats

    User->>Python API: df.agg(col("x").percentile(0.5))
    Python API->>DSL: percentile(percentage)
    DSL->>DSL: AggExpr::Percentile(expr, percentage)
    DSL->>LocalPlan: populate_aggregation_stages
    LocalPlan->>LocalPlan: Stage 1: List(expr) - collect values
    LocalPlan->>LocalPlan: Stage 2: Percentile(list_col, percentage)
    LocalPlan->>RecordBatch: eval_expression(AggExpr::Percentile)
    RecordBatch->>Series: percentile(groups, percentage)
    Series->>Series: cast to Float64
    Series->>Array: Float64Array.percentile() or ListArray.percentile()
    Array->>Stats: exact_percentile(values, percentage)
    Stats->>Stats: filter nulls, sort, linear interpolation
    Stats-->>Array: Option<f64>
    Array-->>Series: Float64Array
    Series-->>RecordBatch: Series
    RecordBatch-->>LocalPlan: Result
    LocalPlan-->>Python API: Result
    Python API-->>User: DataFrame with percentile
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 67.50000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.41%. Comparing base (10fe495) to head (6ad8a32).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-dsl/src/expr/mod.rs 66.66% 6 Missing ⚠️
src/daft-recordbatch/src/lib.rs 0.00% 4 Missing ⚠️
src/daft-logical-plan/src/ops/project.rs 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #6153       +/-   ##
===========================================
+ Coverage   43.43%   73.41%   +29.98%     
===========================================
  Files         924      993       +69     
  Lines      114984   129196    +14212     
===========================================
+ Hits        49938    94844    +44906     
+ Misses      65046    34352    -30694     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 95.36% <100.00%> (+0.01%) ⬆️
src/daft-dsl/src/expr/agg.rs 76.92% <100.00%> (+76.92%) ⬆️
src/daft-local-plan/src/agg.rs 94.24% <100.00%> (+94.24%) ⬆️
src/daft-sql/src/modules/aggs.rs 70.76% <100.00%> (+31.88%) ⬆️
src/daft-logical-plan/src/ops/project.rs 56.62% <0.00%> (+10.56%) ⬆️
src/daft-recordbatch/src/lib.rs 74.46% <0.00%> (+52.99%) ⬆️
src/daft-dsl/src/expr/mod.rs 75.06% <66.66%> (+37.82%) ⬆️

... and 680 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aaron-ang thank you for the PR, it's been awesome to see all your recent contributions!

I wanted us to consider a different approach for this PR. Most data engines use an exact median for median() rather than an approximate one. For reference, Spark's median() delegates to exact Percentile(child, 0.5): (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala)

If we wanted to add syntactic sugar for approx_percentile(0.5) as approx_median() that's fine, though I'd argue that's not as useful or common a sugar as median() is to percentile(0.5) so I'd prefer if we didn't.

I think the more impactful feature would be an exact median() backed by percentile(0.5). Would you be interested in taking that on instead? We would need exact percentile support first.

Tip: See Also
[`daft.functions.median`](https://docs.daft.ai/en/stable/api/functions/median/)
"""
return self.approx_percentiles(0.5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should wire the Python side median() through AggExpr::Median instead of shortcutting to approx_percentiles(0.5), so both SQL and Python will always share the same implementation of Median.

@aaron-ang
Copy link
Contributor Author

@desmondcheongzx ok, I will implement exact percentile.

@aaron-ang aaron-ang changed the title feat: median in daft sql and Expression feat: percentile and median ops Feb 12, 2026
@aaron-ang
Copy link
Contributor Author

@greptile-apps re-review.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@aaron-ang aaron-ang force-pushed the median branch 2 times, most recently from 488b089 to 0573570 Compare February 19, 2026 03:11
@aaron-ang aaron-ang requested a review from a team as a code owner February 26, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add median function

2 participants