feat: percentile and median ops by aaron-ang · Pull Request #6153 · Eventual-Inc/Daft

aaron-ang · 2026-02-10T00:57:31Z

Changes Made

median = 0.5 percentile.

Related Issues

Closes #3491.

greptile-apps · 2026-02-10T01:01:56Z

Greptile Overview

Greptile Summary

This PR implements exact percentile and median aggregation operations for Daft, addressing issue #3491.

Implementation Overview:

Added percentile(percentage) and median() aggregation functions that compute exact percentiles using linear interpolation
Median is implemented as a special case of percentile at 0.5
Returns Float64 for all numeric input types
Properly handles null values, empty inputs, and grouped aggregations

Key Changes:

Core implementation (src/daft-core/src/array/ops/percentile.rs, src/daft-core/src/utils/stats.rs): Implements exact percentile calculation using sort + linear interpolation between adjacent values
Multi-stage aggregation (src/daft-local-plan/src/agg.rs): Uses list collection in first stage, percentile computation in second stage for distributed execution
Python API (daft/expressions/expressions.py, daft/functions/agg.py): Adds Expression.percentile(), Expression.median(), daft.functions.percentile(), and daft.functions.median() following the project's API consistency rule
SQL support (src/daft-sql/src/modules/aggs.rs): Adds PERCENTILE(column, percentage) and MEDIAN(column) SQL functions with validation that percentage is a float literal between 0.0 and 1.0
Comprehensive tests (tests/dataframe/test_percentile.py): Covers global/grouped aggregation, null handling, empty inputs, and SQL validation

Previous Thread Issues:

✅ Both daft.functions.percentile and daft.functions.median now exist and are properly exported
✅ Median returns scalar Float64, not a list (verified in implementation)

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-structured, comprehensive, and follows established patterns in the codebase. It includes proper validation, null handling, comprehensive test coverage, and addresses both previous thread concerns. The percentile algorithm uses standard linear interpolation, type inference is correct (Float64), and the multi-stage aggregation strategy mirrors existing aggregations like mean and stddev.
No files require special attention

Important Files Changed

Filename	Overview
src/daft-core/src/array/ops/percentile.rs	New file implementing exact percentile calculation for Float64Array and ListArray with proper null handling
src/daft-core/src/utils/stats.rs	Adds exact_percentile function using linear interpolation and percentile validation helper
src/daft-dsl/src/expr/mod.rs	Adds Percentile and Median aggregation expressions with proper type inference to Float64
daft/expressions/expressions.py	Adds percentile() and median() methods to Expression class with proper documentation
daft/functions/agg.py	Adds percentile() and median() aggregation functions matching the Expression API
src/daft-local-plan/src/agg.rs	Implements multi-stage aggregation strategy for percentile/median using list collection
src/daft-sql/src/modules/aggs.rs	Adds SQL support for PERCENTILE and MEDIAN functions with validation and documentation
tests/dataframe/test_percentile.py	Comprehensive tests covering global/grouped aggregation, null handling, and SQL validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Python API
    participant DSL
    participant LocalPlan
    participant RecordBatch
    participant Series
    participant Array
    participant Stats

    User->>Python API: df.agg(col("x").percentile(0.5))
    Python API->>DSL: percentile(percentage)
    DSL->>DSL: AggExpr::Percentile(expr, percentage)
    DSL->>LocalPlan: populate_aggregation_stages
    LocalPlan->>LocalPlan: Stage 1: List(expr) - collect values
    LocalPlan->>LocalPlan: Stage 2: Percentile(list_col, percentage)
    LocalPlan->>RecordBatch: eval_expression(AggExpr::Percentile)
    RecordBatch->>Series: percentile(groups, percentage)
    Series->>Series: cast to Float64
    Series->>Array: Float64Array.percentile() or ListArray.percentile()
    Array->>Stats: exact_percentile(values, percentage)
    Stats->>Stats: filter nulls, sort, linear interpolation
    Stats-->>Array: Option<f64>
    Array-->>Series: Float64Array
    Series-->>RecordBatch: Series
    RecordBatch-->>LocalPlan: Result
    LocalPlan-->>Python API: Result
    Python API-->>User: DataFrame with percentile

greptile-apps

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

daft/expressions/expressions.py

src/daft-local-plan/src/agg.rs

codecov · 2026-02-10T01:28:49Z

Codecov Report

❌ Patch coverage is 67.50000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.41%. Comparing base (10fe495) to head (6ad8a32).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-dsl/src/expr/mod.rs	66.66%	6 Missing ⚠️
src/daft-recordbatch/src/lib.rs	0.00%	4 Missing ⚠️
src/daft-logical-plan/src/ops/project.rs	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #6153       +/-   ##
===========================================
+ Coverage   43.43%   73.41%   +29.98%     
===========================================
  Files         924      993       +69     
  Lines      114984   129196    +14212     
===========================================
+ Hits        49938    94844    +44906     
+ Misses      65046    34352    -30694

Files with missing lines	Coverage Δ
daft/expressions/expressions.py	`95.36% <100.00%> (+0.01%)`	⬆️
src/daft-dsl/src/expr/agg.rs	`76.92% <100.00%> (+76.92%)`	⬆️
src/daft-local-plan/src/agg.rs	`94.24% <100.00%> (+94.24%)`	⬆️
src/daft-sql/src/modules/aggs.rs	`70.76% <100.00%> (+31.88%)`	⬆️
src/daft-logical-plan/src/ops/project.rs	`56.62% <0.00%> (+10.56%)`	⬆️
src/daft-recordbatch/src/lib.rs	`74.46% <0.00%> (+52.99%)`	⬆️
src/daft-dsl/src/expr/mod.rs	`75.06% <66.66%> (+37.82%)`	⬆️

... and 680 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

desmondcheongzx

Hey @aaron-ang thank you for the PR, it's been awesome to see all your recent contributions!

I wanted us to consider a different approach for this PR. Most data engines use an exact median for median() rather than an approximate one. For reference, Spark's median() delegates to exact Percentile(child, 0.5): (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala)

If we wanted to add syntactic sugar for approx_percentile(0.5) as approx_median() that's fine, though I'd argue that's not as useful or common a sugar as median() is to percentile(0.5) so I'd prefer if we didn't.

I think the more impactful feature would be an exact median() backed by percentile(0.5). Would you be interested in taking that on instead? We would need exact percentile support first.

desmondcheongzx · 2026-02-10T01:45:10Z

daft/expressions/expressions.py

+        Tip: See Also
+            [`daft.functions.median`](https://docs.daft.ai/en/stable/api/functions/median/)
+        """
+        return self.approx_percentiles(0.5)


I believe we should wire the Python side median() through AggExpr::Median instead of shortcutting to approx_percentiles(0.5), so both SQL and Python will always share the same implementation of Median.

aaron-ang · 2026-02-10T03:02:44Z

@desmondcheongzx ok, I will implement exact percentile.

aaron-ang · 2026-02-12T15:59:51Z

@greptile-apps re-review.

greptile-apps

_{22 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

github-actions bot added the feat label Feb 10, 2026

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

daft/expressions/expressions.py Outdated Show resolved Hide resolved

src/daft-local-plan/src/agg.rs Show resolved Hide resolved

desmondcheongzx reviewed Feb 10, 2026

View reviewed changes

aaron-ang changed the title ~~feat: median in daft sql and Expression~~ feat: percentile and median ops Feb 12, 2026

aaron-ang force-pushed the median branch from 6ad8a32 to ebafe5f Compare February 12, 2026 15:59

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

aaron-ang requested a review from desmondcheongzx February 12, 2026 16:26

aaron-ang force-pushed the median branch 2 times, most recently from 488b089 to 0573570 Compare February 19, 2026 03:11

aaron-ang force-pushed the median branch from 0573570 to 180d6b8 Compare February 26, 2026 06:43

aaron-ang requested a review from a team as a code owner February 26, 2026 06:43

aaron-ang added 2 commits February 26, 2026 21:52

feat: percentile and median ops

0258a07

fix: use first stage schema when shuffling dataset

0883101

aaron-ang force-pushed the median branch from 180d6b8 to 0883101 Compare February 27, 2026 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: percentile and median ops#6153

feat: percentile and median ops#6153
aaron-ang wants to merge 2 commits intoEventual-Inc:mainfrom
aaron-ang:median

aaron-ang commented Feb 10, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 10, 2026

Uh oh!

desmondcheongzx left a comment

Uh oh!

desmondcheongzx Feb 10, 2026

Uh oh!

aaron-ang commented Feb 10, 2026

Uh oh!

aaron-ang commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aaron-ang commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Uh oh!

greptile-apps bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 10, 2026

Codecov Report

Uh oh!

desmondcheongzx left a comment

Choose a reason for hiding this comment

Uh oh!

desmondcheongzx Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

aaron-ang commented Feb 10, 2026

Uh oh!

aaron-ang commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aaron-ang commented Feb 10, 2026 •

edited

Loading

greptile-apps bot commented Feb 10, 2026 •

edited

Loading