feat: percentile and median ops#6153
Conversation
Greptile OverviewGreptile SummaryThis PR implements exact percentile and median aggregation operations for Daft, addressing issue #3491. Implementation Overview:
Key Changes:
Previous Thread Issues:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Python API
participant DSL
participant LocalPlan
participant RecordBatch
participant Series
participant Array
participant Stats
User->>Python API: df.agg(col("x").percentile(0.5))
Python API->>DSL: percentile(percentage)
DSL->>DSL: AggExpr::Percentile(expr, percentage)
DSL->>LocalPlan: populate_aggregation_stages
LocalPlan->>LocalPlan: Stage 1: List(expr) - collect values
LocalPlan->>LocalPlan: Stage 2: Percentile(list_col, percentage)
LocalPlan->>RecordBatch: eval_expression(AggExpr::Percentile)
RecordBatch->>Series: percentile(groups, percentage)
Series->>Series: cast to Float64
Series->>Array: Float64Array.percentile() or ListArray.percentile()
Array->>Stats: exact_percentile(values, percentage)
Stats->>Stats: filter nulls, sort, linear interpolation
Stats-->>Array: Option<f64>
Array-->>Series: Float64Array
Series-->>RecordBatch: Series
RecordBatch-->>LocalPlan: Result
LocalPlan-->>Python API: Result
Python API-->>User: DataFrame with percentile
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6153 +/- ##
===========================================
+ Coverage 43.43% 73.41% +29.98%
===========================================
Files 924 993 +69
Lines 114984 129196 +14212
===========================================
+ Hits 49938 94844 +44906
+ Misses 65046 34352 -30694
🚀 New features to boost your workflow:
|
desmondcheongzx
left a comment
There was a problem hiding this comment.
Hey @aaron-ang thank you for the PR, it's been awesome to see all your recent contributions!
I wanted us to consider a different approach for this PR. Most data engines use an exact median for median() rather than an approximate one. For reference, Spark's median() delegates to exact Percentile(child, 0.5): (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala)
If we wanted to add syntactic sugar for approx_percentile(0.5) as approx_median() that's fine, though I'd argue that's not as useful or common a sugar as median() is to percentile(0.5) so I'd prefer if we didn't.
I think the more impactful feature would be an exact median() backed by percentile(0.5). Would you be interested in taking that on instead? We would need exact percentile support first.
daft/expressions/expressions.py
Outdated
| Tip: See Also | ||
| [`daft.functions.median`](https://docs.daft.ai/en/stable/api/functions/median/) | ||
| """ | ||
| return self.approx_percentiles(0.5) |
There was a problem hiding this comment.
I believe we should wire the Python side median() through AggExpr::Median instead of shortcutting to approx_percentiles(0.5), so both SQL and Python will always share the same implementation of Median.
|
@desmondcheongzx ok, I will implement exact percentile. |
|
@greptile-apps re-review. |
488b089 to
0573570
Compare
Changes Made
Related Issues
Closes #3491.