Query Information
PPL Command/Query:
source=bounty-numbers | rename value as val | dedup category | fields category, val
Expected Result:
val should retain original value data after rename + dedup:
{
"datarows": [["A", 10.5], ["B", 100.0], ["C", 50.0]]
}
Actual Result:
All val values are null. The dedup correctly filters by category but nullifies the renamed field:
{
"datarows": [["A", null], ["B", null], ["C", null]]
}
Dataset Information
Dataset/Schema Type
Index Mapping
{
"mappings": {
"properties": {
"count": {"type": "integer"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"value": {"type": "double"},
"ts": {"type": "date"}
}
}
}
Sample Data
{"count":1,"category":"A","subcategory":"X","value":10.5,"ts":"2024-01-01"}
{"count":2,"category":"A","subcategory":"Y","value":20.3,"ts":"2024-01-02"}
{"count":10,"category":"B","subcategory":"X","value":100.0,"ts":"2024-01-03"}
{"count":5,"category":"C","subcategory":"Z","value":50.0,"ts":"2024-01-04"}
Bug Description
Issue Summary:
When a field is renamed via the rename command and a subsequent dedup command operates on a different field, the renamed field returns null values. More broadly, ANY field whose project name differs from the original index field name (via rename or eval column reference) becomes null after dedup aggregation pushdown. The dedup correctly deduplicates by the specified field, but all non-dedup fields with mismatched names produce null.
This bug specifically affects the aggregation-based DedupPushdownRule path. The collapse-based pushdown path was fixed by PR #4670 (which resolved #4563 and #4664), but the aggregation path was not addressed.
Steps to Reproduce:
- Create the index:
curl -XPUT 'http://localhost:9200/bounty-numbers' -H 'Content-Type: application/json' -d '{
"mappings": {
"properties": {
"count": {"type": "integer"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"value": {"type": "double"},
"ts": {"type": "date"}
}
}
}'
- Insert sample data:
curl -XPOST 'http://localhost:9200/bounty-numbers/_bulk' -H 'Content-Type: application/x-ndjson' -d '
{"index":{}}
{"count":1,"category":"A","subcategory":"X","value":10.5,"ts":"2024-01-01"}
{"index":{}}
{"count":2,"category":"A","subcategory":"Y","value":20.3,"ts":"2024-01-02"}
{"index":{}}
{"count":10,"category":"B","subcategory":"X","value":100.0,"ts":"2024-01-03"}
{"index":{}}
{"count":5,"category":"C","subcategory":"Z","value":50.0,"ts":"2024-01-04"}
'
- Run the PPL query:
curl -XPOST 'http://localhost:9200/_plugins/_ppl' -H 'Content-Type: application/json' \
-d '{"query": "source=bounty-numbers | rename value as val | dedup category | fields category, val"}'
- Observe that
val is null in all rows while category is correct.
Differential Testing Results:
| Query |
Result |
rename value as val | fields category, val |
val has correct values (works) |
dedup category | fields category, value |
value has correct values (works) |
rename value as val | sort val | fields category, val |
val correct and sorted (works) |
rename value as val | dedup category | fields category, val |
val = null (BUG) |
rename category as cat | dedup cat | fields cat, value |
cat and value both correct (works) |
eval newval = value + 1 | dedup category | fields category, newval |
newval correct (works -- script expr) |
rename value as val, category as cat | dedup cat | fields cat, val |
cat correct, val = null (BUG) |
rename name as employee, salary as pay | dedup dept_id |
employee=null, pay=null (BUG -- multiple renames) |
eval pay2 = salary | rename salary as pay | dedup dept_id |
pay=null, pay2=null (BUG -- eval column ref also null) |
eval extra = 42 | rename name as employee | dedup dept_id |
employee=null, extra=42, salary=correct (eval constant survives) |
rename value as val | eval doubled = val * 2 | dedup category | fields category, val, doubled |
val=null, doubled=correct (eval of renamed field survives as RexCall) |
Pattern summary:
- Renamed fields -> null after dedup (unless dedup is on that renamed field itself)
- Eval'd column references (e.g.,
eval pay2 = salary) -> null after dedup when resolved as RexInputRef
- Eval'd constants (e.g.,
eval extra = 42) -> survive (added as ScriptField with correct name)
- Eval'd expressions (e.g.,
eval newval = value + 1) -> survive (added as ScriptField with correct name)
- Unrenamed original fields -> survive (project name == index name)
- The dedup key field itself -> always survives (becomes composite bucket key with correct name)
Impact:
This bug makes it impossible to use rename (or eval column references) with dedup on a different field in the same query pipeline when the aggregation pushdown path is taken. This is a common data transformation pattern.
Root Cause Analysis
The bug is in the DedupPushdownRule aggregation pushdown path. When dedup is pushed down as a composite aggregation with top_hits:
-
DedupPushdownRule.apply() (opensearch/.../planner/rules/DedupPushdownRule.java:51-127): Creates an aggregate with LITERAL_AGG and calls scan.pushDownAggregate(aggregate, targetChildProject). targetChildProject has the renamed field names (e.g., val).
-
AggregateAnalyzer.createTopHitsBuilder() (opensearch/.../request/AggregateAnalyzer.java:803-868): For the LITERAL_AGG case, iterates over all project args. For RexInputRef fields, calls helper.inferNamedField(rex.getKey()) which resolves using the original scan's rowType -- gets original index field name value. Correctly builds top_hits with fetchField("value") / _source.includes: ["category", "value"].
-
OpenSearch response returns top_hits with _source: {"category": "A", "value": 10.5} -- using original field names.
-
TopHitsParser.parse() (opensearch/.../response/agg/TopHitsParser.java:123-134): Extracts hit source as-is: Map<String, Object> with key "value" (original name).
-
OpenSearchResponse.handleAggregationResponse() (opensearch/.../response/OpenSearchResponse.java:251-268): Creates ExprTupleValue with key "value" from the response.
-
OpenSearchIndexEnumerator.current() (opensearch/.../storage/scan/OpenSearchIndexEnumerator.java:84-92): Uses getRowType().getFieldNames() which has the renamed names from copyWithNewSchema(dedup.getRowType()). Tries to resolve "val" from the tuple that only has "value" -> returns null.
The disconnect: The schema expects val (renamed name), but the response tuple contains value (original index name). No rename mapping is applied during response parsing.
Why renaming the dedup field itself works: The dedup field becomes the composite aggregation bucket key. The bucket key name comes from outputFields which uses the renamed name. So the bucket key is correctly named.
Why eval expressions survive but eval column refs don't: eval newval = value + 1 creates a RexCall (script expression), and createTopHitsBuilder() adds it as a ScriptField with name newval. The response fields section has the correct name. However, eval pay2 = salary creates a RexInputRef (just a column reference), which is resolved to fetchField("salary") -- the original index name. The enumerator then tries to resolve "pay2" from a tuple that only has "salary" -> null.
Explain API Output
{
"calcite": {
"logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(category=[$1], val=[$3])\n LogicalFilter(condition=[<=($11, 1)])\n LogicalProject(..., val=[$3], ..., _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $1)])\n LogicalFilter(condition=[IS NOT NULL($1)])\n LogicalProject(count=[$0], category=[$1], subcategory=[$2], val=[$3], ts=[$4], ...)\n CalciteLogicalIndexScan(table=[[OpenSearch, bounty-numbers]])\n",
"physical": "CalciteEnumerableIndexScan(table=[[OpenSearch, bounty-numbers]], PushDownContext=[[AGGREGATION->..., LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={...\"_source\":{\"includes\":[\"category\",\"value\"],\"excludes\":[]},\"script_fields\":{}}...)])"
}
}
Note: The logical plan correctly shows val=[$3] (renamed), but the physical plan's _source.includes uses ["category","value"] (original names). The response parser does not map value back to val.
Execute API Output
{
"schema": [
{"name": "category", "type": "string"},
{"name": "val", "type": "double"}
],
"datarows": [["A", null], ["B", null], ["C", null]],
"total": 3,
"size": 3
}
Environment Information
OpenSearch Version: OpenSearch + SQL Plugin (main branch, Calcite enabled)
Suggested Fix
The DedupPushdownRule needs to account for field renames when building the response mapping. Possible approaches:
-
Build a rename mapping in TopHitsParser: When LITERAL_AGG is used for dedup, pass the project's field name mapping (original -> renamed) to the parser. The TopHitsParser can then rename response keys before returning.
-
Add a post-processing project: After the aggregation scan returns results with original field names, insert a LogicalProject that renames fields to match the expected schema.
-
Use fetchField with alias support: Instead of builder.fetchField(originalName), map the response fields using the project's named expressions.
Approach 1 is the most targeted fix -- pass the Map<String, String> of {originalName -> renamedName} to TopHitsParser and apply it when constructing the result maps.
Related Issues
Query Information
PPL Command/Query:
Expected Result:
valshould retain originalvaluedata after rename + dedup:{ "datarows": [["A", 10.5], ["B", 100.0], ["C", 50.0]] }Actual Result:
All
valvalues are null. The dedup correctly filters by category but nullifies the renamed field:{ "datarows": [["A", null], ["B", null], ["C", null]] }Dataset Information
Dataset/Schema Type
Index Mapping
{ "mappings": { "properties": { "count": {"type": "integer"}, "category": {"type": "keyword"}, "subcategory": {"type": "keyword"}, "value": {"type": "double"}, "ts": {"type": "date"} } } }Sample Data
{"count":1,"category":"A","subcategory":"X","value":10.5,"ts":"2024-01-01"} {"count":2,"category":"A","subcategory":"Y","value":20.3,"ts":"2024-01-02"} {"count":10,"category":"B","subcategory":"X","value":100.0,"ts":"2024-01-03"} {"count":5,"category":"C","subcategory":"Z","value":50.0,"ts":"2024-01-04"}Bug Description
Issue Summary:
When a field is renamed via the
renamecommand and a subsequentdedupcommand operates on a different field, the renamed field returns null values. More broadly, ANY field whose project name differs from the original index field name (viarenameorevalcolumn reference) becomes null after dedup aggregation pushdown. The dedup correctly deduplicates by the specified field, but all non-dedup fields with mismatched names produce null.This bug specifically affects the aggregation-based
DedupPushdownRulepath. The collapse-based pushdown path was fixed by PR #4670 (which resolved #4563 and #4664), but the aggregation path was not addressed.Steps to Reproduce:
valis null in all rows whilecategoryis correct.Differential Testing Results:
rename value as val | fields category, valdedup category | fields category, valuerename value as val | sort val | fields category, valrename value as val | dedup category | fields category, valrename category as cat | dedup cat | fields cat, valueeval newval = value + 1 | dedup category | fields category, newvalrename value as val, category as cat | dedup cat | fields cat, valrename name as employee, salary as pay | dedup dept_ideval pay2 = salary | rename salary as pay | dedup dept_ideval extra = 42 | rename name as employee | dedup dept_idrename value as val | eval doubled = val * 2 | dedup category | fields category, val, doubledPattern summary:
eval pay2 = salary) -> null after dedup when resolved asRexInputRefeval extra = 42) -> survive (added asScriptFieldwith correct name)eval newval = value + 1) -> survive (added asScriptFieldwith correct name)Impact:
This bug makes it impossible to use
rename(or eval column references) withdedupon a different field in the same query pipeline when the aggregation pushdown path is taken. This is a common data transformation pattern.Root Cause Analysis
The bug is in the
DedupPushdownRuleaggregation pushdown path. When dedup is pushed down as a composite aggregation withtop_hits:DedupPushdownRule.apply()(opensearch/.../planner/rules/DedupPushdownRule.java:51-127): Creates an aggregate withLITERAL_AGGand callsscan.pushDownAggregate(aggregate, targetChildProject).targetChildProjecthas the renamed field names (e.g.,val).AggregateAnalyzer.createTopHitsBuilder()(opensearch/.../request/AggregateAnalyzer.java:803-868): For theLITERAL_AGGcase, iterates over all project args. ForRexInputReffields, callshelper.inferNamedField(rex.getKey())which resolves using the original scan's rowType -- gets original index field namevalue. Correctly buildstop_hitswithfetchField("value")/_source.includes: ["category", "value"].OpenSearch response returns
top_hitswith_source: {"category": "A", "value": 10.5}-- using original field names.TopHitsParser.parse()(opensearch/.../response/agg/TopHitsParser.java:123-134): Extracts hit source as-is:Map<String, Object>with key"value"(original name).OpenSearchResponse.handleAggregationResponse()(opensearch/.../response/OpenSearchResponse.java:251-268): CreatesExprTupleValuewith key"value"from the response.OpenSearchIndexEnumerator.current()(opensearch/.../storage/scan/OpenSearchIndexEnumerator.java:84-92): UsesgetRowType().getFieldNames()which has the renamed names fromcopyWithNewSchema(dedup.getRowType()). Tries to resolve"val"from the tuple that only has"value"-> returns null.The disconnect: The schema expects
val(renamed name), but the response tuple containsvalue(original index name). No rename mapping is applied during response parsing.Why renaming the dedup field itself works: The dedup field becomes the composite aggregation bucket key. The bucket key name comes from
outputFieldswhich uses the renamed name. So the bucket key is correctly named.Why
evalexpressions survive butevalcolumn refs don't:eval newval = value + 1creates aRexCall(script expression), andcreateTopHitsBuilder()adds it as aScriptFieldwith namenewval. The responsefieldssection has the correct name. However,eval pay2 = salarycreates aRexInputRef(just a column reference), which is resolved tofetchField("salary")-- the original index name. The enumerator then tries to resolve"pay2"from a tuple that only has"salary"-> null.Explain API Output
{ "calcite": { "logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(category=[$1], val=[$3])\n LogicalFilter(condition=[<=($11, 1)])\n LogicalProject(..., val=[$3], ..., _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $1)])\n LogicalFilter(condition=[IS NOT NULL($1)])\n LogicalProject(count=[$0], category=[$1], subcategory=[$2], val=[$3], ts=[$4], ...)\n CalciteLogicalIndexScan(table=[[OpenSearch, bounty-numbers]])\n", "physical": "CalciteEnumerableIndexScan(table=[[OpenSearch, bounty-numbers]], PushDownContext=[[AGGREGATION->..., LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={...\"_source\":{\"includes\":[\"category\",\"value\"],\"excludes\":[]},\"script_fields\":{}}...)])" } }Note: The logical plan correctly shows
val=[$3](renamed), but the physical plan's_source.includesuses["category","value"](original names). The response parser does not mapvalueback toval.Execute API Output
{ "schema": [ {"name": "category", "type": "string"}, {"name": "val", "type": "double"} ], "datarows": [["A", null], ["B", null], ["C", null]], "total": 3, "size": 3 }Environment Information
OpenSearch Version: OpenSearch + SQL Plugin (main branch, Calcite enabled)
Suggested Fix
The
DedupPushdownRuleneeds to account for field renames when building the response mapping. Possible approaches:Build a rename mapping in TopHitsParser: When
LITERAL_AGGis used for dedup, pass the project's field name mapping (original -> renamed) to the parser. TheTopHitsParsercan then rename response keys before returning.Add a post-processing project: After the aggregation scan returns results with original field names, insert a
LogicalProjectthat renames fields to match the expected schema.Use fetchField with alias support: Instead of
builder.fetchField(originalName), map the response fields using the project's named expressions.Approach 1 is the most targeted fix -- pass the
Map<String, String>of{originalName -> renamedName}toTopHitsParserand apply it when constructing the result maps.Related Issues
renamecommand breaksdedupoperation - returns null values #4563 (closed) -renamebreaksdedupwhen the dedup field itself is renamed (fixed by PR Use table scan rowType in filter pushdown could fix rename issue #4670 -- collapse-based pushdown)renamecommand breaksWHEREclause with functions - NullPointerException in Calcite filter pushdown #4664 (closed) -renamebreakswherewith functions (fixed by PR Use table scan rowType in filter pushdown could fix rename issue #4670 -- filter pushdown rowType issue)DedupPushdownRulepath was not addressed