Skip to content

[BUG] Dedup aggregation pushdown nullifies renamed non-dedup fields via top_hits response mapping #5150

@penghuo

Description

@penghuo

Query Information

PPL Command/Query:

source=bounty-numbers | rename value as val | dedup category | fields category, val

Expected Result:
val should retain original value data after rename + dedup:

{
  "datarows": [["A", 10.5], ["B", 100.0], ["C", 50.0]]
}

Actual Result:
All val values are null. The dedup correctly filters by category but nullifies the renamed field:

{
  "datarows": [["A", null], ["B", null], ["C", null]]
}

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "count": {"type": "integer"},
      "category": {"type": "keyword"},
      "subcategory": {"type": "keyword"},
      "value": {"type": "double"},
      "ts": {"type": "date"}
    }
  }
}

Sample Data

{"count":1,"category":"A","subcategory":"X","value":10.5,"ts":"2024-01-01"}
{"count":2,"category":"A","subcategory":"Y","value":20.3,"ts":"2024-01-02"}
{"count":10,"category":"B","subcategory":"X","value":100.0,"ts":"2024-01-03"}
{"count":5,"category":"C","subcategory":"Z","value":50.0,"ts":"2024-01-04"}

Bug Description

Issue Summary:
When a field is renamed via the rename command and a subsequent dedup command operates on a different field, the renamed field returns null values. More broadly, ANY field whose project name differs from the original index field name (via rename or eval column reference) becomes null after dedup aggregation pushdown. The dedup correctly deduplicates by the specified field, but all non-dedup fields with mismatched names produce null.

This bug specifically affects the aggregation-based DedupPushdownRule path. The collapse-based pushdown path was fixed by PR #4670 (which resolved #4563 and #4664), but the aggregation path was not addressed.

Steps to Reproduce:

  1. Create the index:
curl -XPUT 'http://localhost:9200/bounty-numbers' -H 'Content-Type: application/json' -d '{
  "mappings": {
    "properties": {
      "count": {"type": "integer"},
      "category": {"type": "keyword"},
      "subcategory": {"type": "keyword"},
      "value": {"type": "double"},
      "ts": {"type": "date"}
    }
  }
}'
  1. Insert sample data:
curl -XPOST 'http://localhost:9200/bounty-numbers/_bulk' -H 'Content-Type: application/x-ndjson' -d '
{"index":{}}
{"count":1,"category":"A","subcategory":"X","value":10.5,"ts":"2024-01-01"}
{"index":{}}
{"count":2,"category":"A","subcategory":"Y","value":20.3,"ts":"2024-01-02"}
{"index":{}}
{"count":10,"category":"B","subcategory":"X","value":100.0,"ts":"2024-01-03"}
{"index":{}}
{"count":5,"category":"C","subcategory":"Z","value":50.0,"ts":"2024-01-04"}
'
  1. Run the PPL query:
curl -XPOST 'http://localhost:9200/_plugins/_ppl' -H 'Content-Type: application/json' \
  -d '{"query": "source=bounty-numbers | rename value as val | dedup category | fields category, val"}'
  1. Observe that val is null in all rows while category is correct.

Differential Testing Results:

Query Result
rename value as val | fields category, val val has correct values (works)
dedup category | fields category, value value has correct values (works)
rename value as val | sort val | fields category, val val correct and sorted (works)
rename value as val | dedup category | fields category, val val = null (BUG)
rename category as cat | dedup cat | fields cat, value cat and value both correct (works)
eval newval = value + 1 | dedup category | fields category, newval newval correct (works -- script expr)
rename value as val, category as cat | dedup cat | fields cat, val cat correct, val = null (BUG)
rename name as employee, salary as pay | dedup dept_id employee=null, pay=null (BUG -- multiple renames)
eval pay2 = salary | rename salary as pay | dedup dept_id pay=null, pay2=null (BUG -- eval column ref also null)
eval extra = 42 | rename name as employee | dedup dept_id employee=null, extra=42, salary=correct (eval constant survives)
rename value as val | eval doubled = val * 2 | dedup category | fields category, val, doubled val=null, doubled=correct (eval of renamed field survives as RexCall)

Pattern summary:

  • Renamed fields -> null after dedup (unless dedup is on that renamed field itself)
  • Eval'd column references (e.g., eval pay2 = salary) -> null after dedup when resolved as RexInputRef
  • Eval'd constants (e.g., eval extra = 42) -> survive (added as ScriptField with correct name)
  • Eval'd expressions (e.g., eval newval = value + 1) -> survive (added as ScriptField with correct name)
  • Unrenamed original fields -> survive (project name == index name)
  • The dedup key field itself -> always survives (becomes composite bucket key with correct name)

Impact:
This bug makes it impossible to use rename (or eval column references) with dedup on a different field in the same query pipeline when the aggregation pushdown path is taken. This is a common data transformation pattern.

Root Cause Analysis

The bug is in the DedupPushdownRule aggregation pushdown path. When dedup is pushed down as a composite aggregation with top_hits:

  1. DedupPushdownRule.apply() (opensearch/.../planner/rules/DedupPushdownRule.java:51-127): Creates an aggregate with LITERAL_AGG and calls scan.pushDownAggregate(aggregate, targetChildProject). targetChildProject has the renamed field names (e.g., val).

  2. AggregateAnalyzer.createTopHitsBuilder() (opensearch/.../request/AggregateAnalyzer.java:803-868): For the LITERAL_AGG case, iterates over all project args. For RexInputRef fields, calls helper.inferNamedField(rex.getKey()) which resolves using the original scan's rowType -- gets original index field name value. Correctly builds top_hits with fetchField("value") / _source.includes: ["category", "value"].

  3. OpenSearch response returns top_hits with _source: {"category": "A", "value": 10.5} -- using original field names.

  4. TopHitsParser.parse() (opensearch/.../response/agg/TopHitsParser.java:123-134): Extracts hit source as-is: Map<String, Object> with key "value" (original name).

  5. OpenSearchResponse.handleAggregationResponse() (opensearch/.../response/OpenSearchResponse.java:251-268): Creates ExprTupleValue with key "value" from the response.

  6. OpenSearchIndexEnumerator.current() (opensearch/.../storage/scan/OpenSearchIndexEnumerator.java:84-92): Uses getRowType().getFieldNames() which has the renamed names from copyWithNewSchema(dedup.getRowType()). Tries to resolve "val" from the tuple that only has "value" -> returns null.

The disconnect: The schema expects val (renamed name), but the response tuple contains value (original index name). No rename mapping is applied during response parsing.

Why renaming the dedup field itself works: The dedup field becomes the composite aggregation bucket key. The bucket key name comes from outputFields which uses the renamed name. So the bucket key is correctly named.

Why eval expressions survive but eval column refs don't: eval newval = value + 1 creates a RexCall (script expression), and createTopHitsBuilder() adds it as a ScriptField with name newval. The response fields section has the correct name. However, eval pay2 = salary creates a RexInputRef (just a column reference), which is resolved to fetchField("salary") -- the original index name. The enumerator then tries to resolve "pay2" from a tuple that only has "salary" -> null.

Explain API Output

{
  "calcite": {
    "logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n  LogicalProject(category=[$1], val=[$3])\n    LogicalFilter(condition=[<=($11, 1)])\n      LogicalProject(..., val=[$3], ..., _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $1)])\n        LogicalFilter(condition=[IS NOT NULL($1)])\n          LogicalProject(count=[$0], category=[$1], subcategory=[$2], val=[$3], ts=[$4], ...)\n            CalciteLogicalIndexScan(table=[[OpenSearch, bounty-numbers]])\n",
    "physical": "CalciteEnumerableIndexScan(table=[[OpenSearch, bounty-numbers]], PushDownContext=[[AGGREGATION->..., LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={...\"_source\":{\"includes\":[\"category\",\"value\"],\"excludes\":[]},\"script_fields\":{}}...)])"
  }
}

Note: The logical plan correctly shows val=[$3] (renamed), but the physical plan's _source.includes uses ["category","value"] (original names). The response parser does not map value back to val.

Execute API Output

{
  "schema": [
    {"name": "category", "type": "string"},
    {"name": "val", "type": "double"}
  ],
  "datarows": [["A", null], ["B", null], ["C", null]],
  "total": 3,
  "size": 3
}

Environment Information

OpenSearch Version: OpenSearch + SQL Plugin (main branch, Calcite enabled)

Suggested Fix

The DedupPushdownRule needs to account for field renames when building the response mapping. Possible approaches:

  1. Build a rename mapping in TopHitsParser: When LITERAL_AGG is used for dedup, pass the project's field name mapping (original -> renamed) to the parser. The TopHitsParser can then rename response keys before returning.

  2. Add a post-processing project: After the aggregation scan returns results with original field names, insert a LogicalProject that renames fields to match the expected schema.

  3. Use fetchField with alias support: Instead of builder.fetchField(originalName), map the response fields using the project's named expressions.

Approach 1 is the most targeted fix -- pass the Map<String, String> of {originalName -> renamedName} to TopHitsParser and apply it when constructing the result maps.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing languagebugSomething isn't workinggood-for-agent

    Type

    No type

    Projects

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions