Skip to content

[Feature] PPL Highlight Support#5234

Merged
RyanL1997 merged 25 commits intoopensearch-project:mainfrom
RyanL1997:ppl-highlight-cmd
Mar 19, 2026
Merged

[Feature] PPL Highlight Support#5234
RyanL1997 merged 25 commits intoopensearch-project:mainfrom
RyanL1997:ppl-highlight-cmd

Conversation

@RyanL1997
Copy link
Collaborator

@RyanL1997 RyanL1997 commented Mar 14, 2026

Description

Preview of the updated documentation at my fork:
https://github.com/RyanL1997/sql/blob/ppl-highlight-cmd/docs/user/ppl/interfaces/endpoint.md#highlight

This PR replaces the request-level ThreadLocal approach from #5141. Instead of passing highlight config through thread-local state, we thread a HighlightConfig record through the AST and Calcite plan — following the same pattern as fetch_size (#5109). The highlight config flows from the API request into the AST via StatementBuilderContext, becomes a LogicalHighlight Calcite node, and gets pushed down to the OpenSearch scan via an optimizer rule. This avoids the thread-safety concerns and implicit coupling of ThreadLocal and keeps the data flow explicit and traceable through the plan.

Updated design according to review comment

Done. Dropped LogicalHighlight, HighlightIndexScanRule, and the Highlight AST node (3 classes deleted). Highlight config now threads through Query → QueryPlan → QueryService → CalcitePlanContext, and pushDownHighlight() is called eagerly in visitRelation() before alias wrapping — same consume-once pattern as other scan hints. Introduced a HighlightPushDown interface in core to bridge the module boundary with opensearch (since CalciteRelNodeVisitor can't reference CalciteLogicalIndexScan directly). Config is cleared immediately after pushdown so it's applied exactly once.

  • Add PPL search result highlighting via the highlight API parameter (not as a PPL command)
  • Support both the simple array format (["*"]) and the rich OpenSearch Dashboards object format with pre_tags, post_tags, fields, and fragment_size
  • Introduce HighlightConfig record to carry highlight options through the full pipeline: PPLQueryRequestAstStatementBuilderHighlight AST node → CalciteRelNodeVisitorLogicalHighlightHighlightIndexScanRule pushdown → OpenSearch HighlightBuilder
  • Response includes a top-level highlights array parallel to datarows, containing per-field highlighted fragments
  • Add endpoint documentation with examples and expected output

Related Issues

Sample Request / Response

Details
curl -s -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{"query":"search source=accounts \"Holmes\"","highlight":{"fields":{"*":{}},"pre_tags":["@opensearch-dashboards-highlighted-field@"],"post_tags":["@/opensearch-dashboards-highlighted-field@"],"fragment_size":2147483647}}' | jq
{
  "schema": [
    {
      "name": "account_number",
      "type": "bigint"
    },
    {
      "name": "firstname",
      "type": "string"
    },
    {
      "name": "address",
      "type": "string"
    },
    {
      "name": "balance",
      "type": "bigint"
    },
    {
      "name": "gender",
      "type": "string"
    },
    {
      "name": "city",
      "type": "string"
    },
    {
      "name": "employer",
      "type": "string"
    },
    {
      "name": "state",
      "type": "string"
    },
    {
      "name": "age",
      "type": "bigint"
    },
    {
      "name": "email",
      "type": "string"
    },
    {
      "name": "lastname",
      "type": "string"
    }
  ],
  "datarows": [
    [
      578,
      "Holmes",
      "969 Metropolitan Avenue",
      34259,
      "M",
      "Aguila",
      "Cubicide",
      "PA",
      37,
      "holmesmcknight@cubicide.com",
      "Mcknight"
    ],
    [
      828,
      "Blanche",
      "605 Stryker Court",
      44890,
      "F",
      "Loomis",
      "Motovate",
      "KS",
      33,
      "blancheholmes@motovate.com",
      "Holmes"
    ],
    [
      1,
      "Amber",
      "880 Holmes Lane",
      39225,
      "M",
      "Brogan",
      "Pyrami",
      "IL",
      32,
      "amberduke@pyrami.com",
      "Duke"
    ]
  ],
  "highlights": [
    {
      "firstname": [
        "@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"
      ],
      "firstname.keyword": [
        "@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"
      ]
    },
    {
      "lastname.keyword": [
        "@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"
      ],
      "lastname": [
        "@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"
      ]
    },
    {
      "address": [
        "880 @opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@ Lane"
      ]
    }
  ],
  "total": 3,
  "size": 3
}

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 5891541.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java516lowUser-supplied highlight terms are interpolated directly into a Lucene query_string expression with only double-quote wrapping and no further sanitization. A crafted term containing backslash escapes or unbalanced quotes could malform or expand the generated query. Impact is limited to highlighting behavior (not document access control), so this is likely an oversight rather than malicious intent.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Reviewer Guide 🔍

(Review updated until commit 08d3c20)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: HighlightConfig data model and AST/request parsing

Relevant files:

  • core/src/main/java/org/opensearch/sql/ast/tree/HighlightConfig.java
  • core/src/main/java/org/opensearch/sql/ast/statement/Query.java
  • ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstStatementBuilder.java
  • core/src/main/java/org/opensearch/sql/calcite/CalcitePlanContext.java
  • ppl/src/test/java/org/opensearch/sql/ppl/domain/PPLQueryRequestTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/parser/AstStatementBuilderTest.java

Sub-PR theme: Highlight pushdown into Calcite scan and OpenSearch request builder

Relevant files:

  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/calcite/plan/HighlightPushDown.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownType.java
  • core/src/main/java/org/opensearch/sql/expression/HighlightExpression.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/storage/scan/CalciteIndexScanCostTest.java

Sub-PR theme: Highlight config threading through execution pipeline and response formatting

Relevant files:

  • core/src/main/java/org/opensearch/sql/executor/QueryService.java
  • core/src/main/java/org/opensearch/sql/executor/execution/QueryPlan.java
  • core/src/main/java/org/opensearch/sql/executor/execution/QueryPlanFactory.java
  • ppl/src/main/java/org/opensearch/sql/ppl/PPLService.java
  • protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java
  • protocol/src/main/java/org/opensearch/sql/protocol/response/format/SimpleJsonResponseFormatter.java
  • ppl/src/test/java/org/opensearch/sql/ppl/PPLServiceTest.java
  • core/src/test/java/org/opensearch/sql/executor/execution/QueryPlanTest.java
  • protocol/src/test/java/org/opensearch/sql/protocol/response/QueryResultTest.java
  • protocol/src/test/java/org/opensearch/sql/protocol/response/format/SimpleJsonResponseFormatterTest.java

⚡ Recommended focus areas for review

Iterator Change

The iterator() method was refactored to use tuple.entrySet().stream() instead of Map::values. This changes iteration order to be based on the tuple's entry set order rather than the schema column order. If the tuple map's key order doesn't match the schema column order, data rows will be misaligned with schema columns. The old approach used convertExprValuesToValues which iterated over Map::values — the new approach iterates over entrySet() which may produce a different ordering for some map implementations.

return exprValues.stream()
    .map(ExprValueUtils::getTupleValue)
    .map(
        tuple ->
            tuple.entrySet().stream().map(e -> e.getValue().value()).toArray(Object[]::new))
    .iterator();
Highlight Consumed Early

The highlight config is consumed (set to null) immediately after the first Relation node is visited. If a query involves multiple relations (e.g., JOINs or subqueries), only the first scan will receive the highlight config. This may be intentional for single-table PPL queries, but should be validated for multi-table scenarios.

if (context.getHighlightConfig() != null && scan instanceof HighlightPushDown) {
  RelNode newScan = ((HighlightPushDown) scan).pushDownHighlight(context.getHighlightConfig());
  context.relBuilder.build(); // pop old scan
  context.relBuilder.push(newScan);
  scan = newScan;
  context.setHighlightConfig(null); // consumed
}
Highlight Condition

The highlight config is only set on the Query statement when fields() is non-null and non-empty. However, a HighlightConfig constructed with a non-empty field list via the simple array constructor will always have non-empty fields. The condition may silently drop a valid highlight config if fields() returns empty for some edge case (e.g., a config constructed with an empty list). Consider whether this guard is necessary or if it should throw instead.

if (context.getHighlightConfig() != null
    && context.getHighlightConfig().fields() != null
    && !context.getHighlightConfig().fields().isEmpty()) {
  query.setHighlightConfig(context.getHighlightConfig());
}
Fragment Size Validation

The fragment_size validation only applies to the rich object format. In the per-field options parsed by parsePerFieldOptions, a negative fragment_size is not validated and would be passed through to OpenSearch without error. Consider adding validation in parsePerFieldOptions as well.

if (fieldObj.has("fragment_size")) {
  opts.put("fragment_size", fieldObj.getInt("fragment_size"));
}
Project Filter

In pushDownProject, the _highlight field is filtered out from the projected field names sent to OpenSearch via pushDownProjectStream. However, the _highlight field is still included in the newSchema passed to the scan. This means the schema has _highlight but the source builder does not request it — which is correct since highlights come from the highlight builder, not _source. This implicit coupling should be documented or made more explicit to avoid future regressions.

  action =
      (OSRequestBuilderAction)
          requestBuilder ->
              requestBuilder.pushDownProjectStream(
                  newSchema.getFieldNames().stream()
                      .filter(f -> !HighlightExpression.HIGHLIGHT_FIELD.equals(f)));
}

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Code Suggestions ✨

Latest suggestions up to 08d3c20

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Align row values with schema column order by name

The new iterator implementation iterates over the tuple's entry set directly, which
relies on the tuple's map iteration order matching the schema column order. If the
ExprTupleValue map order ever diverges from the schema column order (e.g., when
_highlight is appended to the schema but the tuple was built without it), the column
values will be misaligned with the schema. The previous implementation using
Map::values had the same issue, but the change makes it more explicit. Consider
iterating over schema columns and looking up each value by name to guarantee
alignment.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [84-89]

+    List<String> columnNames = schema.getColumns().stream()
+        .map(col -> col.getAlias() != null ? col.getAlias() : col.getName())
+        .collect(java.util.stream.Collectors.toList());
     return exprValues.stream()
         .map(ExprValueUtils::getTupleValue)
-        .map(
-            tuple ->
-                tuple.entrySet().stream().map(e -> e.getValue().value()).toArray(Object[]::new))
+        .map(tuple -> columnNames.stream()
+            .map(name -> {
+                ExprValue v = tuple.get(name);
+                return v != null ? v.value() : null;
+            })
+            .toArray(Object[]::new))
         .iterator();
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a real potential issue: if the ExprTupleValue map order diverges from the schema column order (especially with the new _highlight column), values could be misaligned. Iterating by schema column names guarantees correct alignment and is a meaningful correctness improvement.

Low
Prevent integer overflow when parsing fragment_size

The validation rejects fragment_size of 0 and negative values, but the OSD object
format commonly sends fragment_size: 2147483647 (Integer.MAX_VALUE) to disable
fragmentation. However, the real issue is that Integer.MAX_VALUE is a valid value,
but the check fragmentSize <= 0 is correct. The problem is that
obj.getInt("fragment_size") can throw a JSONException if the value overflows an int
(e.g., values larger than Integer.MAX_VALUE). Consider using obj.getLong and then
range-checking before casting to int to avoid silent overflow or exceptions.

ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java [169-174]

     if (obj.has("fragment_size")) {
-      fragmentSize = obj.getInt("fragment_size");
-      if (fragmentSize <= 0) {
-        throw new IllegalArgumentException("highlight fragment_size must be a positive integer");
+      long fragmentSizeLong = obj.getLong("fragment_size");
+      if (fragmentSizeLong <= 0 || fragmentSizeLong > Integer.MAX_VALUE) {
+        throw new IllegalArgumentException("highlight fragment_size must be a positive integer not exceeding " + Integer.MAX_VALUE);
       }
+      fragmentSize = (int) fragmentSizeLong;
     }
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - using getLong instead of getInt prevents potential overflow for values near Integer.MAX_VALUE. However, since OSD commonly sends exactly Integer.MAX_VALUE (2147483647), which fits in an int, this is a low-probability edge case. The improved code is accurate and addresses a real but minor robustness concern.

Low
General
Avoid silently dropping highlight config with empty fields

The condition skips setting the highlight config when fields is null or empty, but a
HighlightConfig constructed via the simple array constructor new
HighlightConfig(List.of("*")) will have a non-null, non-empty fields map. However,
if someone constructs a HighlightConfig with an empty fields map (e.g., object
format with no fields key), the highlight config will be silently dropped without
any error or warning. This could lead to confusing behavior where a user provides a
highlight config but gets no highlighting. The guard should either propagate the
config regardless or throw an error for empty fields.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstStatementBuilder.java [41-45]

-    if (context.getHighlightConfig() != null
-        && context.getHighlightConfig().fields() != null
-        && !context.getHighlightConfig().fields().isEmpty()) {
+    if (context.getHighlightConfig() != null) {
       query.setHighlightConfig(context.getHighlightConfig());
     }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that a HighlightConfig with empty fields (e.g., object format with no fields key) would be silently dropped. Removing the extra guards and always propagating a non-null config is a reasonable improvement, though the practical impact depends on whether empty-fields configs are a real use case.

Low
Guard unsafe casts in per-field options application

The @SuppressWarnings("unchecked") annotation suppresses all unchecked cast warnings
for the entire method, but the casts to List and Boolean are unsafe if the map
values are not of the expected types. Since the values come from JSON parsing in
parsePerFieldOptions, a ClassCastException at runtime is possible if the JSON
contains unexpected types (e.g., a number where a boolean is expected). Each cast
should be individually guarded or the annotation scope should be narrowed.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [532-563]

-    @SuppressWarnings(
-        "unchecked") // values are trusted types from PPLQueryRequest.parsePerFieldOptions
-  private static void applyPerFieldOptions(
-      HighlightBuilder.Field field, Map<String, Object> options) {
-    ...
-    if (options.containsKey("pre_tags")) {
-      field.preTags(((List<String>) options.get("pre_tags")).toArray(new String[0]));
+    private static void applyPerFieldOptions(
+        HighlightBuilder.Field field, Map<String, Object> options) {
+      if (options == null || options.isEmpty()) {
+        return;
+      }
+      if (options.containsKey("fragment_size")) {
+        field.fragmentSize(((Number) options.get("fragment_size")).intValue());
+      }
+      if (options.containsKey("number_of_fragments")) {
+        field.numOfFragments(((Number) options.get("number_of_fragments")).intValue());
+      }
+      if (options.containsKey("type")) {
+        field.highlighterType((String) options.get("type"));
+      }
+      if (options.containsKey("pre_tags")) {
+        Object val = options.get("pre_tags");
+        if (val instanceof List<?> list) {
+          field.preTags(list.stream().map(Object::toString).toArray(String[]::new));
+        }
+      }
+      if (options.containsKey("post_tags")) {
+        Object val = options.get("post_tags");
+        if (val instanceof List<?> list) {
+          field.postTags(list.stream().map(Object::toString).toArray(String[]::new));
+        }
+      }
+      if (options.containsKey("require_field_match")) {
+        Object val = options.get("require_field_match");
+        if (val instanceof Boolean b) {
+          field.requireFieldMatch(b);
+        }
+      }
+      if (options.containsKey("no_match_size")) {
+        field.noMatchSize(((Number) options.get("no_match_size")).intValue());
+      }
+      if (options.containsKey("order")) {
+        field.order((String) options.get("order"));
+      }
     }
-    if (options.containsKey("post_tags")) {
-      field.postTags(((List<String>) options.get("post_tags")).toArray(new String[0]));
-    }
-    if (options.containsKey("require_field_match")) {
-      field.requireFieldMatch((Boolean) options.get("require_field_match"));
-    }
-  </code>
Suggestion importance[1-10]: 3

__

Why: The suggestion is valid in principle - the @SuppressWarnings("unchecked") suppresses all warnings for the method. However, since the values are populated by parsePerFieldOptions which explicitly uses typed JSON getters, the actual risk of ClassCastException is low. The existing code is also not accurately represented in the existing_code snippet (it contains ... and </code> artifacts), making the suggestion less precise.

Low

Previous suggestions

Suggestions up to commit d289025
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix incorrect highlight data access path in test

The test accesses result.getJSONArray("highlights"), but based on the rest of the PR
the highlight data is returned as a _highlight column inside datarows, not as a
top-level highlights array. This will cause a JSONException at runtime when the key
is not found. The test should instead look up the _highlight column index from
schema and read it from datarows.

integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java [490-495]

-var highlights = result.getJSONArray("highlights");
-var highlight = highlights.getJSONObject(0);
+JSONArray schema = result.getJSONArray("schema");
+int hlIndex = -1;
+for (int i = 0; i < schema.length(); i++) {
+  if ("_highlight".equals(schema.getJSONObject(i).getString("name"))) {
+    hlIndex = i;
+    break;
+  }
+}
+Assert.assertTrue("_highlight column should exist", hlIndex >= 0);
+JSONObject highlight = result.getJSONArray("datarows").getJSONArray(0).getJSONObject(hlIndex);
 Assert.assertTrue(
     "Highlight should contain <em>Hattie</em>",
     highlight.getJSONArray("firstname").getString(0).contains("<em>Hattie</em>"));
Suggestion importance[1-10]: 8

__

Why: The test accesses result.getJSONArray("highlights") which does not exist in the response format; highlight data is stored as a _highlight column inside datarows. This will cause a JSONException at runtime, making the test fail. The improved code correctly reads the _highlight column index from schema and retrieves the value from datarows.

Medium
Ensure row values align with schema column order

The new iterator implementation iterates over the tuple's entry set directly, which
relies on the tuple's map iteration order matching the schema column order. If the
ExprTupleValue map order ever diverges from the schema column order (e.g., when
highlight adds _highlight to the tuple but the schema lists it last), the column
values will be misaligned. The previous approach using Map::values had the same
issue, but the new code makes it more explicit. Consider iterating over schema
columns and looking up each value by name to guarantee alignment.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [84-89]

+// Iterate in schema column order to guarantee alignment
+List<String> columnNames = schema.getColumns().stream()
+    .map(col -> col.getAlias() != null ? col.getAlias() : col.getName())
+    .collect(java.util.stream.Collectors.toList());
 return exprValues.stream()
     .map(ExprValueUtils::getTupleValue)
     .map(
         tuple ->
-            tuple.entrySet().stream().map(e -> e.getValue().value()).toArray(Object[]::new))
+            columnNames.stream()
+                .map(name -> {
+                    ExprValue v = tuple.get(name);
+                    return v != null ? v.value() : null;
+                })
+                .toArray(Object[]::new))
     .iterator();
Suggestion importance[1-10]: 7

__

Why: The new iterator iterates over the tuple's entry set order rather than the schema column order, which could cause column misalignment if the _highlight field or other fields are stored in a different order than the schema. The suggested fix of iterating by schema column names is more robust and directly addresses a real correctness risk introduced by this PR.

Medium
General
Avoid silently dropping valid highlight configs with empty fields

The condition skips setting the highlight config when fields is null or empty, but a
HighlightConfig constructed via the simple array constructor (new
HighlightConfig(List.of("*"))) will always have a non-null, non-empty fields map.
However, if someone constructs a HighlightConfig with an empty fields map (e.g.,
object format with no fields key), the config will be silently dropped. This could
cause unexpected behavior where a valid highlight config is ignored. Consider only
checking for null config.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstStatementBuilder.java [41-45]

-if (context.getHighlightConfig() != null
-    && context.getHighlightConfig().fields() != null
-    && !context.getHighlightConfig().fields().isEmpty()) {
+if (context.getHighlightConfig() != null) {
   query.setHighlightConfig(context.getHighlightConfig());
 }
Suggestion importance[1-10]: 5

__

Why: The extra fields null/empty check in AstStatementBuilder could silently drop a valid HighlightConfig constructed with an empty fields map (e.g., object format with no fields key). Simplifying to only check for null config is more correct and consistent with how the config is constructed.

Low
Align validation condition with error message wording

The validation rejects fragment_size of 0 and negative values, but the test
testHighlightOsdObjectFormatAllFields passes 2147483647 (Integer.MAX_VALUE) which is
valid. However, the OSD format commonly uses Integer.MAX_VALUE to mean "no
fragmentation". The current check is correct, but the error message says "positive
integer" while 0 is also rejected — consider clarifying the message to say "must be
greater than 0" or use fragmentSize < 1 for consistency with the message.

ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java [171-173]

-if (fragmentSize <= 0) {
-  throw new IllegalArgumentException("highlight fragment_size must be a positive integer");
+if (fragmentSize < 1) {
+  throw new IllegalArgumentException("highlight fragment_size must be greater than 0");
 }
Suggestion importance[1-10]: 2

__

Why: The functional difference between <= 0 and < 1 is identical for integers, so this is purely a cosmetic/wording improvement. The existing code is correct and the test already covers both 0 and -1 cases.

Low
Suggestions up to commit e2ef1b2
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against non-collection highlight field values

Calling entry.getValue().collectionValue() will throw an exception if the highlight
value for a field is not a collection (e.g., it's a string or missing). This can
cause a runtime failure when processing highlight data that doesn't conform to the
expected list-of-strings structure. Add a null/type check before calling
collectionValue().

core/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+  ExprValue fieldVal = entry.getValue();
+  if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+    continue;
+  }
   hlMap.put(
       entry.getKey(),
-      entry.getValue().collectionValue().stream()
+      fieldVal.collectionValue().stream()
           .map(ExprValue::stringValue)
           .collect(Collectors.toList()));
 }
-return (Map<String, Object>) hlMap;
+return hlMap;
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential runtime exception if a highlight field value is not a collection. The improved code adds null/missing/null checks before calling collectionValue(), which is a valid defensive improvement. The unchecked cast (Map<String, Object>) hlMap is also removed in the improved code, which is a minor cleanup.

Low
Guard against non-integer fragment_size values

The validation rejects fragment_size of 0 and negative values, but the OSD format
commonly sends fragment_size: 2147483647 (Integer.MAX_VALUE) to disable
fragmentation. However, the real issue is that Integer.MAX_VALUE itself is valid,
but the check <= 0 is correct. The bigger concern is that
obj.getInt("fragment_size") will throw a JSONException (not
IllegalArgumentException) if the value is not an integer (e.g., a float like 100.5),
causing an unhandled exception to propagate. Use optInt with a sentinel or wrap in a
try-catch to provide a consistent error message.

ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java [170-173]

-if (fragmentSize <= 0) {
+if (obj.has("fragment_size")) {
+  Object fsRaw = obj.get("fragment_size");
+  if (!(fsRaw instanceof Integer) && !(fsRaw instanceof Long)) {
     throw new IllegalArgumentException("highlight fragment_size must be a positive integer");
+  }
+  fragmentSize = ((Number) fsRaw).intValue();
+  if (fragmentSize <= 0) {
+    throw new IllegalArgumentException("highlight fragment_size must be a positive integer");
+  }
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that obj.getInt("fragment_size") can throw a JSONException for non-integer values, but in practice the OSD format always sends integer values, and the JSONException would still surface as an error to the caller. The improvement is minor and the existing code handles the common cases correctly.

Low
General
Use consistent non-empty check for highlight config

The simple array format constructs a HighlightConfig via new HighlightConfig(List),
which calls toFieldMap. If the field list is non-null but empty, fields() will be an
empty map, and the highlight config will be silently dropped here. This means a
request with "highlight": [] would not propagate the config at all, which may be the
desired behavior, but it also means a HighlightConfig created with a non-empty field
list but whose fields() map is somehow empty would be dropped. Consider also
checking fieldNames() for the simple-array path to be consistent.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstStatementBuilder.java [41-45]

 if (context.getHighlightConfig() != null
-    && context.getHighlightConfig().fields() != null
-    && !context.getHighlightConfig().fields().isEmpty()) {
+    && !context.getHighlightConfig().fieldNames().isEmpty()) {
   query.setHighlightConfig(context.getHighlightConfig());
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion simplifies the null/empty check by using fieldNames() instead of checking fields() directly, which is more consistent with the HighlightConfig API. However, the behavior difference for an empty array [] is intentional (silently dropping it), so this is a minor style/consistency improvement rather than a bug fix.

Low
Prevent duplicate highlight field in projections

The _highlight field is unconditionally appended to expandedFields whenever it
exists in the schema, even if the user's explicit | fields projection already
included it. This could result in duplicate _highlight columns in the projection.
Check whether the field is already present in expandedFields before adding it.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [435-439]

 // Include _highlight in projections when the highlight column is present in the schema
 int hlIndex = currentFields.indexOf(HighlightExpression.HIGHLIGHT_FIELD);
-if (hlIndex >= 0) {
+boolean alreadyIncluded = expandedFields.stream()
+    .anyMatch(rex -> rex.equals(context.relBuilder.field(hlIndex >= 0 ? hlIndex : 0)) && hlIndex >= 0);
+if (hlIndex >= 0 && !alreadyIncluded) {
   expandedFields.add(context.relBuilder.field(hlIndex));
 }
Suggestion importance[1-10]: 3

__

Why: The concern about duplicate _highlight columns is valid in theory, but the improved_code is awkward and potentially incorrect (it evaluates context.relBuilder.field(0) when hlIndex < 0). The existing code appends _highlight only when it's in the schema, and typical user projections via | fields would not include _highlight explicitly, making duplicates unlikely in practice.

Low
Suggestions up to commit 1bd30fa
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null highlight field values

Calling entry.getValue().collectionValue() will throw an exception if the highlight
value for a field is not a collection (e.g., it's a string or null). This can cause
a runtime failure when processing highlight data that doesn't conform to the
expected array-of-strings structure. Add a null/type check before calling
collectionValue().

core/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+  ExprValue fieldVal = entry.getValue();
+  if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+    continue;
+  }
   hlMap.put(
       entry.getKey(),
-      entry.getValue().collectionValue().stream()
+      fieldVal.collectionValue().stream()
           .map(ExprValue::stringValue)
           .collect(Collectors.toList()));
 }
-return (Map<String, Object>) hlMap;
+return hlMap;
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential runtime exception if a highlight field value is not a collection type. Adding null/missing/null checks before calling collectionValue() improves robustness, and the unchecked cast (Map<String, Object>) hlMap is also removed in the improved code.

Low
Guard against out-of-range fragment_size values

The validation rejects fragment_size of 2147483647 (Integer.MAX_VALUE), which is the
value explicitly sent by OpenSearch Dashboards as shown in the tests and
documentation. The condition <= 0 is correct, but the comment says "positive
integer" — however, Integer.MAX_VALUE is positive and valid. The real issue is that
obj.getInt("fragment_size") will throw a JSONException if the value exceeds
Integer.MAX_VALUE (e.g., a long value), and there's no guard against that. Consider
wrapping the parse in a try-catch or using optLong with a range check to handle
out-of-range values gracefully.

ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java [169-173]

-if (fragmentSize <= 0) {
-    throw new IllegalArgumentException("highlight fragment_size must be a positive integer");
+if (obj.has("fragment_size")) {
+  long rawFragmentSize = obj.optLong("fragment_size", -1L);
+  if (rawFragmentSize <= 0 || rawFragmentSize > Integer.MAX_VALUE) {
+    throw new IllegalArgumentException(
+        "highlight fragment_size must be a positive integer not exceeding " + Integer.MAX_VALUE);
+  }
+  fragmentSize = (int) rawFragmentSize;
 }
Suggestion importance[1-10]: 4

__

Why: The existing <= 0 check is correct and Integer.MAX_VALUE (2147483647) is a valid positive integer that passes the check. The obj.getInt() method in org.json does handle large integers within int range, and 2147483647 is exactly Integer.MAX_VALUE, so no overflow occurs. The suggestion adds marginal safety for truly out-of-range long values, but this is a low-probability edge case.

Low
General
Prevent duplicate highlight field in projections

The _highlight field is unconditionally appended to every explicit projection when
it exists in the schema, even when the user has explicitly excluded it or when it's
already included in expandedFields via expandProjectFields. This could result in
duplicate _highlight entries in the projection list, potentially causing plan errors
or unexpected behavior.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [435-439]

 // Include _highlight in projections when the highlight column is present in the schema
+// and not already included by the user's explicit projection
 int hlIndex = currentFields.indexOf(HighlightExpression.HIGHLIGHT_FIELD);
-if (hlIndex >= 0) {
+boolean alreadyIncluded = expandedFields.stream()
+    .anyMatch(rex -> rex.toString().contains(HighlightExpression.HIGHLIGHT_FIELD));
+if (hlIndex >= 0 && !alreadyIncluded) {
   expandedFields.add(context.relBuilder.field(hlIndex));
 }
Suggestion importance[1-10]: 5

__

Why: The concern about duplicate _highlight entries is valid — if expandProjectFields already includes _highlight (e.g., user explicitly projects it), appending it again could cause plan errors. However, the improved_code uses a string-based check (rex.toString().contains(...)) which is fragile; a more robust approach would compare field indices or names directly.

Low
Unify highlight config non-empty check

The highlight config is only set on the query when the object-format fields map is
non-empty. However, when the simple array format is used (e.g., ["*"]),
HighlightConfig is constructed via new HighlightConfig(List) which calls toFieldMap,
so fields() will be non-null and non-empty — this works. But if someone passes an
object format with an empty fields object {}, the highlight config will be silently
dropped. This may be intentional, but it's worth verifying the fieldNames() method
is used consistently for the simple-array path.

core/src/main/java/org/opensearch/sql/ppl/parser/AstStatementBuilder.java [41-45]

 if (context.getHighlightConfig() != null
-    && context.getHighlightConfig().fields() != null
-    && !context.getHighlightConfig().fields().isEmpty()) {
+    && !context.getHighlightConfig().fieldNames().isEmpty()) {
   query.setHighlightConfig(context.getHighlightConfig());
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion simplifies the condition by using fieldNames() instead of checking fields() != null && !fields().isEmpty(). This is a minor readability improvement, but the behavior difference (silently dropping empty object-format configs) is a valid concern worth addressing.

Low
Suggestions up to commit 1b07536
CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid adding highlight field to exclusion projections

The _highlight field is unconditionally appended to every non-wildcard projection
when it exists in the schema. This means that even when the user explicitly projects
specific fields (e.g., | fields firstname, age), the _highlight field is silently
added to expandedFields. This could cause issues if expandedFields is later used for
exclusion (projectExcept) since _highlight would be excluded rather than included.
The highlight field should only be added when the projection is not an exclusion
projection.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [435-439]

 // Include _highlight in projections when the highlight column is present in the schema
 int hlIndex = currentFields.indexOf(HighlightExpression.HIGHLIGHT_FIELD);
-if (hlIndex >= 0) {
+if (hlIndex >= 0 && !node.isExcluded()) {
   expandedFields.add(context.relBuilder.field(hlIndex));
 }
Suggestion importance[1-10]: 7

__

Why: This is a valid bug: when node.isExcluded() is true, expandedFields is passed to projectExcept, which would exclude _highlight rather than include it. Adding the !node.isExcluded() guard prevents this incorrect behavior.

Medium
Guard against non-collection highlight field values

The code calls entry.getValue().collectionValue() without checking whether the value
is actually a collection type. If a highlight field value is not a collection (e.g.,
it's a string or null), this will throw a runtime exception. Add a null/type check
before calling collectionValue().

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+  ExprValue fieldVal = entry.getValue();
+  if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+    continue;
+  }
   hlMap.put(
       entry.getKey(),
-      entry.getValue().collectionValue().stream()
+      fieldVal.collectionValue().stream()
           .map(ExprValue::stringValue)
           .collect(Collectors.toList()));
 }
 return (Map<String, Object>) hlMap;
Suggestion importance[1-10]: 5

__

Why: Calling collectionValue() on a non-collection ExprValue could throw a runtime exception. The suggested null/type check adds defensive handling, though in practice highlight values from OpenSearch are always arrays.

Low
General
Avoid silently dropping highlight config with empty fields

The highlight config is only applied when the fields map is non-null and non-empty.
However, a HighlightConfig constructed via the simple array constructor (new
HighlightConfig(List.of("*"))) will have a non-null, non-empty fields map, so that
case is handled. But if someone constructs a HighlightConfig with an empty fields
map (e.g., object format with no fields key), the config will be silently dropped
without any error or warning. This silent discard could be confusing. Consider
applying the config whenever it is non-null, or throwing a validation error earlier
in PPLQueryRequest.getHighlightConfig() when fields are empty.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstStatementBuilder.java [41-45]

-if (context.getHighlightConfig() != null
-    && context.getHighlightConfig().fields() != null
-    && !context.getHighlightConfig().fields().isEmpty()) {
+if (context.getHighlightConfig() != null) {
   query.setHighlightConfig(context.getHighlightConfig());
 }
Suggestion importance[1-10]: 4

__

Why: The current guard silently drops a HighlightConfig with an empty fields map, which could be confusing. Simplifying to just a null check is cleaner, though the empty-fields case is already rejected earlier in PPLQueryRequest.getHighlightConfig() since fieldsObj with no keys would produce an empty map that gets passed through.

Low
Validate fragment_size upper bound to prevent overflow

The validation rejects fragment_size values of 0 or negative, but the OpenSearch
Dashboards integration explicitly sends fragment_size: 2147483647
(Integer.MAX_VALUE) to disable fragmentation. However, the check <= 0 is correct for
that case. The real issue is that Integer.MAX_VALUE is a valid value, but any value
like -1 should be rejected. The current check is fine, but the condition should use
< 0 to allow 0 as a valid "no fragmentation" sentinel if OpenSearch supports it, or
keep <= 0 only if 0 is truly invalid per the OpenSearch API. Per OpenSearch docs,
fragment_size: 0 is invalid, so the check is correct — but consider also validating
the upper bound to prevent overflow issues when passed to the builder.

ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java [169-174]

+if (obj.has("fragment_size")) {
+  fragmentSize = obj.getInt("fragment_size");
+  if (fragmentSize <= 0) {
+    throw new IllegalArgumentException("highlight fragment_size must be a positive integer");
+  }
+}
 
-
Suggestion importance[1-10]: 2

__

Why: The existing_code and improved_code are identical, meaning no actual change is proposed. The suggestion only discusses whether the validation is correct without providing a concrete fix.

Low
Suggestions up to commit 87bb284
CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid duplicate highlight column in scan schema

pushDownHighlight unconditionally adds the _highlight field to the schema even if it
already exists (e.g., when the rule fires a second time or the field was already
added by LogicalHighlight.create). This can result in duplicate _highlight columns
in the row type, causing index-out-of-bounds or incorrect field resolution
downstream.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [89-95]

 public RelNode pushDownHighlight(HighlightConfig highlightConfig) {
   RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
   schemaBuilder.addAll(getRowType().getFieldList());
-  schemaBuilder.add(
-      HighlightExpression.HIGHLIGHT_FIELD,
-      getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+  if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+    schemaBuilder.add(
+        HighlightExpression.HIGHLIGHT_FIELD,
+        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+  }
Suggestion importance[1-10]: 7

__

Why: This is a real correctness issue — pushDownHighlight unconditionally adds _highlight without checking if it already exists, which could cause duplicate columns. The LogicalHighlight.create already guards against this, but pushDownHighlight does not, making it inconsistent and potentially buggy.

Medium
Guard against unexpected highlight value types

entry.getValue().collectionValue() will throw an exception if the highlight value
for a field is not a collection (e.g., it is a plain string or missing). This can
cause the entire response serialization to fail for rows where highlight data has an
unexpected shape. A null/missing check or a try-catch should guard this call.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+  ExprValue fieldVal = entry.getValue();
+  if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+    continue;
+  }
   hlMap.put(
       entry.getKey(),
-      entry.getValue().collectionValue().stream()
+      fieldVal.collectionValue().stream()
           .map(ExprValue::stringValue)
           .collect(Collectors.toList()));
 }
-return (Map<String, Object>) hlMap;
+return hlMap;
Suggestion importance[1-10]: 6

__

Why: Calling collectionValue() on a non-collection ExprValue could throw an exception and break response serialization. Adding a null/missing check is a defensive improvement that prevents potential runtime failures for unexpected highlight data shapes.

Low
Prevent highlight config leaking across queries

The visitHighlight method sets the highlight config on the context and then visits
children, but the config is consumed and cleared inside visitRelation. If the query
has no Relation node (e.g., a Values node or subquery), the highlight config will
remain set on the context and could leak into subsequent queries sharing the same
context. The config should be cleared after visitChildren if it was not consumed.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [3202-3207]

 @Override
 public RelNode visitHighlight(Highlight node, CalcitePlanContext context) {
   context.setHighlightConfig(node.getHighlightConfig());
   visitChildren(node, context);
+  // Clear in case visitRelation was not reached (e.g. non-scan source)
+  context.setHighlightConfig(null);
   return context.relBuilder.peek();
 }
Suggestion importance[1-10]: 5

__

Why: The concern is valid — if visitRelation is not reached, the highlightConfig stays set on the context. However, the config is already cleared inside visitRelation with context.setHighlightConfig(null), and the CalcitePlanContext is typically per-query, reducing the actual risk. The fix is reasonable but the impact is limited.

Low
General
Return null for highlight config with no fields

When the fields object is absent or empty in the OSD object format, fields will be
an empty list, and the HighlightConfig will be created with no fields. This will
silently produce a no-op highlight (the check in AstStatementBuilder skips injection
when fields are empty), but the caller receives a non-null config with empty fields
rather than null, which may be confusing. More critically, if fields is empty,
applyHighlightPushDown returns early without configuring the highlight builder, so
pre/post tags and fragment_size from the OSD object are also silently dropped.
Consider returning null when fields are empty.

ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java [149-154]

 JSONObject fieldsObj = obj.optJSONObject("fields");
 if (fieldsObj != null) {
   for (String key : fieldsObj.keySet()) {
     fields.add(key);
   }
 }
 
+if (fields.isEmpty()) {
+  return null;
+}
+
Suggestion importance[1-10]: 5

__

Why: When the OSD object format has no fields, applyHighlightPushDown returns early and silently drops pre_tags, post_tags, and fragment_size. Returning null instead of an empty-fields config makes the behavior explicit and consistent with the AstStatementBuilder check.

Low

@RyanL1997 RyanL1997 added PPL Piped processing language v3.6.0 Issues targeting release v3.6.0 feature labels Mar 14, 2026
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit f0dacc0.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java519lowUser-supplied highlight terms are wrapped in double quotes but not escaped before being passed to queryStringQuery(). A term containing a literal '"' character could partially break out of the phrase context and alter the Lucene query structure. Because PPL users are already authenticated and can issue arbitrary search queries, the practical exploitability is limited, but proper escaping of the term string (e.g., via QueryParser.escape) would be safer.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit f0dacc0

@github-actions
Copy link
Contributor

Persistent review updated to latest commit ab37b5b

@github-actions
Copy link
Contributor

Persistent review updated to latest commit a7c44d8

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 51356cd

@github-actions
Copy link
Contributor

Persistent review updated to latest commit b2d6afe

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 7bb526a

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit e3b9c75.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java514mediumUser-supplied field names from the API request's 'highlight' parameter are directly interpolated into a Lucene query string with only double-quote wrapping ('"' + term + '"'). While this helps with phrase matching, it does not sanitize Lucene special characters or escape sequences, creating a potential query string injection vector. A crafted field value like '* OR sensitive_field:*' could alter query semantics. This appears to be an unintentional coding oversight rather than malicious intent, as the pattern is consistent with the rest of the feature and the terms are at least wrapped in quotes.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit e3b9c75

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…command

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 3cfbe9a

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 87bb284.

PathLineSeverityDescription
integ-test/src/test/java/org/opensearch/sql/ppl/PPLIntegTestCase.java172lowTest helper `buildRequestWithHighlight` uses String.format to embed the `query` parameter directly into a JSON string without escaping. A query containing double-quotes or backslashes could produce malformed or injected JSON. This is test code and the inputs are controlled, so the risk is minimal, but it represents a JSON injection pattern.
integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java499lowThe private `executeQueryWithHighlight` helper likewise uses String.format to inline both `query` and `highlightJson` directly into a JSON literal without escaping. Same JSON-injection concern as above; confined to test/integration code with controlled inputs.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 0 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@opensearch-project opensearch-project deleted a comment from github-actions bot Mar 17, 2026
@opensearch-project opensearch-project deleted a comment from github-actions bot Mar 17, 2026
@opensearch-project opensearch-project deleted a comment from github-actions bot Mar 17, 2026
@opensearch-project opensearch-project deleted a comment from github-actions bot Mar 17, 2026
@opensearch-project opensearch-project deleted a comment from github-actions bot Mar 17, 2026
@opensearch-project opensearch-project deleted a comment from github-actions bot Mar 17, 2026
@penghuo
Copy link
Collaborator

penghuo commented Mar 17, 2026

Highlight is a scan hint, not a relational operator. Consider dropping LogicalHighlight, HighlightIndexScanRule, and the Highlight AST node. Instead, pass HighlightConfig through CalcitePlanContext and call pushDownHighlight() eagerly in visitRelation() — same pattern as existing PushDownContext operations. Simpler, fewer classes, same result.

cc: @dai-chen

"lastname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
},
{
"address": ["880 @opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@ Lane"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datarows does not include address, should address been highlighted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional and aligns with OpenSearch DSL behavior. In OpenSearch's _search endpoint, the highlight response is independent of _source filtering — even if you use "_source": ["firstname", "lastname"], the highlight section will still return fragments for any field that matched the query, including address. Our PPL highlight follows the same semantics: "fields": {"*": {}} highlights all fields that matched the search term, regardless of which fields are projected in the schema/datarows. Added a note in the docs to clarify this.

Verified against OpenSearch DSL — _source filtered but highlight still includes address
curl -X POST "localhost:9200/accounts/_search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {"query_string": {"query": "Holmes"}},
    "_source": ["firstname", "lastname"],
    "highlight": {
      "fields": {"*": {}},
      "pre_tags": ["<b>"],
      "post_tags": ["</b>"]
    }
  }'
{
  "hits": {
    "hits": [
      {
        "_id": "578",
        "_source": { "firstname": "Holmes", "lastname": "Mcknight" },
        "highlight": {
          "firstname": ["<b>Holmes</b>"],
          "firstname.keyword": ["<b>Holmes</b>"]
        }
      },
      {
        "_id": "828",
        "_source": { "firstname": "Blanche", "lastname": "Holmes" },
        "highlight": {
          "lastname": ["<b>Holmes</b>"],
          "lastname.keyword": ["<b>Holmes</b>"]
        }
      },
      {
        "_id": "1",
        "_source": { "firstname": "Amber", "lastname": "Duke" },
        "highlight": {
          "address": ["880 <b>Holmes</b> Lane"]
        }
      }
    ]
  }
}

_source is filtered to only firstname/lastname, but highlight still returns address for account 1.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 1b07536

@RyanL1997
Copy link
Collaborator Author

Highlight is a scan hint, not a relational operator. Consider dropping LogicalHighlight, HighlightIndexScanRule, and the Highlight AST node. Instead, pass HighlightConfig through CalcitePlanContext and call pushDownHighlight() eagerly in visitRelation() — same pattern as existing PushDownContext operations. Simpler, fewer classes, same result.

Done. Dropped LogicalHighlight, HighlightIndexScanRule, and the Highlight AST node (3 classes deleted). Highlight config now threads through QueryQueryPlanQueryServiceCalcitePlanContext, and pushDownHighlight() is called eagerly in visitRelation() before alias wrapping — same consume-once pattern as other scan hints. Introduced a HighlightPushDown interface in core to bridge the module boundary with opensearch (since CalciteRelNodeVisitor can't reference CalciteLogicalIndexScan directly). Config is cleared immediately after pushdown so it's applied exactly once.

cc @penghuo @dai-chen

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 1bd30fa

penghuo
penghuo previously approved these changes Mar 19, 2026
public void execute(
UnresolvedPlan plan,
QueryType queryType,
HighlightConfig highlightConfig,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If a third request-level parameter is added (timeout, routing, preference), that's the time to refactor to QueryOptions — the pattern exists (ExecutionContext, PlanContext), and it's a straightforward migration. No need to speculate now.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit e2ef1b2.

PathLineSeverityDescription
integ-test/src/test/java/org/opensearch/sql/ppl/PPLIntegTestCase.java165lowTest helper uses String.format to construct JSON with user-supplied query and highlightJson values without escaping. If a query string contains double-quote characters, this could produce malformed or injected JSON. Isolated to test code with controlled inputs so impact is minimal, but the pattern is unsafe if reused in production.
ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java162lowvalidateFieldName only checks for null/empty; it does not restrict special characters (e.g., script injection tokens) in field names before passing them to the OpenSearch HighlightBuilder. While OpenSearch itself is expected to handle this safely, the lack of any allowlist or character-level validation is an anomaly worth noting in a security context.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 0 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit e2ef1b2

@RyanL1997
Copy link
Collaborator Author

RyanL1997 commented Mar 19, 2026

Transforming some offline conversation with @penghuo and @dai-chen :
For the current response in commit e2ef1b2:

Details
curl -sS -X POST "localhost:9200/_plugins/_ppl" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "source=accounts \"Holmes\"",
    "highlight": {
      "fields": {
        "address": {"fragment_size": 200},
        "firstname": {"number_of_fragments": 3},
        "lastname": {}
      },
      "pre_tags": ["<b>"],
      "post_tags": ["</b>"]
    }
  }' | python3 -m json.tool
{
    "schema": [
        {
            "name": "account_number",
            "type": "bigint"
        },
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "address",
            "type": "string"
        },
        {
            "name": "balance",
            "type": "bigint"
        },
        {
            "name": "gender",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "employer",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "age",
            "type": "bigint"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        }
    ],
    "datarows": [
        [
            578,
            "Holmes",
            "969 Metropolitan Avenue",
            34259,
            "M",
            "Aguila",
            "Cubicide",
            "PA",
            37,
            "holmesmcknight@cubicide.com",
            "Mcknight"
        ],
        [
            828,
            "Blanche",
            "605 Stryker Court",
            44890,
            "F",
            "Loomis",
            "Motovate",
            "KS",
            33,
            "blancheholmes@motovate.com",
            "Holmes"
        ],
        [
            1,
            "Amber",
            "880 Holmes Lane",
            39225,
            "M",
            "Brogan",
            "Pyrami",
            "IL",
            32,
            "amberduke@pyrami.com",
            "Duke"
        ]
    ],
    "highlights": [
        {
            "firstname": [
                "<b>Holmes</b>"
            ]
        },
        {
            "lastname": [
                "<b>Holmes</b>"
            ]
        },
        {
            "address": [
                "880 <b>Holmes</b> Lane"
            ]
        }
    ],
    "total": 3,
    "size": 3
}

We are proposing - instead of doing highlight in a separate response formatter to isolate everything out in json body, we should keep it in the original data rows / schemas, and let the frontend handle the transformation for render:

Details
 curl -sS -X POST "localhost:9200/_plugins/_ppl" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "source=accounts \"Holmes\"",
    "highlight": {
      "fields": {
        "address": {"fragment_size": 200},
        "firstname": {"number_of_fragments": 3},
        "lastname": {}
      },
      "pre_tags": ["<b>"],
      "post_tags": ["</b>"]
    }
  }' | python3 -m json.tool
{
    "schema": [
        {
            "name": "account_number",
            "type": "bigint"
        },
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "address",
            "type": "string"
        },
        {
            "name": "balance",
            "type": "bigint"
        },
        {
            "name": "gender",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "employer",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "age",
            "type": "bigint"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "_highlight",
            "type": "struct"
        }
    ],
    "datarows": [
        [
            578,
            "Holmes",
            "969 Metropolitan Avenue",
            34259,
            "M",
            "Aguila",
            "Cubicide",
            "PA",
            37,
            "holmesmcknight@cubicide.com",
            "Mcknight",
            {
                "firstname": [
                    "<b>Holmes</b>"
                ]
            }
        ],
        [
            828,
            "Blanche",
            "605 Stryker Court",
            44890,
            "F",
            "Loomis",
            "Motovate",
            "KS",
            33,
            "blancheholmes@motovate.com",
            "Holmes",
            {
                "lastname": [
                    "<b>Holmes</b>"
                ]
            }
        ],
        [
            1,
            "Amber",
            "880 Holmes Lane",
            39225,
            "M",
            "Brogan",
            "Pyrami",
            "IL",
            32,
            "amberduke@pyrami.com",
            "Duke",
            {
                "address": [
                    "880 <b>Holmes</b> Lane"
                ]
            }
        ]
    ],
    "total": 3,
    "size": 3
}

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit d289025

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 08d3c20

@RyanL1997 RyanL1997 merged commit 94649fb into opensearch-project:main Mar 19, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PPL Piped processing language v3.6.0 Issues targeting release v3.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE RFC] PPL Search Result Highlighting [FEATURE] Supporting Query Highlight Feature into PPL API

3 participants