Skip to content

ESQL: Pushdown constructs doing case-insensitive regexes #128393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bpintea
Copy link
Contributor

@bpintea bpintea commented May 23, 2025

This introduces an optimization to pushdown to Lucense those language constructs that aim at case-insensitive regular expression matching, used with LIKE and RLIKE operators, such as:

  • | WHERE TO_LOWER(field) LIKE "abc*"
  • | WHERE TO_UPPER(field) RLIKE "ABC.*"

These are now pushed as case-insensitive wildcard and regexp respectively queries down to Lucene.

Closes #127479

This introduces an optimization to pushdown to Lucense those language
constructs that aim case-insensitive regular expression matching, used
with LIKE and RLIKE operators, such as:
* `| WHERE TO_LOWER(field) LIKE "abc*"`
* `| WHERE TO_UPPER(field) RLIKE `ABC.*`
These are now pushed as case-insensitive `regexp` and `wildcard`
respectively queries down to Lucene.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @bpintea, I've created a changelog YAML for you.

Comment on lines +85 to +93
if (caseInsensitive() && out.getTransportVersion().before(TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY)) {
// The plan has been optimized to run a case-insensitive match, which the remote peer cannot be notified of. Simply avoiding
// the serialization of the boolean would result in wrong results.
throw new EsqlIllegalArgumentException(
NAME + " with case insensitivity is not supported in peer node's version [{}]. Upgrade to version [{}] or newer.",
out.getTransportVersion(),
TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY
);
}
Copy link
Contributor Author

@bpintea bpintea May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointing out this fast-failure here // bwc issue.

We haven't been serialising the case-insensitivity of the regexp operators, so far. We now need to.

Not sure we have an alternative to failing the query in this case, without introducing some compensating mechanisms like rerouting queries to old nodes for planning. The planner isn't currently aware of the versions it should plan for.

Introducing new operators (RLIKE2?) would still fail on old nodes; but it would maybe allow the user to opt out of the optimisation, so that the query on mixed-versions won't fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case for us needing to know if we can enable the optimization based on the minimum node version, or something like it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make these decisions on the local node only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we could be safe, see my comment below, but in general I agree that having minimum node version (or node features, or whatever) at planning level would help a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make these decisions on the local node only?

We probably should be able to pull that into a rule that applies locally-only, yes.. Thanks!
That would allows to apply this optimisation, tough the serialisation issue itself will remain.

@luigidellaquila luigidellaquila self-requested a review May 23, 2025 15:22
Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look and left a first round of comments, the implementation looks good in general.

My real concern is on the breaking change: TO_LOWER and TO_UPPER are the most used functions in ES|QL, it means that introducing a breaking change here would impact most of our users.
Probably we have an escape hatch though, see my comment below.

FROM employees
| KEEP emp_no, first_name
| SORT emp_no
| WHERE TO_UPPER(first_name) LIKE "GEOR*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a couple of negative tests, eg. where TO_UPPER tries to match a lowercase pattern

FROM employees
| KEEP emp_no, first_name
| SORT emp_no
| WHERE TO_UPPER(first_name) LIKE "geor*"

10055 |Georgy
;

likeWithLower
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to accept the breaking change, you'll need at least a new capability, otherwise these tests will fail on mixed clusters.

@@ -78,6 +78,8 @@
import org.elasticsearch.xpack.esql.expression.function.scalar.multivalue.MvSum;
import org.elasticsearch.xpack.esql.expression.function.scalar.nulls.Coalesce;
import org.elasticsearch.xpack.esql.expression.function.scalar.string.Concat;
import org.elasticsearch.xpack.esql.expression.function.scalar.string.RLike;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll also need more unit tests, ie. a few more cases in WildcardLikeTests and RLikeTests

}

@Override
public void writeTo(StreamOutput out) throws IOException {
source().writeTo(out);
out.writeNamedWriteable(field());
out.writeString(pattern().asJavaRegex());
if (caseInsensitive() && out.getTransportVersion().before(TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY)) {
Copy link
Contributor

@luigidellaquila luigidellaquila May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just re-add the to_upper/lower() function on the fly if we are on an old transport version?

We'll have to be careful with layouts and NameIDs, but probably it's not impossible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also give us a bwc advantage if we decide to add case insensitive operators to the grammar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever idea: serialize foo rlike "aAa" to --> to_lower(foo) rlike to_lower "aAa".
This is best done in the planner but since we don't have the versioning in place, we can do this locally for rlike and like.

Another suggestion that might be cleaner would be to perform the optimization on the data nodes only, not on the coordinator to avoid the difference in serialization.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this one up! LGTM to me once the serialization/compatibility stuff gets sorted out.

@@ -21,9 +21,10 @@ public RLikePattern(String regexpPattern) {
}

@Override
public Automaton createAutomaton() {
public Automaton createAutomaton(boolean ignoreCase) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expose ignoreCase as a property in StringPattern since it affects both the Automaton and javaRegex. The former can contain the mode but the latter doesn't so we need a way to bubble it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - make the parameter a class property.

public static String luceneWildcardToRegExp(String wildcard) {
StringBuilder regex = new StringBuilder();

for (int i = 0, wcLen = wildcard.length(); i < wcLen; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

@Override
public void writeTo(StreamOutput out) throws IOException {
source().writeTo(out);
out.writeNamedWriteable(field());
out.writeString(pattern().asJavaRegex());
if (caseInsensitive() && out.getTransportVersion().before(TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever idea: serialize foo rlike "aAa" to --> to_lower(foo) rlike to_lower "aAa".
This is best done in the planner but since we don't have the versioning in place, we can do this locally for rlike and like.

Another suggestion that might be cleaner would be to perform the optimization on the data nodes only, not on the coordinator to avoid the difference in serialization.

@costin costin self-requested a review May 23, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: pushdown to_upper/to_lower in like/rlike
5 participants