Skip to content

Conversation

@lyne7-sc
Copy link

@lyne7-sc lyne7-sc commented Jan 1, 2026

Which issue does this PR close?

Rationale for this change

This PR improves the performance of the substring_index function by optimizing
delimiter search and substring extraction:

  • Single-byte fast path: introduces a specialized byte-based search for single-byte delimiters (e.g. ., ,), avoiding UTF-8 pattern matching overhead.
  • Efficient index discovery: replaces the split-and-sum-length approach with direct index location using match_indices / rmatch_indices.

What changes are included in this PR?

  • Added a fast path for delimiter.len() == 1 using byte-based search.
  • Refactored the general path to use match_indices and rmatch_indices for more efficient positioning.

Benchmarks

  • Single-byte delimiter benchmarks show ~2–3× speedup across batch sizes.
  • Multi-byte delimiters see a consistent ~10–15% improvement.
group                                               main_substrindex                       perf_substrindex
-----                                               ----------------                       ----------------
substr_index/substr_index_10000_long_delimiter      1.12   548.2±18.39µs        ? ?/sec    1.00   488.4±15.62µs        ? ?/sec
substr_index/substr_index_10000_single_delimiter    2.14   543.4±15.12µs        ? ?/sec    1.00    254.0±7.80µs        ? ?/sec
substr_index/substr_index_1000_long_delimiter       1.12     43.1±1.63µs        ? ?/sec    1.00     38.6±2.03µs        ? ?/sec
substr_index/substr_index_1000_single_delimiter     3.51     46.4±2.21µs        ? ?/sec    1.00     13.2±0.99µs        ? ?/sec
substr_index/substr_index_100_long_delimiter        1.01      3.7±0.18µs        ? ?/sec    1.00      3.7±0.20µs        ? ?/sec
substr_index/substr_index_100_single_delimiter      2.15      3.6±0.14µs        ? ?/sec    1.00  1675.9±79.15ns        ? ?/sec

Are these changes tested?

  • Yes, Existing unit tests pass.
  • New benchmarks added to verify performance improvement.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Jan 1, 2026
Copy link

@uzqw uzqw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I noticed some Clippy warnings when running locally:
cargo clippy -p datafusion-functions --all-targets --all-features

match string.get(string.len().saturating_sub(length)..) {
Some(substring) => builder.append_value(substring),
None => builder.append_null(),
if n > 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this else { if .. } block can be collapsed


for batch_size in batch_sizes {
group.bench_function(
&format!("substr_index_{}_single_delimiter", batch_size),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables can be used directly in the format! string

);

group.bench_function(
&format!("substr_index_{}_long_delimiter", batch_size),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables can be used directly in the format! string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants