Skip to content

nucleo-matcher documentation: Please clarify what matching "indices" actually are. #85

@markus-bauer

Description

@markus-bauer

The Documentation sounds like it would be character indices : All .._indices functions will also compute the indices of the matched characters.
The example code below shows that this is not the case.

Instead they are indices into nucleo's Utf32Str. And these are built by taking the first character of a grapheme(cluster):

nucleo/matcher/src/chars.rs

Lines 185 to 207 in 5b74652

/// Nucleo cannot match graphemes as single units. To work around
/// that we only use the first codepoint of each grapheme. This
/// iterator returns the first character of each unicode grapheme
/// in a string and is used for constructing `Utf32Str(ing)`.
pub fn graphemes(text: &str) -> impl Iterator<Item = char> + '_ {
#[cfg(feature = "unicode-segmentation")]
let res = text.graphemes(true).map(|grapheme| {
// we need to special-case this check since `\r\n` is a single grapheme and is
// therefore the exception to the rule that normalization of a grapheme should
// map to the first character.
if grapheme == "\r\n" {
'\n'
} else {
grapheme
.chars()
.next()
.expect("graphemes must be non-empty")
}
});
#[cfg(not(feature = "unicode-segmentation"))]
let res = text.chars();
res
}

But only if the unicode_segmentation feature is active (which is on by default), otherwise it actually is character indices.

And helix (which highlights matches in the picker, for example), also treats it as grapheme "indices":

https://github.com/helix-editor/helix/blob/cfb5158cd1b3e5e1962eda66e673c0c35b786046/helix-term/src/ui/picker.rs#L786-L790

Can you please clarify this in the doc, and perhaps provide an example showing how you would use those indices with the original haystack string.

A further note about the example below: It was compiled with the newest version from github. The crates.io version doesn't work at all and produces weird grapheme segmentation. So please push a new release.

use nucleo_matcher::pattern::{Atom, AtomKind, CaseMatching, Normalization};
use nucleo_matcher::Utf32Str;
use unicode_segmentation::UnicodeSegmentation;

fn test(haystack: &str, needle: &str) {
    let mut matcher = nucleo_matcher::Matcher::new(nucleo_matcher::Config::DEFAULT);

    let atom = Atom::new(
        needle,
        CaseMatching::default(),
        Normalization::default(),
        AtomKind::Substring,
        false,
    );

    let mut buf = Vec::new();
    let nucleo_string = Utf32Str::new(haystack, &mut buf);

    let characters = haystack.chars().collect::<Vec<_>>();
    let graphemes = UnicodeSegmentation::graphemes(haystack, true).collect::<Vec<_>>();
    let nucleo_chars = nucleo_string.chars().collect::<Vec<_>>();

    let matches = {
        let mut m = Vec::new();
        atom.indices(nucleo_string, &mut matcher, &mut m);
        m.into_iter().map(|a| a as usize).collect::<Vec<_>>()
    };

    println!("haystack: {}", haystack);
    println!("needle  : {}", needle);

    println!("characters  : {:?}", characters);
    println!("graphemes   : {:?}", graphemes);
    println!("nucleo chars: {:?}", nucleo_chars);

    println!("matching indices: {:?}", matches);
    println!("matching character: {:?}", characters.get(matches[0]));
    println!("matching grapheme: {:?}", graphemes.get(matches[0]));
    println!("matching nucleo chars: {:?}", nucleo_chars.get(matches[0]));
}

fn main() {
    test("abx", "x");
    println!();
    test("g̈bx", "x");
}
haystack: abx
needle  : x
characters  : ['a', 'b', 'x']
graphemes   : ["a", "b", "x"]
nucleo chars: ['a', 'b', 'x']
matching indices: [2]
matching character: Some('x')
matching grapheme: Some("x")
matching nucleo chars: Some('x')

haystack: g̈bx
needle  : x
characters  : ['g', '\u{308}', 'b', 'x']
graphemes   : ["g\u{308}", "b", "x"]
nucleo chars: ['g', 'b', 'x']
matching indices: [2]
matching character: Some('b')
matching grapheme: Some("x")
matching nucleo chars: Some('x')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions