-
Notifications
You must be signed in to change notification settings - Fork 54
Description
The Documentation sounds like it would be character indices : All .._indices functions will also compute the indices of the matched characters.
The example code below shows that this is not the case.
Instead they are indices into nucleo's Utf32Str. And these are built by taking the first character of a grapheme(cluster):
Lines 185 to 207 in 5b74652
| /// Nucleo cannot match graphemes as single units. To work around | |
| /// that we only use the first codepoint of each grapheme. This | |
| /// iterator returns the first character of each unicode grapheme | |
| /// in a string and is used for constructing `Utf32Str(ing)`. | |
| pub fn graphemes(text: &str) -> impl Iterator<Item = char> + '_ { | |
| #[cfg(feature = "unicode-segmentation")] | |
| let res = text.graphemes(true).map(|grapheme| { | |
| // we need to special-case this check since `\r\n` is a single grapheme and is | |
| // therefore the exception to the rule that normalization of a grapheme should | |
| // map to the first character. | |
| if grapheme == "\r\n" { | |
| '\n' | |
| } else { | |
| grapheme | |
| .chars() | |
| .next() | |
| .expect("graphemes must be non-empty") | |
| } | |
| }); | |
| #[cfg(not(feature = "unicode-segmentation"))] | |
| let res = text.chars(); | |
| res | |
| } |
But only if the unicode_segmentation feature is active (which is on by default), otherwise it actually is character indices.
And helix (which highlights matches in the picker, for example), also treats it as grapheme "indices":
Can you please clarify this in the doc, and perhaps provide an example showing how you would use those indices with the original haystack string.
A further note about the example below: It was compiled with the newest version from github. The crates.io version doesn't work at all and produces weird grapheme segmentation. So please push a new release.
use nucleo_matcher::pattern::{Atom, AtomKind, CaseMatching, Normalization};
use nucleo_matcher::Utf32Str;
use unicode_segmentation::UnicodeSegmentation;
fn test(haystack: &str, needle: &str) {
let mut matcher = nucleo_matcher::Matcher::new(nucleo_matcher::Config::DEFAULT);
let atom = Atom::new(
needle,
CaseMatching::default(),
Normalization::default(),
AtomKind::Substring,
false,
);
let mut buf = Vec::new();
let nucleo_string = Utf32Str::new(haystack, &mut buf);
let characters = haystack.chars().collect::<Vec<_>>();
let graphemes = UnicodeSegmentation::graphemes(haystack, true).collect::<Vec<_>>();
let nucleo_chars = nucleo_string.chars().collect::<Vec<_>>();
let matches = {
let mut m = Vec::new();
atom.indices(nucleo_string, &mut matcher, &mut m);
m.into_iter().map(|a| a as usize).collect::<Vec<_>>()
};
println!("haystack: {}", haystack);
println!("needle : {}", needle);
println!("characters : {:?}", characters);
println!("graphemes : {:?}", graphemes);
println!("nucleo chars: {:?}", nucleo_chars);
println!("matching indices: {:?}", matches);
println!("matching character: {:?}", characters.get(matches[0]));
println!("matching grapheme: {:?}", graphemes.get(matches[0]));
println!("matching nucleo chars: {:?}", nucleo_chars.get(matches[0]));
}
fn main() {
test("abx", "x");
println!();
test("g̈bx", "x");
}haystack: abx
needle : x
characters : ['a', 'b', 'x']
graphemes : ["a", "b", "x"]
nucleo chars: ['a', 'b', 'x']
matching indices: [2]
matching character: Some('x')
matching grapheme: Some("x")
matching nucleo chars: Some('x')
haystack: g̈bx
needle : x
characters : ['g', '\u{308}', 'b', 'x']
graphemes : ["g\u{308}", "b", "x"]
nucleo chars: ['g', 'b', 'x']
matching indices: [2]
matching character: Some('b')
matching grapheme: Some("x")
matching nucleo chars: Some('x')