Different tokenisation results between oniguruma and RE2

Right now envy uses regex-based tokeniser (until #218 at least).

We have 2 drop-in replacement regex engines: default from Go RE2 and default from Ruby [oniguruma](https://github.com/kkos/oniguruma).

Recent improvements done with Go module migration #219 surfaced a new issue: it seems that tokeniser produces a bit different results, depending on which regex engine is used :/

More specifically, the token frequencies built from linguist samples are different and high-level code-generator test catch by comparing with a fixture (pre-generated with RE2) and fail on oniguruma profiles like this https://github.com/src-d/enry/pull/219#issuecomment-482525632

We need to find the exact reason and depending on it decide, if we want to support 2 versions of fixtures or change something so there is no difference in output.

This also potentially affects #194 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different tokenisation results between oniguruma and RE2 #225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Different tokenisation results between oniguruma and RE2 #225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions