Skip to content

Different tokenisation results between oniguruma and RE2 #225

Closed
@bzz

Description

@bzz

Right now envy uses regex-based tokeniser (until #218 at least).

We have 2 drop-in replacement regex engines: default from Go RE2 and default from Ruby oniguruma.

Recent improvements done with Go module migration #219 surfaced a new issue: it seems that tokeniser produces a bit different results, depending on which regex engine is used :/

More specifically, the token frequencies built from linguist samples are different and high-level code-generator test catch by comparing with a fixture (pre-generated with RE2) and fail on oniguruma profiles like this #219 (comment)

We need to find the exact reason and depending on it decide, if we want to support 2 versions of fixtures or change something so there is no difference in output.

This also potentially affects #194

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions