-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Fix and Enhance Scanning Involving & and - in Regular Expressions in Unicode Sets Mode
#62716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix and Enhance Scanning Involving & and - in Regular Expressions in Unicode Sets Mode
#62716
Conversation
… in Unicode Sets Mode - Removes the incorrect errors on individual `&`s (not `&&`s) at certain positions - Flags `-`s at the beginning and the end of class sets - Improves error recovery and makes diagnostic messages more intuitive and user-friendly when there are consecutive `&`s or `-`s or when operators are mixed within the same class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request improves the error recovery and diagnostic messages for regular expression Unicode sets mode (/v flag) in TypeScript. The changes focus on handling complex operator combinations (&& for intersection and -- for subtraction) in character classes, providing more user-friendly error messages and better error recovery for malformed regex patterns.
Key changes:
- Enhanced scanner logic to better detect and report errors in Unicode sets mode regular expressions
- Added two new diagnostic messages (error codes 1547 and 1548) for clearer error reporting
- Comprehensive test coverage for various edge cases involving
&,-,&&, and--operators
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
tests/cases/compiler/regularExpressionUnicodeSets.ts |
New comprehensive test file covering 188 regex patterns with various operator combinations |
tests/baselines/reference/regularExpressionUnicodeSets.types |
Baseline file capturing type information for the test cases |
tests/baselines/reference/regularExpressionUnicodeSets.symbols |
Baseline file capturing symbol information for the test cases |
tests/baselines/reference/regularExpressionUnicodeSets.js |
Baseline file showing compiled JavaScript output |
tests/baselines/reference/regularExpressionUnicodeSets.errors.txt |
Baseline file documenting the 226 expected error messages |
tests/baselines/reference/regularExpressionScanning(target=esnext).errors.txt |
Updated baseline to reflect new error messages |
tests/baselines/reference/regularExpressionScanning(target=es5).errors.txt |
Updated baseline to reflect new error messages |
tests/baselines/reference/regularExpressionScanning(target=es2015).errors.txt |
Updated baseline to reflect new error messages |
src/compiler/scanner.ts |
Enhanced regex scanner with improved error recovery for Unicode sets mode |
src/compiler/diagnosticMessages.json |
Added two new error messages (codes 1547 and 1548) |
| // If we see '-', parse the expression as ClassUnion. | ||
| // An exception is made for '-&&', in which case it is more intuitive to parse as ClassIntersection and scan '-' as an invalid operand | ||
| else if ( | ||
| ch === CharacterCodes.minus && ( | ||
| charCodeChecked(pos + 1) !== CharacterCodes.ampersand || charCodeChecked(pos + 2) !== CharacterCodes.ampersand | ||
| ) | ||
| ) { | ||
| // '-' can't be an operand in a ClassUnion (a class in Unicode sets mode can't start with '-') | ||
| error(Diagnostics.Incomplete_character_class_range_Expected_a_class_set_character); | ||
| mayContainStrings = false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was struggling whether to treat - as an unexpected character ("Unexpected '-'. Did you mean to escape it with backslash?") or an incomplete character class with a zero-width message in cases like /[-a-b-]/v
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And temporarily went with the former.
| "Incomplete character class range. Expected a class set character.": { | ||
| "category": "Error", | ||
| "code": 1547 | ||
| }, | ||
| "Missing '{0}' in between class set operands. To form a union of these characters, wrap them in a nested class instead.": { | ||
| "category": "Error", | ||
| "code": 1548 | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate any better suggestions regarding the wording of the messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if old diagnostic messages are amendable. Precisely, shall I alter the message of TS1519 from "Wrap it" to "Wrap them"?
| if (ch === charCodeChecked(pos + 1) && (ch === CharacterCodes.minus || ch === CharacterCodes.ampersand)) { | ||
| error(Diagnostics.Operators_must_not_be_mixed_within_a_character_class_Wrap_it_in_a_nested_class_instead, pos, 2); | ||
| pos += 2; | ||
| operand = text.slice(start, pos); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be even more intuitive if we were to rescan the class set as intersection/subtraction when we hit any -- or &&, but it is infeasible without a two-pass scan, the first of which must not emit errors. Unfortunately, I need to give up as it would overcomplicate the implementation.
| operand = scanClassSetOperand(); | ||
| } | ||
| // Use the first operator to determine the expression type | ||
| switch (charCodeChecked(pos)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might not have functional difference right now but to future-proof this I'd still include this:
| switch (charCodeChecked(pos)) { | |
| switch (ch = charCodeChecked(pos)) { |
The reason why this should be reassigned here is that the above call to scanClassSetOperand could move the pos cursor
&s (not&&s) at certain positions-(class set range operator) at the end of class sets are not flagged as errors&s or-s, when operators are mixed within the same class, or when there are consecutive operands in class intersections and subtractionsFixes #62707