-
Notifications
You must be signed in to change notification settings - Fork 215
Open
Labels
C-segmentationComponent: SegmentationComponent: SegmentationS-epicSize: Major project (create smaller child issues)Size: Major project (create smaller child issues)T-bugType: Bad behavior, security, privacyType: Bad behavior, security, privacy
Milestone
Description
The segmentation models in ICU4X (and ICU4C) are trained on the most widely used language in each script (Han, Thai, Khmer, Lao, and Myanmar). They do not work very well for digitally disadvantaged languages (DDLs) that share those same scripts, such as Cantonese (Han script), So (Thai script), and Shan (Myanmar script).
Since we are now able to use ML models that can carry context throughout an entire string, it should be possible to train a model that can accurately find breakpoints for an arbitrary string in a given script. Basically, the ML model for segmentation will learn how to do language detection at the same time.
younies
Metadata
Metadata
Assignees
Labels
C-segmentationComponent: SegmentationComponent: SegmentationS-epicSize: Major project (create smaller child issues)Size: Major project (create smaller child issues)T-bugType: Bad behavior, security, privacyType: Bad behavior, security, privacy