Skip to content

Support DDLs in complex script segmentation models #3411

@sffc

Description

@sffc

The segmentation models in ICU4X (and ICU4C) are trained on the most widely used language in each script (Han, Thai, Khmer, Lao, and Myanmar). They do not work very well for digitally disadvantaged languages (DDLs) that share those same scripts, such as Cantonese (Han script), So (Thai script), and Shan (Myanmar script).

Since we are now able to use ML models that can carry context throughout an entire string, it should be possible to train a model that can accurately find breakpoints for an arbitrary string in a given script. Basically, the ML model for segmentation will learn how to do language detection at the same time.

Metadata

Metadata

Assignees

Labels

C-segmentationComponent: SegmentationS-epicSize: Major project (create smaller child issues)T-bugType: Bad behavior, security, privacy

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions