Support DDLs in complex script segmentation models

The segmentation models in ICU4X (and ICU4C) are trained on the most widely used language in each script (Han, Thai, Khmer, Lao, and Myanmar). They do not work very well for digitally disadvantaged languages (DDLs) that share those same scripts, such as Cantonese (Han script), So (Thai script), and Shan (Myanmar script).

Since we are now able to use ML models that can carry context throughout an entire string, it should be possible to train a model that can accurately find breakpoints for an arbitrary string in a given script. Basically, the ML model for segmentation will learn how to do language detection at the same time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support DDLs in complex script segmentation models #3411

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support DDLs in complex script segmentation models #3411

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions