Reduce resources to load language models #121

pemistahl · 2022-11-05T09:14:30Z

Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be ndarray.

ghost · 2022-12-17T06:26:39Z

Which files ? if you require the processing in Python or in JavaScript(Node) I can work on a Google proto buffer format; quite sure the persisted model would be way lighter, maybe the processing would be fast, I do not know.
Any way, I'm glad to help.
I'm happy that you provide a JS binding as well, I'm looking for a fast language detection runnable on Node.
Thanks

ghost · 2022-12-17T07:33:27Z

I know this is half road, as you were asking for a better structure to gain processing time. But for big model on memory here is a solution:

I changed the format a little bit from regular Map<string: string> to Map<number[]: string[]>. I guess you treat as so anyway, so hopefully not a problem.

Here is a working example in JavaScript/Node: https://github.com/bacloud23/lingua-rs-bigrams

So here how it goes:

You encode/persist JSON once back and forth into a lightweight binary file.
With proto buffers, you definitely can load the binary file with the defined format, and decode the entire object (it would be way lighter).
But If you work with the values of the ngrams key iteratively and not cumulatively (I guess so), you can (I guess) load one Pair at a time inside a loop. I think it comes with a processing cost though (again if even possible).
You can do the same in Rust or in Python with the same proto model and the new encoding.

Drawback: new protobufjs dependency.

getreu · 2023-04-27T19:37:57Z

@ghost: By how much your solution reduces the binary size?

adamreichold · 2025-03-23T16:39:03Z

I am not sure ndarray (or any n-dimensional array data structure for that matter) can help with associative look-ups of ngram keys. If I understand things correctly, then the main issue is the memory required to store the keys, i.e. the ngrams.

If so, I think one promising candidate would be finite state transducers (FST), c.f. https://burntsushi.net/transducers/, via the fst crate. It would try to compress the key space and also allow being memory-mapped directly from disk. The downside is that values are limited to u64 but I think this would actually suffice to store frequencies as u32 and probabilities as fractions of two u32.

I think ideally, all models for a given language would also be serialized and thereby memory-mapped as a single FST by encoding model type and ngram name into the key itself, e.g. most common fivegram "nicht" could be encoded as the byte string mapping

[1 /* = most common */, 5 /* fivegrams */, b'n', b'i', b'c', b'h', b't'] => 0

(where the value does not really matter if we only need set membership) or the bigram probability 338628/8845267 of "ni" could be encoded as the byte string mapping

[2 /* = probabilities */, 2, /* bigrams */, b'n', b'i'] => (338628 << 32) | 8845267

The main caveat I see is that FST are meant to store longer tokens as there is by design little sequential redundancy in bigrams, so the compression effect could be quite limited. The upside of memory-mapping a single look-up-efficient data structure per language should still be an improvement though.

adamreichold · 2025-03-23T17:48:10Z

Thinking about it a bit, ngram length does not need to be encoded at all as it is self-describing in that approach. Also it might be preferable to store the model type at the end to simplify exploiting common suffixes in fst's approximate construction algorithm.

pemistahl added the enhancement New feature or request label Nov 5, 2022

pemistahl added this to the Lingua 1.5.0 milestone Nov 5, 2022

pemistahl modified the milestones: Lingua 1.5.0, Lingua 1.6.0 May 29, 2023

pemistahl modified the milestones: Lingua 1.6.0, Lingua 1.7.0 Oct 30, 2023

pemistahl removed this from the Lingua 1.7.0 milestone Aug 14, 2024

adamreichold linked a pull request Mar 24, 2025 that will close this issue

WIP: Use Finite State Transducers (FST) as the backing store for language models #458

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce resources to load language models #121

Reduce resources to load language models #121

pemistahl commented Nov 5, 2022 •

edited

Loading

ghost commented Dec 17, 2022 •

edited by ghost

Loading

Uh oh!

ghost commented Dec 17, 2022 •

edited by ghost

Loading

Uh oh!

getreu commented Apr 27, 2023

Uh oh!

adamreichold commented Mar 23, 2025 •

edited

Loading

Uh oh!

adamreichold commented Mar 23, 2025

Uh oh!

Reduce resources to load language models #121

Reduce resources to load language models #121

Comments

pemistahl commented Nov 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ghost commented Dec 17, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Dec 17, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

getreu commented Apr 27, 2023

Uh oh!

adamreichold commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamreichold commented Mar 23, 2025

Uh oh!

pemistahl commented Nov 5, 2022 •

edited

Loading

ghost commented Dec 17, 2022 •

edited by ghost

Loading

ghost commented Dec 17, 2022 •

edited by ghost

Loading

adamreichold commented Mar 23, 2025 •

edited

Loading