-
Notifications
You must be signed in to change notification settings - Fork 46
Reduce resources to load language models #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which files ? if you require the processing in Python or in JavaScript(Node) I can work on a Google proto buffer format; quite sure the persisted model would be way lighter, maybe the processing would be fast, I do not know. |
I know this is half road, as you were asking for a better structure to gain processing time. But for big model on memory here is a solution: I changed the format a little bit from regular Here is a working example in JavaScript/Node: https://github.com/bacloud23/lingua-rs-bigrams So here how it goes:
Drawback: new protobufjs dependency. |
@ghost: By how much your solution reduces the binary size? |
I am not sure If so, I think one promising candidate would be finite state transducers (FST), c.f. https://burntsushi.net/transducers/, via the I think ideally, all models for a given language would also be serialized and thereby memory-mapped as a single FST by encoding model type and ngram name into the key itself, e.g. most common fivegram "nicht" could be encoded as the byte string mapping [1 /* = most common */, 5 /* fivegrams */, b'n', b'i', b'c', b'h', b't'] => 0 (where the value does not really matter if we only need set membership) or the bigram probability 338628/8845267 of "ni" could be encoded as the byte string mapping [2 /* = probabilities */, 2, /* bigrams */, b'n', b'i'] => (338628 << 32) | 8845267 The main caveat I see is that FST are meant to store longer tokens as there is by design little sequential redundancy in bigrams, so the compression effect could be quite limited. The upside of memory-mapping a single look-up-efficient data structure per language should still be an improvement though. |
Thinking about it a bit, ngram length does not need to be encoded at all as it is self-describing in that approach. Also it might be preferable to store the model type at the end to simplify exploiting common suffixes in |
Uh oh!
There was an error while loading. Please reload this page.
Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.
One promising candidate could be ndarray.
The text was updated successfully, but these errors were encountered: