You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should have access to inferred rules from C++ code and be able to match the incoming lemma (a potential implementation could be a RuleInflector invoked after DictionaryInflector). Words that match the rules, but not being in the dictionary, could then be inflected automatically. Additional processing may be necessary to fix corner cases, but it would possibly be easier than developing a full set of rules from scratch.
Current use case:
Words уранак, пашњак are not in the dictionary but belong to the same inflection group (see below) as word пропланак which is.
Applying rules for group f would produce correct results for both of them without need for extra logic.
NOTE(George): Groups c and f encode the same rules (suffix is k and ak) - is that a bug in dictionary-parser?
Reasoning
In our initial discussion on how to implement inflection library we mentioned couple solutions:
Dictionary based
Rule based
ML based
Above mentioned options can also be mixed into hybrid solution, e.g. rule and dictionary approach with fallback to ML for more complex languages.
Our current implementation is based on dictionary lookup, with specific language tailorings written in C++, e.g. English guessSingularInflection. In the process of Wikidata ingestion we also produce inflection rules for various parts of speech, including nouns, proper-nouns and adjectives. Those rules are then only applied to words already in the dictionary.
I feel that many language specific tailorings could be avoided, reducing complexity and time needed to implement language support, by reusing those rules for words outside of the dictionary that follow the rule patterns.
Benefits of having a rule based inflector:
The dictionary can be sparse, helping with size
We can launch more languages with sparse Wikidata (see language status)
Writing inflection rules by hand is hard. Take a look at a somewhat simple list of rules in Serbian:
Masculine nouns, ending with -∅, -о and -е, and neutral nouns ending with -о and -е and where the stem stays the same.
Neutral nouns ending with -е, where the stem gets expanded with consonants н, т in most cases.
Nouns where the stem ends with -а (both masculine and feminine).
Feminine nouns ending with -∅ if adjacent adjective is also expressed in feminine form.
Implementing that in Pynini which is a system optimized for quick rule matching is not trivial, but doing it in C++ is a harder problem that doesn't scale as well.
The text was updated successfully, but these errors were encountered:
Feature request
We should have access to inferred rules from C++ code and be able to match the incoming lemma (a potential implementation could be a RuleInflector invoked after DictionaryInflector). Words that match the rules, but not being in the dictionary, could then be inflected automatically. Additional processing may be necessary to fix corner cases, but it would possibly be easier than developing a full set of rules from scratch.
Current use case:
NOTE(George): Groups c and f encode the same rules (suffix is k and ak) - is that a bug in dictionary-parser?
Reasoning
In our initial discussion on how to implement inflection library we mentioned couple solutions:
Above mentioned options can also be mixed into hybrid solution, e.g. rule and dictionary approach with fallback to ML for more complex languages.
Our current implementation is based on dictionary lookup, with specific language tailorings written in C++, e.g. English guessSingularInflection. In the process of Wikidata ingestion we also produce inflection rules for various parts of speech, including nouns, proper-nouns and adjectives. Those rules are then only applied to words already in the dictionary.
I feel that many language specific tailorings could be avoided, reducing complexity and time needed to implement language support, by reusing those rules for words outside of the dictionary that follow the rule patterns.
Benefits of having a rule based inflector:
Writing inflection rules by hand is hard. Take a look at a somewhat simple list of rules in Serbian:
Implementing that in Pynini which is a system optimized for quick rule matching is not trivial, but doing it in C++ is a harder problem that doesn't scale as well.
The text was updated successfully, but these errors were encountered: