Skip to content

Using derived rules to inflect nouns #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nciric opened this issue May 1, 2025 · 1 comment
Open

Using derived rules to inflect nouns #112

nciric opened this issue May 1, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@nciric
Copy link
Contributor

nciric commented May 1, 2025

Feature request

We should have access to inferred rules from C++ code and be able to match the incoming lemma (a potential implementation could be a RuleInflector invoked after DictionaryInflector). Words that match the rules, but not being in the dictionary, could then be inflected automatically. Additional processing may be necessary to fix corner cases, but it would possibly be easier than developing a full set of rules from scratch.

Current use case:

  1. Words уранак, пашњак are not in the dictionary but belong to the same inflection group (see below) as word пропланак which is.
  2. Applying rules for group f would produce correct results for both of them without need for extra logic.

NOTE(George): Groups c and f encode the same rules (suffix is k and ak) - is that a bug in dictionary-parser?

Reasoning

In our initial discussion on how to implement inflection library we mentioned couple solutions:

  1. Dictionary based
  2. Rule based
  3. ML based

Above mentioned options can also be mixed into hybrid solution, e.g. rule and dictionary approach with fallback to ML for more complex languages.

Our current implementation is based on dictionary lookup, with specific language tailorings written in C++, e.g. English guessSingularInflection. In the process of Wikidata ingestion we also produce inflection rules for various parts of speech, including nouns, proper-nouns and adjectives. Those rules are then only applied to words already in the dictionary.

I feel that many language specific tailorings could be avoided, reducing complexity and time needed to implement language support, by reusing those rules for words outside of the dictionary that follow the rule patterns.

Benefits of having a rule based inflector:

  1. The dictionary can be sparse, helping with size
  2. We can launch more languages with sparse Wikidata (see language status)

Writing inflection rules by hand is hard. Take a look at a somewhat simple list of rules in Serbian:

  1. Masculine nouns, ending with -∅, -о and -е, and neutral nouns ending with -о and -е and where the stem stays the same.
  2. Neutral nouns ending with -е, where the stem gets expanded with consonants н, т in most cases.
  3. Nouns where the stem ends with -а (both masculine and feminine).
  4. Feminine nouns ending with -∅ if adjacent adjective is also expressed in feminine form.

Implementing that in Pynini which is a system optimized for quick rule matching is not trivial, but doing it in C++ is a harder problem that doesn't scale as well.

@nciric nciric added the enhancement New feature or request label May 1, 2025
@nciric nciric self-assigned this May 1, 2025
@grhoten
Copy link
Member

grhoten commented May 13, 2025

Here's some sample code code that may help.

    const auto& inflector(::inflection::dictionary::Inflector::getInflector(::inflection::util::LocaleUtils::SERBIAN()));
    ::std::vector<inflection::dictionary::Inflector_InflectionPattern> inflectionPatterns;
    std::map<std::u16string, inflection::dictionary::Inflector_InflectionPattern> suffixToPattern;
    for (const auto str : {u"кафана", u"мост"}) {
        std::u16string_view word(str);
        inflectionPatterns.clear();
        inflector.getInflectionPatternsForWord(word, inflectionPatterns);
        if (!inflectionPatterns.empty()) {
            std::u16string suffix(word.substr(word.length() - 2));
            suffixToPattern.emplace(suffix, inflectionPatterns.front());
        }
    }
    for (const auto& [suffix, inflectionPattern] : suffixToPattern) {
        std::cout << inflection::util::StringViewUtils::to_string(suffix) << ": " << inflection::util::StringViewUtils::to_string(inflectionPattern.getIdentifier()) << std::endl;
    }

Here are the results:

на: 2
ст: 11

You can use the inflection pattern directly on the relevant words, and you don't need to look up the patterns by name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants