Skip to content

Integrate fr Wikidata into Unicode Inflection #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
grhoten opened this issue Jan 21, 2025 · 1 comment
Open

Integrate fr Wikidata into Unicode Inflection #51

grhoten opened this issue Jan 21, 2025 · 1 comment
Assignees
Milestone

Comments

@grhoten
Copy link
Member

grhoten commented Jan 21, 2025

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • The dictionary-parser output needs to be addressed
  • The unit tests need to be fixed.

Tool output that needs to be addressed:

Line 2415: Q1050744 is not a known part of speech grammeme for L19397(duquel)
Line 89172: Q10343770 is not a known grammeme for L738468(IP)
Line 167868: Q82955 is not a known part of speech grammeme for L1373953(Raymond Lemieux)
Line 345406: Q2824480 is not a known part of speech grammeme for L9203(ce)
Line 345414: Q420020 is not a known grammeme for L9288(nous)
Line 345625: Q1050744 is not a known part of speech grammeme for L11158(lequel)
Line 346100: Q3618903 is not a known part of speech grammeme for L15026(aucun)
Line 432895: Q10343770 is not a known grammeme for L738472(ADSL)
Line 522421: Q11655558 is not a known part of speech grammeme for L57947(lors même que)
Line 687814: Q650250 is not a known grammeme for L9094(je)
Line 687940: Q3618903 is not a known part of speech grammeme for L10023(chaque)
Line 859126: Q650250 is not a known grammeme for L9096(tu)
Line 860426: Q1050744 is not a known part of speech grammeme for L19396(auquel)
Line 1030832: Q3618903 is not a known part of speech grammeme for L9275(quelque)
Line 1179165: Q4116295 is not a known part of speech grammeme for L1232738(Canapé)
Line 1201046: Q114092330 is not a known grammeme for L2770(le)
Line 1201584: Q114092330 is not a known grammeme for L7026(beau)
Line 1201947: Q3618903 is not a known part of speech grammeme for L10017(tout)

Here is the current generated lexical dictionary files to debug the test failures.
fr.zip

@grhoten grhoten added this to the 0.1 milestone Jan 21, 2025
@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

In order to get this to work, the appleproduct usage in FrGrammarSynthesizer_CountLookupFunction needs to be removed. We can either:

  1. Add the word to Wikidata and note that they're singular. This might be hard to do for these trademarked names into Wikidata.
  2. Remove the code and the tests. This would be a loss of functionality.
  3. Find an alternate to represent such named entities
  4. Add customized data separate from Wikidata that is similar to the vowel-start/consonant-start in English. It's probably a good idea to have some sort of official way to override the defaults anyway.

These are the options used to generate this data.

--language fr --inflection-types noun,adjective,verb,pronoun,adverb,article --add-sound consonant-start,vowel-start --ignore-property Q420020 --ignore-entries-with-grammemes common

@BrunoCartoni BrunoCartoni self-assigned this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants