Skip to content

Integrate ar Wikidata into Unicode Inflection #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
grhoten opened this issue Jan 22, 2025 · 7 comments
Open

Integrate ar Wikidata into Unicode Inflection #62

grhoten opened this issue Jan 22, 2025 · 7 comments
Milestone

Comments

@grhoten
Copy link
Member

grhoten commented Jan 22, 2025

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • The dictionary-parser output needs to be addressed
  • The unit tests need to be fixed.

Tool output that needs to be addressed:

Line 1516: Q124351233 is not a known grammeme for L12125(ٱِخْضَرَّ)
Line 6472: Q124351233 is not a known grammeme for L52282(فصَلَ)
Line 140303: Q1969448 is not a known part of speech grammeme for L1152311(صاحِبُ الزَّاد)
Line 140312: Q1969448 is not a known part of speech grammeme for L1152377(قَبُو)
Line 140475: Q51929154 is not a known grammeme for L1153733(حزين)
Line 140885: Q12230930 is not a known grammeme for L1157035(عَشِقَ)
Line 140893: Q120867784 is not a known grammeme for L1157102(سَقَى)
Line 140894: Q1969448 is not a known part of speech grammeme for L1157117(الحامِل)
Line 140937: Q1969448 is not a known part of speech grammeme for L1157456(صاحب المدينة)
Line 144897: Q9081 is not a known grammeme for L1188684(بُلْغَة)
Line 153128: Q124351233 is not a known grammeme for L1255342(تَماشَى)
Line 160864: Q1555419 is not a known grammeme for L1316914(صَافِي)
Line 167994: Q124351233 is not a known grammeme for L1374985(سَلَّمَ)
Line 173113: Q124351233 is not a known grammeme for L7882(ذَهَبَ)
Line 198499: Q20386151 is not a known grammeme for L222483(رَجَاةٌ)
Line 313193: Q120867784 is not a known grammeme for L1157058(الشَّبَّابة)
Line 313203: Q51929154 is not a known grammeme for L1157152(نخاع)
Line 313241: Q1969448 is not a known part of speech grammeme for L1157502(صاحب الصلاة)
Line 322570: Q124312584 is not a known part of speech grammeme for L1233453(إِلَّا)
Line 343753: Q7075064 is not a known part of speech grammeme for L1404925(هُ)
Line 345741: Q775724 is not a known grammeme for L12114(قطع)
Line 484419: Q175026 is not a known grammeme for L1157105(زَمَّرَ)
Line 484421: Q1969448 is not a known part of speech grammeme for L1157111(ماء الزهر)
Line 484663: Q22928968 is not a known grammeme for L1159040(قَنْطَرَة)
Line 484664: Q120867784 is not a known grammeme for L1159042(عَيْب)
Line 484860: Q1969448 is not a known part of speech grammeme for L1160500(الحَبة الحلوة)
Line 484868: Q1969448 is not a known part of speech grammeme for L1160598(بَابُوج)
Line 484870: Q1969448 is not a known part of speech grammeme for L1160612(الْوَشَق)
Line 484872: Q1969448 is not a known part of speech grammeme for L1160623(الرقروق)
Line 484886: Q82799 is not a known part of speech grammeme for L1160741(حُقَّة)
Line 487303: Q1969448 is not a known part of speech grammeme for L1180739(البَراءة)
Line 493768: Q2339337 is not a known part of speech grammeme for L1233455(لَ)
Line 493977: Q124351233 is not a known grammeme for L1235177(كَتَّبَ)
Line 507348: Q124351233 is not a known grammeme for L1343443(صَبَّ)
Line 520457: Q124351233 is not a known grammeme for L41672(قَتَلَ)
Line 655406: Q1969448 is not a known part of speech grammeme for L1152272(دار صناعة)
Line 655411: Q175026 is not a known grammeme for L1152301(العَافِيَة)
Line 656008: Q82990 is not a known part of speech grammeme for L1157099(مِئَة)
Line 656013: Q1098772 is not a known grammeme for L1157116(حَائِك)
Line 656255: Q1969448 is not a known part of speech grammeme for L1159039(القَطِيفَة)
Line 656256: Q1969448 is not a known part of speech grammeme for L1159041(الْقَنَاة)
Line 656266: Q1969448 is not a known grammeme for L1159138(قِنطار)
Line 656462: Q1969448 is not a known part of speech grammeme for L1160588(حب الرأس)
Line 656556: Q1098772 is not a known grammeme for L1161301(مَطْرَح)
Line 656853: Q20402133 is not a known grammeme for L1163488(سرنديب)
Line 665222: Q124351233 is not a known grammeme for L1232062(بَعَثَ)
Line 686302: Q124351233 is not a known grammeme for L1404930(صَلَّى)
Line 686303: Q65279776 is not a known part of speech grammeme for L1404933(لَا)
Line 826263: Q28640 is not a known grammeme for L1152274(الصَّقْل)
Line 826895: Q1969448 is not a known part of speech grammeme for L1157095(مَاسُورَة)
Line 827338: Q1969448 is not a known part of speech grammeme for L1160590(إِفْرَنْج)
Line 827341: Q1098772 is not a known grammeme for L1160617(سَيْر)
Line 827682: Q82799 is not a known part of speech grammeme for L1163485(صُفَّة)
Line 829788: Q1969448 is not a known part of speech grammeme for L1180740(البَرَص)
Line 834902: Q430880 is not a known grammeme for L1222151(ضرب)
Line 839632: Q361669 is not a known part of speech grammeme for L1259879(ن)
Line 847300: Q124351233 is not a known grammeme for L1321118(سَمَرَ)
Line 854298: Q28833099 is not a known part of speech grammeme for L1378299(و)
Line 858642: Q124351233 is not a known grammeme for L4976(بَكَى)
Line 859492: Q775724 is not a known grammeme for L12113(فرح)
Line 998600: Q853614 is not a known grammeme for L1157530(الْقِنَّة)
Line 998906: Q853614 is not a known grammeme for L1160165(دَلِيل)
Line 998947: Q1969448 is not a known part of speech grammeme for L1160559(الْعِفَّة)
Line 998952: Q1969448 is not a known part of speech grammeme for L1160601(القَصِيل)
Line 999635: Q1969448 is not a known part of speech grammeme for L1165953(صافن)
Line 1002438: Q9081 is not a known grammeme for L1187855(بيَاض)
Line 1002439: Q9081 is not a known grammeme for L1187856(بُندُقَة)
Line 1007881: Q124288191 is not a known part of speech grammeme for L1232402(إِذاً)
Line 1029791: Q124351233 is not a known grammeme for L465(دَخَلَ)
Line 1034053: Q28833099 is not a known part of speech grammeme for L35600(أَمْ)
Line 1166702: Q118465097 is not a known grammeme for L1131459(ماهِر)
Line 1169805: Q175026 is not a known grammeme for L1157096(مَطْرُوق)
Line 1169808: Q1969448 is not a known part of speech grammeme for L1157112(زَهْرُ النَّرْد)
Line 1169851: Q82799 is not a known grammeme for L1157384(خدد)
Line 1170018: Q51929154 is not a known grammeme for L1158715(رُكن)
Line 1170070: Q6499736 is not a known grammeme for L1159032(عَرَبِيَّة)
Line 1170276: Q1969448 is not a known part of speech grammeme for L1160604(الْقِبْطِيَّة)
Line 1170941: Q82799 is not a known part of speech grammeme for L1165959(سكة)
Line 1173734: Q22928968 is not a known grammeme for L1188600(مُشَمَّع)
Line 1179240: Q2146100 is not a known part of speech grammeme for L1233454(ٱلَّذِي)
Line 1189784: Q12185455 is not a known grammeme for L1319752(كُلّ)
Line 1196824: Q118106334 is not a known grammeme for L1377880(أُصُلًا)
Line 1201316: Q124351233 is not a known grammeme for L4979(تَبَاكَى)
Line 1202214: Q124351233 is not a known grammeme for L12100(كَتَبَ)
Line 1204950: Q12230930 is not a known grammeme for L34391(أَعْطَى)
Line 1336704: Q24238356 is not a known part of speech grammeme for L1118970(ارحلوا)
Line 1341292: Q12230930 is not a known grammeme for L1156941(قيراط)
Line 1341352: Q1969448 is not a known part of speech grammeme for L1157518(صاحب السوق)
Line 1341721: Q1969448 is not a known part of speech grammeme for L1160560(الْدَّار)

Here is the current generated lexical dictionary files to debug the test failures.

ar.zip

@grhoten grhoten changed the title Integrate he Wikidata into Unicode Inflection Integrate ar Wikidata into Unicode Inflection Jan 22, 2025
@grhoten grhoten added this to the 0.1 milestone Jan 22, 2025
@nciric
Copy link
Contributor

nciric commented Jan 24, 2025

There are 583 nouns in Arabic. See this query.

@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

There are a fair number of missing words. The one causing the most issues at the moment is the missing gendered determiners for this/that/these/those.

@grhoten
Copy link
Member Author

grhoten commented Feb 24, 2025

These options can be used to derive the data. The code currently expects the pronouns to be determiners. If "this" changes from a pronoun to a determiner, we can change the inflection type to mirror the change.

--language ar --inflection-types noun,adjective,verb,determiner --ignore-entries-with-grammemes definite

@younies
Copy link
Member

younies commented Mar 5, 2025

Most of the Arabic words in the list are legitimate words.

@baha-bouali
Copy link

baha-bouali commented Apr 15, 2025

  • Term (Q1969448)

Line 140303: Q1969448 is not a known part of speech grammeme for L1152311(صاحِبُ الزَّاد)

In Wikidata, this is a noun phrase. and none of its forms are marked as Term

Line 484860: Q1969448 is not a known part of speech grammeme for L1160500(الحَبة الحلوة)

This is noun phrase but one of its forms (L1160500-F1) is marked as Term (Q1969448)

Line 140894: Q1969448 is not a known part of speech grammeme for L1157117(الحامِل)

This is a noun and none of its forms are Term


  • Plural Person (Q51929154)

Line 140475: Q51929154 is not a known grammeme for L1153733(حزين)

Lexical category : adjective but it has
One of its forms. L1153733-F3 is Plural Person.

Line 313203: Q51929154 is not a known grammeme for L1157152(نخاع)

Lexical category : noun
No form has plural person. (just plural)

Line 1170018: Q51929154 is not a known grammeme for L1158715(رُكن)

Lexical category: noun
No form has plural person. (just plural)


  • fi'il muḍāri' (Q12230930)

Line 140885: Q12230930 is not a known grammeme for L1157035(عَشِقَ)

Lexical category: verb
Grammatical features: first person, singular, jussive, active, and most importantly fi'il muḍāri'

Line 1204950: Q12230930 is not a known grammeme for L34391(أَعْطَى)

Lexical category: verb
No form or senses that include fi'il muḍāri'

Line 1341292: Q12230930 is not a known grammeme for L1156941(قيراط)

Lexical category: noun
One of its forms include fi'il muḍāri' (L1156941-F4)


What I wanted to say is that: some of the corresponding grammemes do exist in Wikidata, while in others they don’t.

So my question is:
Even when the grammeme is present on Wikidata, why does the tool still flag it as missing?

To resolve this, should these grammemes/POS be added or corrected on Wikidata, or is there something that needs to be adjusted on the Unicode Inflection side instead? I’m not entirely sure which direction to take.

Thank you!

@baha-bouali
Copy link

baha-bouali commented Apr 15, 2025

For Arabic, I suspect that a lot of the issues involve missing words. You can either create the missing words manually, or you can use one of these templates. The templates are more convenient so that you don't have to type so many grammatical properties.

Hey @grhoten, could you please clarify where to find these missing words or what kind of list I should look for, so I can add the missing ones?

@grhoten
Copy link
Member Author

grhoten commented Apr 15, 2025

  • Term (Q1969448)
  • Plural Person (Q51929154)
  • fi'il muḍāri' (Q12230930)
    What I wanted to say is that: some of the corresponding grammemes do exist in Wikidata, while in others they don’t.

So my question is: Even when the grammeme is present on Wikidata, why does the tool still flag it as missing?

To resolve this, should these grammemes/POS be added or corrected on Wikidata, or is there something that needs to be adjusted on the Unicode Inflection side instead? I’m not entirely sure which direction to take.

To resolve these issues, the Grammar.java needs to be updated to map them to something meaningful or ignore them. This is a new tool, and Arabic is using some grammemes that are not used in other languages.

The term (Q1969448) property is likely incorrect. That's not a part of speech in any other language. If it's a noun phrase (a combination of words), then it should be switched from term to noun phrase or other type of phrase with an explicit part of speech. Using term as a part of speech is confusing. It's not explicit on the functionality as a part of speech. For example, "red herring" should be considered a noun phrase. It's an adjective with a noun. Each part can be inflected separately. If it's something like "businessman", then that's a singular word, and it should be marked as a noun.

The Plural Person (Q51929154) is a hard one. I have to look at the context. If it's dealing with a possessive pronoun attached to a word, then the entire form should be ignored, which is possible to mark in Grammar.java. Based on what you've said, it sounds like perhaps the wrong plural was being used. If the plural forms are incorrectly marked as plural person, then switching them to plural instead of plural person would resolve the issue. That would need confirmation if that is the problem.

The fi'il muḍāri (Q12230930) is hard too. I suspect that it should be possible to map it to imperfective and non-past in Grammar.java, but I'm not sure if that's correct. Some research is needed.

could you please clarify where to find these missing words or what kind of list I should look for, so I can add the missing ones?

You can find the missing words by running the tests. The readme has instructions for how to run the tests.

If you're more interested in the test data, you may be interested in these files.

https://github.com/unicode-org/inflection/blob/main/inflection/test/resources/inflection/dialog/inflection/ar.xml
https://github.com/unicode-org/inflection/blob/main/inflection/test/src/inflection/grammar/synthesis/QuantifyTest.cpp#L107 (QuantifyTest#testArabic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants