-
-
Notifications
You must be signed in to change notification settings - Fork 13
Integrate ar Wikidata into Unicode Inflection #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There are 583 nouns in Arabic. See this query. |
There are a fair number of missing words. The one causing the most issues at the moment is the missing gendered determiners for this/that/these/those. |
These options can be used to derive the data. The code currently expects the pronouns to be determiners. If "this" changes from a pronoun to a determiner, we can change the inflection type to mirror the change.
|
Most of the Arabic words in the list are legitimate words. |
In Wikidata, this is a noun phrase. and none of its forms are marked as Term
This is noun phrase but one of its forms (L1160500-F1) is marked as Term (Q1969448)
This is a noun and none of its forms are Term
Lexical category : adjective but it has
Lexical category : noun
Lexical category: noun
Lexical category: verb
Lexical category: verb
Lexical category: noun What I wanted to say is that: some of the corresponding grammemes do exist in Wikidata, while in others they don’t. So my question is: To resolve this, should these grammemes/POS be added or corrected on Wikidata, or is there something that needs to be adjusted on the Unicode Inflection side instead? I’m not entirely sure which direction to take. Thank you! |
Hey @grhoten, could you please clarify where to find these missing words or what kind of list I should look for, so I can add the missing ones? |
To resolve these issues, the Grammar.java needs to be updated to map them to something meaningful or ignore them. This is a new tool, and Arabic is using some grammemes that are not used in other languages. The term (Q1969448) property is likely incorrect. That's not a part of speech in any other language. If it's a noun phrase (a combination of words), then it should be switched from term to noun phrase or other type of phrase with an explicit part of speech. Using term as a part of speech is confusing. It's not explicit on the functionality as a part of speech. For example, "red herring" should be considered a noun phrase. It's an adjective with a noun. Each part can be inflected separately. If it's something like "businessman", then that's a singular word, and it should be marked as a noun. The Plural Person (Q51929154) is a hard one. I have to look at the context. If it's dealing with a possessive pronoun attached to a word, then the entire form should be ignored, which is possible to mark in Grammar.java. Based on what you've said, it sounds like perhaps the wrong plural was being used. If the plural forms are incorrectly marked as plural person, then switching them to plural instead of plural person would resolve the issue. That would need confirmation if that is the problem. The fi'il muḍāri (Q12230930) is hard too. I suspect that it should be possible to map it to imperfective and non-past in Grammar.java, but I'm not sure if that's correct. Some research is needed.
You can find the missing words by running the tests. The readme has instructions for how to run the tests. If you're more interested in the test data, you may be interested in these files. https://github.com/unicode-org/inflection/blob/main/inflection/test/resources/inflection/dialog/inflection/ar.xml |
The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.
The initial issues include:
Tool output that needs to be addressed:
Here is the current generated lexical dictionary files to debug the test failures.
ar.zip
The text was updated successfully, but these errors were encountered: