Jump to content

GRAMMAR

@Balyozxane: Erê, min ê kiribûya lê min ji bîr kiriye: bnr. [1]

Lêker

Ji bo lêkeran, tiştekî vitira hewce ye, wekî, Template:Gender:

{{lastletter:$1|Text for consonant letter|Text for vowel letter|Optional text}}.

Mînak: Ji bîr neke ku ev tenê pêşdîtineke [[:$1]] {{lastletter:$1|e|ye|(y)e}}.

Navdêr

Ji bo navdêran zanîna "tîpa dawî" û "zayend" jî pêwîst e (bnr. modula Wîkîferhengê):

Bnr [2].

Lîsta $1(y)ê hat jêbirin.: li vira zayenda parametreya $1 pêwîst e ji ber ku bo navdêra nêr "$1yî" ye,

wekî Lîsta $1{{Gender:$1:|(y)î|(y)ê}} hat jêbirin.

Ji bo navdêran li ser phab: çêkirina/sererastkirina $wgGrammarForms (Grammar) hewce ye, bnr:

Encam

Şabloneke vitira hewce ye:

{{Grammar:def/indef-case|$1}}

  • def: definite (en)/binavkirî; indef: indefinite (en)/nebinavkirî
  • case (en)/rewş: nominative/navkî, construct/îzafe, oblique/çemandî, vocativ/bangkirin

Mînak:

  • ($1: mê, yekjimar) {{Grammar:indef-construct|sêv}} --> sêveke
  • ($1: nêr, yekjimar) {{Grammar:def-oblique|gund}} --> gundî
  • ($1: nêr, pirjimar) {{Grammar:def-oblique|gund}} --> gundan
  • ($1: nêr, yekjimar) {{Grammar:indef-oblique|gund}} --> gundekî
Pirsgirêk

Zayend û jimara parametreyan ($1, $2, ...) tên zanîn?

Ghybu (talk)17:54, 22 December 2022
Edited by author.
Last edit: 21:09, 22 December 2022

It's a common case in translations that we don't provide the grammatical gender or grammatical number of translated items, along with their translations (the grammatical number may be different from what was in the original source in English).

When they are later reused to compose other messages containign them in a varaible, nothing is transmitted (this does not occur with users' gender which has a specific support, provided that these users have registered a local account with their stored preferences).

Similar problem occurs when translated item have some beginings/endings; when they are used to compose other messages in sentences, we may need to contract the surrouding words, possibly remove spaces when the contraction needs to uses an apostrophe (e.G. in French, or Italian, but sometimes as well in english), or even mutate a part of that term given in the value of a parameter.

More generally we have no way to translate a term not just into another term, but to a term with additional properties (gender, plural, case, initials/finals treated specially), or to a set of contextually selectable variants (that the "GRAMMAR:" parser function could recognize to select or generate the appropriate form). Those additional properties (or specially tagged translations having several contextual alternatives required) are not transmitted: all translated items are currently "opaque". We have no way to add variants or properties that a true linguistic engine would need.

The assumption that a translation of any term from language A to language B is necessarily unique is false. For now we need instead to ask to each project to add more translations units to be translated separately, hopping that the target software will manage them correctly (this is already the case even for users gender, and even when using numbers wotih plural rules).

A generic solution would be to be able to add, for each translation unit and for each translation language, the possibility to define additional properies or variants, and being able to tag them properly so that they cecome selectionable (the current support with "PLURAL:" and "GENDER:" is too limited, and with "GRAMMAR:" it is often impossible to implement correctly without make false guesses, or using a possibly large dictionnary that will be difficult to maintain).


That's a problem that will be interesting to solve with the future multilingual Wikipedia (with help of new modules in Wikifunctions, or with Lua modules for Scribunto in specific wikis), that will offer such possibility. But that should also be integrated in the "Translate" extension of MediaWiki. For this to work, we need to be able to define suitable "tags" in each language for selectors of variants. The number of tags may be open (not the same from one lanuage to another); these tags exists and are used in some Wiktionnaries for identifying its entries. Such thing has been attempted separately in Wikidata for tagging and identifying the various "forms" that a "lexeme" can take.

And we should send and hear some comments and solutions about this frequent problem to the Unicode CLDR project for internationalization libraries (CLDR has exactly the same problem but proposes no standard solution at all for now; this would require updating some LDML specifications and increase its own data model). For now only Wikidata propose something that will be integrable (but for deployment in applications, they need stability of the "tags" they use, because they won't query Wikidata each time: we need a set of stable tags for each language that can be used to select variants, whereas translated message bundles will possibly contain multiple selectable variants for the same message).

Some project may also need to provide their own subset of tags, but will need to document them, so that translators will know how to tag the needed variants, and if variants dependant of some tags are required, or optional and can be infered rom other defined variants (this is already the case for plural forms, which is just a more specific subproblem, but which is only partially solved!) and add the possibility for each variant to return not juqt the test but also some additional tags to allow a safe reusability.

Note that I have alsready submitted such concepts to other projects, including for the future Multilingual Wikipedia and Wikifunction, in CLDR, in Wikidata with their new lexemes, and in Wiktionnary. Definining a sommon set of tags (and mantaining a regisry of these tags to preserve their reusability) is something important that should be considered in the future for internalization and translation.

These linguistic "tags" may also have other uses than being purely grammatical. They could carry contextual information, as well as user preferences (such as the "formal", or"informal" language register, or a "scientific" vs. "vernacular", "simplified", "slang", or "obsolescent/old", with more specific and possibly ordered preferences depending on the targetted audience/age), or other stylistic information (e.g. to avoid the repeition of some words or long expressions, by allowing them to be abbreviated by pronouns or shorter expressions, possibly abbreviations, or to allow the use of alternative but equivalent terms for example to avoid aliterations and get a more "fluid" and "natural" language that a human would prefer to use and read). And in fact, these tags would also provide some additional documentation to the source messages about their intended use (e.g. as a verb for an action to be taken or as a noun) and some other technical restrictions (such as max length, format, HTML allowed/disallowed, restricted subsets of characters for example in identifiers)... All these use cases require being able to add variants. and being able to select them contextually and coherently.

We, humans, do that naturally and constantly in our languages, without thinking much about this, but they follow some language-specific logic that CAN be computed using productive rules (if "tags" are standardized and maintained in some register like Wikidata). And in fact, if you've followed the proposals for the future multlingual Wikipzdia, the proposed syntax will require using a large set of "functions" that will need to be named or to use such named tags in parameters, and will be able to handle and reduce posibly large sets of possible variants (including automatically derived terms and exceptions): the source text will not be English, but written in a metalanguage entirely made of such metadata tags, and the first thing that a translation engine usually makes is to try guessing that metadata form. Human linguists do the same thing with they analyse the text and "tag" them with semantic or grammatical meta-information.

And we all do that kind of analysis (most often unconciously in our "native" languages) to juge the "quality" or "beauty" of any spoken or written text (poests, famous writers, humorists, singers and other artists are expert at doing this analysis conciously, with hard work when trying improve their texts and carry their intent or emotion or preception/view of any topic; politicians, merchants, advertizers, and recruiters are expert at doing this for their own goals).

Verdy p (talk)20:01, 22 December 2022

Yes that's what I thought: gender and number are not always given.

On the other hand I think it is possible to create a model like "Gender" (a new MagicWords?) which tells you if the last letter of a word is a consonant or a vowel and to act consequently.

Ghybu (talk)21:00, 22 December 2022
  • The magic keyword "GENDER:" is already taken but for a limited usage: its's first parameter is a username on the local wiki (and by default it is the user reading the page, whose human gender may be unknown or volutarily chosen by that user to be neutral). It is completely unrelated to the grammatical gender (which may even follow other rules, e.g. for non-humans, unanimated concepts, or for reasons of style/politeness/respect or irrespect, or that sometimes changes depending on plurals or uncountability).
  • We have a keyword "GRAMMAR:" whose first parameter is already a "tag", but that doesnot accept list of tags, some required that may be needed, some other optional to alter the renered variant).

But we have NO way to return BOTH the text of the translated variant, AND classification tags on output; only one input tag is accepted.

  • The magic keyword "PLURAL:" has no input tags, it only takes a numerical value on input.

Returning tags only on output will be inefficiant: it requires returning all possible variants, each one with their own tags for external selection, or return a some functional object that will be able to generate the selection without having to enumerate all possible variants. For common cases, like derived terms in the conjugation of verbs, or grammatical cases, and more complex clases like German verbs with their "detachable" particles, or pronominal verbs, we need input tags. Output tags should be used for the generated sets of variants. then we'll need some orchestrator to reduce the sets of variants and generate the calls of functions that will make the selctions. But for translators this is a complex task to do: we need to allow them to just being able to add variants and complement each output with one or more tags that can can easily select form a set of known tags; and these tags must make sense to human translators (so tag names must follow some known and documented convention). That's not somethinb easy to design in a snigle basic "magic keyword" in MediaWiki, except for very limited use cases.


IMHO, using tags on input rather than output will be more powerful and will offer a simpler interface. And making all tags on input requires using a "functional" syntax, but with possibly unordered parameters (the input would just be some unordered set of tags. Adding variants for a given message to translate just requires each added variant to have distinct sets of tags, including the empty set which would be the default translation (that should be always usable in isolation, e.g. as items in a bulleted list or in an index, but not as a "title" or section header as it adds specific some usage context requiring its own generative tag fior example for its capitalization).

If we accept the fact that "tags" cannot be named with any space in them (only alphanumeric or hyphens), then such solution is implementable in MediaWiki using the existing "GRAMMAR:" magic keyword as:

{{GRAMMAR: |tag1 tag2 = variant1 |tag1 tag3 = variant2 | ...}}

and it would be important to remember that the order of those tags is not significant (so "tag1 tag2" or "tag2 tag1" are equivalent. Optionnally some wildcard may be used but it would complexify the task for translators. Tags should be wellknown, documented and not generated as they want, so that each one can be validated.

But this new syntax (only to be used in the middle of a translated message, may as well use another separate keyword like "SELECT:", "VARIANTS:" and so on.

For the case of variants making the whole content of the message, we don't necessarily need any syntax to be exposed to translators, as they may just have a "+" button to add variants and an extra field where they can add suitable tags (the validator would just have to check that sets of tags are distinct for each added variant of the message, and would store them in some "canonical" order). And the UI may offer facilities to easily select "known" tags from a repository appropriate for each language, so that translators have less difficulties to use select and use them. That syntax would be hidden.

Any translated message (even if it has a single variant listed) could have one or more tags added to them (including the single default variant): those tags would then be able to provide the needed semantics that also allows generators (like conjugators for verbs) to work with them. Internally each variant stored would be used as a pure functional object, capable to use tags given on input and the defined text of the variant, to generate the simple text output. Basic i18n libariies or applications would ignore the input tags and would work only with the defined text (but would loose the semantic of that text if they use it to generate other texts).

The main difference is that stored translations would not longer be opaque texts

Verdy p (talk)22:48, 22 December 2022
 

@Ghybu: Yes, that is possible, I have done so for genetive GRAMMAR in Norwegian; when words are put in genitive in Norwegian, we just add an "s", unless the word ends in s, x or z, then we add an apostrophe instead. It should be easy to do something similar for Kurdish where the output depends on whether the final letter is a vowel or not.

Jon Harald Søby (talk)12:02, 23 December 2022

There are some complexities: the content you test for a final s may be formatted (so at end of the you may find HTML tags around the text, or some image, or some existing apostrophe-quotes for the MediaWiki syntax of bold/italic styles). The last displayed character in the content may also be some punctuation, or it may be hidden because that content was generated or formated with a template transclusion or function call in a module (and that template or function may also not provide the metadata for the grammatical or lexical semantics of what it returns with the same template call: the template or function would have to do that change itself, other wise the parsing will be complex and may be faulty).

Here again we fall on the assumption that a single (wiki)text result is sufficient. But then where do we store and return the necessary metadata that allows doing correctly further processings? This could be in the same (wiki)text, however this requires defining an encoding syntax for that (could be some hidden tags that get stripped at end of processing, or some JSON or XML syntax, magic keyword, along with some escaping mechanisms for safer encapsulation... as long as further processes can handle it)

Verdy p (talk)13:21, 23 December 2022

@User:Verdy p Are you saying template Grammar shouldn't be used? We could make do with something like MediaWiki:Aboutsite/fi.

@ User:Jon Harald Søby Is there a know-how for creating something similar to what Finnish uses?

Balyozxane (talk)16:25, 25 December 2022

The Grammar documentation page is not about any "template". It's a parser function of MediaWiki (note the required presence of the colon after the keyword "GRAMMAR", before all other pipe-separated parameters, instead of the pipe if this was a template named "GRAMMAR").

In some cases we could use a template-syntax also on MediaWiki, as a possible wrapper (or workaround) for using the parser function. Some non-MediaWiki projects may also use their own syntax, or use a syntax similar to these two in MediaWiki. It does not matter, I've not said that either syntax "shoudn't" be used.

But the usable syntax must be clear, because the keyword used is significant (it may or may not be case-sensitive: the names of parser functions are not case-sensitive, and may be translated with some known aliases, but in translatable messages, we should always use the untranslated name which should still be usable on non-English wikis; however for MediaWiki template names, there's no English aliases unless they are defined as redirected pages (and to avoid the deletion of these redirects, that may not be used in Wiki pages but only in translatable messages, they should be documented as aliases/synonyms with an internal comment on the redirect page or a categorization saying that it's needed for translatable MediaWiki messages).

The other problem then is the syntax of parameters, the parameter names if they are named, and their order if it is significant; then comes the syntax and sematincs of their value (what is permitted, and the interpretation of values if there are restrictions).

Using the parser function syntax may add more restrictions/limitations than when using the template syntax, which could also applies some transforms or add supplementary features, including supporting aliased values, fallbacks and so on, whilst being written either entirely as a template, or using MediaWiki parser functions, or invoking a function in a Lua module supported by the Scribunto parser function, or a mix of all that).

The resulting syntax used in the translatable message does not mean that the project using that message absolutely be using MediaWiki, but the syntax finally used in the translatable message should be minimalist (with minimal technical tricks, which will also simplify the work performed by the local message validator used in translatewiki.net) so that any implementation may be easy plugged in the project via its own i18n library, or easily convertible to the syntax that will be finally be used in the runtime of the deployed application/project (such conversion would occur when exporting messages from translatewiki.net to the project's repository via its own import tool. And to help translators, the project page in translatewiki.net should have a link to the documentation related to that supported syntax for supporting such "grammar" extension.

Note that if the project uses a template-like syntax, it may be possible to also provide a syntax helper also in translatewiki.net, as long as the template name used is specific enough and related to the project using it. That syntax-helper may effectively render a link on translatewiki.net to that documentation page for that project. And IMHO, if the syntax used is for a non-MediaWiki-based project, the name chosen should include some project-specific prefix (instead of using a blind "GRAMMAR" name to reserve for MediaWiki-based projects; for example it it's a project in Python, Javascript, or C/C++: the MediaWiki syntax for transcluding templates with parameters is flexible enough to be easily supported by many other non-MediaWiki implementations; the syntax with parser functions is less flexible and adds further complexities for massage validation in translatewiki.net: see for example the complexities caused by "PLURAL" syntaxes).

Verdy p (talk)18:07, 25 December 2022
 

For common nouns in Kurmancî Kurdish, you need to know the gender of the words (masculine or feminine) for the declension. However, other than usernames, gender is not specified.

So if we create it, it will only be useful for usernames.

But I think it is necessary to create a model that tells us if the last word of a word is a vowel or a consonant to solve this kind of problem.

Then by combining the "Gender" template and knowing the last letter of the username, we can even decline it...

And I think it's possible to do it, you just have to find the time to make the request on "Phabricator" :)

Ghybu (talk)18:29, 25 December 2022

In some translated messages, it may be possible to find a formulation that does not depend on the grammatical gender (or sometimes plural) of the entity, and does not require any conjugation. For example, instead of translating "$1 sent a message sent to $2", you would translate it as if it was "message sent (sender: $1; recipient(s): $1)". Such trick is used in Russian, for example, but often requires adding some punctuation and some reordering, so that variables can be used in isolation form. However the result may not always be as easier to understand than just keeping a single form assuming a default gender, a default plural form, a default grammatical case or a default article.

Handling the case of contractions and internal mutations can be very tricky with a basic transformation rule that would ignore lot of exceptions and common usages.

The same is true when you attempt to derive an language name into an adjectival form, because the language name is sometimes plural and may also include already the term translating "language" that you'd want to use with the translated adjectival form of the language name: this cannot be done if the translation is just a single text and does not self-contains any metadata for correctly reducing its set of possible derivations, and finally select the best one at end of the text generation.

If there still remains multiple choices, each one of these choices should have at least one distinctive metadata, or should be sorted by order of occurence in the set, or better distinguished by some given or computable "score" metadata; one such computable score could be to use the shortest formulation, but an actual metadata score may contain some tag indicating a more common modern usage, or tags for politeness or level of formality.

If the text output is in MediaWiki or HTML or some other rich text format, it may even be possible to not make any predetermined choice, by using a rendering form based on dynamic features (provided by HTML or Javascript or similar), which could take into account user preferences, so that the actual display text may be derived on the client side (there are some examples in Wikipedia, where the generated text contains some microtagging semantic features, allowing such client-side modifications of the visible contents; this occurs for example with dates or with some accessiblity features; however the server or application using those "rich messages" must be prepared to be able to deliver the accessibility tool, or should document and standardize the microtagging system it uses. And such thing is not just needed for translation, it could be useful as well in any monolingual content (even in English only). So this is not just a problem of "internationalisation" (18n) or "regionalisation" (r15n) but more generally a problem of "localisation" (l10n). You have a good example with the orthograph of quoted words in the last sentence (is it an "s" or a "z" in English?).

Verdy p (talk)21:34, 25 December 2022

Personally, I don't see a real problem. A person using the site (origin of messages) will know how to use these templates correctly. If there is an error, the translator will come back here to correct it... And if he's nice :) he will complete the documentation ("Information about message") to help others so that he doesn't the same error!

Ghybu (talk)06:19, 26 December 2022

Yes but a template-based solution will not work on all wikis if these are generic Mediawiki messages (e.g. to be deployed to Miraheze, or elsewhere). The doc may however be linked and shared on Mediawiki-Wiki. Such thuing cannot be decided isolately.

Verdy p (talk)13:13, 26 December 2022

It's already decided. There is a Grammar parser function and we want to utilize it only for SITENAME and Usernames. Why are you complicating this even more? "message sent (sender: $1; recipient(s): $1)" is not always possible to implement and we need declensions for SITENAME. For example we have to omit SITENAME on MediaWiki:Searchsuggest-search/ku-latn or use "SITENAME: Lê bigere" which is not ideal at all. We encounter SITENAME in translations enough to require utilizing something that already exists. Can we stop discussing issues in Localization please and start doing something about our problem?

Balyozxane (talk)19:33, 26 December 2022