Problem saving links with hardcoded diacritics via the translation interface

Problem saving links with hardcoded diacritics via the translation interface

Hello. I'd like to report a problem when saving links with hardcoded diacritics (č, ž, š) via the translation interface. The error provided was as follows: "The following parameter is unknown: %8". I have been able to save the link https://sl.wikipedia.org/wiki/Matemati%C4%8Dna_konstanta by editing the source of the message directly. The error occurred in Blockly:MATH CONSTANT HELPURL/sl. Should I file a bug on Phabricator? Thank you for your feedback.

Eleassar (talk)12:26, 20 March 2021
Edited by author.
Last edit: 21:05, 3 April 2021

URLs in basic format are not "hardcoded", they are usually "url-encoded", by transformating each byte of UTF-8 sequences for non-ASCII characters into 3 ASCII bytes ("%nn" in hexadecimal). The leading byte is URL-encoded in "%C2".."%FD", the following 1 to 3 trailing byte(s) are in "%80".."%BF" (the number of encoded trailing bytes depends on the value of the encoded leading byte), so this bug affects any Unicode character that contain any trailing bytes URL-encoded in "%80".."%9F", i.e. one half of all non-ASCII characters, because the Translatewiki.net's UI incorrectly assumes that "%80".."%89", "%90".."%99" (or "%8" and "%9" for actual bytes URL-encoded as "%8A".."%8F" or "%9A".."%9F") are an incomplete scanf/printf placeholder, not correctly terminated by a valid type/format letter (so yes this is a serious bug)

When trailing bytes are in "%A0" .. "%BF", this bug does not occur because "%A" or "%B", in capitals, are not recognized as valid placeholders for C/C++, but the bug would occur if these bytes were encoded using non-standard lowercase in "%a0".."%bf" (because "%a" and "%b" could eventually be placeholders in some C/C++ libraries using extended scanf/printf formatters).

Translatewiki.net actually does not properly check if theses are valid C/C++ placefolder for print/scanf. It does not even know that the message will be used with a C/C++ library, and it also supposes this could be placeholders for other languages (such as DOS-Command or NT-CMD shell variables, or libraries for Java, Javascript, PHP, Python...).

Here you tried "%8D" which superficially looks like a printf/scanf placeholder for a format of length 8 but unknown type/format letter "D"; as this is not recognized, then TranslateWiki assumes this could be also a basic "%n" placeholder for another syntax (a shell syntax for command.com/cmd.exe? or some other programming language of library), and errs because there's no such "%8" placeholder in the original string to translate. What Translatewiki.net does is just making superficial guesses, with frequently false assumptions.

  • One possible workaround could be to link to another redirecting article that does not use these characters in their title (e.g. redirecting the English title).
  • Another alternative would be to not URL-encode non-ASCII characters in the URL, using the IRI syntax supported by modern browsers (and supported also by the MediaWiki parser).
  • You've found the wordaround using the MediaWiki editor instead of the Translate UI (but note that even if you've edited it there, the text cannot be validated in the "review" UI of TranslateWiki, or will resurrect if one needs to point to another URL on the target wiki).

This is a bug of the Translate interface itself: the solution would be to mark the resource to translate as not using the Mediawiki syntax (if this is the case) and containing no C/C++ "printf/scanf-like" placeholder when importing the resource on Translatewiki.net, so that its UI won't perform an incorrect validation check.

As far as I know, the UI of TranslateWiki.net still does not support such flagging (unlike what other common UIs for translations of .po/.pot files for gettext) and just makes some "magic" guess of the format used, assuming it is either using the Mediawiki syntax (where such URL-encoding of URLs is not needed, but where "{}" and "[]" are special-cased for MediaWiki's parser, as well as "$name" for Translatewiki's syntax itself and supported in the Mediawiki API internal libraries, and a few other syntaxes) or the placeholders in C/C++ "printf/scanf" format strings.

Translatewiki.net should not perform such magic guess of the expected format. It creates false positives and does not properly validate all what may be needed for projects. It should support resource-flagging like in .po/.pot translation tools (where the uncompiled ".po" format uses special comment-lines starting by "#," before the resource texts starting by "msgid" or "msgstr"), using some metadata separated from the original string, possibly stored in the "/qqq" doc subpage, or in a dedicated "/qqx" subpage, or using some hidden tag in the source page (like TWN already does when prepending hidden "!!FUZZY" markers in translated subpages that need updates).
I would suggest prepending "!!FORMAT(...)" or "!!{...}" markers is source pages to store that metadata, and possibly as well in the translated pages to make sure that these special markers are in sync with the original).
As well, a project should be able to store such global metadata for their whole message group (to provide some defaults that could be overriden in specific resources using the special marker), notably the programming language that will use these resources, so it will know if printf/scanf format strings are really needed, or if Mediawiki or HTML syntax is expected. This means creating a special page for each message group, containing the group description, contact info, supported formats (including supported plural forms), URL to help/support pages, URLs to their bug tracker and relevant discussions or terminologies for specific languages...
These changes in TWN would make this wiki more universal to better support more projects. The alternative for these projects are to use online translation tools other than Translatewiki.net (most of them are built for .po resources with gettext)...

The alternative would be for Blockly to accept this string in IRI format, natively in UTF-8 (like in the address bar of modern web browsers, or in the Mediawiki parser), documenting that the URL should not be URL-encoded (this would require Blockly to URL-encode that IRI itself, either in its runtime code or during the import of translated resources from TWN to their code repository) and documenting that info in the "/qqq" doc subpage of TranslateWiki.net. However that last alternative requires involving Blockly developers, notifying them, and make them aware of what they can do when importing/exporting their project with Translatewiki.net (only these developers can instruct translators about what they can do).

Verdy p (talk)19:32, 20 March 2021

Hi, Verdy. Thank you a lot for your detailed and clear explanation. The last option would certainly be best but would require someone with the relevant technical knowledge to communicate with developers. Would you be willing to take this on? Otherwise, I'll ask a colleague or will try to do it myself. For the time being, I'm staying with the workaround. --Eleassar (talk) 08:14, 21 March 2021 (UTC)

Eleassar (talk)08:14, 21 March 2021

Feel free to link this talk thread. I would be pleased to see such suggestions implemtend somewhere for improving Translatewiki.net and better support the hosted projects with better communications and reporting with translators, and improved checking and validation of translations.

Verdy p (talk)22:32, 3 April 2021
 
 

I have disabled variables check on this message.

Nike (talk)17:22, 14 April 2021