TheWikipediaLibrary - json-based i18n files in addition to existing gettext

Fragment of a discussion from Support
Jump to navigation Jump to search

I agree that using plain UTF-8 would be better (the "\uNNNN" escape can cause problems, as it requires encoding pairs with surrogates from every character in Supplementary planes, and the convention is not universal, as some tools will require the long form "\U00NNNNNN" instead.

As well it is overlong and some tools may not accept leading zeroes, generating ambiguities, unless they use a strictly conforming JSON parser). No such problem with UTF-8, as the only escapes needed are for quotation marks and backslashes inside actual texts, and some ASCII controls like TAB and NEWLINE: this is much simpler to parse anyway even with only a few ASCII exceptions).

You should ensure that the generated JSON is strictly valid to the JSON standard (and not dependant on the old Javascript implementations that had these caveats). You may eventually be lenient when parsing input JSON, as long as there's not any possible ambiguity.

Verdy p (talk)17:20, 30 March 2021

This is all extremely useful stuff. We should be able to do UTF-8 in our initial output and add a json linter to our CICD process to make sure that we can parse the input.

On the creation of files: It's actually easiest for us to create them all no matter what, but we can update to cull the empty ones. We do have a couple of existing translations currently, so it sounds like the ideal process might be to create qqq, en, and then any translations that we do have?

Jsn.sherman (talk)17:46, 30 March 2021

Hi Jsn.sherman

The JSON structure appears to be good and we should be able to process it on translatewiki.net. It will require a small change on our end to read the new JSON file. This change will also be needed if more JSON files are added in the future. You can also use the banana-i18n format for strings in the JSON.

Could we skip the keys that don't have any definition such as: "24_description", "78_description"? These keys appear to be machine generated, can we assume that these keys will not change?

I see that (almost) all the strings in the English language end with "\n". Can this be added programmatically? It maybe difficult for translators to recognize and add this at the end of every translation that they make.

Regarding creation of the language / translation files, only create the language files for which you have translations. When importing the messages translatewiki.net will take care of reading them.

One more thing, translatewiki.net will add a @metadata key to the JSON file. Example: https://github.com/wikimedia/mediawiki-extensions-TwoColConflict/blob/master/i18n/af.json#L2. I hope that this is not an issue.

Regards,

Abijeet Patro (talk)09:03, 31 March 2021

Yeah, I can see that we need to normalize those trailing newlines. Would stripping them out everywhere (instead of adding them) be a good solution?

We can add that metadata key to start with.

And yes, those keys will not change. We may add new ones over time, but won't change what's there.

Jsn.sherman (talk)14:23, 31 March 2021

Would stripping them out everywhere (instead of adding them) be a good solution?

Yea, I'd just strip them out.

I also want to draw attention to this:

Could we skip the keys that don't have any definition such as: "24_description", "78_description"? These keys appear to be machine generated, can we assume that these keys will not change?

Regards,

Abijeet Patro (talk)05:21, 1 April 2021

Thanks for the feedback! We have updated our GitHub PR with all the suggestions you made: https://github.com/WikipediaLibrary/TWLight/pull/654. Let us know if you have any more additional feedback.

Suecarmol (talk)17:06, 1 April 2021
 

Okay, we've merged this in. It's ready to be included for translation so far as we're concerned. Let us know if there are any problems or unanswered questions.

Thanks!

Jsn.sherman (talk)17:49, 5 April 2021