TheWikipediaLibrary - json-based i18n files in addition to existing gettext

Jump to navigation Jump to search

TheWikipediaLibrary - json-based i18n files in addition to existing gettext

Hi there,

We're working to address a longstanding issue that precluded some of our content from being translatable via translatewiki. We're moving messages from the database into json files over the course of a few PRs, with the idea that these would be picked up by translatewiki. Our first take on this is to do a json file per message type and language, and we've started with the descriptions of our partners who provide resources for the library. You can see here that we've got the descriptions for all partners together in a file for each language. If we move forward with this, we'll add new files for other fields as we take that content out of the database. Before we proceed any further, I'd like to check in to see if this approach is workable on your end, or if we need to organize the data differently. Here's our 1st PR to make this move, so you can see what it looks like currently:

Feel free to ignore the python, which does the job of extracting the info and creating those initial json files, and just focus on the json.

Feedback is welcome!

Jsn.sherman (talk)16:47, 30 March 2021

Thanks a lot! JSON is generally easier to manage than PO.

One issue I immediately see is that you output non-ASCII characters as escapes, such as "101_short_description": "<p>\u0627\u0644\u0645\u0646\u0647\u0644", etc. According to the convention documented at mw:Localisation file format, it's preferable to use the actual letters and not escapes. Will it be possible to do it? Since humans won't have to deal with these files very often, it's probably not a blocker, although User:Abijeet Patro and User:Nike should correct me if I'm wrong. That said, humans may have to deal with it; for example, sometimes it's too difficult to find things in translatewiki itself and it's more convenient to grep through the source.

Moreover, perhaps you don't need to convert the translated files at all: maybe it will be enough to add en.json and qqq.json and let the translatewiki import/export scripts generate the JSON files with the translations.

Amir E. Aharoni (talk)16:58, 30 March 2021

I agree that using plain UTF-8 would be better (the "\uNNNN" escape can cause problems, as it requires encoding pairs with surrogates from every character in Supplementary planes, and the convention is not universal, as some tools will require the long form "\U00NNNNNN" instead.

As well it is overlong and some tools may not accept leading zeroes, generating ambiguities, unless they use a strictly conforming JSON parser). No such problem with UTF-8, as the only escapes needed are for quotation marks and backslashes inside actual texts, and some ASCII controls like TAB and NEWLINE: this is much simpler to parse anyway even with only a few ASCII exceptions).

You should ensure that the generated JSON is strictly valid to the JSON standard (and not dependant on the old Javascript implementations that had these caveats). You may eventually be lenient when parsing input JSON, as long as there's not any possible ambiguity.

Verdy p (talk)17:20, 30 March 2021

This is all extremely useful stuff. We should be able to do UTF-8 in our initial output and add a json linter to our CICD process to make sure that we can parse the input.

On the creation of files: It's actually easiest for us to create them all no matter what, but we can update to cull the empty ones. We do have a couple of existing translations currently, so it sounds like the ideal process might be to create qqq, en, and then any translations that we do have?

Jsn.sherman (talk)17:46, 30 March 2021

You do not have permission to edit this page, for the following reason:

The action you have requested is limited to users in the group: Users.

You can view and copy the source of this page.

Return to Thread:Support/TheWikipediaLibrary - json-based i18n files in addition to existing gettext/reply (4).

Yeah, I can see that we need to normalize those trailing newlines. Would stripping them out everywhere (instead of adding them) be a good solution?

We can add that metadata key to start with.

And yes, those keys will not change. We may add new ones over time, but won't change what's there.

Jsn.sherman (talk)14:23, 31 March 2021

Would stripping them out everywhere (instead of adding them) be a good solution?

Yea, I'd just strip them out.

I also want to draw attention to this:

Could we skip the keys that don't have any definition such as: "24_description", "78_description"? These keys appear to be machine generated, can we assume that these keys will not change?


Abijeet Patro (talk)05:21, 1 April 2021

Thanks for the feedback! We have updated our GitHub PR with all the suggestions you made: Let us know if you have any more additional feedback.

Suecarmol (talk)17:06, 1 April 2021

Okay, we've merged this in. It's ready to be included for translation so far as we're concerned. Let us know if there are any problems or unanswered questions.


Jsn.sherman (talk)17:49, 5 April 2021

Opened: to track this


Abijeet Patro (talk)14:36, 7 April 2021

Our team was just discussing this some more today and we realized that we may be able to accommodate the content covered by this change with our existing ugettext workflow. Feel free to de-prioritize this task while we verify. We should know in a week or so. I cross-posted this to the phab task as well.

Jsn.sherman (talk)14:48, 7 April 2021