Repository management

From translatewiki.net

This page documents the tools and processes for synchronization of translations in source code repositories for translation admins.

High level summary

Translation admins synchronize translation made in translatewiki.net with over a thousand source code repositories. This process is automated to a large extend. This process can be divided into two parts: import and export. Each part has multiple steps.

Import means that we update messages in translatewiki.net based on changes made in the source code repositories. Checkouts of all the source code repositories of all the supported projects are in the directory /resources/projects on the server. These checkouts use anonymous read access and a dedicated user account l10n-bot so that translation admins and scheduled scripts can update the checkouts without having to deal with access and permissions. The main synchronization scripts automatically take care of the permissions.

Import consists of the following steps:

  1. Update the source code repository checkouts for import to the newest version.
  2. Run the script processMessageChanges.php from the Translate extension. The script compares the messages and translations in the repository to those in the wiki. The script creates a list of changes based on the type of the changes in the source code repositories. For example, rename of a message must be handled differently from a change of a translations. If there are changes, the message group is locked to prevent imports and exports happening in an inconsistent state.
  3. Process the list of changes. Some changes are imported automatically. Translation admins process the remaining changes using Special:ManageMessageGroups.
  4. A systemd script checks every 5 minutes if changes are synchronized properly and unlocks such message groups.

Export means that all new, updated and deleted translations in translatewiki.net are sent to the source code repositories.

Export consists of the following steps:

  1. Update the source code repository checkouts for export to the same version as the source code repository checkouts for import, or to the latest version if repository state synchronization is disabled. See #Repository state synchronization.
  2. Export translations using the script export.php from the Translate extension.
  3. Create commits for the changes and push those commits to update the canonical version of the the source code repositories.
  4. Do additional steps like creating pull/merge requests or voting in Gerrit.

The automated steps for import and export actions are captured by the following commands. Each command has two versions. One version does the action for all MediaWiki related source code repositories hosted in Wikimedia Gerrit and the other version does the action for all other source code repositories. These two sets of projects are managed by different translation admins and use different schedules.

  • autoimport does import for all supported projects (except MediaWiki in Wikimedia Gerrit)
  • autoimport-mediawiki does import for all MediaWiki code (core, extensions, skins hosted in Wikimedia Gerrit)
  • autoexport does export for all supported projects (except MediaWiki in Wikimedia Gerrit)
  • autoexport-mediawiki does export for all MediaWiki code (core, extensions, skins hosted in Wikimedia Gerrit)

The autoimport* scripts are automatically run by systemd multiple times a day. They announce any changes that need manual processing in the #translatewiki IRC channel on Libera.Chat. Those scripts can also be run manually (but the output is still in IRC).

The autoexport* scripts can only be used if you are a member of the l10n-bot group.

The import scripts usually only take few minutes to complete. The export scripts take around 20 to 60 minutes to complete – likely longer if you are doing exports for the first time.

Subscribe to Phab:T208190 to stay aware of known export failures and add new ones you notice.

Repository state synchronization

Repository state synchronization prevents the export scripts from accidentally undoing changes that have been made to the source code repositories after the last import. It is enabled by default if state-directory is configured in #repoconfig.yaml.

The state synchronization is applied during exports. When updating a repository for translation exports, the commands will check what version the read-only checkouts are, and update the export repository to the same version. With git repositories, after the translation updates have been exported to the file system, a git rebase is applied to bring the changes on top of latest upstream version. If the rebase fails, translation updates for that repository fail until read-only checkouts are updated.

Here is an example of sequence where accidental overwrites can happen without state synchronization:

  1. We update read-only checkout of a project to latest version A.
  2. We process changes automatically or manually using Special:ManageMessageGroups.
  3. Upstream modifies English (source language) and message documentation files to create version B.
  4. We export and commit translations on top of A to create version C.

The upstream changes to message documentation in B are lost. Do note that changes in English files are not usually exported, as it is usually the source language. Hence those changes would not get overwritten in this scenario.

Message group synchronization lock (Strong synchronization)

Here is an example where accidental overwrites could happen without the "strong synchronization" feature:

  1. Upstream modifies all translation files: for example, by updating a copyright date in a string and removing some unused strings (basically, any changes that make --safe-import to not process changes automatically).
  2. We update read-only checkout from version A to latest version B.
  3. We do not process changes using Special:MessageGroupChanges
  4. We export and commit translations on top of B (but based on A) to create version C.

Upstream changes in version B are lost. For the next import we automatically update our read-only checkout to version C, so we never see or process changes in B.

Strong synchronization prevents imports and exports for message groups that are currently in synchronization or failed synchronization state.

Helper commands

Sometimes you don't want to deal with all repositories, for example while doing one-off exports or setting things up. You should use repo command when working with the read-only checkouts. You should use repomulti command when working with write checkouts. These commands set up appropriate permissions for you.

Each of the commands documented in this section will crawl up the directory tree until it finds the repoconfig.yaml file. This means you can run these commands in the project subdirectories as well.

The repo command automatically uses the l10n-bot user and the /resources/projects read-only repositories using the /resources/projects/repoconfig.yaml (symlink) configuration. It takes two commands:

  • command: update (export and commit can be used too, but they don't make any sense!)
  • project name as above

You need to be in the l10n-bot shell user group to be able to use this command.

repomulti is a versatile command, which takes two arguments:

  • command: status (default), update, export or commit
  • project selector regular expression, defaults to all projects

Warning: Unlike all other commands, this command takes a regular expression for the group selector!

You need to be in l10n-bot shell user group to be able to use this command.

For example, if you haven't done exports in a while, you can run repomulti update first, so that when you run autoexport later it will be faster. Or if you need to do manual exports to specific projects, you can do:

sudo -su l10n-bot
repomulti update 'mw.*' # matches mwgitlab, mwgithub and mwgerrit, but not mediawiki-extensions
repomulti export 'mw.*' 
repomulti status 'mw.*' # you can check what has changed, prints git/svn status information
repomulti commit 'mw.*'

Finally, for scripting purposes, we have three commands:

  • repoupdate updates a repositories.
  • repoexport exports the translations.
  • repocommit creates commits and pushes them out.

Each command takes a project as the second argument, e.g. freecol, mediawiki. These commands do not set up permissions, so you need to do it manually using sudo -u l10n-bot as well as l10n-bot wrapper for ssh key access.

RepoNG

The above repo* commands are only thin wrappers to repong.php script. This scripts does most of the actual work, although it uses export.php from Translate and clupdate-X-repo where X is one of git, svn, bzr that support different version control systems and authentication. Authentication should be separated from the version control in the future.

RepoNG has some nice feature such as doing things in parallel using multiple threads to speed things up. It also handles state synchronization between the read-only and write checkouts so that we do not accidentally overwrite changes we haven't processed yet.

This command takes two arguments:

  • command: One of update, export, commit
  • project name as above

It also has two switches:

  • -v makes it to print out the commands it executes. Useful for debugging. By default the script is very quiet.
  • --variant can be used choose a variant from the config (currently only export is supported. Default is taken from REPONG-VARIANT file that is created alongside the repoconfig.yaml file. If neither is given, it will default to default variant used for read-only checkouts.

repoconfig.yaml

repoconfig.yaml is the configuration file that acts as a list of managed repositories for RepoNG. It is a YAML file that that contains one or more projects. The basic structure of a project is as follows:

project name:
  # Project contains project properties
  group: example
  repos:
    checkout-path:
      # Repos contain repo properties
      type: github
      url: https://github.com/translatewiki/example.git

Same config file can be used in multiple contexts using variants. The most common case is a shared unauthenticated read-only checkout of all repositories and one or more authenticated checkouts for pushing translation updates. Any property can vary, but it's recommended to only vary scalar values. Variants are specified by appending | to the key followed by the variant name. In examples below you see examples of this feature using export as the variant name.

Global properties

Global properties are given under a virtual project name @meta. Mandatory keys are bolded.

Property key Description
expand Command line Translate's expand-groupspec.php in MediaWiki installation.
export Command line to Translate's export.php.
Note that you can add --wiki parameter to expand and export to choose a wiki in multi-wiki setup.
state-directory Path to your read-only checkouts. See #Repository state synchronization.
Example
'@meta':
  export: php /srv/mediawiki/targets/production/extensions/Translate/scripts/export.php
  expand: php /srv/mediawiki/targets/production/extensions/Translate/scripts/expand-groupspec.php --exportable
  state-directory|export: /resources/projects

Project properties

Project properties apply to all repositories under a project. Mandatory keys are bolded.

Property key Description
group Which message groups are connected to this project. This accepts a GroupSpec: comma separated values with support for wildcards * and ?.
repos List of repositories connected to this project. This is a hash where the key is the filesystem path relative to the repoconfig.yaml where the repository is checked out, and the value is a hash of repository properties. Exception to this is key @generator which takes a command line to a script, which must return the repositories as a JSON string as output. If this method is used, no repositories can be specified for this project in the repoconfig.yaml file.
export-threshold This controls the --threshold parameter for export.php. Languages where less than given percentage of messages are not translated are not exported. The threshold is checked independently for each message group, even if the project consists of many.

Default value is 25.

no-export-languages This controls the --skip parameter for export.php. Accepts a comma separated string of language codes. These languages are never exported. By default en is not exported.
always-export-languages These languages are always exported even if they do not pass export-threshold. Useful for language variants which are not expected to reach 100%, such as British English.

Default value is qqq, which must be included if this value is overridden.

auto-merge If set, after pushing updates to Wikimedia Gerrit it will give CR+2 to all of them.

Only applicable for repository type wmgerrit.

To use patterns, the string must start with ^. For example ^mediawiki/extensions.* or ^project|another-project.
Example
mediawiki-extensions:
  always-export-languages: en-ca,en-gb,es-formal,de-formal,de-at,de-ch,hu-formal,nl-informal,zh-hk
  no-export-languages: test,aeb,ais,be-x-old,crh,dk,en,fiu-vro,gan,gom,hif,kbd,kk,kk-cn,iu,kk-kz,kk-tr,ko-kp,ku,ku-arab,no,ruq,simple,sr,tg,tp,tt,ug,zh,zh-classical,zh-cn,zh-sg,zh-min-nan,zh-mo,zh-my,zh-tw,zh-yue,bbc,ady
  export-threshold: 0
  group: ext-*
  auto-merge: ^mediawiki/extensions/.*
  repos:
    '@generator': php ../groups/MediaWiki/repong-generator.php extensions

Repository properties

Repository properties only apply to one repository. Mandatory keys are bolded.

Property key Description
type Supported values are:
  • git: plain clone of a git repository
  • github: plain clone of a git repository and setting name and email for l10n-bot
  • wmgerrit: plain clone of a git repository and setting name and email for l10n-bot and setting up git review for Wikimedia Gerrit
  • svn: checkout of Subversion repository
  • bzr: checkout of Bazaar repository
url URL of the repository. This usually varies with variant configuration.
branch Which branch to use. Ignored for repository type svn. This can vary with variant configuration, in which cause state sync is automatically disabled. Commits will be added to the remote branch without rebasing and rewriting, which means they will get out of sync with the source branch over time. Do not vary this when using push-branch or pull-branch.

Default value is master.

push-branch Which branch translation updates are pushed to. This will do a force push to create or update the remote branch. Only applicable for repository types git and github.
pull-branch Same as push-branch, but additionally updates or creates a pull request. Only applicable for repository type github.
no-state-sync Disables repository state synchronization. It is automatically disabled if branch varies. See #Repository state synchronization.
svn-add-options Additional options for Subversion to be applied for newly added files.
Example
fudforum:
  group: out-fudforum
  repos:
    fudforum:
      type: svn
      url: svn://svn.code.sf.net/p/fudforum/code/trunk/install/forum_data/thm/default/i18n
      svn-add-options: config:auto-props:msg=svn:mime-type=text/plain;svn:eol-style=native
      url|export: svn+ssh://translatewiki@svn.code.sf.net/p/fudforum/code/trunk/install/forum_data/thm/default/i18n

How to process external message changes

When running autoimport or autoimport-mediawiki or processMessageChanges.php directly, one gets a link to Special:ManageMessageGroups. On this page one does a sanity check of the changes before "accepting" them to translatewiki.net.

The page consists of diffs, where external state (files in repositories) is on the first column and the wiki state is on the second column. Changes seen on this page usually fall into the following categories:

New messages in source language.
There is usually nothing to check for these, and in fact autoimport will accept all new messages for a message group if there are no other changes. If there is something that doesn't look translatable (empty messages, URLs with no translations, symbols) one should update message group configuration to list these messages either as ignored or optional as appropriate.
Messages or translations deleted.
Again, these can usually be safely accepted. If there is a large amount of unexpected deletions, there might be a syntax error in the source file, that should be fixed before proceeding. We don't delete translation that go unused from the wiki.
Changed messages in source language.
It is normal for the messages in source language to changed. In this case one should see if the change is something that doesn't require fixes in translations (usually only spelling mistakes fill this criteria) and in that case choose the option to not mark translations as outdated.
Changes in translations.
Our exports are not yet fully atomic. Changes in translations should be checked carefully, because the system might try to overwrite a very recent translation with the previous one. It might also be an external change, in which case one should use his/her best judgement which version to choose.
Renamed messages.
Message renames can also happen, although they are discouraged. In this case you would see a message deleted and a new message with exactly the same content. Translations might or might not be renamed externally.

If messages keys are renamed while the content is exactly the same, Translatewiki.net will match the messages automatically and display them. The matching can be broken by selecting the Add as new menu option next to the matched renames.

Sometimes renames and content change is done at the same time. In such cases Translatewiki.net will not be able to match the messages automatically. If such messages are spotted, they can be manually matched using the Add as a rename menu option

The dialog box that appears when renaming, displays the list of messages that Translatewiki.net detects as missing, as of this import for that group. It also displays the similarity % between the message for which we are trying to find a rename, and the possible renames. The actual renaming is performed once the page is submitted and happens in the background via the JobQueue.

Using Special:ReplaceText to rename messages: We can still rename messages using the Special:ReplaceText. Copy the old and new key to Special:ReplaceText without namespace and language code, but include the trailing /. Regular expression can be included to rename multiple similar messages at once, but then / might need to be written as \/. Uncheck all namespaces and check the namespace and its talk namespace in question. Uncheck replace in content and check replace in titles. On the confirmation page check that all pages can be renamed – it might show that source page cannot be renamed if you don't have sufficient permissions. Once you have done the renames, wait a bit for JobQueue to process them, and then re-run the script to re-generate the diffs and proceed as usual. After you accept all changes, you might get an page with heading but no diffs, this is okay and it will disappear after re-generating the diffs next time.

Sometimes the diffs can be messy, for example if people duplicate messages, or do renames and change content at the same time. These situations need special care to get them right (or sometimes it is just too difficult and we make translators re-translate using translation memory). In some cases there are changes to the source message that could be programmatically applied to the translation using Special:ReplaceText or similar. The issue with these tools is that they do not preserve the outdated status of translations, so one should make sure they are either automatically or manually marked outdated after the automatic replacements. For this reason it is not usually worth trying to do those changes programmatically.

Project maintainers are advised to inform translation admins if there are major message renames, so check with the Translation admins whenever you have are reviewing a large number of renames as they may have further information about it.

Troubleshooting

Sometimes the cache can go out of sync and does not show changes that should be there.

One case where this happens is when adding a prefix to a message that has already been imported. In this case you can delete /resources/caches/translatewiki.net/translate_groupcache-*groupid*/*languagecode*.cdb for appropriate languages (usually en or qqq) and re-run the import.

Another case where this happens is when replacing a regular message group with an aggregate message group. This can be observed by a warning such as AggregateMessageGroup ext-scribunto cannot be primary owner of key scribunto-lua-error. In this case you can safely remove the whole directory for that group and run createmi to verify.


Sometimes for debugging it is useful to go very low level to figure out why access to some repository fails. Some examples:

# Check whether you can access ssh-agent
l10n-bot ssh-add -L
# Execute git command with verbose mode enabled for ssh:
GIT_SSH_COMMAND="git-ssh-wrapper -v" l10n-bot git fetch