CLDR

From translatewiki.net

The Common Locale Data Repository (CLDR) is run by the Unicode Consortium. The Unicode Consortium is a non-profit organisation which develops standards for software localisation in any language. It is an open-source project. CLDR maintains and continually expands a database of localisation data such as localised language names, characters, numbers, localised country names, currencies, units.

At present (as of September 2012) MediaWiki and translatewiki.net import and use most language names and Plural rules from CLDR. This data is imported via the MediaWiki extension CLDR, which is updated whenever the CLDR issues a new release.

Contribute to an existing locale

The easiest thing to do is completing translations of an existing locale. "Only" about 70 locales/languages are 98+ % complete for the needs of modern software; there is probably plenty you can help with in your language. New languages, major bugs in the data, plural rules, and the like, can be requested at any time on the tracker. Localisation data for words, punctuation, time formats and the like are input by contributors who have an account on the CLDR survey tool. The CLDR survey tool is only open for a few months each year.

Beginning in 2014 Wikimedia has the status of a contributing organisation on CLDR and can get survey tool accounts at CLDR for translatewiki.net translators, which have more votes than a normal "guest" account. To get an account on the CLDR survey tool, just ask Nemo for an account. He is Wikimedia's survey tool manager. Then use your account to translate and correct translations of language names and other things, when the survey tool is open.

See how to translate language names:

Localised language names

Localised language names are used by MediaWiki in a number of places, for instance titles of interwiki links. To add or change them you should go to CLDR and request corrections or additions to the language name data held there for your language. A language has its name translatable if and only if it is included in the English locale list of language names. Even if a language doesn't have a CLDR locale, you can ask it to be added.

The simplest way to request something small on CLDR is to open a bug request, for example "Cebuano" in Catalan, although new data should normally be added using the survey tool (see above). When requesting a change, you should provide references, preferably to on-line sources in English, so that the CLDR staff can easily verify your request. If you can't provide a reliable source, you may have to wait a very long time to get your request processed. Once fixed, the new translation is released in the next release of CLDR. New releases happen twice a year, although major bugs are fixed in interim releases.

If a language name has not been translated into your language yet on CLDR, then the CLDR data file will contain the language name in the fallback language used in CLDR for your language; see this discussion on Sorbian.

If CLDR does not yet have a locale for your language, then you could request one at CLDR, see below, so that other websites and programs can offer an interface in your language. However, if you cannot get a locale added to CLDR, or CLDR does not support the language name you wish to translate, you can request language name support for translatewiki.net and MediaWiki at Support on translatewiki.net.

CLDR workflow

Briefly, CLDR has a published[1] schedule of about one year into the future. They have various periods, in which they either do internal work, let people enter new locale data via their survey tool, let people vet on newly entered data, prepare a new release, and so on. Scheduled times are occasionaly adjusted by few weeks to match actual workload and progress. No more than two major releases are made per year. Thus there are at most two periods of time during which approved people can add to, or amend, locales. Some basic data of each locale remains stable until a new release is out, that means, old base data is being used, for instance, for validity checks and warnings, while new data, including this base data, is being collected. The general schedule introduced in 2013 should result in longer periods of access to the survey tool.

More details can be found at the CLDR process documentation.

Relevant for us is also the guideline Requesting Additions/Updates to CLDR Language/Population Data.

Creating a new locale

In short: provide the required core data in a new ticket.

If you plan to request a new locale at the CLDR and work on it, consider these hints first:

  1. If you are unfamiliar with the language, script, or writing forms and formats of the region, you will need several hours of assistance of a knowledgeable advisor, do not try it on your own!
  2. Read CLDRs hints: the section Adding New Locales and Picking the Right Language Identifier and List of script codes - if you are a techie, also a checklist about adding a new locale to CLDR for CLDR developers may interest you. There is a section "What is a Locale?" in the Document describing the XML data format used to store locales. Take the survey tool walkthrough and try to get open questions answered, you are likely to have some. You can use the CLDR survey tool as a visitor without a login; do that, play around inspecting locales that you may have some understanding of. Scroll pages down, and read explanations hidden there. They are often terse, missing or somewhat out of context. It is looking worse at first sight than it really is, once you got used to it!
  3. Request the new locale via the tracker, asking for things listed below to be preset and approved. (If you don't, you may have to wait up to a year before you have them, and it will hamper or even prohibit your work on some data)
    Fallback locale
    Tell what the new locale is to fall back to for data, the new locale does not define. The target must be an existing locale. If you must fall back to a missing locale, you can choose to create it first, or use the future fallback of the missing locale. Remember to change it, once the missing locale becomes available. The fallback locale setting can be altered in the survey tool.
    Plural rules
    These are quite central, and you do not want to miss them. See page Language Plural Rules how CLDR wants possible choices of grammatical number described. Description is basically very easy. You may find a good grammar book, and look it up, but you should check the special cases of zero, and negative numbers, since these are often not covered in classical grammar books. If you have no better source, find out yourself, maybe with the help of friends. You must know your language well. You have a choice of (only few!) of these: singular, dual, trial, quadral, paral, paukal, plural, distributive plural, singulative, collective, various modulo-based rules, and special cases for zero, and/or negative numbers. Try to find sample sentences using a variation of types of words, such as varying grammatical and natural genders, countable versus uncountable, separable versus indivisible, etc. Try each sentence with varying numbers including zero and some negatives. Take special care whether or not sometimes you must, or should better, say "nothing", or "no rice", or "not (even) any …", or something else, rather than "0 pieces of …" - usually finding a single (unavoidable) exception already warrants an extra case. Note, however, that you need only to keep possible distinctions based on numbers, nothing else. For example, in a typical language having the "n is 0", "n is 1", "everything else" choices only, whether "everything else" is going to use plural, distributive plural, or collective, if those exist, depends on the words and maybe context rather than the specific number. For those languages, you can safely skip such distinctions as well as the type of declensions that words are using.
    Character set
    Include every foreign character that may appear in foreign place or country or language names, even if your language usually does not use them. Say, São Tomé & Príncipe with its three accented Latin characters and the ampersand (&) is the correct country name for you, although your language itself never uses either of these characters, you must include all four, and have them approved prior to your work. If you don't, you can enter them later, but you will badly suffer from error messages and rejections, until your additions to the character set have been approved. Approval is not going to happen before the survey tool is closed, and a new CLDR data release is made. (see CLDR workflow, above)
    Collation
    It cannot be handled by the survey tool, you must submit it via the tracker. If you have the data at hand, supply it now. Collation is a tricky subject matter, if you do not have the necessary info, you can leave it until later, most likely without getting into trouble. There is help on collation available. You might find another documentation and ICU's guide useful for translating existing collation names.
    Transliteration
    If your locale does not use the Latin script, submit a transliteration to Latin. You might be able to use one existing at CLDR, otherwise you must ask them to create one from the data that you supply. Usually, there are existing resources on the web to be found with search engines, or ask a librarian familiar with your script.
    Day period categories
    Day period categories are the clock times reference missing! , during which people in your locale's region for example say "good morning" versus "good afternoon", or may append "in the morning" versus "at night" to clock times. If you do not set them, you will likely end up with " 00:00 ≤ am < 12:00 ≤ pm ≤ 24:00 " and be unable to enter correct clock related strings for your locale in the survey tool. Day period categories cannot be set, nor corrected, using the survey tool. There is a sample request for day period categories, which you can follow.
    Territory language information
    The new locale is bound to be missing from the most recent supplemental list of Territory Language Information - supply the necessary information for all territories where it may be usually used by locals.
    Likely Subtags
    Find the Likely Subtags for the new locale and ask to have them included in the list.
    Language list information
    Submit the language identifiers of ISO 639-1, ISO 639-2 (or ISO 639-2T and ISO 639-2B, if they happen to differ), ISO 639-3, native name(s), English name(s), French name(s). Also identify the writing system(s), that is the script(s), record the hierarchy of language families the language belongs to, plus the place, areas, regions, or states where spoken, and the estimated or known number of active and passive speakers. There is a sample list request, which you can follow.
    • (This list may be incomplete)
  4. Once the new locale exists and you have got the appropriate access right, as soon as the survey tool is open for submissions again, you can edit locale data. Do so. If you have questions of a technical nature as to how you should fill a certain field, you may be able to deduce something from the description of the Unicode Locale Data Markup Language (LDML), occasionally.

Additional checks, you should or could make, latest once the locale exists:

  • If your locale refers to a comparatively small area, check CLDR's UN M.49 data describing it to be present and correct. Make a new bug request to have it added or corrected, if necessary.
  • If your locale code or (any subtags) has (or have) aliased or older equivalent codes (or subtags), check that those appear in the supplemental Aliases list. Create a bug request to have them added, if you spot omissions.
  • Maybe more to come ...

See also

Where to ask questions

For CLDR related issues, bugs, and suggestions feel free to open a ticket on CLDRs request tracker. You can do so without having to log in, but make sure to enter your e-mail address. As of 2014, requests are usually processed weekly.

For translatewiki.net related issues and generally anything, you can ask on Support.

There is the CLDR-mailinglist' where you can ask for support, browse the archive, and so on.

Footnotes

  1. For instance in the left row on their home page, below a box of various internal links
  2. See page http://cldr.unicode.org/index/bug-reports