CLDR

From translatewiki.net
Jump to: navigation, search

The Common Locale Data Repository (CLDR) is run by the Unicode Consortium. The Unicode Consortium is a non-profit organisation which develops standards for software localisation in any language. It is an open-source project. CLDR maintains and continually expands a database of localisation data such as localised language names, characters, numbers, localised country names, currencies, units.

At present (as of September 2012) MediaWiki and translatewiki.net import and use most language names and Plural rules from CLDR. This data is imported via the MediaWiki extension CLDR, which is updated whenever the CLDR issues a new release.

Localised language names

To add or change localised language names for translatewiki.net and MediaWiki software you should go to CLDR and request corrections or additions to the language name data held there for your language. The simplest way to request something on CLDR is to open a bug request, for example "Cebuano" in Catalan, although new data should apparently be added using the survey tool. New locale bug requests can be raised from the tracking system home page, but you should first search the CLDR tracking system using the search box, to find other bug reports for your language. When requesting a change, you should provide references, preferably to on-line sources in English, so that the CLDR staff can easily verify your request. If you can't provide a reliable source, you may have to wait a very long time to get your request processed. Once fixed, the new language name translation is released in the next release of CLDR. New releases only happen about twice a year for CLDR, although major bugs are fixed in interim releases.

If a language name has not been translated into your language yet on CLDR, then the CLDR data file will contain the language name in the fallback language used in CLDR for your language; see this discussion on Sorbian.

If CLDR does not yet have a locale for your language, then you could request one at CLDR, see below, so that other websites and programs can offer an interface in your language. However, if you cannot get a locale added to CLDR, or CLDR does not support the language name you wish to translate, you can request language name support for translatewiki.net and MediaWiki at Support on translatewiki.net.

CLDR workflow

Briefly, CLDR has a published[1] schedule of about one year into the future. They have various periods, in which they either do internal work, let people enter new locale data via their survey tool, let people vet on newly entered data, prepare a new release, and so on. Scheduled times are occasionaly adjusted by few weeks to match actual workload and progress. No more than two major releases are made per year. Thus there are at most two periods of time during which approved people can add to, or amend, locales. Some basic data of each locale remains stable until a new release is out, that means, old base data is being used, for instance, for validity checks and warnings, while new data, including this base data, is being collected. The general schedule introduced in 2013 should result in longer periods of access to the survey tool.

More details can be found at the CLDR process documentation.

Creating a new locale

If you plan to request a new locale at the CLDR and work on it, consider these hints first:

  1. If you are unfamiliar with the language, script, or writing forms and formats of the region, you will need several hours of assistance of a knowledgeable advisor, do not try it on your own!
  2. Read CLDRs hints: the section Adding New Locales and Picking the Right Language Identifier and List of script codes - if you are a techie, also a checklist about adding a new locale to CLDR for CLDR developers may interest you. There is a section "What is a Locale?" in the Document describing the XML data format used to store locales. Take the survey tool walkthrough and try to get open questions answered, you are likely to have some. You can use the CLDR survey tool as a visitor without a login; do that, play around inspecting locales that you may have some understanding of. Scroll pages down, and read explanations hidden there. They are often terse, missing or somewhat out of context. It is looking worse at first sight than it really is, once you got used to it!
  3. Request access to the survey tool via the CLDR tracker.
    • Ask for edit rights for the new locale in the survey tool. State that you will at least supply enough data for "minimal coverage" or "basic coverage"[2] - you should know what that means.
    • If you are multilingual, you may wish to ask for appropriate edit rights regarding other locales as well.
  4. Request the new locale via the tracker, asking for things listed below to be preset and approved. (If you don't, you may have to wait up to a year before you have them, and it will hamper or even prohibit your work on some data)
    Fallback locale 
    Tell what the new locale is to fall back to for data, the new locale does not define. The target must be an existing locale. If you must fall back to a missing locale, you can choose to create it first, or use the future fallback of the missing locale. Remember to change it, once the missing locale becomes available. The fallback locale setting can be altered in the survey tool.
    Plural rules 
    These are quite central, and you do not want to miss them. See page Language Plural Rules how CLDR wants possible choices of grammatical number described. Description is basically very easy. You may find a good grammar book, and look it up, but you should check the special cases of zero, and negative numbers, since these are often not covered in classical grammar books. If you have no better source, find out yourself, maybe with the help of friends. You must know your language well. You have a choice of (only few!) of these: singular, dual, trial, quadral, paral, paukal, plural, distributive plural, singulative, collective, various modulo-based rules, and special cases for zero, and/or negative numbers. Try to find sample sentences using a variation of types of words, such as varying grammatical and natural genders, countable versus uncountable, separable versus indivisible, etc. Try each sentence with varying numbers including zero and some negatives. Take special care whether or not sometimes you must, or should better, say "nothing", or "no rice", or "not (even) any …", or something else, rather than "0 pieces of …" - usually finding a single (unavoidable) exception already warrants an extra case. Note, however, that you need only to keep possible distinctions based on numbers, nothing else. For example, in a typical language having the "n is 0", "n is 1", "everything else" choices only, whether "everything else" is going to use plural, distributive plural, or collective, if those exist, depends on the words and maybe context rather than the specific number. For those languages, you can safely skip such distinctions as well as the type of declensions that words are using.
    Character set 
    Include every foreign character that may appear in foreign place or country or language names, even if your language usually does not use them. Say, São Tomé & Príncipe with its three accented Latin characters and the ampersand (&) is the correct country name for you, although your language itself never uses either of these characters, you must include all four, and have them approved prior to your work. If you don't, you can enter them later, but you will badly suffer from error messages and rejections, until your additions to the character set have been approved. Approval is not going to happen before the survey tool is closed, and a new CLDR data release is made. (see CLDR workflow, above)
    Collation 
    It cannot be handled by the survey tool, you must submit it via the tracker. If you have the data at hand, supply it now. Collation is a tricky subject matter, if you do not have the necessary info, you can leave it until later, most likely without getting into trouble. There is help on collation available.
    Transliteration 
    If your locale does not use the Latin script, submit a transliteration to Latin. You might be able to use one existing at CLDR, otherwise you must ask them to create one from the data that you supply. Usually, there are existing resources on the web to be found with search engines, or ask a librarian familiar with your script.
    Day period categories 
    Day period categories  reference missing!  are the clock times, during which people in your locale's region for example say "good morning" versus "good afternoon", or may append "in the morning" versus "at night" to clock times. If you do not set them, you will likely end up with " 00:00 ≤ am < 12:00 ≤ pm ≤ 24:00 " and be unable to enter correct clock related strings for your locale in the survey tool. Day period categories cannot be set, nor corrected, using the survey tool. There is a sample request for day period categories, which you can follow.
    Territory language information. 
    The new locale is bound to be missing from the most recent supplemental list of Territory Language Information - supply the necessary information for all territories where it may be usually used by locals.
    Likely Subtags. 
    Find the Likely Subtags for the new locale and ask to have them included in the list.
    Language list information 
    Submit the language identifiers of ISO 639-1, ISO 639-2 (or ISO 639-2T and ISO 639-2B, if they happen to differ), ISO 639-3, native name(s), English name(s), French name(s). Also identify the writing system(s), that is the script(s), record the hierarchy of language families the language belongs to, plus the place, areas, regions, or states where spoken, and the estimated or known number of active and passive speakers. There is a sample list request, which you can follow.
    • (This list may be incomplete)
  5. Once the new locale exists and you have got the appropriate access right, as soon as the survey tool is open for submissions again, you can edit locale data. Do so. If you have questions of a technical nature as to how you should fill a certain field, you may be able to deduce something from the description of the Unicode Locale Data Markup Language (LDML), occasionally.

Additional checks, you should or could make, latest once the locale exists:

  • If your locale refers to a comparatively small area, check CLDR's UN M.49 data describing it to be present and correct. Make a new bug request to have it added or corrected, if necessary.
  • If your locale code or (any subtags) has (or have) aliased or older equivalent codes (or subtags), check that those appear in the supplemental Aliases list. Create a bug request to have them added, if you spot omissions.
  • Maybe more to come ...

See also

Where to ask questions

For CLDR related issues, bugs, and suggestions feel free to open a ticket on CLDRs request tracker. You can do so without having to log in, but make sure to enter your e-mail address. As of 2014, requests are usually processed weekly.

For translatewiki.net related issues and generally anything, you can ask on Support. If you do not get a sufficient reply in a reasonable amount of time, you are welcome to put a section-link to your question on the talk page of Purodha. He maintains a locale at CLDR and likes to help but does not necessarily read all support requests all the time.

There is the CLDR-mailinglist' where you can ask for support, browse the archive, and so on.

Footnotes

  1. For instance in the left row on their home page, below a box of various internal links
  2. See page http://cldr.unicode.org/index/bug-reports