Jump to content

Remove 'sh' (Serbo-Croatian) language ?

bump...

-- Damien17:05, 11 July 2022

sh works. sc is alternative. Sardinian cannot be sc.

Obsuser (talk)14:45, 16 October 2022

You are wrong, "sc" is standard for Sardinian in ISO 639-1, BCP47 and Wikimedia wikis and MediaWiki translations (do not assume it being usable for Serbo-Croatian).

"sh" has been removed from ISO 639-1 only (but not "hbs" from ISO 639-3 !), and not from BCP47 (where "hbs" was mapped to be the same as "sh") which more or less still considers it as an alias of "sh-latn" (with the implied Latin script by default). but it has been kept in ISO 639-3 as a macrolanguage (containing "hr", "cnr" aliased by default to "cnr-Latn", and "bs" aliased by default to "bs-Latn"). It is special compared to other macrolanguages because of its implied default script, but "sh-cyrl" is also valid (and comprises "bs-Cyrl", "cnr-Cyrl", and "sr" aliased to "sr-Cyrl").

The old decision taken in ISO 639-1 only is very unfortunate, given than "hbs" has been kept in ISO 639-2 and ISO 639-3 (including in its later revision where it was assigned the "scope" of a "macrolanguage"). That code may be retired in ISO 639-1 not for techincal or translation purpose, but probavbly motivated only for bibliographic use (but many public libraries in the world have not remvoed that classification for their book archives, and notably not for books published in the former Yugoslavia!). But that decision was motivated before ISO 639-3 was released to define the concept of "macrolanguages", and also before the revision of BCP47 (at that time ISO 639-1 and ISO 639-2 were a mess, they were unstable, and most applications chose to ignore ISO 639 and have developed their standards to reference BCP47, and its related IANA databases for language tags, rather than ISO 639; ISO 639-3 has attempted to make a more comprehensive codification, had to fix some codes by definining their scope; BCP 47 was revized to include "grandfathered" tags, and ensure stability and backward compatiblity; today nobody wants to make any normative reference to ISO 63, except for bibliographic purposes with simplified classifications, but not usable at all for translations and technical applications; this is the case of all IETF, W3C, ITU standards, as well as other ISO/CIE/ECMA standards, and many national standard bodies, even if sometimes they took the decision to preserve their own legacy codes and made specific requests to ISO and the IETF to maintain the stability).

ISO 639-1 still has very bad codes like "bh" (which is not even a macrolanguage but a family, not mapped to ISO 639-2 or -3 but to ISO 639-5 as "bih"; note that ISO 639-5 is still very incomplete for classificying language families). As well Wikimedia still has its own legacy codes that violate BCP 47 (but they are slowly being retired and replaced). Wikimedia privately uses "bh" in its wiki domaine names by assuming it refers only to "bho" (Bhojpuri), one of the languages of that family, but it uses other conforming ISO 639-3 codes for the other languages of that family, so that is not blocking any project.

"sh" (Serbo-Croatian) is still valid in BCP47 and many linguists (as well as many native speakers) also consider it as being a single macrolanguage comprising "hr" (Cratian), "bs" (Bosnian), "cnr" (Montenegrin), and "sr" (Serbian), independantly of the "Latn" or "Cyrl" script which they may use, even if languages were separated (even though they are basicalty dialects/variants of each other with excellent mutual understanding, and just some prefered forms in each of them, and minor orthographic differences between locations; but the orthographies in the two scripts are mutually interchangeable, that's why Wikimedia wikis provide an automated translitator for reading/writing them in either script, just as a matter of user preferences). Very few words are in fact localized specifically between these 4 languages and none between the two scripts.

(Note that Wikimedia still uses some incorrect "sr-ec" and "sr-el" legacy codes instead of the standard "sr-Cyrl", "sr-Latn" codes. This is only for its domain names and interwikis, not for HTML language tagging which uses standard BCP47 codes; there also remains some properties in Wikidata still using these legacy codes, but they are deprecated and should be also replaced by BCP47 codes; but this does not apply to sitelinks whose usage in domain names (for wikis) and in interwiki prefixes does NOT violate the HTML standard; Wikimedia still has to cleanup its local use of "nrm" instead of "nrf" for Norman, which severely conflicts with ISO 639-3 and BCP47, as it blocks any attempt to translate to "Narom", a completely unrelated South-East Asian language.)

The best working standard for encoding languages used in translations is BCP47 (i.e. RFC 4646 for its last release and its related IANA database). Let's forget ISO 639-1/2 completely (it will remain in the limbos of some public libraries with their old classification system, but many have converted their catalogs to use BCP47 instead for language identification, plus eventually ISO 639-5 for a very weak classification of language families in book collections; if they need more precide classifications today, they can use BCP47 "private-use" codes, they can also use ISO 15924, including script variants for written documents and artworks, even if these variants are unified in Unicode) !

Finaly note that for translations, we don't care at all about ISO 639 (and its many past defiencies), only about BCP 47 (where ISO 639 is only a partial and unstable source); this is not just for this wiki, or Mediawiki or Wikimedia, this is a standard used everywhere on the web (part of HTML for example, as well as almost all i18n libraries and programming languages using them). Many things have disappeared or changed unilaterally in ISO 639, or have been rejected for use in BCP 47, which is a much more usable standard, more precise, and where stability for language identification was part of the design and kept for ever as much as possible (even if some BCP 47 tags or subtags may become insufficient and may need to be requalified in newer documents, but any translation made with a valid BCP 47 tag will remain valid in any later update; except if thre was a severe error and the tags are exceptionnally marked as "discouraged" in the IANA database, where it may or may not suggest some prefered replacement, if one is most likely for most common usage cases). ISO 639 has only been defined for broad use by libarians for the classification and searches in their catalogs, or for managing copyrights in large categories according to their current practices for data exchanges ; later ISO 639 was partly updated to add "technical use" (motivated by trying to get a compatiblity with BCP 47 (but only on old version, and it was never updated later). ISO 639 broke that technical compatiblity later multiple times. BCP 47 data sources for adding entries in the IANA database have ben publicly documented in multiple RFCs with rationales (that's not the case for ISO 639 whose decisions are closed and limited by copyright issues, with no details about they were vetted, so ISO 639 is not a "best practice", only something endorsed by a few national ISO vetters for their use in their national catalogs who actually dont care at all about using precise and distinctive terminology). Don't refer to ISO 639, it's not a normative reference, just an informal informative reference! Note also that various countries have stopped supporting ISO 639 (which they never approved themselves in ISO TC) in their public catalogs and media libraries, they have adopted BCP 47 instead, the same is true for many publishers, media creators and vendors.

Verdy p (talk)15:15, 16 October 2022