blob: 595dc1230ecf198416084dcab424d4482fabc1b6 [file] [log] [blame]
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC4646 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4646.xml">
<!ENTITY rfc5646 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5646.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc category="info" docName="draft-davis-u-langtag-ext-04" ipr="trust200902" submissionType="independent">
<front>
<title abbrev="BCP 47 Unicode Locale Extension">BCP 47 Extension U</title>
<author fullname="Mark Davis" initials="M.E." surname="Davis">
<organization>Google</organization>
<address>
<email>mark@macchiato.com</email>
</address>
</author>
<author fullname="Addison Phillips" initials="A" surname="Phillips">
<organization>Lab126</organization>
<address><email>addison@lab126.com</email></address>
</author>
<author initials="Y" surname="Umaoka" fullname="Yoshito Umaoka"><organization abbrev="IBM">IBM</organization><address><email>yoshito_umaoka@us.ibm.com</email></address></author><date month="August" year="2010" day="26"/>
<!-- Meta-data Declarations -->
<area>General</area>
<workgroup>Internet Engineering Task Force</workgroup>
<keyword>locale</keyword><keyword>bcp 47</keyword>
<!-- Keywords will be incorporated into HTML output
files in a meta tag but they have no effect on text or nroff
output. If you submit your draft to the RFC Editor, the
keywords will be used for the search engine. -->
<abstract>
<t>This document specifies an Extension to BCP 47
which provides subtags that specify language and/or locale-based behavior or refinements to language tags, according
to work done by the Unicode Consortium.</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<t> <xref target="BCP47"></xref> permits the definition and registration of language tag extensions "that
contain a language component and are compatible with applications that
understand language tags". This document defines an extension for identifying Unicode locale-based variations using language tags. The "singleton" identifier for this extension is 'u'.</t>
<section title="Requirements Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.</t>
</section>
</section>
<?rfc needLines="8" ?>
<section title="BCP47 Required Information">
<t>Language tags, as defined by <xref target="BCP47"></xref>, are useful for identifying the language of content. They are also used as locale identifiers (or can be mapped to locales) in many operating environments and APIs. However, many locale identifiers also require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.</t><t>The Unicode Consortium defines a standardized, structured set of locale data and identifiers for locale data in the "Common Locale Data Repository" or "CLDR". The maintaining authority for the extension defined by this document is the Unicode
Consortium:</t>
<texttable>
<ttcol>Item</ttcol>
<ttcol>Value</ttcol>
<c>Name</c>
<c>Unicode Consortium</c>
<c>Contact Email</c>
<c>cldr@unicode.org</c>
<c>Discussion List Email</c>
<c>cldr-users@unicode.org</c>
<c>URL Location</c>
<c>cldr.unicode.org</c>
<c>Specification</c>
<c>Unicode Technical Standard #35 Unicode Locale Data Markup Language
(LDML), http://unicode.org/reports/tr35/</c>
<c>Section</c>
<c>Section 3 Unicode Language and Locale Identifiers</c>
</texttable>
<t>The specification of extension subtags is provided by <eref target="http://unicode.org/reports/tr35/#Key_Type_Definitions">Section 3, Key Type Definitions</eref> of <xref target="UTS35">Unicode
Technical Standard #35: Unicode Locale Data Markup Language</xref>. As required by BCP 47, subtags follow the language tag ABNF and
other rules for the formation of language tags and subtags, are restricted to the ASCII letters and digits, are not case sensitive, and do not exceed eight characters in length. Note that any "well-formed" language tag (see <xref target="BCP47">RFC 5646, Section 2.2.9</xref>) is also a well-formed locale identifier.</t>
<t><xref target="UTS35">LDML</xref> specifies a canonical
representation. LDML is available over the Internet and at no cost, and
is available via a royalty-free license at
http://unicode.org/copyright.html. LDML is versioned, and each
version of LDML is numbered, dated, and stable. Extension subtags, once
defined by LDML, are never retracted or change in meaning in a
substantial way. </t><t>The structure of the Unicode locale extension is determined by the Unicode CLDR Technical Committee, in accordance with the policies and procedures in
http://www.unicode.org/consortium/tc-procedures.html, and subject to the Unicode Consortium Policies on http://www.unicode.org/policies/policies.html.</t><t>Changes that can be made by successive versions of <xref target="UTS35">LDML</xref> by the Unicode Consortium without requiring a new RFC include: the allocation of new attributes, keywords, and types; clarifications or non-material changes to an existing attribute, keyword, or type; and compatible extensions to the overall syntactic structure of attributes, keywords, and types. A new RFC would be required for material changes to an existing attribute, keyword, or type, or an incompatible change to the overall syntactic structure of attributes, keywords, and types; however, such a change would be contrary to the policies of the Unicode Consortium, and thus is not anticipated.</t>
<section title="Summary">
<t>The subtags available for use in the 'u' extension consist of a set of attributes, keys, and types. Attributes, keys, types, and their respective meanings are defined in <eref target="http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers">Section 3 (Unicode Language and Locale Identifiers)</eref> of <xref target="UTS35"></xref>. The following is a summary of that definition:<list style="symbols"><t>An 'attribute' is a subtag with a length of three to eight characters following the singleton and preceding any 'keyword' sequences. No attributes were defined at the time of this document's publication.</t><t>A 'keyword' is a sequence of subtags consisting of a 'key' subtag, followed by zero or more 'type' subtags (so a 'key' might appear alone and not be accompanied by a 'type' subtag). A 'key' MUST NOT appear more than once in a language tag's extension string. The order of the 'type' subtags within a 'keyword' is sometimes significant to their interpretation. <list style="letters"><t>A 'key' is a subtag with a length of exactly two characters. Each 'key' is followed by zero or more 'type' subtags. </t><t>A 'type' is a subtag with a length of three to eight characters following a key. 'Type' subtags are specific to a particular 'key' and the order of the 'type' subtags MAY be significant to the interpretation of the 'keyword'.</t></list></t></list></t><t>For example, the language tag "de-DE-u-attr-co-phonebk" consists of:<list style="symbols"><t>The base language tag "de-DE" (German as used in Germany), exactly as defined by <xref target="BCP47"></xref> using subtags from the IANA Language Subtag Registry.</t><t>The singleton 'u', identifying this extension.</t><t>The attribute 'attr', which is an example for illustration (no attributes were defined at the time this document was published).</t><t>The keyword 'co-phonebk', consisting to the key 'co' (Collation) and the type 'phonebk' (Phonebook collation order).</t></list></t><t>Only the first occurrence of an attribute or key conveys meaning in a language tag. When interpreting tags containing the Unicode locale extension, duplicate attributes or keywords are ignored in the following way: ignore any attribute that has already appeared in the tag and ignore any keyword whose key has already occurred in the tag.</t><t>Successive versions of <xref target="UTS35"></xref> could define additional attributes, keys, and types. Once defined, attributes, keys, and types will never be removed. </t><t>Beginning with CLDR version 1.7.2, machine-readable files are available listing the valid attributes, keys, and types for each successive version of <xref target="UTS35"></xref>. These releases are listed on <eref target="http://cldr.unicode.org/index/downloads">http://cldr.unicode.org/index/downloads</eref>. Each release has an associated data directory of the form "http://unicode.org/Public/cldr/&lt;version&gt;", where "&lt;version&gt;" is replaced by the release number. For example, for version 1.7.2, the "core.zip" file is located at <eref target="http://unicode.org/Public/cldr/1.7.2/">http://unicode.org/Public/cldr/1.7.2/core.zip</eref>. Inside the "core.zip" file, the path "common/bcp47" contains the data files defining the valid attributes, keys, and types. The most recent version is always identified by the version "latest" and can be accessed by the URL in <xref target="regform"></xref>.</t><t>To get the version information in XML when working with the data files, the XML parser must be validating. When the 'core.zip' file is unzipped, the 'dtd' directory will be at the same level as the 'bcp47' directory; that is required for correct validation. For each release after CLDR 1.8, types introduced in that release are also marked in the data files by the XML attribute "since", such as in the following example:<figure><artwork>&lt;type name="adp" since="1.9"/&gt; </artwork></figure></t><t>The data is also currently maintained in a source code repository, with each release tagged, for viewing directly without unzipping. For example, see:<list style="symbols"><t>http://unicode.org/repos/cldr/tags/release-1-7-2/common/bcp47/</t><t>http://unicode.org/repos/cldr/tags/release-1-8/common/bcp47/</t></list></t><t>Some data in the CLDR data files might require reference to <xref target="UTS35">LDML</xref>. For specific information, see <eref target="http://unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data">Appendix Q</eref> in that document. For example, LDML reserves the type 'codepoints' to define specific code point ranges in Unicode for specific purposes.</t><section title="Canonicalization" anchor="canonicalization"><t>As required by <xref target="BCP47"></xref>, the use of uppercase or lowercase letters is not significant in the subtags used in this extension. The canonical form for all subtags in the extension is lowercase. The canonical order of attributes is in <xref target="US-ASCII"></xref> order (that is, numbers before letters, with letters sorted as lowercase US-ASCII code points). The canonical order of keywords is in <xref target="US-ASCII"></xref> order by key. The order of subtags within a keyword is significant; the meaning of this extension is altered if those subtags are rearranged. Thus, the canonical form of the extension never reorders the subtags within a keyword.</t>
</section></section>
<section title="Registration Form" anchor="regform"><t>Per <xref target="BCP47">RFC 5646, Section 3.7</xref>:</t>
<figure><artwork>%%
Identifier: u
Description: Unicode Locale
Comments: Subtags for the identification of language and cultural
variations. Used to set behavior in locale APIs. Data is
located in the "common/bcp47" directory inside the referenced
URL. Unicode Technical Standard #35 (LDML) provides additional
reference material defining the keys and values.
Added: 2010-mm-dd
RFC: [TBD]
Authority: Unicode Consortium
Contact_Email: cldr-contact@unicode.org
Mailing_List: cldr-users@unicode.org
URL: http://www.unicode.org/Public/cldr/latest/core.zip
%%</artwork></figure>
</section>
</section>
<section anchor="Acknowledgements" title="Acknowledgements">
<t>Thanks to John Emmons and the rest of the Unicode
CLDR Technical Committee for their work in developing the BCP 47 subtags
for LDML.</t><t>Thanks also to Doug Ewell, for his many suggestions for improvements to this document.</t>
</section>
<section anchor="IANA" title="IANA Considerations">
<t>This document will require IANA to insert the record in <xref target="regform"></xref> into the Language Extensions Registry, according to
Section 3.7. Extensions and the Extensions Registry of "Tags for
Identifying Languages" in <xref target="BCP47"></xref>. Per Section 5.2 of <xref target="BCP47"></xref>, there might be occasional (rare) requests by the Unicode Consortium (the "Authority" listed in the record) for maintenance of this record. Changes that can be submitted to IANA without the publication of a new RFC are limited to modification of the Comments, Contact_Email, Mailing_List, and URL fields. Any such requested changes MUST use the domain 'unicode.org' in any new addresses or URIs, MUST explicitly cite this document (so that IANA can reference these requirements), and MUST originate from the 'unicode.org' domain. The domain or authority can only be changed via a new RFC.</t><t>This document does not require IANA to create or maintain a new registry or otherwise impact IANA.</t>
</section>
<section anchor="Security" title="Security Considerations">
<t>The security considerations for this extension are the same as those
for <xref target="BCP47"></xref>. See <xref target="BCP47">RFC 5646, Section 6, Security Considerations</xref>.</t>
</section>
</middle>
<back>
<references title="Normative References">
<reference anchor="UTS35" target="http://www.unicode.org/reports/tr35/">
<front>
<title abbrev="LDML">
Unicode Technical Standard #35: Locale Data Markup Language (LDML)
</title>
<author initials="M" surname="Davis" fullname="Mark Davis">
<organization>Unicode Consortium</organization>
</author>
<date day="21" month="December" year="2007"/>
</front>
<annotation>Section 3: http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers</annotation><annotation>Appendix Q: http://unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data</annotation></reference> <reference anchor="BCP47"><front><title abbrev="BCP47">Tags for the Identification of Language (BCP47)</title><author initials="M.E." surname="Davis" fullname="Mark Davis" role="editor"><organization><?xm-replace_text {organization}?></organization></author><date month="September" year="2009"/></front></reference><reference anchor="US-ASCII">
<front>
<title>ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange. </title>
<author>
<organization>International Organization for Standardization</organization>
</author>
<date year="1991"/>
<abstract>
<t>This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646 and makes an electronic copy available at http://www.ecma-international.org/publications/standards/Ecma-006.htm. ISO/IEC 646 JTC 1/SC 2</t>
</abstract>
</front>
</reference></references><references title="Informative References"><reference anchor="ldml-registry"><front><title>Registry for Common Locale Data Repository tag elements</title><author fullname="Unicode Consortium"><organization><?xm-replace_text {organization}?></organization></author><date year="2009" month="September"/></front></reference></references>
</back>
</rfc>