| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
| <html> |
| <head> |
| |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| |
| <meta http-equiv="Content-Language" content="en-us"> |
| |
| <meta name="VI60_defaultClientScript" content="JavaScript"> |
| |
| <meta name="GENERATOR" content="Microsoft FrontPage 6.0"> |
| |
| <meta name="keywords" content="Unicode, common locale data repository"> |
| |
| <meta name="ProgId" content="FrontPage.Editor.Document"> |
| |
| |
| <title>Unicode CLDR Bug Reports</title> |
| <link rel="stylesheet" type="text/css" href="http://www.unicode.org/webscripts/standard_styles.css"> |
| |
| <style type="text/css"> |
| <!-- |
| .e{margin-left:1em;text-indent:-1em;margin-right:1em} |
| .tx{font-weight:bold} |
| --> |
| </style> |
| </head> |
| |
| |
| |
| <body text="#330000"> |
| |
| |
| <table border="0" cellpadding="0" cellspacing="0" width="100%"> |
| |
| <tbody> |
| <tr> |
| |
| <td colspan="2"> |
| |
| <table border="0" cellpadding="0" cellspacing="0" width="100%"> |
| |
| <tbody> |
| <tr> |
| |
| <td class="icon"><a href="http://www.unicode.org/"> |
| <img src="http://www.unicode.org/webscripts/logo60s2.gif" alt="[Unicode]" align="middle" border="0" height="33" width="34"></a> |
| <a class="bar" href="index.html"><font size="3">Common Locale Data Repository</font></a></td> |
| |
| <td class="bar"><a href="http://www.unicode.org" class="bar">Home</a> | <a href="http://www.unicode.org/sitemap/" class="bar">Site Map</a> | |
| <a href="http://www.unicode.org/search/" class="bar">Search</a></td> |
| |
| </tr> |
| |
| |
| </tbody> |
| </table> |
| |
| </td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td colspan="2" class="gray"> </td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navCol" valign="top" width="25%"> |
| |
| <table class="navColTable" border="0" cellpadding="0" cellspacing="4" width="100%"> |
| |
| <tbody> |
| <tr> |
| |
| <td class="navColTitle">Contents</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="#Collation_Bugs">Collation Bugs</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="#Possible_Comparison_Sources">Sources</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColTitle">Unicode CLDR</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="index.html">CLDR Project</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="repository_access.html">CLDR Releases (Downloads)</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="survey_tool.html">CLDR Survey Tool</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="filing_bug_reports.html">CLDR Bug Reports</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="comparison_charts.html">CLDR Charts</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="process.html">CLDR Process</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/reports/tr35/">UTS #35: Locale Data Markup Language (LDML)</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColTitle">Related Links</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top">Join the <a href="http://www.unicode.org/consortium/consort.html">Unicode Consortium</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/reports/">Unicode Technical Reports</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/faq/reports_process.html">Technical Reports Development and Maintenance Process</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/consortium/utc.html">Unicode Technical Committee</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/versions/">Versions of the Unicode Standard</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColTitle">Other Publications</td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/standard/standard.html">The Unicode Standard</a></td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="navColCell" valign="top"><a href="http://www.unicode.org/notes/">Unicode Technical Notes</a></td> |
| |
| </tr> |
| |
| |
| </tbody> |
| </table> |
| |
| <!-- BEGIN CONTENTS --></td> |
| |
| <td> |
| |
| <table> |
| |
| <tbody> |
| <tr> |
| |
| <td class="contents" valign="top"> |
| |
| <div class="body"> |
| |
| <h1>Unicode CLDR Bug Reports</h1> |
| |
| |
| <p><span class="changed">Most proposed data (new or corrections) should be entered via the </span><a href="survey_tool.html">CLDR Survey Tool</a><span class="changed">. |
| </span></p> |
| |
| |
| <p>Bugs may be filed for defects in the survey tool, for |
| adding or changing non-language data (such as currency usage), for |
| additions or changes to data that is not yet handled by the survey tool |
| (collation, segmentation, and transliteration), and for feature |
| requests in CLDR or <a href="http://www.unicode.org/reports/tr35/">UTS #35: Locale Data Markup Language (LDML)</a>.</p> |
| |
| |
| <p>To file such a bug, go to <a href="http://www.unicode.org/cldr/bugs/locale-bugs">Locale Bugs</a>. |
| Try to give as much information as possible to help address the issue, |
| and please group related bugs (such as a list of problems with the LDML |
| specification) into a single bug report. Some specific cases are |
| covered below.</p> |
| |
| |
| <h2><a name="Collation_Bugs">Collation Bugs</a></h2> |
| |
| |
| <p>The exact collation sequence for a given language may be |
| difficult to determine. The base ordering of characters can be fairly |
| straightforward, but there are quite a few other complications |
| involved. </p> |
| |
| |
| <p><span>Most standards that specify collation, such as DIN |
| or CS, are not targeted at algorithmic sorting, and are not complete |
| algorithmic specifications. For example, CSN 97 6030 requires |
| transliteration of foreign scripts, but there are many choices as to |
| how to transliterate, and the exact mechanism is not specified. It also |
| specifies that geometric shapes are sorted by the number of vertices |
| and edges, which is, at a minimum, difficult to determine; and are |
| subject to variation in glyphs. </span>T<span>he CLDR goals are to match the sorting of exemplar letters |
| and common punctuation and |
| leave everything else to the standard UCA ordering. </span>For more information, see |
| <a href="http://www.unicode.org/reports/tr10/#Introduction">UTS #10: Unicode Collation Algorithm</a> (UCA).</p> |
| |
| |
| <p>For readability, the rules are presented here in |
| Java/ICU rule format, rather than XML; for the same reason, we prefer |
| the bug reports to also use that format, even though the end result |
| will be in XML. For more information, see <a href="http://icu.sourceforge.net/userguide/Collate_Customization.html">ICU Collation Customization</a>.</p> |
| |
| |
| <p>Please supply some short test cases that illustrate the |
| correct sorting behavior as a list of lines in sorted order. Try to |
| include cases that show the boundary behavior by including high |
| suffixes, such as the following:</p> |
| |
| |
| <ul> |
| |
| <li><i>Rules:</i> |
| |
| <ul> |
| |
| <li><i>& c < cs</i></li> |
| |
| <li>& cs <<< ccs / cs</li> |
| |
| |
| </ul> |
| |
| </li> |
| |
| <li><i>Test Data:</i> |
| |
| <ul> |
| |
| <li><i>c<br> |
| |
| cy<br> |
| |
| cs<br> |
| |
| cscs<br> |
| |
| ccs<br> |
| |
| cscsy<br> |
| |
| ccsy<br> |
| |
| csy<br> |
| |
| d</i></li> |
| |
| |
| </ul> |
| |
| </li> |
| |
| |
| </ul> |
| |
| |
| <p>Please test out any suggested rules before filing a bug, using Locale Explorer:</p> |
| |
| |
| <ol> |
| |
| <li>Go to the <a href="http://ibm.com/software/globalization/icu/demo/locales">ICU Locale Explorer</a></li> |
| |
| <li>Pick the appropriate locale</li> |
| |
| <li>Follow the instructions at the bottom to use your suggested rules on your suggested test data.</li> |
| |
| <li>Verify that the proper order results.</li> |
| |
| |
| </ol> |
| |
| |
| <h3>Pitfalls</h3> |
| |
| |
| <p>There are a number of pitfalls with collation, so be |
| careful. In some cases, such as Hungarian or Japanese, the rules can be |
| fairly complicated (of course, reflecting that the sorting sequence for |
| those languages is complicated).</p> |
| |
| |
| <ol> |
| |
| <li><b>Only tailor expected data. </b>We focus on the required collation sequence for a given language with normal data. So we don't include |
| full-width characters for a European collation sequence, such as |
| |
| <ul> |
| |
| <li>... CSCS <<< CSCS ...</li> |
| |
| <li>... CSCS <<< \uFF23\uFF33\uFF23\uFF33 ... (equivalently)</li> |
| |
| |
| </ul> |
| |
| </li> |
| |
| <li><b>Tailor trailing contractions. </b>If a sequence of characters is treated as a unit for collation, it should be entered as a contraction. |
| |
| <p>& c < ch</p> |
| |
| |
| <p>One might think that sequence like "dz" doesn't |
| require that, since it would always come after "d" followed by any |
| other letter; it is a "trailing contraction". But in unusual cases, |
| that wouldn't be true; if "dz" is a unit sorted as if it were a |
| distinct letter after "d", one should get the ordering "d<font size="3">α" < "dz". This will only happen if "dz" is a contraction, such as</font></p> |
| |
| |
| <p><font size="3">& d < dz</font></p> |
| |
| </li> |
| |
| <li><b>Watch out for Expansions.</b> If you have a rule like &cs < d, and "cs" has not occurred in a previous rule as a contraction, then |
| this is automatically considered to be the same as &c < d / s; that is, the d <i>expands</i> as if it were a "cs" (actually, primary greater |
| than a "cs", since we wrote "<"). This expansion takes effect until the next primary difference. |
| |
| <p>So suppose that "ccs" is to behave as if it were |
| "cscs", and take case differences into account. You might try to do |
| this with the rules on the left:</p> |
| |
| |
| <table id="table3" border="1" cellpadding="4" cellspacing="0"> |
| |
| <tbody> |
| <tr> |
| |
| <th align="left" width="50%">Rules (Wrong)</th> |
| |
| <th align="left" width="50%">Actual Effect</th> |
| |
| </tr> |
| |
| <tr> |
| |
| <td width="50%">& C < cs <<< Cs <<< CS<br> |
| |
| & cscs <<< ccs<br> |
| |
| <<< Cscs <<< Ccs<br> |
| |
| <<< CSCS <<< CCS</td> |
| |
| <td width="50%">& C < cs <<< Cs <<< CS<br> |
| |
| & cs <<< ccs / cs<br> |
| |
| <<< Cscs / cs <<< Ccs / cs<br> |
| |
| <<< CSCS / cs <<< CCS / cs</td> |
| |
| </tr> |
| |
| |
| </tbody> |
| </table> |
| |
| |
| <p>But since the <u>CSCS</u> has not been made a contraction in previous rules, this produces an automatic expansion, one that continues |
| through the entire sequence of non-primary differences, as shown on the right. This is <i>not</i> what is wanted: each item acts like it |
| expands compared to the previous item. So CCS, for example, will act like it expands to CSCScs!</p> |
| |
| |
| <p>What you actually want is the following:</p> |
| |
| |
| <table id="table4" border="1" cellpadding="4" cellspacing="0"> |
| |
| <tbody> |
| <tr> |
| |
| <th align="left" width="50%">Rules (Right)</th> |
| |
| <th align="left" width="50%">Actual Effect</th> |
| |
| </tr> |
| |
| <tr> |
| |
| <td width="50%">& C < cs <<< Cs <<< CS<br> |
| |
| & cscs <<< ccs<br> |
| |
| & Cscs <<< Ccs<br> |
| |
| & CSCS <<< CCS</td> |
| |
| <td width="50%">& C < cs <<< Cs <<< CS<br> |
| |
| & cs <<< ccs / cs<br> |
| |
| & Cs <<< Ccs / cs<br> |
| |
| & CS <<< CCS / CS</td> |
| |
| </tr> |
| |
| |
| </tbody> |
| </table> |
| |
| |
| <p>In short, when you have expansions, it is always |
| safer and clearer to express them with separate resets. There are only |
| a few exceptions to this, notably when CJK characters are interleaved |
| with Hangul Syllables.</p> |
| |
| </li> |
| |
| <li><b>Don't tailor what you don't have to. </b>Example: Maltese was sorting character sequences <i>before</i> a base character using the |
| following style: |
| |
| <p>& B<br> |
| |
| < Ä‹<br> |
| |
| <<<ÄŠ<br> |
| |
| < c<br> |
| |
| <<<C</p> |
| |
| |
| <p>This works, but is sub-optimal for two reasons. </p> |
| |
| |
| <ol> |
| |
| <li>it tailors c/C when it doesn't need to be; any extra tailoring generally makes for longer sort keys.</li> |
| |
| <li>by tailoring c/C, it puts other those things that are after b/B after c/C instead. See |
| <a href="http://www.unicode.org/charts/collation/">http://www.unicode.org/charts/collation/</a> for examples.</li> |
| |
| |
| </ol> |
| |
| |
| <p>The correct rules should be:</p> |
| |
| |
| <p>& [before 1] c < Ä‹ <<< ÄŠ</p> |
| |
| |
| <p>This finds the highest primary (that's what the 1 is |
| for) character less than c, and uses that as the reset point. For |
| Maltese, the same technique needs to be used for ġ and ż.</p> |
| |
| </li> |
| |
| <li>Contractions can be blocked with CGJ, as described in the Unicode Standard and in the |
| <a href="http://www.unicode.org/faq/char_combmark.html">Characters and Combining Marks FAQ</a>.</li> |
| |
| <li>Normally all combinations of case need to be supplied for contractions. That is, if <i>ch</i> |
| is a contraction, then you would have the rules ... ch < cH < Ch |
| < CH. The reason for this is so that all case variants sort at the |
| same primary level: thus lowercasing a string will not affect its |
| primary order. Cases such as <i>McHugh</i> are handled like other instances where contractions should be blocked.</li> |
| |
| |
| </ol> |
| |
| |
| <h2><a name="Possible_Comparison_Sources">Possible Comparison Sources</a></h2> |
| |
| |
| <p>Sources and references may be standards or can also be dictionaries, journal style guides (such as <i>The Economist Style Guide for English</i>), |
| and other available sources that provide guidance as to common |
| practice. Online sources are preferred where available, since they can |
| be more easily checked.</p> |
| |
| |
| <p>The goal is to follow common, customary practice. For |
| example, language or territory display names should use the most |
| recognizable name in common usage. This is generally not the official |
| name. For example, one would use "Switzerland" not "Swiss |
| Confederation".</p> |
| |
| |
| <p>Here are some possible resources for comparison of locale data. <i>This is <b>not</b> an endorsement of the sources, merely a collation of |
| possibly-useful links. </i><font color="black" face="Arial" size="3"><span style="font-size: 12pt;">To suggest additions, </span></font> |
| file a <a href="filing_bug_reports.html">Bug Report</a>.</p> |
| |
| |
| <h3>Territory names; Language names; Gregorian/non-Gregorian month names; Day names; Exemplar characters, and Collation</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.geonames.de/">http://www.geonames.de/</a></li> |
| |
| |
| </ul> |
| |
| |
| <h3><i>The Economist Style Guide</i> (unfortunately only hard copy): Currencies, Display Names, Formatting for English:</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.amazon.com/exec/obidos/tg/detail/-/186197535X">http://www.amazon.com/exec/obidos/tg/detail/-/186197535X</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3><a name="Exemplar_Characters">Exemplar Characters</a></h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.eki.ee/letter/">http://www.eki.ee/letter/</a> </li> |
| |
| <li><a href="http://europa.eu.int/comm/eurostat/research/index.htm?http://europa.eu.int/en/comm/eurostat/research/isi/special/&1">http://europa.eu.int/comm/eurostat/research/index.htm</a></li> |
| |
| <li><a href="http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin"><span>http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin</span></a><span> |
| </span></li> |
| |
| <li><a href="http://www.omniglot.com/writing/alphabets.htm"> |
| http://www.omniglot.com/writing/alphabets.htm</a> </li> |
| |
| <li><a href="http://www.geonames.de/">http://www.geonames.de/</a></li> |
| |
| |
| </ul> |
| |
| |
| <h3>Territory Names</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.world-gazetteer.com/pronun.htm">http://www.world-gazetteer.com/pronun.htm</a></li> |
| |
| <li><a href="http://www.eki.ee/knn/lingid2.htm#WRLD">http://www.eki.ee/knn/lingid2.htm#WRLD</a> </li> |
| |
| <li><a href="http://www.p.lodz.pl/I35/personal/jw37/EUROPE/europe.html">http://www.p.lodz.pl/I35/personal/jw37/EUROPE/europe.html</a> |
| </li> |
| |
| |
| </ul> |
| |
| |
| <h3>Currency names; Territory names (Replace es with desired language code) </h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://publications.eu.int/code/es/es-5000500.htm">http://publications.eu.int/code/es/es-5000500.htm</a> <br> |
| |
| <a href="http://publications.eu.int/code/es/es-5000700.htm">http://publications.eu.int/code/es/es-5000700.htm</a> <br> |
| |
| <a href="http://publications.eu.int/">http://publications.eu.int/</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3>Territory & Region names (Use the links at the top switch languages); </h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.worldlanguage.com/Arabic/Countries/">http://www.worldlanguage.com/Arabic/Countries/</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3>Exemplar/collation information</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.omniglot.com/writing/">http://www.omniglot.com/writing/</a><br> |
| |
| <a href="http://www.alphabets-world.com/">http://www.alphabets-world.com/</a> <br> |
| |
| <a href="http://developer.mimer.com/collations/charts/">http://developer.mimer.com/collations/charts/</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3>Simple Translations</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://world.altavista.com/">http://world.altavista.com/</a></li> |
| |
| <li><a href="http://www.google.com/language_tools">http://www.google.com/language_tools</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3>List of date/time formatting for Windows</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.microsoft.com/globaldev/nlsweb/">http://www.microsoft.com/globaldev/nlsweb/</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3>Exemplar Characters; Transliteration</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.eki.ee/wgrs/">UNGEGN: Working Group on Romanization Systems</a> </li> |
| |
| <li><a href="http://ee.www.ee/transliteration/">Transliteration of Non-Roman Alphabets and Scripts (Søren Binks)</a> </li> |
| |
| <li><a href="http://www.archivists.org/catalog/stds99/chapter8.html">Standards for Archival Description: Romanization</a> </li> |
| |
| <li><a href="http://ee.www.ee/transliteration/pdf/Hindi-Marathi-Nepali.pdf">ISO-15915 (Hindi)</a> </li> |
| |
| <li><a href="http://ee.www.ee/transliteration/pdf/Gujarati.pdf">ISO-15915 (Gujarati) </a></li> |
| |
| <li><a href="http://ee.www.ee/transliteration/pdf/Kannada.pdf">ISO-15915 (Kannada) </a></li> |
| |
| <li><a href="http://www.cdacindia.com/html/gist/down/iscii_d.asp">ISCII-91</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3>Geographical Names</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://unstats.un.org/unsd/geoinfo/">http://unstats.un.org/unsd/geoinfo/</a> </li> |
| |
| |
| </ul> |
| |
| |
| <h3><span>Currencies</span></h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.globalfindata.com/gh/index.html"><span>http://www.globalfindata.com/gh/index.html</span></a><span> </span></li> |
| |
| |
| </ul> |
| |
| |
| <h3>General</h3> |
| |
| |
| <ul> |
| |
| <li><a href="http://www.cia.gov/cia/publications/factbook/">http://www.cia.gov/cia/publications/factbook/</a> </li> |
| |
| <li><a href="http://www.microsoft.com/mspress/books/5717.asp">http://www.microsoft.com/mspress/books/5717.asp</a> very complete set of information, |
| like postal information, currency symbols, date/time formats, calendars,...</li> |
| |
| |
| </ul> |
| |
| |
| <p> </p> |
| |
| |
| <blockquote> |
| </blockquote> |
| |
| </div> |
| |
| </td> |
| |
| </tr> |
| |
| <tr> |
| |
| <td class="contents" valign="top"> </td> |
| |
| </tr> |
| |
| |
| </tbody> |
| </table> |
| |
| |
| <hr width="50%"> |
| |
| <div align="center"> |
| |
| <center> |
| |
| <table border="0" cellpadding="0" cellspacing="0"> |
| |
| <tbody> |
| <tr> |
| |
| <td><a href="http://www.unicode.org/copyright.html"> |
| <img src="http://www.unicode.org/img/hb_notice.gif" alt="Access to Copyright and terms of use" border="0" height="50" width="216"></a></td> |
| |
| </tr> |
| |
| |
| </tbody> |
| </table> |
| |
| |
| <script language="Javascript" type="text/javascript" src="http://www.unicode.org/webscripts/lastModified.js"> |
| </script> |
| </center> |
| </div> |
| |
| </td> |
| |
| </tr> |
| |
| </tbody> |
| </table> |
| |
| |
| </body> |
| </html> |