| # README.txt for regex files used by org.unicode.cldr.icu.NewLdml2IcuConverter. |
| # |
| # Copyright © 2012-2014 Unicode, Inc. |
| # CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/) |
| # For terms of use, see http://www.unicode.org/copyright.html |
| |
| ====== |
| Basics |
| ====== |
| |
| NewLdml2IcuConverter maps CLDR XPaths to ICU Resource Bundle paths based on |
| regexes in special text files (ldml2icu.txt for locale data and |
| ldml2icu_supplemental.txt for supplemental data). A Resource Bundle path is |
| comprised of the ordered list of labels that appear for a value, e.g. the |
| following ICU text file would have the paths /a/b/c and a/b/d: |
| |
| rootLabel{ |
| a{ |
| b{ |
| c{ |
| "x", |
| "y" |
| } |
| d{3} |
| } |
| } |
| } |
| |
| Root labels are not included in Resource Bundle paths for brevity. |
| |
| The format of a line in the regex file is: |
| ldmlRegex ; icuPath (; specialInstructions)* |
| |
| All regexes in a file are assumed to be mutually exclusive and unique, so order |
| is not important and the lines can be grouped logically. |
| NOTE: The above is no longer true in CLDR 26 !! |
| |
| If an LDML xpath results in two or more ICU paths, they should be split into |
| multiple lines, like this: |
| ldmlRegex |
| ; rbPath1 (; specialInstructions)* |
| ; rbPath2 (; specialInstructions)* |
| ... |
| |
| rbPath is a ICU path replacement expression that is filled out with arguments |
| from ldmlRegex, e.g. given the line: |
| //ldml/(.*)/b ; /x/$1 |
| matching //ldml/a/b would result in the ICU path /x/a. |
| |
| Each regex pattern must match an xpath completely to have a successful match. |
| Square brackets that are part of the xpath will be preceded by a backslash |
| during processing, so there's no need to do it here. |
| |
| When an xpath is matched, the value of the mapped ICU path will be the |
| unsplit value in the CLDRFile. This can be overridden by using the special |
| instruction "values=". |
| |
| All values written to ICU files are treated as strings and enclosed with quotes |
| in the output by default. To force values to be written without quotes, the |
| corresponding RB path should have :int (for single values) or :intvector |
| (for arrays) appended to it, e.g. /contextTransforms/$1:intvector. |
| |
| There may be zero or more semicolon-delimited specialInstructions per line. See |
| the section "Special Instructions" for details. |
| |
| |
| ========= |
| Variables |
| ========= |
| |
| Variables are substituted into all the regexes in a file and should be declared |
| at the top of the file. The value of the variable can either be a regex or a |
| CLDR XPath that is present in the file being converted. |
| All variable names should start with '%' and be one character long. |
| Any regex being used that contains square brackets should be made a variable, |
| since the converter will automatically escape all brackets that occur in the |
| XPath regex unless they are represented by variables. |
| |
| |
| ==================== |
| Special Instructions |
| ==================== |
| |
| There may be one or more semicolon-separated special instructions per line, each |
| of which are performed whenever the regex in that line is matched. |
| Each instruction may be one of the following: |
| |
| * values=<space-delimited-values> |
| The values instruction explicitly specifies the values to be mapped to the |
| generated rbPath instead of using the default values in CLDR. |
| |
| <space-delimited-values> may contain hardcoded values, replacement arguments |
| from the xpath regex, or the keyword {value} which represents the value of the |
| xpath in the CLDR data file. |
| |
| * fallback=<space-delimited-values> |
| This instruction specifies that if the generated rbPath does not contain a value |
| from an xpath that matches the current regex, the specified fallback values |
| will be added to the rbPath instead. |
| |
| <space-delimited-values> may contain hardcoded values, replacement arguments |
| from the xpath regex, or the XPath of another value in the CLDR file currently |
| being converted. |
| |
| * group (no arguments) |
| If present, this instruction indicates that the values created using this xpath |
| should be grouped together in their own sub-array in the ICU data file. This |
| instruction should be used instead of hidden labels if the array is in an array |
| of values, e.g. the example below is generated by putting "EEEE, d בMMMM y" and |
| "hebr" in the same group: |
| DateTimePatterns { |
| "HH:mm", |
| { |
| "EEEE, d בMMMM y", |
| "hebr", |
| } |
| ... |
| |
| |
| ========== |
| Functions |
| ========== |
| |
| Values sometimes need some further processing to be suitable for ICU. In such |
| cases, functions can be written to process these values and their use is |
| specified in the regex file like this: |
| value=&date($1) |
| |
| This helps to put the special-case code in one place to minimize hackiness. |
| |
| Currently available functions: |
| |
| * &date(<arg>) |
| Converts a formatted date to a pair of integers. |
| |
| * &algorithm(<arg>) |
| Converts aliases for algorithmic numbering system descriptions to an |
| ICU-compatible format. |
| |
| |
| ================ |
| Space Delimiting |
| ================ |
| |
| All matched arguments from xpaths in the rbPath or specialInstructions will be |
| split by whitespace, e.g. if the special instruction value=$1 is given an |
| argument of "New York", it will be split into "New" and "York" in the output. |
| In the case of rbPaths, multiple ICU paths will be created from the split argument, |
| each containing one part of the argument. |
| |
| To prevent splitting, enclose the the corresponding argument in the replacement |
| string with quotes, e.g. "$1" or "1999-01-01 9:00" will not be split. |
| |
| This trick can be used to split default values; the instruction values={value} |
| will cause the CLDR value to be split by whitespace, while values="{value}" has |
| the same effect as not including the instruction at all. |
| |
| |
| ============= |
| Hidden Labels |
| ============= |
| |
| Some ICU paths will contain labels which are enclosed in angle brackets, |
| e.g. /telephoneCodeData/$1/<$2>/code. They are processed and sorted like any |
| any other ICU path, except that labels enclosed in angle brackets will not be |
| written to the ICU file. This is used to create arrays without labels where |
| necessary, e.g. the RB path mentioned earlier would result in the following |
| output: |
| telephoneCodeData{ |
| 001{ |
| { |
| code{"388"} |
| } |
| { |
| code{"800"} |
| } |
| ... |
| |
| <FIFO> is a special hidden label for ldml2icu_supplemental.txt only: using this |
| label will cause the values from a single xpath to be grouped together in the |
| same array, and the arrays will be written in the same order that they were read |
| from the CLDR file. |