blob: bd953891bac9691f381074fed554352c2111b9bf [file] [log] [blame] [edit]
# README.txt for regex files used by org.unicode.cldr.icu.NewLdml2IcuConverter.
#
# Copyright © 2012-2014 Unicode, Inc.
# CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/)
# For terms of use, see http://www.unicode.org/copyright.html
======
Basics
======
NewLdml2IcuConverter maps CLDR XPaths to ICU Resource Bundle paths based on
regexes in special text files (ldml2icu.txt for locale data and
ldml2icu_supplemental.txt for supplemental data). A Resource Bundle path is
comprised of the ordered list of labels that appear for a value, e.g. the
following ICU text file would have the paths /a/b/c and a/b/d:
rootLabel{
a{
b{
c{
"x",
"y"
}
d{3}
}
}
}
Root labels are not included in Resource Bundle paths for brevity.
The format of a line in the regex file is:
ldmlRegex ; icuPath (; specialInstructions)*
All regexes in a file are assumed to be mutually exclusive and unique, so order
is not important and the lines can be grouped logically.
NOTE: The above is no longer true in CLDR 26 !!
If an LDML xpath results in two or more ICU paths, they should be split into
multiple lines, like this:
ldmlRegex
; rbPath1 (; specialInstructions)*
; rbPath2 (; specialInstructions)*
...
rbPath is a ICU path replacement expression that is filled out with arguments
from ldmlRegex, e.g. given the line:
//ldml/(.*)/b ; /x/$1
matching //ldml/a/b would result in the ICU path /x/a.
Each regex pattern must match an xpath completely to have a successful match.
Square brackets that are part of the xpath will be preceded by a backslash
during processing, so there's no need to do it here.
When an xpath is matched, the value of the mapped ICU path will be the
unsplit value in the CLDRFile. This can be overridden by using the special
instruction "values=".
All values written to ICU files are treated as strings and enclosed with quotes
in the output by default. To force values to be written without quotes, the
corresponding RB path should have :int (for single values) or :intvector
(for arrays) appended to it, e.g. /contextTransforms/$1:intvector.
There may be zero or more semicolon-delimited specialInstructions per line. See
the section "Special Instructions" for details.
=========
Variables
=========
Variables are substituted into all the regexes in a file and should be declared
at the top of the file. The value of the variable can either be a regex or a
CLDR XPath that is present in the file being converted.
All variable names should start with '%' and be one character long.
Any regex being used that contains square brackets should be made a variable,
since the converter will automatically escape all brackets that occur in the
XPath regex unless they are represented by variables.
====================
Special Instructions
====================
There may be one or more semicolon-separated special instructions per line, each
of which are performed whenever the regex in that line is matched.
Each instruction may be one of the following:
* values=<space-delimited-values>
The values instruction explicitly specifies the values to be mapped to the
generated rbPath instead of using the default values in CLDR.
<space-delimited-values> may contain hardcoded values, replacement arguments
from the xpath regex, or the keyword {value} which represents the value of the
xpath in the CLDR data file.
* fallback=<space-delimited-values>
This instruction specifies that if the generated rbPath does not contain a value
from an xpath that matches the current regex, the specified fallback values
will be added to the rbPath instead.
<space-delimited-values> may contain hardcoded values, replacement arguments
from the xpath regex, or the XPath of another value in the CLDR file currently
being converted.
* group (no arguments)
If present, this instruction indicates that the values created using this xpath
should be grouped together in their own sub-array in the ICU data file. This
instruction should be used instead of hidden labels if the array is in an array
of values, e.g. the example below is generated by putting "EEEE, d בMMMM y" and
"hebr" in the same group:
DateTimePatterns {
"HH:mm",
{
"EEEE, d בMMMM y",
"hebr",
}
...
==========
Functions
==========
Values sometimes need some further processing to be suitable for ICU. In such
cases, functions can be written to process these values and their use is
specified in the regex file like this:
value=&date($1)
This helps to put the special-case code in one place to minimize hackiness.
Currently available functions:
* &date(<arg>)
Converts a formatted date to a pair of integers.
* &algorithm(<arg>)
Converts aliases for algorithmic numbering system descriptions to an
ICU-compatible format.
================
Space Delimiting
================
All matched arguments from xpaths in the rbPath or specialInstructions will be
split by whitespace, e.g. if the special instruction value=$1 is given an
argument of "New York", it will be split into "New" and "York" in the output.
In the case of rbPaths, multiple ICU paths will be created from the split argument,
each containing one part of the argument.
To prevent splitting, enclose the the corresponding argument in the replacement
string with quotes, e.g. "$1" or "1999-01-01 9:00" will not be split.
This trick can be used to split default values; the instruction values={value}
will cause the CLDR value to be split by whitespace, while values="{value}" has
the same effect as not including the instruction at all.
=============
Hidden Labels
=============
Some ICU paths will contain labels which are enclosed in angle brackets,
e.g. /telephoneCodeData/$1/<$2>/code. They are processed and sorted like any
any other ICU path, except that labels enclosed in angle brackets will not be
written to the ICU file. This is used to create arrays without labels where
necessary, e.g. the RB path mentioned earlier would result in the following
output:
telephoneCodeData{
001{
{
code{"388"}
}
{
code{"800"}
}
...
<FIFO> is a special hidden label for ldml2icu_supplemental.txt only: using this
label will cause the values from a single xpath to be grouped together in the
same array, and the arrays will be written in the same order that they were read
from the CLDR file.