blob: 17eb8bed9a43ede87e2bdd333f9347325cd150d7 [file] [log] [blame]
== Notes on {kddi,docomo,softbank}-*.ucm mappings.
kddi-jisx-208 is a variant of JIS X 208 used by KDDI, a Japanese cell
phone carrier.
kddi-shift_jis, docomo-shift_jis, and softbank-shift_jis are variants
of Shift_JIS used by KDDI, DoCoMo and SoftBank.
- kddi-jisx-208 contains Emoji (emoticon) code points in
0x75xx, 0x76xx, 0x77xx, 0x78xx, 0x79xx, 0x7Axx, 0x7Bxx,
where xx means 21-7E.
- kddi-shift_jis contains Emoji code points in
0xEBxx, 0xECxx, 0xEDxx, and 0xEExx, 0xF3xx, 0xF4xx, 0xF6xx, 0xF7xx,
where xx means 40-7E, 80-FC.
- docomo-shift_jis contains Emoji code points in
0xF8xx, and 0xF9xx, where xx means 40-7E, 80-FC.
- softbank-shift_jis contains Emoji code points in
0xF7xx, 0xF9xx, and 0xFBxx, where xx means 40-7E, 80-FC.
- softbank-jisx-208 contains Emoji code points in
0x75xx, 0x76xx, 0x77xx, 0x78xx, 0x79xx, 0x7Axx, 0x7Bxx, 0x7Dxx
where xx means 21-7E.
== How the -2012.ucm tables were modified in April 2013
The -2012 versions were created by
http://code.google.com/p/emoji4unicode/source/browse/trunk/src/gen_conversion_files.py
using each of the older 2012 versions as the base table files
to avoid non-Emoji changes:
# gen_google_ucm.sh
icu_mappings=/google/src/cloud/mscherer/icubranch/google_vendor_src_branch/icu/source/data/mappings
dest=/home/mscherer/www/no_crawl/emoji
./gen_conversion_files.py $icu_mappings/docomo-shift_jis-2012.ucm
cp ../generated/docomo-shift_jis-2012.ucm $dest
./gen_conversion_files.py $icu_mappings/kddi-shift_jis-2012.ucm
cp ../generated/kddi-shift_jis-2012.ucm $dest
./gen_conversion_files.py $icu_mappings/softbank-shift_jis-2012.ucm
cp ../generated/softbank-shift_jis-2012.ucm $dest
./gen_conversion_files.py
The only differences from 2012-sep are in mappings for symbols
that have Unicode Variation Selector (VS) sequences.
The older tables relied on a hack in the ICU conversion code that
ignored the "use fallback" flag for fallbacks from sequences with VS.
The new tables rely on a new feature in ICU4C 51:
For the relevant symbols that have roundtrip mappings,
- the mappings with Emoji Variation Selector
use the |0 roundtrip precision
- the other mappings (no VS & text VS)
use the |4 "good one-way" precision
See http://bugs.icu-project.org/trac/ticket/9602
== How the -2012.ucm tables were created in September 2012
The 2012 versions were created by
http://code.google.com/p/emoji4unicode/source/browse/trunk/src/gen_conversion_files.py
using each of the 2007 versions as the base table files
to avoid non-Emoji changes:
icu_mappings=~/p4/emoji/google_vendor_src_branch/icu/source/data/mappings
dest=~/www/no_crawl/emoji
./gen_conversion_files.py $icu_mappings/docomo-shift_jis-2007.ucm
cp ../generated/docomo-shift_jis-2012.ucm $dest
./gen_conversion_files.py $icu_mappings/kddi-shift_jis-2007.ucm
cp ../generated/kddi-shift_jis-2012.ucm $dest
./gen_conversion_files.py $icu_mappings/softbank-shift_jis-2007.ucm
cp ../generated/softbank-shift_jis-2012.ucm $dest
./gen_conversion_files.py
The emoji4unicode code uses the mappings that were established during the
Unicode Emoji standardization process.
The new conversion tables round-trip carrier Emoji symbol codes
to and from Unicode 6 standard code points
and also include fallback mappings from the Google PUA code points
to the carrier codes.
The trailing "|0" etc. on the mapping table lines specify the mapping type:
|0 round-trip Unicode <-> charset
|1 fallback Unicode -> charset
|3 "reverse fallback" Unicode <- charset
For details about the .ucm file format see
http://userguide.icu-project.org/conversion/data#TOC-.ucm-File-Format
== How the -2007.ucm tables were created
So far, we haven't obtained "official" conversion tables from the cell
phone carriers. However, we empirically know their clients support
VDCs in MS932, like U2460 (CIRCLED DIGIT ONE), etc. Hence we use
MS932 as the base table for them.
kddi-jisx-208-2007.ucm is based on jisx-208.ucm in this directory.
The original table's mappings to codes 0x75xx to 0x7Bxx are excluded
to avoid collisions with emoji.
kddi-shift_jis-2007.ucm is based on windows-932-2000.ucm.
The original table's mappings to codes 0xEBxx to 0xEExx, and 0xF0xx to
0xF90xx (EUDC block), are excluded to avoid collisions with emoji.
docomo-shift_jis-2007.ucm is based on windows-932-2000.ucm.
The original table's mappings to codes 0xF0xx to 0xF90xx (EUDC block)
are excluded to avoid collisions with emoji.
softbank-shift_jis-2007.ucm is based on windows-932-2000.ucm.
The original table's mappings to codes 0xF0xx to 0xF90xx (EUDC block),
and 0xFBxx, are excluded to avoid collisions with emoji.
softbank-jisx-208-2007.ucm is based on jisx-208.ucm in this directory.
The original table's mappings to codes 0x75xx to 0x7Bxx, and 0x7Dxx
are excluded to avoid collisions with emoji.
== Google Standard Emoji Unicode Mapping
The Google standard emoji Unicode mapping can be found at:
/home/build/google3/i18n/encodings/emoji/emoji_unicode_mapping.txt
TODO(mscherer): Use <icu:base> to share most standard JIS mappings
among *-shift_jis-2007.ucm files.