| Unicode ident | 
 | ============= | 
 |  | 
 | [<img alt="github" src="https://img.shields.io/badge/github-dtolnay/unicode--ident-8da0cb?style=for-the-badge&labelColor=555555&logo=github" height="20">](https://github.com/dtolnay/unicode-ident) | 
 | [<img alt="crates.io" src="https://img.shields.io/crates/v/unicode-ident.svg?style=for-the-badge&color=fc8d62&logo=rust" height="20">](https://crates.io/crates/unicode-ident) | 
 | [<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-unicode--ident-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs" height="20">](https://docs.rs/unicode-ident) | 
 | [<img alt="build status" src="https://img.shields.io/github/actions/workflow/status/dtolnay/unicode-ident/ci.yml?branch=master&style=for-the-badge" height="20">](https://github.com/dtolnay/unicode-ident/actions?query=branch%3Amaster) | 
 |  | 
 | Implementation of [Unicode Standard Annex #31][tr31] for determining which | 
 | `char` values are valid in programming language identifiers. | 
 |  | 
 | [tr31]: https://www.unicode.org/reports/tr31/ | 
 |  | 
 | This crate is a better optimized implementation of the older `unicode-xid` | 
 | crate. This crate uses less static storage, and is able to classify both ASCII | 
 | and non-ASCII codepoints with better performance, 2–10× faster than | 
 | `unicode-xid`. | 
 |  | 
 | <br> | 
 |  | 
 | ## Comparison of performance | 
 |  | 
 | The following table shows a comparison between five Unicode identifier | 
 | implementations. | 
 |  | 
 | - `unicode-ident` is this crate; | 
 | - [`unicode-xid`] is a widely used crate run by the "unicode-rs" org; | 
 | - `ucd-trie` and `fst` are two data structures supported by the [`ucd-generate`] tool; | 
 | - [`roaring`] is a Rust implementation of Roaring bitmap. | 
 |  | 
 | The *static storage* column shows the total size of `static` tables that the | 
 | crate bakes into your binary, measured in 1000s of bytes. | 
 |  | 
 | The remaining columns show the **cost per call** to evaluate whether a single | 
 | `char` has the XID\_Start or XID\_Continue Unicode property, comparing across | 
 | different ratios of ASCII to non-ASCII codepoints in the input data. | 
 |  | 
 | [`unicode-xid`]: https://github.com/unicode-rs/unicode-xid | 
 | [`ucd-generate`]: https://github.com/BurntSushi/ucd-generate | 
 | [`roaring`]: https://github.com/RoaringBitmap/roaring-rs | 
 |  | 
 | | | static storage | 0% nonascii | 1% | 10% | 100% nonascii | | 
 | |---|---|---|---|---|---| | 
 | | **`unicode-ident`** | 10.0 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns | | 
 | | **`unicode-xid`** | 11.5 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns | | 
 | | **`ucd-trie`** | 10.2 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns | | 
 | | **`fst`** | 138 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns | | 
 | | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns | | 
 |  | 
 | Source code for the benchmark is provided in the *bench* directory of this repo | 
 | and may be repeated by running `cargo criterion`. | 
 |  | 
 | <br> | 
 |  | 
 | ## Comparison of data structures | 
 |  | 
 | #### unicode-xid | 
 |  | 
 | They use a sorted array of character ranges, and do a binary search to look up | 
 | whether a given character lands inside one of those ranges. | 
 |  | 
 | ```rust | 
 | static XID_Continue_table: [(char, char); 763] = [ | 
 |     ('\u{30}', '\u{39}'),  // 0-9 | 
 |     ('\u{41}', '\u{5a}'),  // A-Z | 
 |     … | 
 |     ('\u{e0100}', '\u{e01ef}'), | 
 | ]; | 
 | ``` | 
 |  | 
 | The static storage used by this data structure scales with the number of | 
 | contiguous ranges of identifier codepoints in Unicode. Every table entry | 
 | consumes 8 bytes, because it consists of a pair of 32-bit `char` values. | 
 |  | 
 | In some ranges of the Unicode codepoint space, this is quite a sparse | 
 | representation – there are some ranges where tens of thousands of adjacent | 
 | codepoints are all valid identifier characters. In other places, the | 
 | representation is quite inefficient. A characater like `µ` (U+00B5) which is | 
 | surrounded by non-identifier codepoints consumes 64 bits in the table, while it | 
 | would be just 1 bit in a dense bitmap. | 
 |  | 
 | On a system with 64-byte cache lines, binary searching the table touches 7 cache | 
 | lines on average. Each cache line fits only 8 table entries. Additionally, the | 
 | branching performed during the binary search is probably mostly unpredictable to | 
 | the branch predictor. | 
 |  | 
 | Overall, the crate ends up being about 10× slower on non-ASCII input | 
 | compared to the fastest crate. | 
 |  | 
 | A potential improvement would be to pack the table entries more compactly. | 
 | Rust's `char` type is a 21-bit integer padded to 32 bits, which means every | 
 | table entry is holding 22 bits of wasted space, adding up to 3.9 K. They could | 
 | instead fit every table entry into 6 bytes, leaving out some of the padding, for | 
 | a 25% improvement in space used. With some cleverness it may be possible to fit | 
 | in 5 bytes or even 4 bytes by storing a low char and an extent, instead of low | 
 | char and high char. I don't expect that performance would improve much but this | 
 | could be the most efficient for space across all the libraries, needing only | 
 | about 7 K to store. | 
 |  | 
 | #### ucd-trie | 
 |  | 
 | Their data structure is a compressed trie set specifically tailored for Unicode | 
 | codepoints. The design is credited to Raph Levien in [rust-lang/rust#33098]. | 
 |  | 
 | [rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098 | 
 |  | 
 | ```rust | 
 | pub struct TrieSet { | 
 |     tree1_level1: &'static [u64; 32], | 
 |     tree2_level1: &'static [u8; 992], | 
 |     tree2_level2: &'static [u64], | 
 |     tree3_level1: &'static [u8; 256], | 
 |     tree3_level2: &'static [u8], | 
 |     tree3_level3: &'static [u64], | 
 | } | 
 | ``` | 
 |  | 
 | It represents codepoint sets using a trie to achieve prefix compression. The | 
 | final states of the trie are embedded in leaves or "chunks", where each chunk is | 
 | a 64-bit integer. Each bit position of the integer corresponds to whether a | 
 | particular codepoint is in the set or not. These chunks are not just a compact | 
 | representation of the final states of the trie, but are also a form of suffix | 
 | compression. In particular, if multiple ranges of 64 contiguous codepoints have | 
 | the same Unicode properties, then they all map to the same chunk in the final | 
 | level of the trie. | 
 |  | 
 | Being tailored for Unicode codepoints, this trie is partitioned into three | 
 | disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints \[0, | 
 | 0x800), the second \[0x800, 0x10000) and the third \[0x10000, 0x110000). These | 
 | partitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded | 
 | codepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints, | 
 | respectively. | 
 |  | 
 | Lookups in this data structure are significantly more efficient than binary | 
 | search. A lookup touches either 1, 2, or 3 cache lines based on which of the | 
 | trie partitions is being accessed. | 
 |  | 
 | One possible performance improvement would be for this crate to expose a way to | 
 | query based on a UTF-8 encoded string, returning the Unicode property | 
 | corresponding to the first character in the string. Without such an API, the | 
 | caller is required to tokenize their UTF-8 encoded input data into `char`, hand | 
 | the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by converting | 
 | back into the variable-length representation for trie traversal. | 
 |  | 
 | #### fst | 
 |  | 
 | Uses a [finite state transducer][fst]. This representation is built into | 
 | [ucd-generate] but I am not aware of any advantage over the `ucd-trie` | 
 | representation. In particular `ucd-trie` is optimized for storing Unicode | 
 | properties while `fst` is not. | 
 |  | 
 | [fst]: https://github.com/BurntSushi/fst | 
 | [ucd-generate]: https://github.com/BurntSushi/ucd-generate | 
 |  | 
 | As far as I can tell, the main thing that causes `fst` to have large size and | 
 | slow lookups for this use case relative to `ucd-trie` is that it does not | 
 | specialize for the fact that only 21 of the 32 bits in a `char` are meaningful. | 
 | There are some dense arrays in the structure with large ranges that could never | 
 | possibly be used. | 
 |  | 
 | #### roaring | 
 |  | 
 | This crate is a pure-Rust implementation of [Roaring Bitmap], a data structure | 
 | designed for storing sets of 32-bit unsigned integers. | 
 |  | 
 | [Roaring Bitmap]: https://roaringbitmap.org/about/ | 
 |  | 
 | Roaring bitmaps are compressed bitmaps which tend to outperform conventional | 
 | compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be | 
 | hundreds of times faster and they often offer significantly better compression. | 
 |  | 
 | In this use case the performance was reasonably competitive but still | 
 | substantially slower than the Unicode-optimized crates. Meanwhile the | 
 | compression was significantly worse, requiring 6× as much storage for the | 
 | data structure. | 
 |  | 
 | I also benchmarked the [`croaring`] crate which is an FFI wrapper around the C | 
 | reference implementation of Roaring Bitmap. This crate was consistently about | 
 | 15% slower than pure-Rust `roaring`, which could just be FFI overhead. I did not | 
 | investigate further. | 
 |  | 
 | [`croaring`]: https://crates.io/crates/croaring | 
 |  | 
 | #### unicode-ident | 
 |  | 
 | This crate is most similar to the `ucd-trie` library, in that it's based on | 
 | bitmaps stored in the leafs of a trie representation, achieving both prefix | 
 | compression and suffix compression. | 
 |  | 
 | The key differences are: | 
 |  | 
 | - Uses a single 2-level trie, rather than 3 disjoint partitions of different | 
 |   depth each. | 
 | - Uses significantly larger chunks: 512 bits rather than 64 bits. | 
 | - Compresses the XID\_Start and XID\_Continue properties together | 
 |   simultaneously, rather than duplicating identical trie leaf chunks across the | 
 |   two. | 
 |  | 
 | The following diagram show the XID\_Start and XID\_Continue Unicode boolean | 
 | properties in uncompressed form, in row-major order: | 
 |  | 
 | <table> | 
 | <tr><th>XID_Start</th><th>XID_Continue</th></tr> | 
 | <tr> | 
 | <td><img alt="XID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td> | 
 | <td><img alt="XID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td> | 
 | </tr> | 
 | </table> | 
 |  | 
 | Uncompressed, these would take 140 K to store, which is beyond what would be | 
 | reasonable. However, as you can see there is a large degree of similarity | 
 | between the two bitmaps and across the rows, which lends well to compression. | 
 |  | 
 | This crate stores one 512-bit "row" of the above bitmaps in the leaf level of a | 
 | trie, and a single additional level to index into the leafs. It turns out there | 
 | are 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient to | 
 | index them. | 
 |  | 
 | The chunk size of 512 bits is selected as the size that minimizes the total size | 
 | of the data structure. A smaller chunk, like 256 or 128 bits, would achieve | 
 | better deduplication but require a larger index. A larger chunk would increase | 
 | redundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size of | 
 | the index plus leaf bitmaps. | 
 |  | 
 | In fact since there are only 124 unique chunks, we can use an 8-bit index with a | 
 | spare bit to index at the half-chunk level. This achieves an additional 8.5% | 
 | compression by eliminating redundancies between the second half of any chunk and | 
 | the first half of any other chunk. Note that this is not the same as using | 
 | chunks which are half the size, because it does not necessitate raising the size | 
 | of the trie's first level. | 
 |  | 
 | In contrast to binary search or the `ucd-trie` crate, performing lookups in this | 
 | data structure is straight-line code with no need for branching. | 
 |  | 
 | ```asm | 
 | is_xid_start: | 
 | 	mov eax, edi | 
 | 	shr eax, 9 | 
 | 	lea rcx, [rip + unicode_ident::tables::TRIE_START] | 
 | 	add rcx, rax | 
 | 	xor eax, eax | 
 | 	cmp edi, 201728 | 
 | 	cmovb rax, rcx | 
 | 	test rax, rax | 
 | 	lea rcx, [rip + .L__unnamed_1] | 
 | 	cmovne rcx, rax | 
 | 	movzx eax, byte ptr [rcx] | 
 | 	shl rax, 5 | 
 | 	mov ecx, edi | 
 | 	shr ecx, 3 | 
 | 	and ecx, 63 | 
 | 	add rcx, rax | 
 | 	lea rax, [rip + unicode_ident::tables::LEAF] | 
 | 	mov al, byte ptr [rax + rcx] | 
 | 	and dil, 7 | 
 | 	mov ecx, edi | 
 | 	shr al, cl | 
 | 	and al, 1 | 
 | 	ret | 
 | ``` | 
 |  | 
 | <br> | 
 |  | 
 | ## License | 
 |  | 
 | Use of the Unicode Character Database, as this crate does, is governed by the <a | 
 | href="LICENSE-UNICODE">Unicode License Agreement – Data Files and Software | 
 | (2016)</a>. | 
 |  | 
 | All intellectual property within this crate that is **not generated** using the | 
 | Unicode Character Database as input is licensed under either of <a | 
 | href="LICENSE-APACHE">Apache License, Version 2.0</a> or <a | 
 | href="LICENSE-MIT">MIT license</a> at your option. | 
 |  | 
 | The **generated** files incorporate tabular data derived from the Unicode | 
 | Character Database, together with intellectual property from the original source | 
 | code content of the crate. One must comply with the terms of both the Unicode | 
 | License Agreement and either of the Apache license or MIT license when those | 
 | generated files are involved. | 
 |  | 
 | Unless you explicitly state otherwise, any contribution intentionally submitted | 
 | for inclusion in this crate by you, as defined in the Apache-2.0 license, shall | 
 | be licensed as just described, without any additional terms or conditions. |