tag | 4d96679f4481273b2b997c6084a7d843a3968d58 | |
---|---|---|
tagger | The Android Open Source Project <initial-contribution@android.com> | Sat Dec 10 02:57:47 2022 -0800 |
object | 3833d03eab641aba628a452cb264c6cbb921f6d8 |
aml_uwb_331115000 (9323957,com.google.android.uwb)
commit | 3833d03eab641aba628a452cb264c6cbb921f6d8 | [log] [tgz] |
---|---|---|
author | Joel Galenson <jgalenson@google.com> | Wed Dec 15 18:07:12 2021 +0000 |
committer | Automerger Merge Worker <android-build-automerger-merge-worker@system.gserviceaccount.com> | Wed Dec 15 18:07:12 2021 +0000 |
tree | 04746dbf501ec3ecc5e9a87964764df52e5fe7e3 | |
parent | 6ea3adb0863cee5de579e459295c12febfecb77f [diff] | |
parent | eb1a942d3a86aaa2d85f7a8ae749b7a1d73b488a [diff] |
Merge "Refresh Android.bp, cargo2android.json, TEST_MAPPING." am: bfa294896c am: 6d683916f1 am: 8b65176931 am: eb1a942d3a Original change: https://android-review.googlesource.com/c/platform/external/rust/crates/regex-automata/+/1912439 Change-Id: I7d11715091c6be8a516233c41a0ee609098bd4bd
A low level regular expression library that uses deterministic finite automata. It supports a rich syntax with Unicode support, has extensive options for configuring the best space vs time trade off for your use case and provides support for cheap deserialization of automata for use in no_std
environments.
Dual-licensed under MIT or the UNLICENSE.
https://docs.rs/regex-automata
Add this to your Cargo.toml
:
[dependencies] regex-automata = "0.1"
and this to your crate root (if you're using Rust 2015):
extern crate regex_automata;
This example shows how to compile a regex using the default configuration and then use it to find matches in a byte string:
use regex_automata::Regex; let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); let text = b"2018-12-24 2016-10-08"; let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); assert_eq!(matches, vec![(0, 10), (11, 21)]);
For more examples and information about the various knobs that can be turned, please see the docs.
no_std
This crate comes with a std
feature that is enabled by default. When the std
feature is enabled, the API of this crate will include the facilities necessary for compiling, serializing, deserializing and searching with regular expressions. When the std
feature is disabled, the API of this crate will shrink such that it only includes the facilities necessary for deserializing and searching with regular expressions.
The intended workflow for no_std
environments is thus as follows:
std
feature that compiles and serializes a regular expression. Serialization should only happen after first converting the DFAs to use a fixed size state identifier instead of the default usize
. You may also need to serialize both little and big endian versions of each DFA. (So that's 4 DFAs in total for each regex.)no_std
environment, follow the examples above for deserializing your previously serialized DFAs into regexes. You can then search with them as you would any regex.Deserialization can happen anywhere. For example, with bytes embedded into a binary or with a file memory mapped at runtime.
Note that the ucd-generate
tool will do the first step for you with its dfa
or regex
sub-commands.
std
- Enabled by default. This enables the ability to compile finite automata. This requires the regex-syntax
dependency. Without this feature enabled, finite automata can only be used for searching (using the approach described above).transducer
- Disabled by default. This provides implementations of the Automaton
trait found in the fst
crate. This permits using finite automata generated by this crate to search finite state transducers. This requires the fst
dependency.The main goal of the regex
crate is to serve as a general purpose regular expression engine. It aims to automatically balance low compile times, fast search times and low memory usage, while also providing a convenient API for users. In contrast, this crate provides a lower level regular expression interface that is a bit less convenient while providing more explicit control over memory usage and search times.
Here are some specific negative differences:
[01]*1[01]{N}
will build a DFA with 2^(N+1)
states. For this reason, untrusted patterns should not be compiled with this library. (In the future, the API may expose an option to return an error if the DFA gets too big.)\w{3}
with byte classes enabled takes just over 1 second and almost 5MB of memory! (Compiling a sparse regex takes about the same time but only uses about 500KB of memory.) Conversly, compiling the same regex without Unicode support, e.g., (?-u)\w{3}
, takes under 1 millisecond and less than 5KB of memory. For this reason, you should only use Unicode character classes if you absolutely need them!^
, $
, \b
or \B
.&str
API like in the regex crate. In this crate, all APIs operate on &[u8]
. By default, match indices are guaranteed to fall on UTF-8 boundaries, unless RegexBuilder::allow_invalid_utf8
is enabled.With some of the downsides out of the way, here are some positive differences:
no_std
environments. While no_std
environments cannot compile regexes, they can deserialize pre-compiled regexes.regex
crate on equivalent tasks. The performance difference is likely not large. However, because of a complex set of optimizations in the regex crate (like literal optimizations), an accurate performance comparison may be difficult to do.DenseDFA
and SparseDFA
, which enables one to do less work in some cases. For example, if you only need the end of a match and not the start of a match, then you can use a DFA directly without building a Regex
, which always requires a second DFA to find the start of a match.