| .. _module-pw_tokenizer-design: |
| |
| ====== |
| Design |
| ====== |
| .. pigweed-module-subpage:: |
| :name: pw_tokenizer |
| :tagline: Cut your log sizes in half |
| :nav: |
| getting started: module-pw_tokenizer-get-started |
| design: module-pw_tokenizer-design |
| api: module-pw_tokenizer-api |
| cli: module-pw_tokenizer-cli |
| |
| There are two sides to ``pw_tokenizer``, which we call tokenization and |
| detokenization. |
| |
| * **Tokenization** converts string literals in the source code to binary tokens |
| at compile time. If the string has printf-style arguments, these are encoded |
| to compact binary form at runtime. |
| * **Detokenization** converts tokenized strings back to the original |
| human-readable strings. |
| |
| Here's an overview of what happens when ``pw_tokenizer`` is used: |
| |
| 1. During compilation, the ``pw_tokenizer`` module hashes string literals to |
| generate stable 32-bit tokens. |
| 2. The tokenization macro removes these strings by declaring them in an ELF |
| section that is excluded from the final binary. |
| 3. After compilation, strings are extracted from the ELF to build a database of |
| tokenized strings for use by the detokenizer. The ELF file may also be used |
| directly. |
| 4. During operation, the device encodes the string token and its arguments, if |
| any. |
| 5. The encoded tokenized strings are sent off-device or stored. |
| 6. Off-device, the detokenizer tools use the token database to decode the |
| strings to human-readable form. |
| |
| .. _module-pw_tokenizer-design-example: |
| |
| -------------------------- |
| Example: tokenized logging |
| -------------------------- |
| This example demonstrates using ``pw_tokenizer`` for logging. In this example, |
| tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded |
| size (49 → 15 bytes). |
| |
| **Before**: plain text logging |
| |
| +------------------+-------------------------------------------+---------------+ |
| | Location | Logging Content | Size in bytes | |
| +==================+===========================================+===============+ |
| | Source contains | ``LOG("Battery state: %s; battery | | |
| | | voltage: %d mV", state, voltage);`` | | |
| +------------------+-------------------------------------------+---------------+ |
| | Binary contains | ``"Battery state: %s; battery | 41 | |
| | | voltage: %d mV"`` | | |
| +------------------+-------------------------------------------+---------------+ |
| | | (log statement is called with | | |
| | | ``"CHARGING"`` and ``3989`` as arguments) | | |
| +------------------+-------------------------------------------+---------------+ |
| | Device transmits | ``"Battery state: CHARGING; battery | 49 | |
| | | voltage: 3989 mV"`` | | |
| +------------------+-------------------------------------------+---------------+ |
| | When viewed | ``"Battery state: CHARGING; battery | | |
| | | voltage: 3989 mV"`` | | |
| +------------------+-------------------------------------------+---------------+ |
| |
| **After**: tokenized logging |
| |
| +------------------+-----------------------------------------------------------+---------+ |
| | Location | Logging Content | Size in | |
| | | | bytes | |
| +==================+===========================================================+=========+ |
| | Source contains | ``LOG("Battery state: %s; battery | | |
| | | voltage: %d mV", state, voltage);`` | | |
| +------------------+-----------------------------------------------------------+---------+ |
| | Binary contains | ``d9 28 47 8e`` (0x8e4728d9) | 4 | |
| +------------------+-----------------------------------------------------------+---------+ |
| | | (log statement is called with | | |
| | | ``"CHARGING"`` and ``3989`` as arguments) | | |
| +------------------+-----------------------------------------------------------+---------+ |
| | Device transmits | =============== ============================== ========== | 15 | |
| | | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` | | |
| | | --------------- ------------------------------ ---------- | | |
| | | Token ``"CHARGING"`` argument ``3989``, | | |
| | | as | | |
| | | varint | | |
| | | =============== ============================== ========== | | |
| +------------------+-----------------------------------------------------------+---------+ |
| | When viewed | ``"Battery state: CHARGING; battery voltage: 3989 mV"`` | | |
| +------------------+-----------------------------------------------------------+---------+ |
| |
| .. _module-pw_tokenizer-base64-format: |
| |
| ------------- |
| Base64 format |
| ------------- |
| The tokenizer encodes messages to a compact binary representation. Applications |
| may desire a textual representation of tokenized strings. This makes it easy to |
| use tokenized messages alongside plain text messages, but comes at a small |
| efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory |
| as binary messages. |
| |
| The Base64 format is comprised of a ``$`` character followed by the |
| Base64-encoded contents of the tokenized message. For example, consider |
| tokenizing the string ``This is an example: %d!`` with the argument -1. The |
| string's token is 0x4b016e66. |
| |
| .. code-block:: text |
| |
| Source code: PW_LOG("This is an example: %d!", -1); |
| |
| Plain text: This is an example: -1! [23 bytes] |
| |
| Binary: 66 6e 01 4b 01 [ 5 bytes] |
| |
| Base64: $Zm4BSwE= [ 9 bytes] |
| |
| See :ref:`module-pw_tokenizer-base64-guides` for guidance on encoding and |
| decoding Base64 messages. |
| |
| .. _module-pw_tokenizer-token-databases: |
| |
| --------------- |
| Token databases |
| --------------- |
| Token databases store a mapping of tokens to the strings they represent. An ELF |
| file can be used as a token database, but it only contains the strings for its |
| exact build. A token database file aggregates tokens from multiple ELF files, so |
| that a single database can decode tokenized strings from any known ELF. |
| |
| Token databases contain the token, removal date (if any), and string for each |
| tokenized string. |
| |
| For help with using token databases, see |
| :ref:`module-pw_tokenizer-managing-token-databases`. |
| |
| Token database formats |
| ====================== |
| Three token database formats are supported: CSV, binary, and directory. Tokens |
| may also be read from ELF files or ``.a`` archives, but cannot be written to |
| these formats. |
| |
| CSV database format |
| ------------------- |
| The CSV database format has three columns: the token in hexadecimal, the removal |
| date (if any) in year-month-day format, and the string literal, surrounded by |
| quotes. Quote characters within the string are represented as two quote |
| characters. |
| |
| This example database contains six strings, three of which have removal dates. |
| |
| .. code-block:: |
| |
| 141c35d5, ,"The answer: ""%s""" |
| 2e668cd6,2019-12-25,"Jello, world!" |
| 7b940e2a, ,"Hello %s! %hd %e" |
| 851beeb6, ,"%u %d" |
| 881436a0,2020-01-01,"The answer is: %s" |
| e13b0f94,2020-04-01,"%llu" |
| |
| Binary database format |
| ---------------------- |
| The binary database format is comprised of a 16-byte header followed by a series |
| of 8-byte entries. Each entry stores the token and the removal date, which is |
| 0xFFFFFFFF if there is none. The string literals are stored next in the same |
| order as the entries. Strings are stored with null terminators. See |
| `token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_ |
| for full details. |
| |
| The binary form of the CSV database is shown below. It contains the same |
| information, but in a more compact and easily processed form. It takes 141 B |
| compared with the CSV database's 211 B. |
| |
| .. code-block:: text |
| |
| [header] |
| 0x00: 454b4f54 0000534e TOKENS.. |
| 0x08: 00000006 00000000 ........ |
| |
| [entries] |
| 0x10: 141c35d5 ffffffff .5...... |
| 0x18: 2e668cd6 07e30c19 ..f..... |
| 0x20: 7b940e2a ffffffff *..{.... |
| 0x28: 851beeb6 ffffffff ........ |
| 0x30: 881436a0 07e40101 .6...... |
| 0x38: e13b0f94 07e40401 ..;..... |
| |
| [string table] |
| 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s" |
| 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H |
| 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e. |
| 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer |
| 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu. |
| |
| .. _module-pw_tokenizer-directory-database-format: |
| |
| Directory database format |
| ------------------------- |
| pw_tokenizer can consume directories of CSV databases. A directory database |
| will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all |
| of which will be used for subsequent detokenization lookups. |
| |
| An example directory database might look something like this: |
| |
| .. code-block:: text |
| |
| token_database |
| ├── chuck_e_cheese.pw_tokenizer.csv |
| ├── fungi_ble.pw_tokenizer.csv |
| └── some_more |
| └── arcade.pw_tokenizer.csv |
| |
| This format is optimized for storage in a Git repository alongside source code. |
| The token database commands randomly generate unique file names for the CSVs in |
| the database to prevent merge conflicts. Running ``mark_removed`` or ``purge`` |
| commands in the database CLI consolidates the files to a single CSV. |
| |
| The database command line tool supports a ``--discard-temporary |
| <upstream_commit>`` option for ``add``. In this mode, the tool attempts to |
| discard temporary tokens. It identifies the latest CSV not present in the |
| provided ``<upstream_commit>``, and tokens present that CSV that are not in the |
| newly added tokens are discarded. This helps keep temporary tokens (e.g from |
| debug logs) out of the database. |
| |
| JSON support |
| ============ |
| While pw_tokenizer doesn't specify a JSON database format, a token database can |
| be created from a JSON formatted array of strings. This is useful for side-band |
| token database generation for strings that are not embedded as parsable tokens |
| in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for |
| instructions on generating a token database from a JSON file. |
| |
| .. _module-pw_tokenizer-collisions: |
| |
| ---------------- |
| Token collisions |
| ---------------- |
| Tokens are calculated with a hash function. It is possible for different |
| strings to hash to the same token. When this happens, multiple strings will have |
| the same token in the database, and it may not be possible to unambiguously |
| decode a token. |
| |
| The detokenization tools attempt to resolve collisions automatically. Collisions |
| are resolved based on two things: |
| |
| - whether the tokenized data matches the strings arguments' (if any), and |
| - if / when the string was marked as having been removed from the database. |
| |
| See :ref:`module-pw_tokenizer-collisions-guide` for guidance on how to fix |
| collisions. |
| |
| Probability of collisions |
| ========================= |
| Hashes of any size have a collision risk. The probability of one at least |
| one collision occurring for a given number of strings is unintuitively high |
| (this is known as the `birthday problem |
| <https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are |
| used for tokens, the probability of collisions increases substantially. |
| |
| This table shows the approximate number of strings that can be hashed to have a |
| 1% or 50% probability of at least one collision (assuming a uniform, random |
| hash). |
| |
| +-------+---------------------------------------+ |
| | Token | Collision probability by string count | |
| | bits +--------------------+------------------+ |
| | | 50% | 1% | |
| +=======+====================+==================+ |
| | 32 | 77000 | 9300 | |
| +-------+--------------------+------------------+ |
| | 31 | 54000 | 6600 | |
| +-------+--------------------+------------------+ |
| | 24 | 4800 | 580 | |
| +-------+--------------------+------------------+ |
| | 16 | 300 | 36 | |
| +-------+--------------------+------------------+ |
| | 8 | 19 | 3 | |
| +-------+--------------------+------------------+ |
| |
| Keep this table in mind when masking tokens (see |
| :ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when |
| tokenizing a small set of strings, such as module names, but won't be suitable |
| for large sets of strings, like log messages. |
| |
| .. _module-pw_tokenizer-detokenization: |
| |
| -------------- |
| Detokenization |
| -------------- |
| Detokenization is the process of expanding a token to the string it represents |
| and decoding its arguments. ``pw_tokenizer`` provides Python, C++ and |
| TypeScript detokenization libraries. |
| |
| **Example: decoding tokenized logs** |
| |
| A project might tokenize its log messages with the |
| :ref:`module-pw_tokenizer-base64-format`. Consider the following log file, which |
| has four tokenized logs and one plain text log: |
| |
| .. code-block:: text |
| |
| 20200229 14:38:58 INF $HL2VHA== |
| 20200229 14:39:00 DBG $5IhTKg== |
| 20200229 14:39:20 DBG Crunching numbers to calculate probability of success |
| 20200229 14:39:21 INF $EgFj8lVVAUI= |
| 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk= |
| |
| The project's log strings are stored in a database like the following: |
| |
| .. code-block:: |
| |
| 1c95bd1c, ,"Initiating retrieval process for recovery object" |
| 2a5388e4, ,"Determining optimal approach and coordinating vectors" |
| 3743540c, ,"Recovery object retrieval failed with status %s" |
| f2630112, ,"Calculated acceptable probability of success (%.2f%%)" |
| |
| Using the detokenizing tools with the database, the logs can be decoded: |
| |
| .. code-block:: text |
| |
| 20200229 14:38:58 INF Initiating retrieval process for recovery object |
| 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors |
| 20200229 14:39:20 DBG Crunching numbers to calculate probability of success |
| 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%) |
| 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY |
| |
| .. note:: |
| |
| This example uses the :ref:`module-pw_tokenizer-base64-format`, which |
| occupies about 4/3 (133%) as much space as the default binary format when |
| encoded. For projects that wish to interleave tokenized with plain text, |
| using Base64 is a worthwhile tradeoff. |
| |
| See :ref:`module-pw_tokenizer-detokenization-guides` for detailed instructions |
| on how to do detokenization in different programming languages. |
| |
| ------------- |
| Compatibility |
| ------------- |
| * C11 |
| * C++14 |
| * Python 3 |
| |
| ------------ |
| Dependencies |
| ------------ |
| * ``pw_varint`` module |
| * ``pw_preprocessor`` module |
| * ``pw_span`` module |
| |
| --------------------------- |
| Limitations and future work |
| --------------------------- |
| |
| GCC bug: tokenization in template functions |
| =========================================== |
| GCC incorrectly ignores the section attribute for template `functions |
| <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables |
| <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. For example, the |
| following won't work when compiling with GCC and tokenized logging: |
| |
| .. code-block:: cpp |
| |
| template <...> |
| void DoThings() { |
| int value = GetValue(); |
| // This log won't work with tokenized logs due to the templated context. |
| PW_LOG_INFO("Got value: %d", value); |
| ... |
| } |
| |
| The bug causes tokenized strings in template functions to be emitted into |
| ``.rodata`` instead of the special tokenized string section. This causes two |
| problems: |
| |
| 1. Tokenized strings will not be discovered by the token database tools. |
| 2. Tokenized strings may not be removed from the final binary. |
| |
| There are two workarounds. |
| |
| #. **Use Clang.** Clang puts the string data in the requested section, as |
| expected. No extra steps are required. |
| |
| #. **Move tokenization calls to a non-templated context.** Creating a separate |
| non-templated function and invoking it from the template resolves the issue. |
| This enables tokenizing in most cases encountered in practice with |
| templates. |
| |
| .. code-block:: cpp |
| |
| // In .h file: |
| void LogThings(value); |
| |
| template <...> |
| void DoThings() { |
| int value = GetValue(); |
| // This log will work: calls non-templated helper. |
| LogThings(value); |
| ... |
| } |
| |
| // In .cc file: |
| void LogThings(int value) { |
| // Tokenized logging works as expected in this non-templated context. |
| PW_LOG_INFO("Got value %d", value); |
| } |
| |
| There is a third option, which isn't implemented yet, which is to compile the |
| binary twice: once to extract the tokens, and once for the production binary |
| (without tokens). If this is interesting to you please get in touch. |
| |
| 64-bit tokenization |
| =================== |
| The Python and C++ detokenizing libraries currently assume that strings were |
| tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and |
| ``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit |
| device performed the tokenization. |
| |
| Supporting detokenization of strings tokenized on 64-bit targets would be |
| simple. This could be done by adding an option to switch the 32-bit types to |
| 64-bit. The tokenizer stores the sizes of these types in the |
| ``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified |
| by checking the ELF file, if necessary. |
| |
| Tokenization in headers |
| ======================= |
| Tokenizing code in header files (inline functions or templates) may trigger |
| warnings such as ``-Wlto-type-mismatch`` under certain conditions. That |
| is because tokenization requires declaring a character array for each tokenized |
| string. If the tokenized string includes macros that change value, the size of |
| this character array changes, which means the same static variable is defined |
| with different sizes. It should be safe to suppress these warnings, but, when |
| possible, code that tokenizes strings with macros that can change value should |
| be moved to source files rather than headers. |
| |
| .. _module-pw_tokenizer-tokenized-strings-as-args: |
| |
| Tokenized strings as ``%s`` arguments |
| ===================================== |
| Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are |
| encoded 1:1, with no tokenization. It would be better to send a tokenized string |
| literal as an integer instead of a string argument, but this is not yet |
| supported. |
| |
| A string token could be sent by marking an integer % argument in a way |
| recognized by the detokenization tools. The detokenizer would expand the |
| argument to the string represented by the integer. |
| |
| .. code-block:: cpp |
| |
| #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]" |
| |
| constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there"); |
| |
| PW_TOKENIZE_STRING("Knock knock: %" PW_TOKEN_ARG "?", answer_token); |
| |
| Strings with arguments could be encoded to a buffer, but since printf strings |
| are null-terminated, a binary encoding would not work. These strings can be |
| prefixed Base64-encoded and sent as ``%s`` instead. See |
| :ref:`module-pw_tokenizer-base64-format`. |
| |
| Another possibility: encode strings with arguments to a ``uint64_t`` and send |
| them as an integer. This would be efficient and simple, but only support a small |
| number of arguments. |
| |
| ---------------------------------- |
| C99 ``printf`` Compatibility Notes |
| ---------------------------------- |
| This implementation is designed to align with the |
| `C99 specification, section 7.19.6 |
| <https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf>`_. |
| Notably, this specification is slightly different than what is implemented |
| in most compilers due to each compiler choosing to interpret undefined |
| behavior in slightly different ways. Treat the following description as the |
| source of truth. |
| |
| This implementation supports: |
| |
| - Overall Format: ``%[flags][width][.precision][length][specifier]`` |
| - Flags (Zero or More) |
| - ``-``: Left-justify within the given field width; Right justification is |
| the default (see Width modifier). |
| - ``+``: Forces to preceed the result with a plus or minus sign (``+`` or |
| ``-``) even for positive numbers. By default, only negative numbers are |
| preceded with a ``-`` sign. |
| - (space): If no sign is going to be written, a blank space is inserted |
| before the value. |
| - ``#``: Specifies an alternative print syntax should be used. |
| - Used with ``o``, ``x`` or ``X`` specifiers the value is preceeded with |
| ``0``, ``0x`` or ``0X``, respectively, for values different than zero. |
| - Used with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or ``G`` it |
| forces the written output to contain a decimal point even if no more |
| digits follow. By default, if no digits follow, no decimal point is |
| written. |
| - ``0``: Left-pads the number with zeroes (``0``) instead of spaces when |
| padding is specified (see width sub-specifier). |
| - Width (Optional) |
| - ``(number)``: Minimum number of characters to be printed. If the value to |
| be printed is shorter than this number, the result is padded with blank |
| spaces or ``0`` if the ``0`` flag is present. The value is not truncated |
| even if the result is larger. If the value is negative and the ``0`` flag |
| is present, the ``0``\s are padded after the ``-`` symbol. |
| - ``*``: The width is not specified in the format string, but as an |
| additional integer value argument preceding the argument that has to be |
| formatted. |
| - Precision (Optional) |
| - ``.(number)`` |
| - For ``d``, ``i``, ``o``, ``u``, ``x``, ``X``, specifies the minimum |
| number of digits to be written. If the value to be written is shorter |
| than this number, the result is padded with leading zeros. The value is |
| not truncated even if the result is longer. |
| |
| - A precision of ``0`` means that no character is written for the value |
| ``0``. |
| |
| - For ``a``, ``A``, ``e``, ``E``, ``f``, and ``F``, specifies the number |
| of digits to be printed after the decimal point. By default, this is |
| ``6``. |
| |
| - For ``g`` and ``G``, specifies the maximum number of significant digits |
| to be printed. |
| |
| - For ``s``, specifies the maximum number of characters to be printed. By |
| default all characters are printed until the ending null character is |
| encountered. |
| |
| - If the period is specified without an explicit value for precision, |
| ``0`` is assumed. |
| - ``.*``: The precision is not specified in the format string, but as an |
| additional integer value argument preceding the argument that has to be |
| formatted. |
| - Length (Optional) |
| - ``hh``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed char`` or ``unsigned char``. |
| However, this is largely ignored in the implementation due to it not being |
| necessary for Python or argument decoding (since the argument is always |
| encoded at least as a 32-bit integer). |
| - ``h``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed short int`` or |
| ``unsigned short int``. However, this is largely ignored in the |
| implementation due to it not being necessary for Python or argument |
| decoding (since the argument is always encoded at least as a 32-bit |
| integer). |
| - ``l``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed long int`` or |
| ``unsigned long int``. Also is usable with ``c`` and ``s`` to specify that |
| the arguments will be encoded with ``wchar_t`` values (which isn't |
| different from normal ``char`` values). However, this is largely ignored in |
| the implementation due to it not being necessary for Python or argument |
| decoding (since the argument is always encoded at least as a 32-bit |
| integer). |
| - ``ll``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed long long int`` or |
| ``unsigned long long int``. This is required to properly decode the |
| argument as a 64-bit integer. |
| - ``L``: Usable with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or |
| ``G`` conversion specifiers applies to a long double argument. However, |
| this is ignored in the implementation due to floating point value encoded |
| that is unaffected by bit width. |
| - ``j``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``intmax_t`` or ``uintmax_t``. |
| - ``z``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``size_t``. This will force the argument |
| to be decoded as an unsigned integer. |
| - ``t``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``ptrdiff_t``. |
| - If a length modifier is provided for an incorrect specifier, it is ignored. |
| - Specifier (Required) |
| - ``d`` / ``i``: Used for signed decimal integers. |
| |
| - ``u``: Used for unsigned decimal integers. |
| |
| - ``o``: Used for unsigned decimal integers and specifies formatting should |
| be as an octal number. |
| |
| - ``x``: Used for unsigned decimal integers and specifies formatting should |
| be as a hexadecimal number using all lowercase letters. |
| |
| - ``X``: Used for unsigned decimal integers and specifies formatting should |
| be as a hexadecimal number using all uppercase letters. |
| |
| - ``f``: Used for floating-point values and specifies to use lowercase, |
| decimal floating point formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``F``: Used for floating-point values and specifies to use uppercase, |
| decimal floating point formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``e``: Used for floating-point values and specifies to use lowercase, |
| exponential (scientific) formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``E``: Used for floating-point values and specifies to use uppercase, |
| exponential (scientific) formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``g``: Used for floating-point values and specified to use ``f`` or ``e`` |
| formatting depending on which would be the shortest representation. |
| |
| - Precision specifies the number of significant digits, not just digits |
| after the decimal place. |
| |
| - If the precision is specified as ``0``, it is interpreted to mean ``1``. |
| |
| - ``e`` formatting is used if the the exponent would be less than ``-4`` or |
| is greater than or equal to the precision. |
| |
| - Trailing zeros are removed unless the ``#`` flag is set. |
| |
| - A decimal point only appears if it is followed by a digit. |
| |
| - ``NaN`` or infinities always follow ``f`` formatting. |
| |
| - ``G``: Used for floating-point values and specified to use ``f`` or ``e`` |
| formatting depending on which would be the shortest representation. |
| |
| - Precision specifies the number of significant digits, not just digits |
| after the decimal place. |
| |
| - If the precision is specified as ``0``, it is interpreted to mean ``1``. |
| |
| - ``E`` formatting is used if the the exponent would be less than ``-4`` or |
| is greater than or equal to the precision. |
| |
| - Trailing zeros are removed unless the ``#`` flag is set. |
| |
| - A decimal point only appears if it is followed by a digit. |
| |
| - ``NaN`` or infinities always follow ``F`` formatting. |
| |
| - ``c``: Used for formatting a ``char`` value. |
| |
| - ``s``: Used for formatting a string of ``char`` values. |
| |
| - If width is specified, the null terminator character is included as a |
| character for width count. |
| |
| - If precision is specified, no more ``char``\s than that value will be |
| written from the string (padding is used to fill additional width). |
| |
| - ``p``: Used for formatting a pointer address. |
| |
| - ``%``: Prints a single ``%``. Only valid as ``%%`` (supports no flags, |
| width, precision, or length modifiers). |
| |
| Underspecified details: |
| |
| - If both ``+`` and (space) flags appear, the (space) is ignored. |
| - The ``+`` and (space) flags will error if used with ``c`` or ``s``. |
| - The ``#`` flag will error if used with ``d``, ``i``, ``u``, ``c``, ``s``, or |
| ``p``. |
| - The ``0`` flag will error if used with ``c``, ``s``, or ``p``. |
| - Both ``+`` and (space) can work with the unsigned integer specifiers ``u``, |
| ``o``, ``x``, and ``X``. |
| - If a length modifier is provided for an incorrect specifier, it is ignored. |
| - The ``z`` length modifier will decode arugments as signed as long as ``d`` or |
| ``i`` is used. |
| - ``p`` is implementation defined. |
| |
| - For this implementation, it will print with a ``0x`` prefix and then the |
| pointer value was printed using ``%08X``. |
| |
| - ``p`` supports the ``+``, ``-``, and (space) flags, but not the ``#`` or |
| ``0`` flags. |
| |
| - None of the length modifiers are usable with ``p``. |
| |
| - This implementation will try to adhere to user-specified width (assuming the |
| width provided is larger than the guaranteed minimum of ``10``). |
| |
| - Specifying precision for ``p`` is considered an error. |
| - Only ``%%`` is allowed with no other modifiers. Things like ``%+%`` will fail |
| to decode. Some C stdlib implementations support any modifiers being |
| present between ``%``, but ignore any for the output. |
| - If a width is specified with the ``0`` flag for a negative value, the padded |
| ``0``\s will appear after the ``-`` symbol. |
| - A precision of ``0`` for ``d``, ``i``, ``u``, ``o``, ``x``, or ``X`` means |
| that no character is written for the value ``0``. |
| - Precision cannot be specified for ``c``. |
| - Using ``*`` or fixed precision with the ``s`` specifier still requires the |
| string argument to be null-terminated. This is due to argument encoding |
| happening on the C/C++-side while the precision value is not read or |
| otherwise used until decoding happens in this Python code. |
| |
| Non-conformant details: |
| |
| - ``n`` specifier: We do not support the ``n`` specifier since it is impossible |
| for us to retroactively tell the original program how many characters have |
| been printed since this decoding happens a great deal of time after the |
| device sent it, usually on a separate processing device entirely. |
| |
| -------------------- |
| Deployment war story |
| -------------------- |
| The tokenizer module was developed to bring tokenized logging to an |
| in-development product. The product already had an established text-based |
| logging system. Deploying tokenization was straightforward and had substantial |
| benefits. |
| |
| Results |
| ======= |
| * Log contents shrunk by over 50%, even with Base64 encoding. |
| |
| * Significant size savings for encoded logs, even using the less-efficient |
| Base64 encoding required for compatibility with the existing log system. |
| * Freed valuable communication bandwidth. |
| * Allowed storing many more logs in crash dumps. |
| |
| * Substantial flash savings. |
| |
| * Reduced the size firmware images by up to 18%. |
| |
| * Simpler logging code. |
| |
| * Removed CPU-heavy ``snprintf`` calls. |
| * Removed complex code for forwarding log arguments to a low-priority task. |
| |
| This section describes the tokenizer deployment process and highlights key |
| insights. |
| |
| Firmware deployment |
| =================== |
| * In the project's logging macro, calls to the underlying logging function were |
| replaced with a tokenized log macro invocation. |
| * The log level was passed as the payload argument to facilitate runtime log |
| level control. |
| * For this project, it was necessary to encode the log messages as text. In |
| the handler function the log messages were encoded in the $-prefixed |
| :ref:`module-pw_tokenizer-base64-format`, then dispatched as normal log messages. |
| * Asserts were tokenized a callback-based API that has been removed (a |
| :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better |
| alternative). |
| |
| .. attention:: |
| Do not encode line numbers in tokenized strings. This results in a huge |
| number of lines being added to the database, since every time code moves, |
| new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line |
| numbers are encoded in the log metadata. Line numbers may also be included by |
| by adding ``"%d"`` to the format string and passing ``__LINE__``. |
| |
| .. _module-pw_tokenizer-database-management: |
| |
| Database management |
| =================== |
| * The token database was stored as a CSV file in the project's Git repo. |
| * The token database was automatically updated as part of the build, and |
| developers were expected to check in the database changes alongside their code |
| changes. |
| * A presubmit check verified that all strings added by a change were added to |
| the token database. |
| * The token database included logs and asserts for all firmware images in the |
| project. |
| * No strings were purged from the token database. |
| |
| .. tip:: |
| Merge conflicts may be a frequent occurrence with an in-source CSV database. |
| Use the :ref:`module-pw_tokenizer-directory-database-format` instead. |
| |
| Decoding tooling deployment |
| =========================== |
| * The Python detokenizer in ``pw_tokenizer`` was deployed to two places: |
| |
| * Product-specific Python command line tools, using |
| ``pw_tokenizer.Detokenizer``. |
| * Standalone script for decoding prefixed Base64 tokens in files or |
| live output (e.g. from ``adb``), using ``detokenize.py``'s command line |
| interface. |
| |
| * The C++ detokenizer library was deployed to two Android apps with a Java |
| Native Interface (JNI) layer. |
| |
| * The binary token database was included as a raw resource in the APK. |
| * In one app, the built-in token database could be overridden by copying a |
| file to the phone. |
| |
| .. tip:: |
| Make the tokenized logging tools simple to use for your project. |
| |
| * Provide simple wrapper shell scripts that fill in arguments for the |
| project. For example, point ``detokenize.py`` to the project's token |
| databases. |
| * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in |
| continuously-running tools, so that users don't have to restart the tool |
| when the token database updates. |
| * Integrate detokenization everywhere it is needed. Integrating the tools |
| takes just a few lines of code, and token databases can be embedded in APKs |
| or binaries. |