| LZ4 Block Format Description |
| ============================ |
| Last revised: 2022-07-31 . |
| Author : Yann Collet |
| |
| |
| This specification is intended for developers willing to |
| produce or read LZ4 compressed data blocks |
| using any programming language of their choice. |
| |
| LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format. |
| There is no entropy encoder back-end nor framing layer. |
| The latter is assumed to be handled by other parts of the system |
| (see [LZ4 Frame format]). |
| This design is assumed to favor simplicity and speed. |
| |
| This document describes only the Block Format, |
| not how the compressor nor decompressor actually work. |
| For more details on such topics, see later section "Implementation Notes". |
| |
| [LZ4 Frame format]: lz4_Frame_format.md |
| |
| |
| |
| Compressed block format |
| ----------------------- |
| An LZ4 compressed block is composed of sequences. |
| A sequence is a suite of literals (not-compressed bytes), |
| followed by a match copy operation. |
| |
| Each sequence starts with a `token`. |
| The `token` is a one byte value, separated into two 4-bits fields. |
| Therefore each field ranges from 0 to 15. |
| |
| |
| The first field uses the 4 high-bits of the token. |
| It provides the length of literals to follow. |
| |
| If the field value is smaller than 15, |
| then it represents the total nb of literals present in the sequence, |
| including 0, in which case there is no literal. |
| |
| The value 15 is a special case: more bytes are required to indicate the full length. |
| Each additional byte then represents a value from 0 to 255, |
| which is added to the previous value to produce a total length. |
| When the byte value is 255, another byte must be read and added, and so on. |
| There can be any number of bytes of value `255` following `token`. |
| The Block Format does not define any "size limit", |
| though real implementations may feature some practical limits |
| (see more details in later chapter "Implementation Notes"). |
| |
| Note : this format explains why a non-compressible input block is expanded by 0.4%. |
| |
| Example 1 : A literal length of 48 will be represented as : |
| |
| - 15 : value for the 4-bits High field |
| - 33 : (=48-15) remaining length to reach 48 |
| |
| Example 2 : A literal length of 280 will be represented as : |
| |
| - 15 : value for the 4-bits High field |
| - 255 : following byte is maxed, since 280-15 >= 255 |
| - 10 : (=280 - 15 - 255) remaining length to reach 280 |
| |
| Example 3 : A literal length of 15 will be represented as : |
| |
| - 15 : value for the 4-bits High field |
| - 0 : (=15-15) yes, the zero must be output |
| |
| Following `token` and optional length bytes, are the literals themselves. |
| They are exactly as numerous as just decoded (length of literals). |
| Reminder: it's possible that there are zero literals. |
| |
| |
| Following the literals is the match copy operation. |
| |
| It starts by the `offset` value. |
| This is a 2 bytes value, in little endian format |
| (the 1st byte is the "low" byte, the 2nd one is the "high" byte). |
| |
| The `offset` represents the position of the match to be copied from the past. |
| For example, 1 means "current position - 1 byte". |
| The maximum `offset` value is 65535. 65536 and beyond cannot be coded. |
| Note that 0 is an invalid `offset` value. |
| The presence of a 0 `offset` value denotes an invalid (corrupted) block. |
| |
| Then the `matchlength` can be extracted. |
| For this, we use the second `token` field, the low 4-bits. |
| Such a value, obviously, ranges from 0 to 15. |
| However here, 0 means that the copy operation is minimal. |
| The minimum length of a match, called `minmatch`, is 4. |
| As a consequence, a 0 value means 4 bytes. |
| Similarly to literal length, any value smaller than 15 represents a length, |
| to which 4 (`minmatch`) must be added, thus ranging from 4 to 18. |
| A value of 15 is special, meaning 19+ bytes, |
| to which one must read additional bytes, one at a time, |
| with each byte value ranging from 0 to 255. |
| They are added to total to provide the final match length. |
| A 255 value means there is another byte to read and add. |
| There is no limit to the number of optional `255` bytes that can be present, |
| and therefore no limit to representable match length, |
| though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below). |
| |
| Note: this format has a maximum achievable compression ratio of about ~250. |
| |
| Decoding the `matchlength` reaches the end of current sequence. |
| Next byte will be the start of another sequence, and therefore a new `token`. |
| |
| |
| End of block conditions |
| ------------------------- |
| There are specific restrictions required to terminate an LZ4 block. |
| |
| 1. The last sequence contains only literals. |
| The block ends right after the literals (no `offset` field). |
| 2. The last 5 bytes of input are always literals. |
| Therefore, the last sequence contains at least 5 bytes. |
| - Special : if input is smaller than 5 bytes, |
| there is only one sequence, it contains the whole input as literals. |
| Even empty input can be represented, using a zero byte, |
| interpreted as a final token without literal and without a match. |
| 3. The last match must start at least 12 bytes before the end of block. |
| The last match is part of the _penultimate_ sequence. |
| It is followed by the last sequence, which contains _only_ literals. |
| - Note that, as a consequence, |
| blocks < 12 bytes cannot be compressed. |
| And as an extension, _independent_ blocks < 13 bytes cannot be compressed, |
| because they must start by at least one literal, |
| that the match can then copy afterwards. |
| |
| When a block does not respect these end conditions, |
| a conformant decoder is allowed to reject the block as incorrect. |
| |
| These rules are in place to ensure compatibility with |
| a wide range of historical decoders |
| which rely on these conditions for their speed-oriented design. |
| |
| Implementation notes |
| ----------------------- |
| The LZ4 Block Format only defines the compressed format, |
| it does not tell how to create a decoder or an encoder, |
| which design is left free to the imagination of the implementer. |
| |
| However, thanks to experience, there are a number of typical topics that |
| most implementations will have to consider. |
| This section tries to provide a few guidelines. |
| |
| #### Metadata |
| |
| An LZ4-compressed Block requires additional metadata for proper decoding. |
| Typically, a decoder will require the compressed block's size, |
| and an upper bound of decompressed size. |
| Other variants exist, such as knowing the decompressed size, |
| and having an upper bound of the input size. |
| The Block Format does not specify how to transmit such information, |
| which is considered an out-of-band information channel. |
| That's because in many cases, the information is present in the environment. |
| For example, databases must store the size of their compressed block for indexing, |
| and know that their decompressed block can't be larger than a certain threshold. |
| |
| If you need a format which is "self-contained", |
| and also transports the necessary metadata for proper decoding on any platform, |
| consider employing the [LZ4 Frame format] instead. |
| |
| #### Large lengths |
| |
| While the Block Format does not define any maximum value for length fields, |
| in practice, most implementations will feature some form of limit, |
| since it's expected for such values to be stored into registers of fixed bit width. |
| |
| If length fields use 64-bit registers, |
| then it can be assumed that there is no practical limit, |
| as it would require a single continuous block of multiple petabytes to reach it, |
| which is unreasonable by today's standard. |
| |
| If length fields use 32-bit registers, then it can be overflowed, |
| but requires a compressed block of size > 16 MB. |
| Therefore, implementations that do not deal with compressed blocks > 16 MB are safe. |
| However, if such a case is allowed, |
| then it's recommended to check that no large length overflows the register. |
| |
| If length fields use 16-bit registers, |
| then it's definitely possible to overflow such register, |
| with less than < 300 bytes of compressed data. |
| |
| A conformant decoder should be able to detect length overflows when it's possible, |
| and simply error out when that happens. |
| The input block might not be invalid, |
| it's just not decodable by the local decoder implementation. |
| |
| Note that, in order to be compatible with the larger LZ4 ecosystem, |
| it's recommended to be able to read and represent lengths of up to 4 MB, |
| and to accept blocks of size up to 4 MB. |
| Such limits are compatible with 32-bit length registers, |
| and prevent overflow of 32-bit registers. |
| |
| #### Safe decoding |
| |
| If a decoder receives compressed data from any external source, |
| it is recommended to ensure that the decoder is resilient to corrupted input, |
| and made safe from buffer overflow manipulations. |
| Always ensure that read and write operations |
| remain within the limits of provided buffers. |
| |
| Of particular importance, ensure that the nb of bytes instructed to copy |
| does not overflow neither the input nor the output buffers. |
| Ensure also, when reading an offset value, that the resulting position to copy |
| does not reach beyond the beginning of the buffer. |
| Such a situation can happen during the first 64 KB of decoded data. |
| |
| For more safety, test the decoder with fuzzers |
| to ensure it's resilient to improbable sequences of conditions. |
| Combine them with sanitizers, in order to catch overflows (asan) |
| or initialization issues (msan). |
| |
| Pay some attention to offset 0 scenario, which is invalid, |
| and therefore must not be blindly decoded: |
| a naive implementation could preserve destination buffer content, |
| which could then result in information disclosure |
| if such buffer was uninitialized and still containing private data. |
| For reference, in such a scenario, the reference LZ4 decoder |
| clears the match segment with `0` bytes, |
| though other solutions are certainly possible. |
| |
| Finally, pay attention to the "overlap match" scenario, |
| when `matchlength` is larger than `offset`. |
| In which case, since `match_pos + matchlength > current_pos`, |
| some of the later bytes to copy do not exist yet, |
| and will be generated during the early stage of match copy operation. |
| Such scenario must be handled with special care. |
| A common case is an offset of 1, |
| meaning the last byte is repeated `matchlength` times. |
| |
| #### Compression techniques |
| |
| The core of a LZ4 compressor is to detect duplicated data across past 64 KB. |
| The format makes no assumption nor limits to the way a compressor |
| searches and selects matches within the source data block. |
| For example, an upper compression limit can be reached, |
| using a technique called "full optimal parsing", at high cpu and memory cost. |
| But multiple other techniques can be considered, |
| featuring distinct time / performance trade-offs. |
| As long as the specified format is respected, |
| the result will be compatible with and decodable by any compliant decoder. |