| # Puffin: A deterministic deflate re-compressor (for patching purposes) |
| |
| ## Table of Contents |
| |
| [TOC] |
| |
| ## Puffin |
| |
| Puffin is a deterministic deflate recompressor. It is mainly used for patching |
| deflate compressed images (zip, gzip, etc.) because patches created from deflate |
| files/streams are usually large (deflate has a bit-aligned format, hence, |
| changing one byte in the raw data can cause the entire deflate stream to change |
| drastically.) |
| |
| Puffin has two tools (operations): `puffdiff` and `puffpatch` (shown |
| [here](puffin).) The purpose of `puffdiff` operation is to create a patch |
| between a source and a target file with one or both of them having some deflate |
| streams. This patch is used in the `puffpatch` operation to generate the target |
| file from the source file deterministically. The patch itself is created by |
| `bsdiff` library (but can be replaced by any other diffing mechanism). But, |
| before it uses `bsdiff` to create the patch, we have to transform both the |
| source and target files into a specific format so `bsdiff` can produce smaller |
| patches. We define `puff` operation to perform such a transformation. The |
| resulting stream is called a `puff` stream. The reverse transformation (from |
| `puff` stream to deflate stream) is called a `huff` operation. `huff` is used in |
| the client to transform the `puff` stream back into its original deflate stream |
| deterministically. |
| |
|  |
| |
| For information about deflate format see [RFC 1951]. |
| |
| ## puff and huff |
| |
| `puff` is an operation that decompresses only the Huffman part of the deflate |
| stream and keeps the structure of the LZ77 coding unchanged. This is roughly |
| equivalent of decompressing ‘half way’. |
| |
| `huff` is the exact opposite of `puff` and it deterministically converts the |
| `puff` stream back to its original deflate stream. This deterministic conversion |
| is based on two facts: |
| |
| * There is no need to perform LZ77 algorithm. This means the deflate stream |
| could be built by any LZ77 algorithm. |
| * The dynamic Huffman tables can be recreated uniquely from the code length |
| array stored inside the `puff` stream. This means the deflate stream can be |
| reconstructed deterministically using the Huffman tables. |
| |
| The inclusion of Huffman tables in the `puff` stream has minuscule space burden |
| (on average maximum 320 bytes for each block. There is roughly one block per |
| 32KB of uncompressed data). |
| |
| `bsdiff` of two `puffed` streams has much smaller patch in comparison to their |
| deflate streams, but it is larger than uncompressed streams. |
| |
| **The major benefits** |
| |
| * Size of the patch is smaller than deflate stream’s patch. |
| * `puff` and `huff` are deterministic operations. |
| * `huff` is in order of 10X faster than full recompression. It is even faster |
| than Huffman algorithm because it already has the Huffman tables and does |
| not need to reconstruct it. |
| * Both algorithms have very low memory footprint. |
| * The underlying LZ77 algorithm can be anything (as long as it is deflate |
| compatible). This includes google’s [zopfli] |
| |
| **The drawbacks** |
| |
| * The need to define a file format for the puffed stream and stay with this |
| format forever. If format needs to be changed in the future, then some |
| versioning mechanism should be there to handle it and backward compatibility |
| should be maintained. |
| * The need to define a payload format and stay with it forever. Similarly |
| needs to be versioned if required later change. |
| * Does not reduces the patch size as full recompression. |
| |
| |
| ## puffdiff and puffpatch |
| |
| A deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at |
| different locations. When this file is puffed, the resulting `puff` stream |
| contains both puffs and the raw data that existed between the deflates in the |
| compressed file. When performing `huff` operation, the location of puffs in the |
| `puff` stream and deflates in the deflate stream should be passed upon. This is |
| necessary as `huff` operation has to know exactly where the locations of both |
| puffs and deflates are. (See the following [image](puffin-stream)) |
| |
|  |
| |
| Similarly `puffpatch` requires deflates and puffs locations in both the source |
| and target streams. These location information is saved using Protobufs in the |
| patch generated by `bsdiff`. One requirement for these two operations are that |
| `puffpatch` has to be efficient in the client. Client devices have limited |
| memory and CPU bandwidth and it is necessary that each of these operations are |
| performed with the most efficiency available. In order to achieve this |
| efficiency a new operation can be added to `bspatch`, that reads and writes into |
| a `puff` streams using special interfaces for puffing and huffing on the fly. |
| |
| |
| ## Puffin Patch Format |
| |
| * Magic (4 bytes) - The string literal "PUF1". |
| * Header Length (4 bytes) - The size of the header (length of the generated |
| Protobuf). |
| * Header - Lengths and locations of deflate and puffs streams in source and |
| target files in a Protobuf format. See [puffin.proto]. |
| * Patch - This is a binary array directly generated by the `bsdiff` program. |
| |
| ## Puffin Stream Format |
| |
| * [Block Header](#block-header-format) (3+ bytes) - Defines the type of the |
| block. |
| * Data - A mix of literals list and copy length/distances. |
| * End of Block (2 bytes) - The end of block symbol. Similar to the symbols for |
| copy length/distance but without the distance bytes. The actual copy length |
| value is 259 (0x81FF). |
| |
| ### Block Header Format |
| |
| * Length (2 Bytes) - The size of the block header excluding the two Length |
| bytes itself - 1. |
| * Final Block (1 Bit) - Whether the block is the last one or not. |
| * 1 - Final block |
| * 0 - Middle block |
| * Type (2 Bits) |
| * 0 - Uncompressed. Immediately after the header, zero or one literals |
| list is present which defines the content of the uncompressed blocks. |
| * 1 - Fixed Huffman table. The list of literals and length/distances comes |
| immediately after the header. Fixed Huffman table is defined the deflate |
| specification and will not change. |
| * 2 - Dynamic Huffman table. The dynamic Huffman table comes at the end of |
| the block header. |
| * 3 - Undefined. This is an error. |
| * Skip Bits (5 Bits) - Used only for uncompressed blocks (For other types its |
| value is zero). In an uncompressed block, the [RFC 1951] skips any bits |
| after reading final block and type bits until the byte boundary in the |
| input. However, it does not define what the value of these bits should |
| be. Most implementations assume 0, but some implementations may use any |
| value. This segment contains those bits as a five-bits integer. When writing |
| the block header back to the deflate format, the actual number of bits which |
| where skipped will be calculated easily. |
| * Huffman Table - It only comes for dynamic Huffman table. |
| |
| #### Dynamic Huffman Table Format |
| |
| Depending on the scheme for storing the Huffman tables, the payload size can |
| change. We discovered that the following scheme does not produce the smallest |
| payload, but it is the most deterministic one. In a deflate stream, Huffman |
| tables for literals/length and distances are themselves Huffman coded. In this |
| format, we also `puff` the Huffman tables into the `puff` stream instead of |
| completely decompressing them. |
| |
| There are three tables stored in this structure very similar to the one defined |
| in [RFC 1951]. A Huffman table can be defined as an array of unsigned integer |
| code length values. Three Puffed Huffman tables appear like the following |
| scheme. The first table (codes) is the Huffman table used to decode the next two |
| Huffman tables. The second Huffman table is used to decode literals and lengths, |
| and the third Huffman table is used to decode distances. |
| |
| |
| * Literal/Length Count (1 byte) - Number of alphabets used for literal/length |
| Huffman codes - 257. |
| * Distance Count (1 byte) - Number of alphabets used for distance Huffman |
| codes - 1. |
| * Alphabet Count (1 byte) - Number of alphabets for coding two previous |
| Huffman tables - 4. |
| * Code Lengths ((Alphabet Count + 1) / 2 bytes) - A list of codes for reading |
| the next two Huffman tables. Each byte contains two codes. If the number of |
| codes is odd, the last four Bits will be zero. |
| * Literal/Length/Distance Code Lengths - List of code lengths for encoding |
| literals/lengths/distance followed The encoding is as follows: |
| * [0..15] - Represent code [0..15] |
| * [16..19] - Copy the last code length [3..6] times. |
| * [20..27] - Repeat code length of 0 for [3..10] times. |
| * [28..155] - Repeat code length of 0 for [11..138] times. |
| |
| ### Literals List |
| Literals lists are constructed by a “length” value followed by “length” bytes of |
| literals. The Puffer should try to merge adjacent literals lists as much as |
| possible into one literals list in the `puff` stream. This Is a length value |
| followed by length bytes of literals (Even if there is only one literal.) |
| |
| * Tag (1 Bit) - The value is 0 indicating that this is a list of literals (not |
| a copy length/distance). |
| * Length - The number of literals that would follow in the list. |
| * (7 Bits) Length = [1 .. 127], The value is: Length - 1 |
| * (7 Bits + 2 Bytes) Length = [128 .. 65663], The values are: 127, |
| Length - 127. |
| Conserves size by using only one byte if the number of upcoming |
| literals is smaller or equal to 127 (About 99% of literals length in a |
| normal deflate stream fit into this category.) We should never have zero |
| length literals. Otherwise it will use three bytes. |
| * Literals (Length Bytes) - A sequence of Length number of literals. |
| |
| ### Copy Length/Distance |
| |
| This Is a Length value followed by a Distance value. |
| |
| * Tag (1 Bit) - The value is 1 indicating that this is a copy length/distance |
| field. |
| * Length - Conserves size by using only one byte if the length value is |
| smaller or equal to 129. Otherwise two bytes are used. |
| * (7 Bits) - Length = [3 .. 129], The value is: Length - 3 |
| * (7 Bites + 1 Byte) Length = [130 .. 258], The value is: 127, Length - |
| 130: |
| * Distance (2 Bytes) - Distance = [1 .. 32768], The value is: |
| Distance - 1. The distance value as an unsigned integer. |
| |
| ## Building Puffin |
| Currently Puffin is being used in both Android and Chrome OS and is built |
| differently in each of them. There is also a Makefile build, but it is not |
| comprehensive. |
| |
| ## Usage |
| |
| Puffin builds an executable `puffin` which can be used to perform diffing and |
| patching algorithms using Puffin format. To get the list of options available |
| run: |
| |
| ```shell |
| puffin --help |
| ``` |
| |
| It can also be used as a library (currently used by update_engine) that provides |
| different APIs. |
| |
| ## References |
| |
| [RFC 1951]: https://www.ietf.org/rfc/rfc1951.txt |
| [puffin.proto]: /src/puffin.proto |
| [zopfli]: https://en.wikipedia.org/wiki/Zopfli |