docs/technical_details.txt - platform/external/AFLplusplus - Git at Google

 ===================================
 Technical "whitepaper" for afl-fuzz
 ===================================

   This document provides a quick overview of the guts of American Fuzzy Lop.
   See README for the general instruction manual; and for a discussion of
   motivations and design goals behind AFL, see historical_notes.txt.

 0) Design statement
 -------------------

 American Fuzzy Lop does its best not to focus on any singular principle of
 operation and not be a proof-of-concept for any specific theory. The tool can
 be thought of as a collection of hacks that have been tested in practice,
 found to be surprisingly effective, and have been implemented in the simplest,
 most robust way I could think of at the time.

 Many of the resulting features are made possible thanks to the availability of
 lightweight instrumentation that served as a foundation for the tool, but this
 mechanism should be thought of merely as a means to an end. The only true
 governing principles are speed, reliability, and ease of use.

 1) Coverage measurements
 ------------------------

 The instrumentation injected into compiled programs captures branch (edge)
 coverage, along with coarse branch-taken hit counts. The code injected at
 branch points is essentially equivalent to:

   cur_location = <COMPILE_TIME_RANDOM>;
   shared_mem[cur_location ^ prev_location]++;
   prev_location = cur_location >> 1;

 The cur_location value is generated randomly to simplify the process of
 linking complex projects and keep the XOR output distributed uniformly.

 The shared_mem[] array is a 64 kB SHM region passed to the instrumented binary
 by the caller. Every byte set in the output map can be thought of as a hit for
 a particular (branch_src, branch_dst) tuple in the instrumented code.

 The size of the map is chosen so that collisions are sporadic with almost all
 of the intended targets, which usually sport between 2k and 10k discoverable
 branch points:

    Branch cnt | Colliding tuples | Example targets
   ------------+------------------+-----------------
         1,000 | 0.75%            | giflib, lzo
         2,000 | 1.5%             | zlib, tar, xz
         5,000 | 3.5%             | libpng, libwebp
        10,000 | 7%               | libxml
        20,000 | 14%              | sqlite
        50,000 | 30%              | -

 At the same time, its size is small enough to allow the map to be analyzed
 in a matter of microseconds on the receiving end, and to effortlessly fit
 within L2 cache.

 This form of coverage provides considerably more insight into the execution
 path of the program than simple block coverage. In particular, it trivially
 distinguishes between the following execution traces:

   A -> B -> C -> D -> E (tuples: AB, BC, CD, DE)
   A -> B -> D -> C -> E (tuples: AB, BD, DC, CE)

 This aids the discovery of subtle fault conditions in the underlying code,
 because security vulnerabilities are more often associated with unexpected
 or incorrect state transitions than with merely reaching a new basic block.

 The reason for the shift operation in the last line of the pseudocode shown
 earlier in this section is to preserve the directionality of tuples (without
 this, A ^ B would be indistinguishable from B ^ A) and to retain the identity
 of tight loops (otherwise, A ^ A would be obviously equal to B ^ B).

 The absence of simple saturating arithmetic opcodes on Intel CPUs means that
 the hit counters can sometimes wrap around to zero. Since this is a fairly
 unlikely and localized event, it's seen as an acceptable performance trade-off.

 2) Detecting new behaviors
 --------------------------

 The fuzzer maintains a global map of tuples seen in previous executions; this
 data can be rapidly compared with individual traces and updated in just a couple
 of dword- or qword-wide instructions and a simple loop.

 When a mutated input produces an execution trace containing new tuples, the
 corresponding input file is preserved and routed for additional processing
 later on (see section #3). Inputs that do not trigger new local-scale state
 transitions in the execution trace (i.e., produce no new tuples) are discarded,
 even if their overall control flow sequence is unique.

 This approach allows for a very fine-grained and long-term exploration of
 program state while not having to perform any computationally intensive and
 fragile global comparisons of complex execution traces, and while avoiding the
 scourge of path explosion.

 To illustrate the properties of the algorithm, consider that the second trace
 shown below would be considered substantially new because of the presence of
 new tuples (CA, AE):

   #1: A -> B -> C -> D -> E
   #2: A -> B -> C -> A -> E

 At the same time, with #2 processed, the following pattern will not be seen
 as unique, despite having a markedly different overall execution path:

   #3: A -> B -> C -> A -> B -> C -> A -> B -> C -> D -> E

 In addition to detecting new tuples, the fuzzer also considers coarse tuple
 hit counts. These are divided into several buckets:

   1, 2, 3, 4-7, 8-15, 16-31, 32-127, 128+

 To some extent, the number of buckets is an implementation artifact: it allows
 an in-place mapping of an 8-bit counter generated by the instrumentation to
 an 8-position bitmap relied on by the fuzzer executable to keep track of the
 already-seen execution counts for each tuple.

 Changes within the range of a single bucket are ignored; transition from one
 bucket to another is flagged as an interesting change in program control flow,
 and is routed to the evolutionary process outlined in the section below.

 The hit count behavior provides a way to distinguish between potentially
 interesting control flow changes, such as a block of code being executed
 twice when it was normally hit only once. At the same time, it is fairly
 insensitive to empirically less notable changes, such as a loop going from
 47 cycles to 48. The counters also provide some degree of "accidental"
 immunity against tuple collisions in dense trace maps.

 The execution is policed fairly heavily through memory and execution time
 limits; by default, the timeout is set at 5x the initially-calibrated
 execution speed, rounded up to 20 ms. The aggressive timeouts are meant to
 prevent dramatic fuzzer performance degradation by descending into tarpits
 that, say, improve coverage by 1% while being 100x slower; we pragmatically
 reject them and hope that the fuzzer will find a less expensive way to reach
 the same code. Empirical testing strongly suggests that more generous time
 limits are not worth the cost.

 3) Evolving the input queue
 ---------------------------

 Mutated test cases that produced new state transitions within the program are
 added to the input queue and used as a starting point for future rounds of
 fuzzing. They supplement, but do not automatically replace, existing finds.

 In contrast to more greedy genetic algorithms, this approach allows the tool
 to progressively explore various disjoint and possibly mutually incompatible
 features of the underlying data format, as shown in this image:

   http://lcamtuf.coredump.cx/afl/afl_gzip.png

 Several practical examples of the results of this algorithm are discussed
 here:

   http://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html
   http://lcamtuf.blogspot.com/2014/11/afl-fuzz-nobody-expects-cdata-sections.html

 The synthetic corpus produced by this process is essentially a compact
 collection of "hmm, this does something new!" input files, and can be used to
 seed any other testing processes down the line (for example, to manually
 stress-test resource-intensive desktop apps).

 With this approach, the queue for most targets grows to somewhere between 1k
 and 10k entries; approximately 10-30% of this is attributable to the discovery
 of new tuples, and the remainder is associated with changes in hit counts.

 The following table compares the relative ability to discover file syntax and
 explore program states when using several different approaches to guided
 fuzzing. The instrumented target was GNU patch 2.7.3 compiled with -O3 and
 seeded with a dummy text file; the session consisted of a single pass over the
 input queue with afl-fuzz:

     Fuzzer guidance | Blocks  | Edges   | Edge hit | Highest-coverage
       strategy used | reached | reached | cnt var  | test case generated
   ------------------+---------+---------+----------+---------------------------
      (Initial file) | 156     | 163     | 1.00     | (none)
                     |         |         |          |
     Blind fuzzing S | 182     | 205     | 2.23     | First 2 B of RCS diff
     Blind fuzzing L | 228     | 265     | 2.23     | First 4 B of -c mode diff
      Block coverage | 855     | 1,130   | 1.57     | Almost-valid RCS diff
       Edge coverage | 1,452   | 2,070   | 2.18     | One-chunk -c mode diff
           AFL model | 1,765   | 2,597   | 4.99     | Four-chunk -c mode diff

 The first entry for blind fuzzing ("S") corresponds to executing just a single
 round of testing; the second set of figures ("L") shows the fuzzer running in a
 loop for a number of execution cycles comparable with that of the instrumented
 runs, which required more time to fully process the growing queue.

 Roughly similar results have been obtained in a separate experiment where the
 fuzzer was modified to compile out all the random fuzzing stages and leave just
 a series of rudimentary, sequential operations such as walking bit flips.
 Because this mode would be incapable of altering the size of the input file,
 the sessions were seeded with a valid unified diff:

     Queue extension | Blocks  | Edges   | Edge hit | Number of unique
       strategy used | reached | reached | cnt var  | crashes found
   ------------------+---------+---------+----------+------------------
      (Initial file) | 624     | 717     | 1.00     | -
                     |         |         |          |
       Blind fuzzing | 1,101   | 1,409   | 1.60     | 0
      Block coverage | 1,255   | 1,649   | 1.48     | 0
       Edge coverage | 1,259   | 1,734   | 1.72     | 0
           AFL model | 1,452   | 2,040   | 3.16     | 1

 At noted earlier on, some of the prior work on genetic fuzzing relied on
 maintaining a single test case and evolving it to maximize coverage. At least
 in the tests described above, this "greedy" approach appears to confer no
 substantial benefits over blind fuzzing strategies.

 4) Culling the corpus
 ---------------------

 The progressive state exploration approach outlined above means that some of
 the test cases synthesized later on in the game may have edge coverage that
 is a strict superset of the coverage provided by their ancestors.

 To optimize the fuzzing effort, AFL periodically re-evaluates the queue using a
 fast algorithm that selects a smaller subset of test cases that still cover
 every tuple seen so far, and whose characteristics make them particularly
 favorable to the tool.

 The algorithm works by assigning every queue entry a score proportional to its
 execution latency and file size; and then selecting lowest-scoring candidates
 for each tuple.

 The tuples are then processed sequentially using a simple workflow:

   1) Find next tuple not yet in the temporary working set,

   2) Locate the winning queue entry for this tuple,

   3) Register *all* tuples present in that entry's trace in the working set,

   4) Go to #1 if there are any missing tuples in the set.

 The generated corpus of "favored" entries is usually 5-10x smaller than the
 starting data set. Non-favored entries are not discarded, but they are skipped
 with varying probabilities when encountered in the queue:

   - If there are new, yet-to-be-fuzzed favorites present in the queue, 99%
     of non-favored entries will be skipped to get to the favored ones.

   - If there are no new favorites:

     - If the current non-favored entry was fuzzed before, it will be skipped
       95% of the time.

     - If it hasn't gone through any fuzzing rounds yet, the odds of skipping
       drop down to 75%.

 Based on empirical testing, this provides a reasonable balance between queue
 cycling speed and test case diversity.

 Slightly more sophisticated but much slower culling can be performed on input
 or output corpora with afl-cmin. This tool permanently discards the redundant
 entries and produces a smaller corpus suitable for use with afl-fuzz or
 external tools.

 5) Trimming input files
 -----------------------

 File size has a dramatic impact on fuzzing performance, both because large
 files make the target binary slower, and because they reduce the likelihood
 that a mutation would touch important format control structures, rather than
 redundant data blocks. This is discussed in more detail in perf_tips.txt.

 The possibility that the user will provide a low-quality starting corpus aside,
 some types of mutations can have the effect of iteratively increasing the size
 of the generated files, so it is important to counter this trend.

 Luckily, the instrumentation feedback provides a simple way to automatically
 trim down input files while ensuring that the changes made to the files have no
 impact on the execution path.

 The built-in trimmer in afl-fuzz attempts to sequentially remove blocks of data
 with variable length and stepover; any deletion that doesn't affect the checksum
 of the trace map is committed to disk. The trimmer is not designed to be
 particularly thorough; instead, it tries to strike a balance between precision
 and the number of execve() calls spent on the process, selecting the block size
 and stepover to match. The average per-file gains are around 5-20%.

 The standalone afl-tmin tool uses a more exhaustive, iterative algorithm, and
 also attempts to perform alphabet normalization on the trimmed files. The
 operation of afl-tmin is as follows.

 First, the tool automatically selects the operating mode. If the initial input
 crashes the target binary, afl-tmin will run in non-instrumented mode, simply
 keeping any tweaks that produce a simpler file but still crash the target. If
 the target is non-crashing, the tool uses an instrumented mode and keeps only
 the tweaks that produce exactly the same execution path.

 The actual minimization algorithm is:

   1) Attempt to zero large blocks of data with large stepovers. Empirically,
      this is shown to reduce the number of execs by preempting finer-grained
      efforts later on.

   2) Perform a block deletion pass with decreasing block sizes and stepovers,
      binary-search-style.

   3) Perform alphabet normalization by counting unique characters and trying
      to bulk-replace each with a zero value.

   4) As a last result, perform byte-by-byte normalization on non-zero bytes.

 Instead of zeroing with a 0x00 byte, afl-tmin uses the ASCII digit '0'. This
 is done because such a modification is much less likely to interfere with
 text parsing, so it is more likely to result in successful minimization of
 text files.

 The algorithm used here is less involved than some other test case
 minimization approaches proposed in academic work, but requires far fewer
 executions and tends to produce comparable results in most real-world
 applications.

 6) Fuzzing strategies
 ---------------------

 The feedback provided by the instrumentation makes it easy to understand the
 value of various fuzzing strategies and optimize their parameters so that they
 work equally well across a wide range of file types. The strategies used by
 afl-fuzz are generally format-agnostic and are discussed in more detail here:

   http://lcamtuf.blogspot.com/2014/08/binary-fuzzing-strategies-what-works.html

 It is somewhat notable that especially early on, most of the work done by
 afl-fuzz is actually highly deterministic, and progresses to random stacked
 modifications and test case splicing only at a later stage. The deterministic
 strategies include:

   - Sequential bit flips with varying lengths and stepovers,

   - Sequential addition and subtraction of small integers,

   - Sequential insertion of known interesting integers (0, 1, INT_MAX, etc),

 The purpose of opening with deterministic steps is related to their tendency to
 produce compact test cases and small diffs between the non-crashing and crashing
 inputs.

 With deterministic fuzzing out of the way, the non-deterministic steps include
 stacked bit flips, insertions, deletions, arithmetics, and splicing of different
 test cases.

 The relative yields and execve() costs of all these strategies have been
 investigated and are discussed in the aforementioned blog post.

 For the reasons discussed in historical_notes.txt (chiefly, performance,
 simplicity, and reliability), AFL generally does not try to reason about the
 relationship between specific mutations and program states; the fuzzing steps
 are nominally blind, and are guided only by the evolutionary design of the
 input queue.

 That said, there is one (trivial) exception to this rule: when a new queue
 entry goes through the initial set of deterministic fuzzing steps, and tweaks to
 some regions in the file are observed to have no effect on the checksum of the
 execution path, they may be excluded from the remaining phases of
 deterministic fuzzing - and the fuzzer may proceed straight to random tweaks.
 Especially for verbose, human-readable data formats, this can reduce the number
 of execs by 10-40% or so without an appreciable drop in coverage. In extreme
 cases, such as normally block-aligned tar archives, the gains can be as high as
 90%.

 Because the underlying "effector maps" are local every queue entry and remain
 in force only during deterministic stages that do not alter the size or the
 general layout of the underlying file, this mechanism appears to work very
 reliably and proved to be simple to implement.

 7) Dictionaries
 ---------------

 The feedback provided by the instrumentation makes it easy to automatically
 identify syntax tokens in some types of input files, and to detect that certain
 combinations of predefined or auto-detected dictionary terms constitute a
 valid grammar for the tested parser.

 A discussion of how these features are implemented within afl-fuzz can be found
 here:

   http://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-grammar-with.html

 In essence, when basic, typically easily-obtained syntax tokens are combined
 together in a purely random manner, the instrumentation and the evolutionary
 design of the queue together provide a feedback mechanism to differentiate
 between meaningless mutations and ones that trigger new behaviors in the
 instrumented code - and to incrementally build more complex syntax on top of
 this discovery.

 The dictionaries have been shown to enable the fuzzer to rapidly reconstruct
 the grammar of highly verbose and complex languages such as JavaScript, SQL,
 or XML; several examples of generated SQL statements are given in the blog
 post mentioned above.

 Interestingly, the AFL instrumentation also allows the fuzzer to automatically
 isolate syntax tokens already present in an input file. It can do so by looking
 for run of bytes that, when flipped, produce a consistent change to the
 program's execution path; this is suggestive of an underlying atomic comparison
 to a predefined value baked into the code. The fuzzer relies on this signal
 to build compact "auto dictionaries" that are then used in conjunction with
 other fuzzing strategies.

 8) De-duping crashes
 --------------------

 De-duplication of crashes is one of the more important problems for any
 competent fuzzing tool. Many of the naive approaches run into problems; in
 particular, looking just at the faulting address may lead to completely
 unrelated issues being clustered together if the fault happens in a common
 library function (say, strcmp, strcpy); while checksumming call stack
 backtraces can lead to extreme crash count inflation if the fault can be
 reached through a number of different, possibly recursive code paths.

 The solution implemented in afl-fuzz considers a crash unique if any of two
 conditions are met:

   - The crash trace includes a tuple not seen in any of the previous crashes,

   - The crash trace is missing a tuple that was always present in earlier
     faults.

 The approach is vulnerable to some path count inflation early on, but exhibits
 a very strong self-limiting effect, similar to the execution path analysis
 logic that is the cornerstone of afl-fuzz.

 9) Investigating crashes
 ------------------------

 The exploitability of many types of crashes can be ambiguous; afl-fuzz tries
 to address this by providing a crash exploration mode where a known-faulting
 test case is fuzzed in a manner very similar to the normal operation of the
 fuzzer, but with a constraint that causes any non-crashing mutations to be
 thrown away.

 A detailed discussion of the value of this approach can be found here:

   http://lcamtuf.blogspot.com/2014/11/afl-fuzz-crash-exploration-mode.html

 The method uses instrumentation feedback to explore the state of the crashing
 program to get past the ambiguous faulting condition and then isolate the
 newly-found inputs for human review.

 On the subject of crashes, it is worth noting that in contrast to normal
 queue entries, crashing inputs are *not* trimmed; they are kept exactly as
 discovered to make it easier to compare them to the parent, non-crashing entry
 in the queue. That said, afl-tmin can be used to shrink them at will.

 10) The fork server
 -------------------

 To improve performance, afl-fuzz uses a "fork server", where the fuzzed process
 goes through execve(), linking, and libc initialization only once, and is then
 cloned from a stopped process image by leveraging copy-on-write. The
 implementation is described in more detail here:

   http://lcamtuf.blogspot.com/2014/10/fuzzing-binaries-without-execve.html

 The fork server is an integral aspect of the injected instrumentation and
 simply stops at the first instrumented function to await commands from
 afl-fuzz.

 With fast targets, the fork server can offer considerable performance gains,
 usually between 1.5x and 2x. It is also possible to:

   - Use the fork server in manual ("deferred") mode, skipping over larger,
     user-selected chunks of initialization code. It requires very modest
     code changes to the targeted program, and With some targets, can
     produce 10x+ performance gains.

   - Enable "persistent" mode, where a single process is used to try out
     multiple inputs, greatly limiting the overhead of repetitive fork()
     calls. This generally requires some code changes to the targeted program,
     but can improve the performance of fast targets by a factor of 5 or more
     - approximating the benefits of in-process fuzzing jobs while still
     maintaining very robust isolation between the fuzzer process and the
     targeted binary.

 11) Parallelization
 -------------------

 The parallelization mechanism relies on periodically examining the queues
 produced by independently-running instances on other CPU cores or on remote
 machines, and then selectively pulling in the test cases that, when tried
 out locally, produce behaviors not yet seen by the fuzzer at hand.

 This allows for extreme flexibility in fuzzer setup, including running synced
 instances against different parsers of a common data format, often with
 synergistic effects.

 For more information about this design, see parallel_fuzzing.txt.

 12) Binary-only instrumentation
 -------------------------------

 Instrumentation of black-box, binary-only targets is accomplished with the
 help of a separately-built version of QEMU in "user emulation" mode. This also
 allows the execution of cross-architecture code - say, ARM binaries on x86.

 QEMU uses basic blocks as translation units; the instrumentation is implemented
 on top of this and uses a model roughly analogous to the compile-time hooks:

   if (block_address > elf_text_start && block_address < elf_text_end) {

     cur_location = (block_address >> 4) ^ (block_address << 8);
     shared_mem[cur_location ^ prev_location]++;
     prev_location = cur_location >> 1;

   }

 The shift-and-XOR-based scrambling in the second line is used to mask the
 effects of instruction alignment.

 The start-up of binary translators such as QEMU, DynamoRIO, and PIN is fairly
 slow; to counter this, the QEMU mode leverages a fork server similar to that
 used for compiler-instrumented code, effectively spawning copies of an
 already-initialized process paused at _start.

 First-time translation of a new basic block also incurs substantial latency. To
 eliminate this problem, the AFL fork server is extended by providing a channel
 between the running emulator and the parent process. The channel is used
 to notify the parent about the addresses of any newly-encountered blocks and to
 add them to the translation cache that will be replicated for future child
 processes.

 As a result of these two optimizations, the overhead of the QEMU mode is
 roughly 2-5x, compared to 100x+ for PIN.

 13) The afl-analyze tool
 ------------------------

 The file format analyzer is a simple extension of the minimization algorithm
 discussed earlier on; instead of attempting to remove no-op blocks, the tool
 performs a series of walking byte flips and then annotates runs of bytes
 in the input file.

 It uses the following classification scheme:

   - "No-op blocks" - segments where bit flips cause no apparent changes to
     control flow. Common examples may be comment sections, pixel data within
     a bitmap file, etc.

   - "Superficial content" - segments where some, but not all, bitflips
     produce some control flow changes. Examples may include strings in rich
     documents (e.g., XML, RTF).

   - "Critical stream" - a sequence of bytes where all bit flips alter control
     flow in different but correlated ways. This may be compressed data,
     non-atomically compared keywords or magic values, etc.

   - "Suspected length field" - small, atomic integer that, when touched in
     any way, causes a consistent change to program control flow, suggestive
     of a failed length check.

   - "Suspected cksum or magic int" - an integer that behaves similarly to a
     length field, but has a numerical value that makes the length explanation
     unlikely. This is suggestive of a checksum or other "magic" integer.

   - "Suspected checksummed block" - a long block of data where any change
     always triggers the same new execution path. Likely caused by failing
     a checksum or a similar integrity check before any subsequent parsing
     takes place.

   - "Magic value section" - a generic token where changes cause the type
     of binary behavior outlined earlier, but that doesn't meet any of the
     other criteria. May be an atomically compared keyword or so.