| # WebM Parser {#mainpage} |
| |
| # Introduction |
| |
| This WebM parser is a C++11-based parser that aims to be a safe and complete |
| parser for WebM. It supports all WebM elements (from the old deprecated ones to |
| the newest ones like `Colour`), including recursive elements like `ChapterAtom` |
| and `SimpleTag`. It supports incremental parsing; parsing may be stopped at any |
| point and resumed later as needed. It also supports starting at an arbitrary |
| WebM element, so parsing need not start from the beginning of the file. |
| |
| The parser (`WebmParser`) works by being fed input data from a data source (an |
| instance of `Reader`) that represents a WebM file. The parser will parse the |
| WebM data into various data structures that represent the encoded WebM elements, |
| and then call corresponding `Callback` event methods as the data structures are |
| parsed. |
| |
| # Building |
| |
| CMake support has been added to the root libwebm `CMakeLists.txt` file. Simply |
| enable the `ENABLE_WEBM_PARSER` feature if using the interactive CMake builder, |
| or alternatively pass the `-DENABLE_WEBM_PARSER:BOOL=ON` flag from the command |
| line. By default, this parser is not enabled when building libwebm, so you must |
| explicitly enable it. |
| |
| Alternatively, the following illustrates the minimal commands necessary to |
| compile the code into a static library without CMake: |
| |
| ```.sh |
| c++ -Iinclude -I. -std=c++11 -c src/*.cc |
| ar rcs libwebm.a *.o |
| ``` |
| |
| # Using the parser |
| |
| There are 3 basic components in the parser that are used: `Reader`, `Callback`, |
| and `WebmParser`. |
| |
| ## `Reader` |
| |
| The `Reader` interface acts as a data source for the parser. You may subclass it |
| and implement your own data source if you wish. Alternatively, use the |
| `FileReader`, `IstreamReader`, or `BufferReader` if you wish to read from a |
| `FILE*`, `std::istream`, or `std::vector<std::uint8_t>`, respectively. |
| |
| The parser supports `Reader` implementations that do short reads. If |
| `Reader::Skip()` or `Reader::Read()` do a partial read (returning |
| `Status::kOkPartial`), the parser will call them again in an attempt to read |
| more data. If no data is available, the `Reader` may return some other status |
| (like `Status::kWouldBlock`) to indicate that no data is available. In this |
| situation, the parser will stop parsing and return the status it received. |
| Parsing may be resumed later when more data is available. |
| |
| When the `Reader` has reached the end of the WebM document and no more data is |
| available, it should return `Status::kEndOfFile`. This will cause parsing to |
| stop. If the file ends at a valid location (that is, there aren't any elements |
| that have specified a size that indicates the file ended prematurely), the |
| parser will translate `Status::kEndOfFile` into `Status::kOkCompleted` and |
| return it. If the file ends prematurely, the parser will return |
| `Status::kEndOfFile` to indicate that. |
| |
| Note that if the WebM file contains elements that have an unknown size (or a |
| seek has been performed and the parser doesn't know the size of the root |
| element(s)), and the parser is parsing them and hits end-of-file, the parser may |
| still call `Reader::Read()`/`Reader::Skip()` multiple times (even though they've |
| already reported `Status::kEndOfFile`) as nested parsers terminate parsing. |
| Because of this, `Reader::Read()`/`Reader::Skip()` implementations should be |
| able to handle being called multiple times after the file's end has been |
| reached, and they should consistently return `Status::kEndOfFile`. |
| |
| The three provided readers (`FileReader`, `IstreamReader`, and `BufferReader`) |
| are blocking implementations (they won't return `Status::kWouldBlock`), so if |
| you're using them the parser will run until it entirely consumes all their data |
| (unless, of course, you request the parser to stop via `Callback`... see the |
| next section). |
| |
| ## `Callback` |
| |
| As the parser progresses through the file, it builds objects (see |
| `webm/dom_types.h`) that represent parsed data structures. The parser then |
| notifies the `Callback` implementation as objects complete parsing. For some |
| data structures (like frames or Void elements), the parser notifies the |
| `Callback` and requests it to consume the data directly from the `Reader` (this |
| is done for structures that can be large/frequent binary blobs in order to allow |
| you to read the data directly into the object/type of your choice, rather than |
| just reading them into a `std::vector<std::uint8_t>` and making you copy it into |
| a different object if you wanted to work with something other than |
| `std::vector<std::uint8_t>`). |
| |
| The parser was designed to parse the data into objects that are small enough |
| that the `Callback` can be quickly and frequently notified as soon as the object |
| is ready, but large enough that the objects received by the `Callback` are still |
| useful. Having `Callback` events for every tiny integer/float/string/etc. |
| element would require too much assembly and work to be useful to most users, and |
| pasing the file into a single DOM tree (or a small handful of large conglomerate |
| structures) would unnecessarily delay video playback or consume too much memory |
| on smaller devices. |
| |
| The parser may call the following methods while nearly anywhere in the file: |
| |
| - `Callback::OnElementBegin()`: This is called for every element that the |
| parser encounters. This is primarily useful if you want to skip some |
| elements or build a map of every element in the file. |
| - `Callback::OnUnknownElement()`: This is called when an element is either not |
| a valid/recognized WebM element, or it is a WebM element but is improperly |
| nested (e.g. an EBMLVersion element inside of a Segment element). The parser |
| doesn't know how to handle the element; it could just skip it but instead |
| defers to the `Callback` to decide how it should be handled. The default |
| implementation just skips the element. |
| - `Callback::OnVoid()`: Void elements can appear anywhere in any master |
| element. This method will be called to handle the Void element. |
| |
| The parser may call the following methods in the proper nesting order, as shown |
| in the list. A `*Begin()` method will always be matched up with its |
| corresponding `*End()` method (unless a seek has been performed). The parser |
| will only call the methods in the proper nesting order as specified in the WebM |
| DOM. For example, `Callback::OnEbml()` will never be called in between |
| `Callback::OnSegmentBegin()`/`Callback::OnSegmentEnd()` (since the EBML element |
| is not a child of the Segment element), and `Callback::OnTrackEntry()` will only |
| ever be called in between |
| `Callback::OnSegmentBegin()`/`Callback::OnSegmentEnd()` (since the TrackEntry |
| element is a (grand-)child of the Segment element and must be contained by a |
| Segment element). `Callback::OnFrame()` is listed twice because it will be |
| called to handle frames contained in both SimpleBlock and Block elements. |
| |
| - `Callback::OnEbml()` |
| - `Callback::OnSegmentBegin()` |
| - `Callback::OnSeek()` |
| - `Callback::OnInfo()` |
| - `Callback::OnClusterBegin()` |
| - `Callback::OnSimpleBlockBegin()` |
| - `Callback::OnFrame()` |
| - `Callback::OnSimpleBlockEnd()` |
| - `Callback::OnBlockGroupBegin()` |
| - `Callback::OnBlockBegin()` |
| - `Callback::OnFrame()` |
| - `Callback::OnBlockEnd()` |
| - `Callback::OnBlockGroupEnd()` |
| - `Callback::OnClusterEnd()` |
| - `Callback::OnTrackEntry()` |
| - `Callback::OnCuePoint()` |
| - `Callback::OnEditionEntry()` |
| - `Callback::OnTag()` |
| - `Callback::OnSegmentEnd()` |
| |
| Only `Callback::OnFrame()` (and no other `Callback` methods) will be called in |
| between `Callback::OnSimpleBlockBegin()`/`Callback::OnSimpleBlockEnd()` or |
| `Callback::OnBlockBegin()`/`Callback::OnBlockEnd()`, since the SimpleBlock and |
| Block elements are not master elements only contain frames. |
| |
| Note that seeking into the middle of the file may cause the parser to skip some |
| `*Begin()` methods. For example, if a seek is performed to a SimpleBlock |
| element, `Callback::OnSegmentBegin()` and `Callback::OnClusterBegin()` will not |
| be called. In this situation, the full sequence of callback events would be |
| (assuming the file ended after the SimpleBlock): |
| `Callback::OnSimpleBlockBegin()`, `Callback::OnFrame()` (for every frame in the |
| SimpleBlock), `Callback::OnSimpleBlockEnd()`, `Callback::OnClusterEnd()`, and |
| `Callback::OnSegmentEnd()`. Since the Cluster and Segment elements were skipped, |
| the `Cluster` DOM object may have some members marked as absent, and the |
| `*End()` events for the Cluster and Segment elements will have metadata with |
| unknown header position, header length, and body size (see `kUnknownHeaderSize`, |
| `kUnknownElementSize`, and `kUnknownElementPosition`). |
| |
| When a `Callback` method has completed, it should return `Status::kOkCompleted` |
| to allow parsing to continue. If you would like parsing to stop, return any |
| other status code (except `Status::kEndOfFile`, since that's treated somewhat |
| specially and is intended for `Reader`s to use), which the parser will return. |
| If you return a non-parsing-error status code (.e.g. `Status::kOkPartial`, |
| `Status::kWouldBlock`, etc. or your own status code with a value > 0), parsing |
| may be resumed again. When parsing is resumed, the parser will call the same |
| callback method again (and once again, you may return `Status::kOkCompleted` to |
| let parsing continue or some other value to stop parsing). |
| |
| You may subclass the `Callback` element and override methods which you are |
| interested in receiving events for. By default, methods taking an `Action` |
| parameter will set it to `Action::kRead` so the entire file is parsed. The |
| `Callback::OnFrame()` method will just skip over the frame bytes by default. |
| |
| ## `WebmParser` |
| |
| The actual parsing work is done with `WebmParser`. Simply construct a |
| `WebmParser` and call `WebmParser::Feed()` (providing it a `Callback` and |
| `Reader` instance) to parse a file. It will return `Status::kOkCompleted` when |
| the entire file has been successfully parsed. `WebmParser::Feed()` doesn't store |
| any internal references to the `Callback` or `Reader`. |
| |
| If you wish to start parsing from the middle of a file, call |
| `WebmParser::DidSeek()` before calling `WebmParser::Feed()` to prepare the |
| parser to receive data starting at an arbitrary point in the file. When seeking, |
| you should seek to the beginning of a WebM element; seeking to a location that |
| is not the start of a WebM element (e.g. seeking to a frame, rather than its |
| containing SimpleBlock/Block element) will cause parsing to fail. Calling |
| `WebmParser::DidSeek()` will reset the state of the parser and clear any |
| internal errors, so a `WebmParser` instance may be reused (even if it has |
| previously failed to parse a file). |
| |
| ## Building your program |
| |
| The following program is a small program that completely parses a file from |
| stdin: |
| |
| ```.cc |
| #include <webm/callback.h> |
| #include <webm/file_reader.h> |
| #include <webm/webm_parser.h> |
| |
| int main() { |
| webm::Callback callback; |
| webm::FileReader reader(std::freopen(nullptr, "rb", stdin)); |
| webm::WebmParser parser; |
| parser.Feed(&callback, &reader); |
| } |
| ``` |
| |
| It completely parses the input file, but we need to make a new class that |
| derives from `Callback` if we want to receive any parsing events. So if we |
| change it to: |
| |
| ```.cc |
| #include <iomanip> |
| #include <iostream> |
| |
| #include <webm/callback.h> |
| #include <webm/file_reader.h> |
| #include <webm/status.h> |
| #include <webm/webm_parser.h> |
| |
| class MyCallback : public webm::Callback { |
| public: |
| webm::Status OnElementBegin(const webm::ElementMetadata& metadata, |
| webm::Action* action) override { |
| std::cout << "Element ID = 0x" |
| << std::hex << static_cast<std::uint32_t>(metadata.id); |
| std::cout << std::dec; // Reset to decimal mode. |
| std::cout << " at position "; |
| if (metadata.position == webm::kUnknownElementPosition) { |
| // The position will only be unknown if we've done a seek. But since we |
| // aren't seeking in this demo, this will never be the case. However, this |
| // if-statement is included for completeness. |
| std::cout << "<unknown>"; |
| } else { |
| std::cout << metadata.position; |
| } |
| std::cout << " with header size "; |
| if (metadata.header_size == webm::kUnknownHeaderSize) { |
| // The header size will only be unknown if we've done a seek. But since we |
| // aren't seeking in this demo, this will never be the case. However, this |
| // if-statement is included for completeness. |
| std::cout << "<unknown>"; |
| } else { |
| std::cout << metadata.header_size; |
| } |
| std::cout << " and body size "; |
| if (metadata.size == webm::kUnknownElementSize) { |
| // WebM master elements may have an unknown size, though this is rare. |
| std::cout << "<unknown>"; |
| } else { |
| std::cout << metadata.size; |
| } |
| std::cout << '\n'; |
| |
| *action = webm::Action::kRead; |
| return webm::Status(webm::Status::kOkCompleted); |
| } |
| }; |
| |
| int main() { |
| MyCallback callback; |
| webm::FileReader reader(std::freopen(nullptr, "rb", stdin)); |
| webm::WebmParser parser; |
| webm::Status status = parser.Feed(&callback, &reader); |
| if (status.completed_ok()) { |
| std::cout << "Parsing successfully completed\n"; |
| } else { |
| std::cout << "Parsing failed with status code: " << status.code << '\n'; |
| } |
| } |
| ``` |
| |
| This will output information about every element in the entire file: it's ID, |
| position, header size, and body size. The status of the parse is also checked |
| and reported. |
| |
| For a more complete example, see `demo/demo.cc`, which parses an entire file and |
| prints out all of its information. That example overrides every `Callback` |
| method to show exactly what information is available while parsing and how to |
| access it. The example is verbose, but that's primarily due to pretty-printing |
| and string formatting operations. |
| |
| When compiling your program, add the `include` directory to your compiler's |
| header search paths and link to the compiled library. Be sure your compiler has |
| C++11 mode enabled (`-std=c++11` in clang++ or g++). |
| |
| # Testing |
| |
| Unit tests are located in the `tests` directory. Google Test and Google Mock are |
| used as testing frameworks. Building and running the tests will be supported in |
| the upcoming CMake scripts, but they can currently be built and run by manually |
| compiling them (and linking to Google Test and Google Mock). |
| |
| # Fuzzing |
| |
| The parser has been fuzzed with [AFL](http://lcamtuf.coredump.cx/afl/) and |
| [libFuzzer](http://llvm.org/docs/LibFuzzer.html). If you wish to fuzz the parser |
| with AFL or libFuzzer but don't want to write an executable that exercises the |
| parsing API, you may use `fuzzing/webm_fuzzer.cc`. |
| |
| When compiling for fuzzing, define the macro |
| `WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT` to be some integer in order to limit the |
| maximum size of ASCII/UTF-8/binary elements. It's too easy for the fuzzer to |
| generate elements that claim to have a ridiculously massive size, which will |
| cause allocations to fail or the program to allocate too much memory. AFL will |
| terminate the process if it allocates too much memory (by default, 50 MB), and |
| the [Address Sanitizer doesn't throw `std::bad_alloc` when an allocation fails] |
| (https://github.com/google/sanitizers/issues/295). Defining |
| `WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT` to a low number (say, 1024) will cause the |
| ASCII/UTF-8/binary element parsers to return `Status::kNotEnoughMemory` if the |
| element's size exceeds `WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT`, which will avoid |
| false positives when fuzzing. The parser expects `std::string` and `std::vector` |
| to throw `std::bad_alloc` when an allocation fails, which doesn't necessarily |
| happen due to the fuzzers' limitations. |
| |
| You may also define the macro `WEBM_FUZZER_SEEK_FIRST` to have |
| `fuzzing/webm_fuzzer.cc` call `WebmParser::DidSeek()` before doing any parsing. |
| This will test the seeking code paths. |