doc/index.xml - platform/tools/dctv-tracedb - Git at Google

 <?xml version='1.0' encoding='utf-8'?>
 <!DOCTYPE html5>
 <html lang="en" xmlns="http://www.w3.org/1999/xhtml">
   <!-- N.B. Sentences in this document are double-spaced so that Emacs
        sentence-editing functions work more reliably. -->
   <head>
     <title>DCTV trace analysis system</title>
     <link rel="stylesheet" href="reset.css" />
     <link rel="stylesheet" href="styles.css" />
   </head>
   <body>
     <div id="container">
       <header id="header">DCTV</header>
       <nav id="sidebar">
         <ul>
           <li>
             <a href="#introduction">Introduction</a>
             <ul>
               <li><a href="#quickstart">Quick start</a></li>
               <li><a href="#background">Background</a></li>
               <li><a href="#conventions">Document conventions</a></li>
               <li><a href="#datamodel">Data model</a></li>
               <li><a href="#differences">Differences from standard SQL</a></li>
             </ul>
           </li>
           <li><a href="#syntax">Syntax reference</a></li>
           <li><a href="#standard_library">Standard library</a></li>
           <li><a href="#example">Worked example</a></li>
         </ul>
       </nav>
       <main id="manual">
         <h1><a name="introduction">Introduction</a></h1>
         <p>DCTV is a data exploration toolkit designed for both
         interactive and batch analysis of trace files and other
         heterogeneous time series data.  It's designed to answer
         complex of the sort of data that one frequently finds in
         records of system activity.</p>
         <p>Important features of DCTV are:</p>
         <ul>
           <li>SQL1999 querying of trace files</li>
           <li>specialized relational algebra and SQL syntax for time series</li>
           <li>comprehensive dimensional analysis for unit conversion
           and error detection</li>
           <li>support for analyzing very large (larger than memory)
           trace files</li>
           <li>powerful GUI for interactive trace exploration</li>
         </ul>
         <p>Use cases include:</p>
         <ul>
           <li>examining CPU time spent by a particular application</li>
           <li>examining CPU time spent in <emph>part</emph> of an
           application</li>
           <li>examining memory activity of the whole system to determine
           what caused a game to miss a frame deadline</li>
           <li>finding which functions cause the most page faults
           during app startup</li>
           <li>tracking down slow memory leaks</li>
           <li>finding why a real-time thread took too long to run and
           poll a device</li>
           <li>bulk analysis of traces from production to extract metrics for a
           dashboard</li>
         </ul>
         <p>DCTV is a "power user" tool: using it effectively requires
         an understanding of both the system components that generate
         the trace events being queried and an understanding of
         SQL-like declarative query systems.  This document aims to
         describe and document DCTV's functionality, walk through a few
         examples of trace analysis, and invite the reader to
         investigate further.</p>
         <h2><a name="quickstart">Quick start</a></h2>
         <aside class="warning">
           DCTV is under active development and is not yet stable.
           It also currently runs on Linux systems only; a port to
           macOS is underway.  See the <a href="go/dctv-db-design">DB
           design document</a> for further information on checking out and
           building the source code.
         </aside>
         <h3>Getting DCTV</h3>
         <ol>
           <li>Be running gLinux (we'll port eventually)</li>
           <li><code>git clone sso://team/dctv/dctv</code></li>
           <li><code>make dev</code></li>
           <li>follow prompts; install dependencies</li>
           <li>while the build is broken, complain to dancol@, goto 2</li>
           <li><code>./dctv</code></li>
         </ol>
         <h3>Hello world</h3>
         <code class="blockquote"><![CDATA[$ ./dctv repl mytrace=mytrace.ftrace
 Type .help for help.
 DCTV> SELECT COUNT(*) FROM mytrace.scheduler.timeslices_p_cpu;
 COUNT()
 -------
   32362
   ]]></code>
         <p>
         </p>
         <h2>Background</h2>
         <blockquote>
           Life is just one damned thing after another.
           <cite>Arnold J. Toynbee</cite>
         </blockquote>
         <h3>Purpose of DCTV</h3>
         <p>A trace file by itself is of limited utility: it's
         gigabytes of detailed, low-level records of system activity.
         When we analyze a trace file, what we really want to do is
         <emph>pose questions</emph> to that trace file and get back
         meaningful answers.  The information we want lies in the
         non-trivial <emph>relationships</emph> between trace events,
         the relationships between relationships, and so on, in a way
         that puts limits on the kind of trace analysis that it's
         possible to do using ad-hoc analysis of trace
         events themselves.</p>
         <p>After we pose questions to a trace file and get answers, we
         frequently want to use these answers as the basis for further
         questions.  In this way, we gradually increase the level of
         abstraction of our analysis, moving from questions posed in
         terms of raw trace events to ones posed in terms of the
         problem we've actually trying to solve.</p>
         <p>DCTV is a question-answering machine.  By incrementally
         constructing queries and then querying against them (for
         example, using the <code>WITH</code> construction), users
         extract increasingly abstract data from trace files, data not
         directly represented by discrete and specific low-level events
         in a trace.  The SQL REPL and the GUI both provide
         information-querying capabilities.</p>
         <p>DCTV also provides a <a href="#standard_library">standard
         library</a> of ready-made building blocks that users can query
         during trace analysis.</p>
         <h3>Other trace analysis tools</h3>
         <p>DCTV is not the first such tool for trace analysis.
         It integrates the best parts of WPA, LISA, and Perfetto's
         trace analysis models.</p>
         <code>TODO(dancol): flesh out this section</code>
         <h2><a name="conventions">Document conventions</a></h2>
         <p>This document currently assumes the reader is familiar with
         the basics of SQL and the basics of trace processing, focusing
         on DCTV's specific features in this area.</p>
         <h3>Time tables</h3>
         <p>Some figures below are "time tables" (they have "Time ▶" in
         the upper-left).  They represent timelines, where each row in
         the table is a separate and independent data series.
         Some tables represent operands and results; in this case, a
         thick black line separates the input rows and output rows.</p>

         <h3>Function signatures</h3>
         <p>Table-valued function signatures are given in Python
         syntax, with a bare <code>*</code> signifying that all
         arguments following the <code>*</code> are keyword-only and
         cannot be specified positionally.  (That is, if a function
         signature is <code>foo(*, bar=7)</code>, then you have to
         write either <code>foo()</code> (using <code>bar</code>'s
         default value or write <code>foo(bar=&gt;5)</code> (specifying an
         explicit value of the keyword argument), and you can't write
         <code>foo(1)</code> (because we can't specify
         <code>bar</code> positionally.)</p>
         <h2><a name="datamodel">Data model</a></h2>
         <p>DCTV is designed around querying one or more trace files
         using SQL queries.  DCTV performs no hardcoded pre-processing
         of trace files: we model each event in a trace file as a row
         of the "raw events" table corresponding to that event's type.
         Each field in an event is a column in that event's table;
         users extract higher-level information from these low-level
         events by defining views in terms of these low-level events.
         By querying the views, users can extract higher-level trace
         events; users can also define views in terms of other views to
         answer more abstract questions.</p>
         <h3>Table types</h3>
         <p>DCTV's query engine provides the tables and set functions
         that any SQL system provides, but extends these facilities
         with a set of operators and functions dedicated to working
         with heterogeneous time series.  Tables in DCTV are
         first-class <emph>typed</emph> objects: tables are either
         regular tables, span tables, or event tables.  Each type of
         table has a set of query operations that it supports; DCTV
         provides functions to convert one type of table to another as
         needed.</p>

         <aside class="note">It's always possible to "view" one of
         DCTV's special table types as a regular table by just using
         regular table operations (like the non-<code>SPAN</code>
         variant of <code>SELECT</code>) on it.  The result of any of
         these non-special operations is itself a regular
         table.</aside>

         <p>This table summarizes the special operations DCTV supports.
         Don't worry if you don't recognize some of these terms (like
         "partitioned span table"): they're defined below.</p>

         <table class="general">
           <tr>
             <th>Operation</th>
             <th>Left operand</th>
             <th>Right operand</th>
             <th>Result</th>
           </tr>
           <tr>
             <td>SELECT</td>
             <td>Regular table</td>
             <td>N/A</td>
             <td>Regular table</td>
           </tr>
           <tr>
             <td>SELECT</td>
             <td>Span table</td>
             <td>N/A</td>
             <td>Regular table</td>
           </tr>
           <tr>
             <td>SELECT SPAN</td>
             <td>Span table</td>
             <td>N/A</td>
             <td>Span table</td>
           </tr>
           <tr>
             <td>SPAN JOIN</td>
             <td>Unpartitioned span table</td>
             <td>Unpartitioned span table</td>
             <td>Unpartitioned span table</td>
           </tr>
           <tr>
             <td>SPAN BROADCAST INTO</td>
             <td>Unpartitioned span table</td>
             <td>Partitioned span table</td>
             <td>Partitioned span table</td>
           </tr>
           <tr>
             <td>SPAN BROADCAST FROM</td>
             <td>Partitioned span table</td>
             <td>Unpartitioned span table</td>
             <td>Partitioned span table</td>
           </tr>
           <tr>
             <td>GROUP USING PARTITION</td>
             <td>Partitioned span table</td>
             <td>N/A</td>
             <td>Unpartitioned span table</td>
           </tr>
           <tr>
             <td>GROUP USING SPANS FROM</td>
             <td>Partitioned span table</td>
             <td>Unpartitioned span table</td>
             <td>Partitioned span table</td>
           </tr>
           <tr>
             <td>GROUP USING SPANS FROM</td>
             <td>Unpartitioned span table</td>
             <td>Unpartitioned span table</td>
             <td>Unpartitioned span table</td>
           </tr>
         </table>

         <p>A <dfn>regular SQL table</dfn> is essentially a list of
         points in high-dimensional space, with each column in the
         table representing one dimension along which a point can
         vary.</p>

         <p>A <dfn>span table</dfn> represents data that vary over the
         time dimension.  An interval of time over which the data in a
         span table remain the same is called a <dfn>span</dfn>.
         The collection of time-varying data described by a span table
         is the <dfn>payload</dfn> of that span table.</p>
         <!-- TODO(dancol): talk about different time basis? -->
         <p>All span tables have two special columns:
         <dfn><code>_ts</code></dfn> and
         <dfn><code>_duration</code></dfn>.  <code>_ts</code> is an
         <code>INT64</code> timestamp, in nanoseconds since the start
         of the trace.  <code>_duration</code> is a non-zero
         <code>INT64</code> number of nanoseconds that the span covers.
         (That is, the span describes the region of time
         [<code>_ts</code>, <code>_ts</code> +
         <code>_duration</code>].)</p>

         <p><code>_ts</code> and <code>_duration</code> are always
         non-<code>NULL</code>, and a span table is always ordered by
         increasing values of <code>_ts</code>.  Spans in a span table
         cannot "overlap": a span must end either before or at exactly
         the same time as the next span begins.  (Spans from different
         partitions may overlap, however: see immediately below.)  A
         span table need not be contiguous: that is, it's legal for
         gaps to exist between spans.</p>

         <p>For example, imagine that you're looking at a Christmas
         tree light that changes color in time with music.  We might
         describe the color of the light using spans.  The following
         diagram depicts how we might use spans to describe the light's
         state.  Each pair of numbers (one above the table, one below)
         indicates the time corresponding the vertical line connecting
         them.</p>

         <table class="spanop">
           <caption>Light color</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="2">Red</td>
             <td colspan="1" class="empty" />
             <td colspan="1">Green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
         </table>

         <p>Here, the light was red from time one to time three and
         then green from time four to time five, inclusive.  (From time
         three to time four, the light was off; we're choosing to
         represent "off" as the absence of a span, but an equally valid
         choice would be to make a span with a special "Off" value for
         the color.)</p>

         <p>It's useful to look at the physical table representation
         of the above set of spans.</p>

         <table class="general spantable">
           <caption>Light color (span table representation)</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>color</th>
           </tr>
           <tr>
             <td>1</td>
             <td>2</td>
             <td>red</td>
           </tr>
           <tr>
             <td>4</td>
             <td>1</td>
             <td>green</td>
           </tr>
         </table>

         <p>Note that one row in the physical table representation of a
         span table corresponds to one <emph>logical</emph> span.</p>

         <aside class="note">
           It's because span tables are always ordered by
           <code>_ts</code> that DCTV disallows queries of the form
           <code>SELECT SPAN ... ORDER BY ...</code>.  Re-ordering a
           span table makes no sense.  If you don't want to
           <code>SELECT</code> from a span table and make the result a
           span table, you can choose to instead view the span table as
           a regular table by using the non-<code>SPAN</code> variant
           of select (<code>SELECT * FROM my_span_table</code>),
           and in this mode, <code>SELECT</code> will let you order
           the result set by whatever you want.
         </aside>
         <p>An <dfn>event table</dfn> is like a span table, but without
         the <code>_duration</code> column.  It represents a sequence
         of "points" in time.  The advantage of using an event table
         over a regular SQL table to represent points is automatic
         integration of the event table into time-based operations
         on spans.</p>

         <h3>Partitions</h3>

         <p>A span table is either a <dfn>partitioned span table</dfn>
         or a <dfn>non-partitioned span table</dfn>.  A non-partitioned
         span table is just the kind of span table described above.
         A partitioned span table, by contrast, has an additional
         special column, the <dfn>partition column</dfn>.
         A partitioned span table is basically a bundle of logical
         partition tables all combined into a single table under a
         single name.  Each distinct <emph>value</emph> of the
         partition column, which is called a <dfn>partition</dfn>,
         defines one independent sequence of spans.</p>

         <p>All of DCTV's operations on span tables know about
         partitioned span tables (the partition column is part of the
         span table's type) and operate on each partition within a span
         table independently.  There are also operations that transform
         a partitioned span table into a non-partitioned span table
         through the use of SQL grouping operators.</p>

         <p>It's useful to sequences of spans this way instead of
         putting each in own table: this way, using a partitioned span
         table, we can operate on groups of related time series
         uniformly without having to change our queries depending on
         how many different time series we have: for example, a
         CPU-related query should look the same on any system no matter
         how many CPUs it has!</p>

         <p>DCTV currently allows a span table to have either zero or
         one partition column, but not more.  This limit is just an
         implementation limit, and in the future, DCTV will allow
         partitioning by more than one column.</p>

         <p>Let's look at our Christmas tree light example, but with
         partitions.  Here, we're looking at two lights, one called
         "light#0" and another called "light#1".  We use a sequence of
         spans to describe each light's state.  It's critical to
         understand that each light has a distinct state history, but
         that we store all of these histories in the same physical
         table, using a column to describe the specific light that a
         specific row describes.</p>

         <aside class="note">For the remainder of this document, when
         the character "#" appears in a span row label, it refers to a
         specific partition of a partitioned span table.</aside>

         <table class="spanop">
           <caption>Colors of two lights</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
           <tr>
             <td>Light#0</td>
             <td colspan="2">Red</td>
             <td colspan="1" class="empty" />
             <td colspan="1">Green</td>
           </tr>
           <tr>
             <td>Light#1</td>
             <td colspan="1">Green</td>
             <td colspan="3">Red</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
         </table>

         <p>Here's the physical partitioned span table representation
         of the logical spans from the above diagram.</p>

         <table class="general spantable">
           <caption>Colors of two lights (span table representation)</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>lightno</th>
             <th>color</th>
           </tr>
           <tr>
             <td>1</td>
             <td>2</td>
             <td>0</td>
             <td>red</td>
           </tr>
           <tr>
             <td>1</td>
             <td>1</td>
             <td>1</td>
             <td>green</td>
           </tr>
           <tr>
             <td>2</td>
             <td>3</td>
             <td>1</td>
             <td>red</td>
           </tr>
           <tr>
             <td>4</td>
             <td>1</td>
             <td>0</td>
             <td>green</td>
           </tr>
         </table>

         <p>Like an unpartitioned span table, a partitioned span table
         is ordered strictly by increasing <code>_ts</code>.  If spans
         from two different partitions begin at the same time, the
         ordering of those with the same <code>_ts</code> value is
         unspecified. </p>

         <aside class="example">
           A real world use of spans is analyzing CPU-specific data.
           On a multi-CPU system, each CPU has its own frequency.
           A CPU might change from 800MHz to 1GHz and then down to
           600MHz, while another CPU, at the same time, might change
           its frequency from 600MHz to 800MHz and then up to 1GHz.
           Each of the two time series (the first CPU's frequency
           history and the second CPU's frequency history) is an
           independent time series.
         </aside>

         <h3>Span operations</h3>
         <p>While we can apply normal SQL querying operations to span
         tables, we can answer certain questions much more conveniently
         by using DCTV's special span operations, which are designed to
         make it easy to work with real-world time series data.</p>

         <h4>Span join</h4>

         <p>The <dfn>span join</dfn> family of operations merge spans
         together in a timewise-correct way and generates new spans
         divided on the common boundaries of the spans that flow as
         input into the span join.</p>

         <p>It's easiest to demonstrate a span join visually.</p>
         <!-- TODO(dancol): can we make this diagram more fun? -->
         <table class="spanop">
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
           <tr>
             <td>Size</td>
             <td colspan="2">tiny</td>
             <td colspan="1">giant</td>
           </tr>
           <tr>
             <td>Species</td>
             <td colspan="1">fish</td>
             <td colspan="2">squirrel</td>
           </tr>
           <tr class="result-divider"><td>SPAN JOIN</td></tr>
           <tr>
             <td>Phenotype</td>
             <td>tiny fish</td>
             <td>tiny squirrel</td>
             <td>giant squirrel</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
         </table>

         <p>Here, we're joining two hypothetical time series (as
         represented by span tables), a time series of sizes and a time
         series of animal types.  (Imagine we're trying to reconstruct
         the state of an animal given a record of the transmutation
         spells some novice sorcer's apprentice might have haphazardly
         cast.)</p>

         <p>In this trace, the "make the animal tiny" spell was in
         effect from timestamp one to timestamp three (inclusive), and
         the "make the animal giant" spell was in effect from timestamp
         3 onward.  Likewise, the "make the animal a fish" spell was in
         effect from timestamp one to timestamp two (inclusive) and the
         "make the animal a squirrel" spell was in effect from
         timestamp two onward.  The first row depicts the result of the
         size spells, and the second row depicts the effect of the
         animal-type spell.  (We imagine that each spell cancels the
         effect of the last spell of the same type.)</p>

         <p>The last row, "phenotype", represents a span table giving
         the type of animal that we observe at each moment, inferred
         from the effects of the previous two rows.  Note that the
         result span table has a span division wherever any of the
         inputs has a span division.  We ensure that all the properties
         of any of the input spans stay constant "within" any of the
         output spans, allowing for correct future computation
         involving these values.
         </p>

         <p>It may be informative to look at the row-wise representation
         of the above span tables:</p>

         <table class="general spantable">
           <caption>Size</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>size</th>
           </tr>
           <tr>
             <td>1</td>
             <td>2</td>
             <td>tiny</td>
           </tr>
           <tr>
             <td>2</td>
             <td>1</td>
             <td>giant</td>
           </tr>
         </table>

         <table class="general spantable">
           <caption>Species</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>species</th>
           </tr>
           <tr>
             <td>1</td>
             <td>1</td>
             <td>fish</td>
           </tr>
           <tr>
             <td>2</td>
             <td>2</td>
             <td>squirrel</td>
           </tr>
         </table>

         <table class="general spantable">
           <caption>Phenotype</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>size</th>
             <th>species</th>
           </tr>
           <tr>
             <td>1</td>
             <td>1</td>
             <td>tiny</td>
             <td>fish</td>
           </tr>
           <tr>
             <td>2</td>
             <td>1</td>
             <td>tiny</td>
             <td>squirrel</td>
           </tr>
           <tr>
             <td>3</td>
             <td>1</td>
             <td>giant</td>
             <td>squirrel</td>
           </tr>
         </table>

         <h4>Span join: inner and outer</h4>
         <p>What happens when spans don't line up exactly?</p>
         <p>Span joins come in two varieties, named after the varieties
         of regular SQL joins: <dfn>inner span join</dfn> and
         <dfn>outer span join</dfn>.  When all the inputs to a span
         join cover the same period of time, the difference doesn't
         matter.  But when there are gaps in one sequence or another,
         the difference becomes important.  Just as in the previous
         section, we'll start with a diagram.</p>

         <table class="spanop">
           <caption>Sample inputs</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
           <tr>
             <td>Breath</td>
             <td colspan="1">fire</td>
             <td class="empty"/>
             <td colspan="1">ice</td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="1">red</td>
             <td colspan="2">green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
         </table>

         <p>Here, we see that there is no magic breath spell in effect
         from time two to time three, inclusive.  What happens when we
         perform a span join on these span tables?  It depends on the
         kind of span join.</p>

         <table class="spanop">
           <caption>Span inner join</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
           <tr>
             <td>Breath</td>
             <td colspan="1">fire</td>
             <td class="empty"/>
             <td colspan="1">ice</td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="1">red</td>
             <td colspan="2">green</td>
           </tr>
           <tr class="result-divider">
             <td>Span inner join</td>
           </tr>
           <tr>
             <td>Phenotype</td>
             <td colspan="1">fire-breathing red</td>
             <td class="empty"/>
             <td colspan="1">ice-breathing green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
         </table>

         <table class="spanop">
           <caption>Span outer join</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
           <tr>
             <td>Breath</td>
             <td colspan="1">fire</td>
             <td class="empty"/>
             <td colspan="1">ice</td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="1">red</td>
             <td colspan="2">green</td>
           </tr>
           <tr class="result-divider">
             <td>Span outer join</td>
           </tr>
           <tr>
             <td>Phenotype</td>
             <td colspan="1">fire-breathing red</td>
             <td colspan="1"><code>NULL</code>-breathing green</td>
             <td colspan="1">ice-breathing green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
         </table>

         <p>In the span inner join case, we emit an output span only
         when <i>all</i> input spans cover a time interval.  In the
         span outer join case, we emit an output span when <i>any</i>
         input span covers a specific time region, providing NULL for
         the value of any payload column not provided by a span for
         that region.</p>

         <p>The table representations of the two result span tables may
         make the result more clear.</p>

         <table class="general spantable">
           <caption>Span inner join result (table view)</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>breath</th>
             <th>color</th>
           </tr>
           <tr>
             <td>1</td>
             <td>1</td>
             <td>fire</td>
             <td>red</td>
           </tr>
           <tr>
             <td>3</td>
             <td>1</td>
             <td>ice</td>
             <td>green</td>
           </tr>
         </table>

         <table class="general spantable">
           <caption>Span outer join result (table view)</caption>
           <tr>
             <th>_ts</th>
             <th>_duration</th>
             <th>breath</th>
             <th>color</th>
           </tr>
           <tr>
             <td>1</td>
             <td>1</td>
             <td>fire</td>
             <td>red</td>
           </tr>
           <tr>
             <td>2</td>
             <td>1</td>
             <td><code>NULL</code></td>
             <td>green</td>
           </tr>
           <tr>
             <td>3</td>
             <td>1</td>
             <td>ice</td>
             <td>green</td>
           </tr>
         </table>

         <p>Note that even a span outer join won't produce a result
         span that covers a period of time that no input span covered,
         as the following diagram indicates.</p>

         <table class="spanop">
           <caption>Holes in span outer join</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
           <tr>
             <td>Breath</td>
             <td colspan="1" class="empty"/>
             <td colspan="1" class="empty"/>
             <td colspan="1">ice</td>
           </tr>
           <tr>
             <td>Color</td>
             <td class="empty"/>
             <td colspan="1">red</td>
             <td colspan="1">green</td>
           </tr>
           <tr class="result-divider">
             <td>Span outer join </td>
           </tr>
           <tr>
             <td>Phenotype</td>
             <td class="empty"/>
             <td colspan="1"><code>NULL</code>-breathing red</td>
             <td colspan="1">ice-breathing green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span><span>4</span></td>
           </tr>
         </table>

         <h4>Span broadcast</h4>

         <p>A <dfn>span broadcast</dfn> is a special kind of span join
         that operates on two span tables, one partitioned and one not.
         Normally, DCTV treats each partition within a partitioned span
         table as a separate time series and operates on each
         independently; DCTV refuses to perform span operations on span
         tables partitioned by different columns or between partitioned
         and non-partitioned span tables, since the desired operation
         isn't obvious.</p>

         <p>With a span broadcast, we can tell DCTV to perform a
         special kind of span join between a partitioned and
         non-partitioned table, "broadcasting" the non-partitioned span
         into every partition in the partitioned span table in such a
         way that the result has useful properties.</p>

         <p>The overall result is <emph>almost</emph> as if we copied
         the non-partitioned span table N times, one for each N
         partition, into a new partitioned span table, and then joined
         that new partitioned span table with the other partitioned
         span table that we had when we started.  The difference
         between this hypothetical operation and span broadcast is that
         span broadcast doesn't generate any output spans for regions
         not covered by any span in the partitioned span table, even if
         that region is covered by the non-partitioned span table.</p>

         <p>Another way to think of it is that span broadcast "labels"
         each span in a partitioned span table with the payload of the
         non-partitioned table.  The output of a span broadcast
         operation is partitioned in the same way as its partitioned
         input.</p>

         <p>As usual, a diagram may be illustrative.  Here, "Size#0"
         and "Size#1" indicate two spans of the same span table (let's
         suppose animals 0 and 1 have different size spells cast on
         them), "Size".  "Color" is the input non-partitioned span
         table (let's suppose color spells affect all animal at the
         same time).</p>

         <table class="spanop">
           <caption>Sample inputs</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
           <tr>
             <td>Size #0</td>
             <td colspan="1">tiny</td>
             <td colspan="2">giant</td>
           </tr>
           <tr>
             <td>Size #1</td>
             <td colspan="3">tiny</td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="1">red</td>
             <td colspan="1" class="empty" />
             <td colspan="2">green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
         </table>

         <p>Just like regular span joins, span broadcasts come in
         <dfn>span inner broadcast</dfn> and <dfn>span outer
         broadcast</dfn> varieties, depicted below.  Note that the time
         period from four to five doesn't appear in the result span
         tables, since from time four to time five, we had a color span
         from the non-partitioned span, but no spans from size, the
         partitioned span table.</p>

         <table class="spanop">
           <caption>Inner broadcast of color into size</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
           <tr>
             <td>Size #0</td>
             <td colspan="1">tiny</td>
             <td colspan="2">giant</td>
           </tr>
           <tr>
             <td>Size #1</td>
             <td colspan="3">tiny</td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="1">red</td>
             <td colspan="1" class="empty" />
             <td colspan="2">green</td>
           </tr>
           <tr class="result-divider">
             <td>Inner broadcast</td>
           </tr>
           <tr>
             <td>Result#0</td>
             <td colspan="1">tiny red</td>
             <td colspan="1" class="empty" />
             <td colspan="1">giant green</td>
           </tr>
           <tr>
             <td>Result#1</td>
             <td colspan="1">tiny red</td>
             <td colspan="1" class="empty" />
             <td colspan="1">tiny green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
         </table>

         <table class="spanop">
           <caption>Outer broadcast of color into size</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
           <tr>
             <td>Size #0</td>
             <td colspan="1">tiny</td>
             <td colspan="2">giant</td>
           </tr>
           <tr>
             <td>Size #1</td>
             <td colspan="3">tiny</td>
           </tr>
           <tr>
             <td>Color</td>
             <td colspan="1">red</td>
             <td colspan="1" class="empty" />
             <td colspan="2">green</td>
           </tr>
           <tr class="result-divider">
             <td>Outer broadcast</td>
           </tr>
           <tr>
             <td>Result#0</td>
             <td colspan="1">tiny red</td>
             <td colspan="1"><code>NULL</code>-colored giant</td>
             <td colspan="1">giant green</td>
           </tr>
           <tr>
             <td>Result#1</td>
             <td colspan="1">tiny red</td>
             <td colspan="1"><code>NULL</code>-colored tiny</td>
             <td colspan="1">tiny green</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span><span>5</span></td>
           </tr>
         </table>

         <p>In general, we use a span broadcast when we have a number
         of different things happening at the same time (each
         represented by one partition of a span table) and we want to
         "mix into" this span table knowledge of something that affects
         the environment as a whole.</p>
         <aside class="example">We might denote a period of a few
         seconds time during which the app
         com.flashlightco.myflashlight starts up in response to a
         launcher tap.  (This is not an efficient flashight app.)  If
         we have a table of process activity, partitioned by CPU, we
         can apply a span inner broadcast to the process activity table
         and narrow our view of that table to the interval during which
         the flashlight app was starting, but keep the result
         partitioned by CPU.
         </aside>

         <h4>Span group</h4>

         <p>A <dfn>span group</dfn> operation is the opposite of a span
         join, in a sense.  It merges spans together and applies SQL
         set functions (like <code>MAX</code> and <code>SUM</code>) to
         the payloads of the merged spans, forming for each payload a
         combined value determined through the usual SQL aggregation
         operation..</p>

         <p>Here's a diagram.</p>

         <table class="spanop">
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
           <tr>
             <td>Number arms</td>
             <td>2</td>
             <td>5</td>
             <td>0</td>
             <td>7</td>
             <td>2</td>
             <td>4</td>
             <td>9</td>
             <td>0</td>
           </tr>
           <tr>
             <td>Periods</td>
             <td colspan="2">A</td>
             <td colspan="2">B</td>
             <td colspan="2">C</td>
             <td colspan="2">D</td>
           </tr>
           <tr class="result-divider"><td>Span group</td></tr>
           <tr>
             <td><code>MAX(arms)</code></td>
             <td colspan="2">5</td>
             <td colspan="2">7</td>
             <td colspan="2">4</td>
             <td colspan="2">9</td>
           </tr>
           <tr>
             <td><code>MIN(arms)</code></td>
             <td colspan="2">2</td>
             <td colspan="2">0</td>
             <td colspan="2">2</td>
             <td colspan="2">0</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
         </table>

         <p>Here, our hapless sorcerer repeatedly changed the numbers
         of arms that our poor animal had at any time.  We want to
         determine, based on the record of arm-number changes, for each
         relatively broad interval A, B, C, and D, the minimum and
         maximum number of arms our animal had during that
         interval.</p>

         <p>A span group operation involves two span tables: the
         <dfn>grouped</dfn> table and the <dfn>grouper</dfn> table.
         The grouped table ("number of arms", in our example) supplies
         the source data for the grouping operations; the grouper table
         (here, "periods") supplies spans describing the groups that
         form the output value.  The grouped table may or may not be
         partitioned; if it is partitioned, DCTV applies grouping to
         each partition individually.  The grouper table may not
         currently be partitioned.</p>

         <p>A span group operation always emits one output span for
         each span in its grouper input span table.  If no grouped span
         overlaps with a given grouper span, all its aggregate values
         end up being <code>NULL</code>.  An example follows.</p>

         <table class="spanop">
           <caption>Illustration of span group behavior with missing
           grouped values</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
           <tr>
             <td>Number arms</td>
             <td>2</td>
             <td>5</td>
             <td>0</td>
             <td class="empty" />
             <td class="empty" />
             <td class="empty" />
             <td>9</td>
             <td>0</td>
           </tr>
           <tr>
             <td>Periods</td>
             <td colspan="2">A</td>
             <td colspan="2">B</td>
             <td colspan="2">C</td>
             <td colspan="2">D</td>
           </tr>
           <tr class="result-divider"><td>Span group</td></tr>
           <tr>
             <td><code>MAX(arms)</code></td>
             <td colspan="2">5</td>
             <td colspan="2">0</td>
             <td colspan="2"><code>NULL</code></td>
             <td colspan="2">9</td>
           </tr>
           <tr>
             <td><code>MIN(arms)</code></td>
             <td colspan="2">2</td>
             <td colspan="2">0</td>
             <td colspan="2"><code>NULL</code></td>
             <td colspan="2">0</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
         </table>

         <!-- TODO(dancol): make this paragraph more clear -->
         <p>Span group operations have two flavors: <dfn>span group and
         intersect</dfn> and <dfn>span group and union</dfn>.
         The difference matters only when multiple partitions are
         involved.  In the former case, we include payloads from the
         grouped span table only when all partitions are present in a
         given interval; in the latter case, we include the grouped
         span table in the output spans when any input grouped
         partition is present.</p>

         <aside class="note">If we want the output of a span join to
         include only the regions of time covered by the grouped span,
         first span join the grouper with the grouped span, then use
         the result as the grouper span table in the span
         group.</aside>

         <h4>Span departition</h4>

         <p>A <dfn>span departition</dfn> operation transforms a
         partitioned span table into a non-partitioned span table by
         grouping the partition payloads with SQL set values.
         This operation is useful mainly when we have a "split up" view
         of activity on the system and want to derive a whole-system
         view by matching up all the partitions.</p>

         <p>To return to our magical forensics example, imagine our
         apprentice cast some very expensive add-arms-to-animals spells
         on a number of different animals.  We're billed for arms based
         on the total number we're using at any one time (there's a
         license server and everything), so we want to reconstruct,
         based on a record of each animal's arm count, the number of
         arms we were using in total at a particular moment.  In the
         following table, "Arms#0", "Arms#1", and so on denote the
         partitions of a single "Arms" span table.</p>

         <table class="spanop">
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
           <tr>
             <td>Arms#0</td>
             <td colspan="3">2</td>
             <td colspan="2">7</td>
             <td>4</td>
             <td>9</td>
             <td>0</td>
           </tr>
           <tr>
             <td>Arms#1</td>
             <td colspan="5">2</td>
             <td colspan="3">4</td>
           </tr>
           <tr class="result-divider"><td>Departition</td></tr>
           <tr>
             <td><code>SUM(arms)</code></td>
             <td colspan="3">4</td>
             <td colspan="2">9</td>
             <td colspan="1">8</td>
             <td colspan="1">13</td>
             <td colspan="1">4</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
         </table>

         <p>A span departition operation resembles a span group join
         followed by a span group operation, but it's specified
         separately so that we can work with partitioned span tables
         without knowing in advance how many partitions we have or
         having to expand our queries to work with each partition
         separately.</p>

         <p>Span departitions come in two varieties, the <dfn>span
         departition and union</dfn> and <dfn>span departition and
         intersect</dfn> operations, with the difference concerning the
         treatment of missing data.  The following table gives the
         differences between these approaches.</p>

         <table class="spanop">
           <caption>Arm history with missing data</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
           <tr>
             <td>Arms#0</td>
             <td colspan="3" class="empty" />
             <td colspan="2">7</td>
             <td>4</td>
             <td>9</td>
             <td>0</td>
           </tr>
           <tr>
             <td>Arms#1</td>
             <td colspan="5">2</td>
             <td colspan="3" class="empty" />
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
         </table>

         <p>In intersect mode, we generate an output span for a region
         of time only when <emph>all</emph> partitions have a span
         covering that period.</p>

         <table class="spanop">
           <caption>Departition intersect result</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
           <tr>
             <td>Arms#0</td>
             <td colspan="3">2</td>
             <td colspan="2">7</td>
             <td>4</td>
             <td>9</td>
             <td>0</td>
           </tr>
           <tr>
             <td>Arms#1</td>
             <td colspan="5">2</td>
             <td colspan="3">4</td>
           </tr>
           <tr class="result-divider">
             <td>Departition intersect</td>
           </tr>
           <tr>
             <td><code>SUM(arms)</code></td>
             <td colspan="3" class="empty" />
             <td colspan="2">9</td>
             <td colspan="3" class="empty" />
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
         </table>

         <p>By contrast, in union mode, we generate an output span when
         <emph>any</emph> partition covers a unit in time.  We treat
         any missing partitions as contributing <code>NULL</code> to
         the output aggregation for each span.  Note that SQL
         aggregation functions just skip <code>NULL</code> values, so
         the sums below are correct.</p>

         <table class="spanop">
           <caption>Departition union result</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
           <tr>
             <td>Arms#0</td>
             <td colspan="3">2</td>
             <td colspan="2">7</td>
             <td>4</td>
             <td>9</td>
             <td>0</td>
           </tr>
           <tr>
             <td>Arms#1</td>
             <td colspan="5">2</td>
             <td colspan="3">4</td>
           </tr>
           <tr class="result-divider">
             <td>Departition union</td>
           </tr>
           <tr>
             <td><code>SUM(arms)</code></td>
             <td colspan="3">2</td>
             <td colspan="2">9</td>
             <td colspan="1">4</td>
             <td colspan="1">9</td>
             <td colspan="1">0</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span></td>
             <td><span>7</span></td>
             <td><span>8</span><span>9</span></td>
           </tr>
         </table>

         <h3>Trace processing intrinsic functions</h3>

         <p>DCTV aims to be a general-purpose time series analysis
         program, one that just happens to be especially useful for
         processing Android system traces.  Its general approach is to
         avoid system- and metric-specific data processing routines and
         provide general-purpose operators that users can combine to
         analyze data in particular situations.</p>

         <p>The previous section describes operations that DCTV
         provides in the form of query operators.  DCTV also provides
         some operations, usually less common ones, in the form of
         table-valued functions.</p>

         <h4><a name="time_series_to_span_conversion">Time series to span conversion</a></h4>

         <p>Recall that DCTV exposes events from trace files as raw
         data points, in event tables.  We have to build span tables
         from these raw data somehow, and the <a
         href="#time_series_to_spans"><code>time_series_to_spans</code></a>
         table-valued function does exactly that.</p>

         <p><code>time_series_to_spans</code> takes as input a set of
         event sources and a set of output column descriptors and
         produces a span table as output.  Logically, it consuming
         events from the given sources, in time order, and constructs
         spans by watching for "start" and "stop" events as denoted by
         the input sources.  Payload values attached to the event
         sources become payload columns of the output span table
         according to each column specification's column
         specification.</p>

         <p>Each source is either a "start-start" source or a "stop"
         source.  The former case models a set of events that divide a
         timeline up into discrete chunks.</p>

         <p>Returning for a moment to our hypothetical wizardly
         apprentice, we recall that an animal's size might change as
         our apprentice casts various "change size" spells on it.
         The raw, event-by-event, record of spells cast by our
         apprentice might look like this.</p>

         <table class="general spantable">
           <caption>Raw size spell record</caption>
           <tr>
             <th>_ts</th>
             <th>size</th>
           </tr>
           <tr>
             <td>1</td>
             <td>tiny</td>
           </tr>
           <tr>
             <td>3</td>
             <td>huge</td>
           </tr>
           <tr>
             <td>4</td>
             <td>large</td>
           </tr>
           <tr>
             <td>6</td>
             <td>huge</td>
           </tr>
         </table>

         <p>Processing this raw event table into spans using
         <code>time_series_to_spans</code>, we end up with a span table
         that looks like this.  (The time scale goes to seven for
         easier comparison with the next example.)</p>

         <table class="spanop">
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
           <tr>
             <td>Size</td>
             <td colspan="2">tiny</td>
             <td colspan="1">huge</td>
             <td colspan="2">large</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
         </table>

         <aside class="note">The final "huge" spell isn't reflected in
         the output span table, because
         <code>time_series_to_spans</code> ignores spans left "open"
         (i.e., unclosed) at the end of processing.  The intent of this
         feature is to work with span inner join operations to
         automatically ignore noisy partial-data "junk intervals" at
         the beginning and end of traces.  If a need arises,
         <code>time_series_to_spans</code> could be extended in the
         future to automatically close open spans.</aside>

         <p><code>time_series_to_spans</code> also supports "stop"
         events.  These events don't start new spans, but do indicate
         that any open span active at the time of the stop event should
         be finished.  In an operating system context, if
         <code>sched_switch</code> is a start-start event, a CPU
         hotplug off event might be a "stop" event, since it would
         indicate that a CPU has stopped processing traces without
         producing any new ones.</p>

         <p>To return to our unfortunate apprentice example, suppose we
         have an additional table of "size reset" spells that we know
         were cast during the sequence of size change spells.  A size
         reset spell just returns a creature to whatever size it had
         without any magical augmentation.  The raw table might look
         something like this.</p>

         <table class="general spantable">
           <caption>Raw size-reset spell record</caption>
           <tr>
             <th>_ts</th>
           </tr>
           <tr>
             <td>5</td>
           </tr>
           <tr>
             <td>7</td>
           </tr>
         </table>

         <p>If we feed both our original size spell record event table
         <emph>and</emph> our size-reset spell table into
         <code>time_series_to_spans</code>, we end up with a span table
         that looks like this.</p>

         <table class="spanop">
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
           <tr>
             <td>Size</td>
             <td colspan="2">tiny</td>
             <td colspan="1">huge</td>
             <td colspan="1">large</td>
             <td colspan="1" class="empty" />
             <td colspan="1">huge</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
         </table>

         <p>Note the differences: first, we now have a "hole" between
         times five and six, because the stop table told us that we
         stopped changing our poor confused creature's size at time
         five and didn't start changing it again until time six.
         Second, we have a "huge" span from time six to seven, because
         the span beginning at time six is no longer left open after
         <code>time_series_to_spans</code> ends.</p>

         <aside class="note">If you want a span table that substitutes
         a concrete value (say, "normal") for the hole, you can combine
         a span outer join of the whole-trace span with
         <code>COALESCE</code> on the payload column to make
         one.</aside>

         <p>Each payload column that <code>time_series_to_spans</code>
         generates is described by a "source specification".
         The specification describes, for each output column, the
         source event table from which we get the column's value and
         the "edge" from which we draw the value.  (The edge defaults
         to "rising".)  Using the "rising" edge means that we draw the
         output payload column for a span from the event that started
         the span; using "falling" instead tells
         <code>time_series_to_spans</code> to draw the payload column
         value from the <emph>closing</emph> event.  We typically stick
         with "rising" except in special cases.</p>

         <p><code>time_series_to_spans</code> supports creating
         partitioned span tables as well; each source specification can
         be associated with a partition column in that source table.
         All sources for a given call to
         <code>time_series_to_spans</code> must be partitioned the same
         way.</p>

         <h4><a name="stackification">Stackification</a></h4>

         <p>Not all raw input events look like a series of start and
         stop on a timeline.  Another common pattern in row input is
         the "start-stop stack", in which a series of nested and
         balanced start and stop events describe the erection and
         demolition of a stack of some kind of thing.</p>

         <p>Stacks can be anything: examples include procedure call
         stacks, Android synchronous atrace regions, and nested
         interrupt handlers.  To keep with our hapless-apprentice
         example theme, we'll imagine that spells are prepared by
         simultaneous chanting, waving, and stirring, and that we have
         distinct "start" and "stop" records for each activity.</p>

         <p>Suppose we know at what time our apprentice starts a given
         activity and know at what time an activity ends.  Suppose also
         that our apprentice at least paid enough attention in class to
         understand that one always stops the magical activity one most
         recently started.</p>

         <p>(Note that at time five, a second chant begins even though
         a chant was already ongoing.  A friend must have joined
         in.)</p>

         <table class="general spantable">
           <caption>Spell starts</caption>
           <tr>
             <th>_ts</th>
             <th>activity</th>
           </tr>
           <tr>
             <td>1</td>
             <td>stir</td>
           </tr>
           <tr>
             <td>3</td>
             <td>wave</td>
           </tr>
           <tr>
             <td>4</td>
             <td>chant</td>
           </tr>
           <tr>
             <td>5</td>
             <td>chant</td>
           </tr>
         </table>

         <table class="general spantable">
           <caption>Spell stops</caption>
           <tr>
             <th>_ts</th>
           </tr>
           <tr>
             <td>2</td>
           </tr>
           <tr>
             <td>7</td>
           </tr>
           <tr>
             <td>7</td>
           </tr>
           <tr>
             <td>7</td>
           </tr>
         </table>

         <p>What happens if we rearrange these data into spans?</p>

         <table class="spanop">
           <caption>Notional stackified spells</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
           <tr>
             <td>Effects</td>
             <td colspan="1">[stir]</td>
             <td colspan="1" class="empty" />
             <td colspan="1">[wave]</td>
             <td colspan="1">[wave, chant]</td>
             <td colspan="2">[wave, chant, chant]</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
         </table>

         <p>This arrangement makes logical sense, but it isn't quite
         compatible with DCTV's data model.  Note that the value of
         each cell is actually a list!  Unlike some databases, DCTV
         does not support composite (multi-part) values as column
         values.  But here we apparently have composite values in the
         cells.  How do we represent these spans as tables?  By <a
         href="https://en.wikipedia.org/wiki/Database_normalization">
         normalization</a>.</p>

         <table class="general spantable">
           <caption>Stack contents</caption>
           <tr>
             <th>stack_id</th>
             <th>depth</th>
             <th>token</th>
           </tr>
           <tr>
             <td>1</td>
             <td>0</td>
             <td>stir</td>
           </tr>
           <tr>
             <td>2</td>
             <td>0</td>
             <td>wave</td>
           </tr>
           <tr>
             <td>3</td>
             <td>0</td>
             <td>wave</td>
           </tr>
           <tr>
             <td>3</td>
             <td>1</td>
             <td>chant</td>
           </tr>
           <tr>
             <td>4</td>
             <td>0</td>
             <td>wave</td>
           </tr>
           <tr>
             <td>4</td>
             <td>1</td>
             <td>chant</td>
           </tr>
           <tr>
             <td>4</td>
             <td>2</td>
             <td>chant</td>
           </tr>
         </table>

         <table class="spanop">
           <caption>Normalized stackified spells</caption>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
           <tr>
             <td>Stack Id</td>
             <td colspan="1">1</td>
             <td colspan="1" class="empty" />
             <td colspan="1">2</td>
             <td colspan="1">3</td>
             <td colspan="2">4</td>
           </tr>
           <tr class="times">
             <td>Time ▶</td>
             <td><span>1</span></td>
             <td><span>2</span></td>
             <td><span>3</span></td>
             <td><span>4</span></td>
             <td><span>5</span></td>
             <td><span>6</span><span>7</span></td>
           </tr>
         </table>

         <p>Now, we can look up the stack corresponding to each span by
         looking at that span's stack id payload and joining it against
         the stack contents table.  The stackify DCTV intrinsic
         processes any kind of stack into these two tables (the stack
         contents regular table and the "stack history" span
         table).</p>

         <aside class="note">This approach is admittedly pretty ugly,
         but it works.  If it turns out to be a big enough of a
         problem, we may just implement support for composite values
         (which is really just this approach under the hood).</aside>

         <h4><a name="span_generation">Generating span tables from thin air</a></h4>

         <p>There are general utility functions to generate specialized
         span tables useful for composing with others.
         The <code>generate_sequential_spans</code> table-valued
         function generates a sequence of spans according to the start
         time, stop time, and duration specified in the call.
         It's useful for generating spans to quantize the timeline into
         discrete intervals and for generating "whole trace" spans that
         act as inputs to span joins.</p>

         <p>Each trace namespace has a few convenience functions for
         succinctly generating, using
         <code>generate_sequential_spans</code>, certain kinds of span
         tables.  See the <a href="#standard_library">"standard
         library"</a> reference below.</p>

         <h3>Dimensional analysis</h3>

         <p>DCTV provides a dimensional analysis feature to make it
         easy and natural to query traces using naturally-specified
         values and to avoid errors that can arise from accidental
         nonsensical combinations of incompatible units. Each quantity
         in a query is associated with a <dfn>unit</dfn> and these
         units propagate through the query as it is processed.
         Quantities with different units combine according to the rules
         of <a
         href="https://en.wikipedia.org/wiki/Dimensional_analysis">dimensional
         analysis</a>.  DCTV also knows how to convert from one
         compatible unit to another.  DCTV will signal errors rather
         than produce results that are dimensional nonsense.
         The overall goal of the dimensional analysis feature is to
         make it easy and natural to query traces using
         naturally-specified values and to avoid errors that can arise
         from accidental nonsensical combinations of incompatible
         units.</p>

         <h4>Unit specification</h4>

         <p>Units come from two sources:</p>
         <ol>
           <li>intrinsic tagging of
           quantities with units during trace parsing, and</li>
           <li>explicit
           tagging of quantities with units in query syntax.</li>
         </ol>

         <p>The syntax for specifying a unit is just adding the name of
         the unit after a numeric literal.  For simple alphanumeric
         unit names, a bare word is sufficient, e.g., <code>4ns</code>.
         For more complicated units that contain operators that SQL
         would otherwise interpret as part of expressions, the unit
         name needs to be quoted with backticks, as in
         <code>4`miles/hour`</code>.  Without the backticks, SQL would
         interpret <code>4miles/hour</code> as an attempt to divide the
         quantity <code>4</code> by the column <code>hour</code>, which
         is probably not what we want.</p>

         <h4>Unit names</h4>

         <!-- TODO(dancol): include the unit name list here -->
         <p>DCTV understands both common and abbreviated names for
         units.  This document will eventually list all understood unit
         names; for the moment, see <a
         href="https://team.git.corp.google.com/dctv/dctv/+/master/src/dctv/units.txt">units.txt</a>
         in the DCTV source code.</p>

         <h4>Unit conversion</h4>

         <p>Queries can explicitly convert units from one type to
         another using the <code>IN</code> operator. <!-- TODO(dancol):
         link to syntax --></p>

         <aside class="example code-example"><![CDATA[DCTV> SELECT 4`inches` IN cm;
 4 IN cm [cm]
 ------------
        10.16]]></aside>

         <p>In the DCTV REPL, column headers that denote a quantity
         with unit list that unit in square brackets after the column
         name.  Above, we see <code>[cm]</code> at the end of the
         column name, indicating that the <code>10.16</code> is
         specified in terms of centimeters.</p>

         <p>DCTV's unit analysis also understands rates.  In the
         example below, DCTV gives a unit in terms of miles, because
         we're multiplying a rate, in miles per hour, by a unit of
         time.  The time unit here need not be the literal unit used in
         the rate: DCTV will convert units as needed.</p>

         <aside class="example code-example"><![CDATA[DCTV> SELECT 4`miles/hour` * 2`days`;
 (4 * 2) [mi]
 ------------
          192]]></aside>

         <h3>
         </h3>

         <h2><a name="differences">Differences from standard SQL</a></h2>

         <h3>Nested namespaces</h3>
         <p>Standard SQL provides a two-level namespace for tables:
         each table is named by an optional schema (followed by a dot),
         and then a table name.  DCTV, by contrast, allows for
         arbitrarily deep nesting of namespaces, with each namespace
         component separated by a period.  (SQL's standard syntax is a
         special case.)  We use the nested namespace syntax to talk
         about specific tables and views embedded in a "trace
         sub-namespace", which we form when we mount a trace into the
         global SQL namespace.</p>

         <h3>Keyword arguments</h3>

         <p>Normal SQL allows only positional arguments to function
         calls.  DCTV allows for Python-style keyword arguments as
         well, with each keyword-argument pair separated by the "=&gt;"
         token.  See the <a href="#syntax">syntax reference</a> for
         details.</p>

         <h3>Extended table-valued-function-call syntax</h3>
         <p>DCTV exposes some facilities as table-valued functions.
         The arguments to these functions are evaluated in a context
         different from normal SQL expression evaluation, and in this
         context, DCTV supports extended syntax, including the use of
         list and dictionary literals.  (Table-valued functions are
         Python functions and these list and dictionary literals become
         list and dict values inside calls.)  See the syntax reference
         for details.</p>

         <h3>Miscellaneous syntax extensions</h3>

         <p>DCTV is designed to minimize users fighting with the
         syntax.  Wherever SQL requires a list of something to be
         comma-separated, DCTV allows and ignores a trailing comma.
         Where SQL requires a list terminator (e.g., semicolons after
         each query statement), DCTV allows users to omit the list
         terminator.</p>

         <p>DCTV recognizes <code>&lt;&gt;</code> and the C-style
         <code>!=</code> operators as equivalent.</p>

         <p>DCTV provides the "spaceship" and "anti-spaceship"
         operators <code>&lt;=&gt;</code> and <code>&lt;!=&gt;</code>,
         respectively, which act like <code>==</code> and
         <code>!=</code>, except that they treat <code>NULL</code> as
         being equal to itself.  (MySQL calls these operators "null
         safe comparison operators".)</p>

         <p>In addition to the standard SQL <code>-- </code> comment
         prefix, DCTV allows the use of <code>#</code> as a
         Python-style comment prefix and the use of <code>/*</code> and
         <code>*/</code> for C-style block comments.</p>

         <h3>Missing features</h3>

         <p>DCTV does not implement some features of more traditional
         databases.  The following table summarizes the features not
         provided, whether we plan to provide them, and any additional
         relevant information.</p>

         <table class="general">
           <tr>
             <th>Feature</th>
             <th>Status</th>
             <th>Command</th>
           </tr>
           <tr>
             <td>INSERT/UPDATE/DELETE</td>
             <td>Not planned</td>
             <td>DCTV is immutable</td>
           </tr>
           <tr>
             <td>SQL1999 window functions</td>
             <td>Planned</td>
             <td></td>
           </tr>
           <tr>
             <td>SQL/PL</td>
             <td>Planned</td>
             <td>Will be accelerated</td>
           </tr>
           <tr>
             <td>Recursive CTEs</td>
             <td>Planned</td>
             <td></td>
           </tr>
           <tr>
             <td>Coordinated subqueries</td>
             <td>Planned</td>
             <td></td>
           </tr>
         </table>
         <h1><a name="syntax">Syntax reference</a></h1>
         <h2>SQL Statement list</h2>
         <p>The REPL accepts statement lists as top-level input.</p>
         <object class="sytax" data="sql-stmt-list.syntax.svg" type="image/svg+xml" />
         <h2>SQL statement</h2>
         <p>A given SQL statement is either a SELECT, which performs a
         query, or one of a few miscellaneous types of data management
         operation.</p>
         <object class="sytax" data="sql-stmt.syntax.svg" type="image/svg+xml" />
         <h2>SELECT</h2>
         <p>A SELECT is a combination of one or more "select core"
         statements (combined together with operators like
         <code>UNION</code>), all sorted and windowed.</p>
         <p>Note that span tables cannot be combined using
         SQL compound operators.</p>
         <p>Common table expressions are "local" views that exist only
         in the context of the following SELECT and
         subsequently-defined common table expressions.  (That is, the
         common table expressions have lexical scope, and the names are
         bound as with <code>let*</code> in Lisp.)</p>
         <object class="sytax" data="select-stmt.syntax.svg" type="image/svg+xml" />
         <h2>Regular select core</h2>

         <p>This diagram shows the syntax for the main body of a
         <code>SELECT</code> statement.  If the keyword
         <code>SPAN</code> appears after the <code>SELECT</code>, the
         "type" of the result of the <code>SELECT</code> is a span
         table; otherwise, it's a regular table.</p>

         <p>In <code>SPAN</code> mode, <code>SELECT</code> always
         includes the special columns <code>_ts</code>,
         <code>_duration</code>, and (if partitioned) the partition
         column in the selected column set.  <code>SELECT SPAN FROM
         ...</code> (with no column list between the <code>SPAN</code>
         and the <code>FROM</code>) indicates selecting
         <emph>only</emph> these special columns.  These special
         columns may not be specified "by hand" in the result-column
         list.</p>

         <p>The <code>table-or-join</code> clause describes the syntax
         for span join operators.  The <code>GROUP...USING SPANS</code>
         syntax describes a span group operation from the data model
         section; the <code>GROUP...USING PARTITION</code> syntax
         describes a span departition operation.  <code>GROUP BY</code>
         works exactly the same way it does in standard SQL.</p>
         <p><code>HAVING</code> in span mode always filters the
         <emph>generated</emph> spans; <code>WHERE</code> filters the
         <emph>inputs</emph> to any span join and grouping operations,
         analogously to the distinction between <code>WHERE</code> and
         <code>HAVING</code> in standard SQL.</p>
         <object class="sytax" data="regular-select.syntax.svg" type="image/svg+xml" />
         <h2>Result column</h2>
         <object class="sytax" data="result-column.syntax.svg" type="image/svg+xml" />
         <h2>Table or join specification</h2>
         <p>This element describes a "column source" from which a
         <code>SELECT</code> draws columns.  It can be a simple table
         name, a call to a table-valued function, a subquery (of which
         <code>VALUES</code> is a special case), or a join operation of
         other column sources.</p>
         <p>A comma joining two table specifications is equivalent to
         <code>INNER JOIN</code>.</p>
         <p><code>AS</code> assigns a local alias to one of these
         column sources, the alias being useful in expressions in the
         result-column clauses.  If a column list comes after the
         <code>AS</code>, the columns of the thing named with
         <code>AS</code> are renamed to match the columns in the column
         list that follows the <code>AS</code>, which must have the
         same length as the set of columns in the named thing.</p>
         <object class="sytax" data="table-or-join.syntax.svg" type="image/svg+xml" />
         <h2>Conventional join</h2>
         <p>A normal SQL join.</p>
         <object class="sytax" data="conventional-join.syntax.svg" type="image/svg+xml" />
         <h2>Span join</h2>
         <p>Describes a span join operation.  The <code>PARTITION
         AS</code> clause provides the name of the partition column in
         the resulting span table, which must be specified if the left
         and right span tables have partition columns with
         different names.</p>
         <object class="sytax" data="span-join.syntax.svg" type="image/svg+xml" />
         <h2>Span broadcast</h2>
         <p>A span broadcast operation.  In the
         <code>BROADCAST..INTO</code> variant, the unpartitioned span
         table is on the left, whereas in the
         <code>BROADCAST...FROM</code> variant, the unpartitioned span
         table is on the right.</p>
         <object class="sytax" data="span-broadcast.syntax.svg" type="image/svg+xml" />
         <h2>Table specification</h2>
         <p>Names a single table, either as a name of a table
         in the table namespace, a call to a table-valued
         function in the table namespace, or a subquery.</p>
         <object class="sytax" data="table-spec.syntax.svg" type="image/svg+xml" />
         <h2>Table-valued function arglist</h2>
         <p>The argument list for a table-valued function call.  Note
         the keyword arguments.</p>
         <object class="sytax" data="tvf-arglist.syntax.svg" type="image/svg+xml" />
         <h2>Table-valued function (TVF) expression</h2>
         <p>The syntax for an expression in TVF context.  Note the dict
         and list literal syntax.  A subquery is also a valid argument
         to a table-valued function!</p>
         <object class="sytax" data="tvf-expr.syntax.svg" type="image/svg+xml" />
         <h2>SQL expression</h2>
         <p>Syntax for expressions that can occur in a query
         outside TVF context.</p>
         <object class="sytax" data="expr.syntax.svg" type="image/svg+xml" />
         <h2>Function call argument list</h2>
         <p>The argument list for a call to a function SQL
         expression context.  Note the keyword arguments and
         the optional <code>DISTINCT</code> keyword.</p>
         <object class="sytax" data="function-arglist.syntax.svg" type="image/svg+xml" />
         <h2>Data type names</h2>
         <p>List of allowed data type names.</p>
         <object class="sytax" data="type-name.syntax.svg" type="image/svg+xml" />
         <h2>Literal value syntax</h2>
         <p>Literal values can appear either as regular SQL expressions
         or as TVF expressions.  In TVF (i.e., Python) context,
         <code>TRUE</code> becomes <code>True</code>,
         <code>FALSE</code> <code>False</code>, and <code>NULL</code>
         <code>None</code>.</p>
         <object class="sytax" data="literal-value.syntax.svg" type="image/svg+xml" />
         <h2>Bind parameters</h2>
         <p>Represents a parameter substitution in a query.
         Positional argument numbers are assigned automatically to
         positional <code>?</code> substitutions without explicit
         numbers.  The assignments starts from zero and proceeds
         left-to-right during parsing, incrementing each unnumbered
         positional substitution's substitution number by one.
         Explicitly numbered substitutions do not affect this automatic
         numbering.</p>
         <object class="sytax" data="bind-parameter.syntax.svg" type="image/svg+xml" />
         <h2>Numeric literal</h2>
         <object class="sytax" data="numeric-literal.syntax.svg" type="image/svg+xml" />
         <h2>VALUES list</h2>
         <p><code>VALUES</code> works just as it does in standard SQL
         and allows query authors to include data inline.  DCTV does
         not provide a <code>CREATE TABLE</code> function, but one can
         achieve a similar effect by using <code>CREATE VIEW</code>
         with a <code>VALUES</code> select-part.</p>
         <object class="sytax" data="values-list.syntax.svg" type="image/svg+xml" />
         <div />
         <object class="sytax" data="values-list-row.syntax.svg" type="image/svg+xml" />
         <h2>Common table expression</h2>
         <p>The common table expression part of a SQL query.
         The optional column list in the name performs the same
         column-renaming operation that the optional column list after
         <code>AS</code> does.</p>
         <object class="sytax" data="common-table-expression.syntax.svg" type="image/svg+xml" />
         <h2>Namespace prefix</h2>
         <p>Names a part of the DCTV namespace.</p>
         <object class="sytax" data="ns-prefix.syntax.svg" type="image/svg+xml" />
         <h2>Table namespace name</h2>
         <p>Describes a table in the table namespace.</p>
         <object class="sytax" data="table-ns-name.syntax.svg" type="image/svg+xml" />
         <h2>Table-valued-function name</h2>
         <p>Describes a table-valued-function in the table namespace.</p>
         <object class="sytax" data="tvf-ns-name.syntax.svg" type="image/svg+xml" />
         <h2>SQL function name</h2>
         <p>Describes a SQL function name in the function namespace.
         Note that the function namespace is distinct from the table
         namespace.</p>
         <object class="sytax" data="function-name.syntax.svg" type="image/svg+xml" />
         <h2>CREATE VIEW</h2>
         <p><code>CREATE VIEW</code> works just like it does in
         standard SQL.</p>
         <object class="sytax" data="create-view-stmt.syntax.svg" type="image/svg+xml" />
         <h2>DROP VIEW</h2>
         <p><code>DROP VIEW</code> works just like it does in
         standard SQL.</p>
         <object class="sytax" data="drop-view-stmt.syntax.svg" type="image/svg+xml" />
         <h2>DROP ALL</h2>
         <p><code>DROP ALL</code> drops everything from a prefix
         of the DCTV namespace.  It is useful for "unmounting"
         traces by detaching a trace sub-namespace from the global
         namespace.</p>
         <object class="sytax" data="drop-all-stmt.syntax.svg" type="image/svg+xml" />
         <h2>MOUNT TRACE</h2>
         <p>Mount trace "mounts" a trace file at a prefix of the trace
         namespace.  See the standard library section of this manual
         for a description of what's available under the mount
         prefix.</p>
         <object class="sytax" data="mount-trace-stmt.syntax.svg" type="image/svg+xml" />
         <h2>Ordering term</h2>
         <p>Ordering in a <code>SELECT</code></p>
         <object class="sytax" data="ordering-term.syntax.svg" type="image/svg+xml" />
         <h2>SQL compound operators</h2>
         <p>Ways of combining multiple <code>SELECT</code> "cores".</p>
         <p>Not applicable to span tables.</p>
         <object class="sytax" data="compound-operator-name.syntax.svg" type="image/svg+xml" />
         <h2>Comment syntax</h2>
         <object class="sytax" data="comment-syntax.syntax.svg" type="image/svg+xml" />
         <h2>Operators</h2>
         <p>The following table gives the precedence
         of each operator.  Operators on the same row have
         the same precedence.</p>

         <table class="general" style="width:50%">
           <caption>Operator precedence</caption>
           <tr>
             <td>
               *
               /
               %
               //
             </td>
           </tr>
           <tr>
             <td>
               +
               -
             </td>
           </tr>
           <tr>
             <td>
               &lt;&lt;
               &gt;&gt;
             </td>
           </tr>
           <tr>
             <td>
               &amp;
             </td>
           </tr>
           <tr>
             <td>
               |
             </td>
           </tr>
           <tr>
             <td>
               BETWEEN
             </td>
           </tr>
           <tr>
             <td>
               =
               &lt;=&gt;
               &lt;!=&gt;
               &gt;=
               &gt;
               &lt;=
               &lt;
               !=
               &lt;&gt;
               ==
               IS
             </td>
           </tr>
           <tr>
             <td>
               NOT
             </td>
           </tr>
           <tr>
             <td>
               AND
             </td>
           </tr>
           <tr>
             <td>
               OR
             </td>
           </tr>
         </table>

         <table class="general" style="width:50%">
           <caption>Operator descriptions</caption>
           <tr>
             <th>Operator</th>
             <th>Description</th>
           </tr>
           <tr>
             <td>*</td>
             <td>Multiply</td>
           </tr>
           <tr>
             <td>/</td>
             <td>True division (yield float)</td>
           </tr>
           <tr>
             <td>%</td>
             <td>Modulus</td>
           </tr>
           <tr>
             <td>//</td>
             <td>Floor division (truncates toward zero)</td>
           </tr>
           <tr>
             <td>+</td>
             <td>Addition</td>
           </tr>
           <tr>
             <td>-</td>
             <td>Subtraction</td>
           </tr>
           <tr>
             <td>&lt;&lt;</td>
             <td>Left shift</td>
           </tr>
           <tr>
             <td>&gt;&gt;</td>
             <td>Right shift</td>
           </tr>
           <tr>
             <td>&amp;</td>
             <td>Bitwise AND</td>
           </tr>
           <tr>
             <td>|</td>
             <td>Bitwise OR</td>
           </tr>
           <tr>
             <td>BETWEEN</td>
             <td>Standard SQL</td>
           </tr>
           <tr>
             <td>= ==</td>
             <td>Equality</td>
           </tr>
           <tr>
             <td>&lt;=&gt; IS</td>
             <td>NULL-safe equality</td>
           </tr>
           <tr>
             <td>&lt;!=&gt; IS NOT</td>
             <td>NULL-safe inequality</td>
           </tr>
           <tr>
             <td>&gt;=</td>
             <td>Greater than or equals</td>
           </tr>
           <tr>
             <td>&gt;</td>
             <td>Greater than</td>
           </tr>
           <tr>
             <td>&lt;=</td>
             <td>Less than or equal</td>
           </tr>
           <tr>
             <td>&lt;</td>
             <td>Less than</td>
           </tr>
           <tr>
             <td>!= &lt;&gt;</td>
             <td>Inequality</td>
           </tr>
           <tr>
             <td>NOT</td>
             <td>Logical negation</td>
           </tr>
           <tr>
             <td>AND</td>
             <td>Logical conjunction</td>
           </tr>
           <tr>
             <td>OR</td>
             <td>Logical disjunction</td>
           </tr>
         </table>

         <h1><a name="standard_library">Trace standard library</a></h1>
         <!-- TODO(dancol): document columns -->
         <p>DCTV queries operate on a single shared per-session
         namespace.  (Each leaf level of the namespace has distinct
         mappings for table-valued things and for SQL functions, but
         the non-leaf nodes are shared between the namespaces.)</p>
         <h2>Per-trace names</h2>
         <p>DCTV makes a trace available by "mounting" it under a
         namespace prefix.  Names beginning with this prefix then refer
         to the trace that was mounted.  Multiple traces can be mounted
         in the same session, and a single query can pull data from
         multiple traces.</p>
         <p>In the description below, <code>mytrace.</code>
         refers to an arbitrary trace mountpoint.</p>
         <dl>
           <dt><code>mytrace.raw_events.*</code></dt>
           <dd>
             <p>These tables provide access to the raw events embedded
             in the trace.  For example,
             <code>mytrace.raw_events.sched_switch</code> is a table of
             <code>sched_switch</code> events, with one column for each
             field in the ftrace event.</p>
             <p>The DCTV event parser has a special case for
             <code>trace_marker_write</code> events: we put each one in
             a table formed by the concatenation of the event name and
             the first part of the event payload.  For example,
             <code>`print|B`</code> refers to those
             <code>trace_marker_write</code> events that begin with the
             prefix <code>B|</code>, indicating the start of a
             synchronous application-defined trace event.  We need to
             write <code>mytrace.raw_events.`print|B`</code> instead of
             <code>mytrace.raw_events.print|B</code> because
             <code>|</code> is normally an operator, so to treat it as
             part of a table name, we need to escape it with
             backticks.</p>
             <aside class="note">This special case for
             <code>write</code> is an ugly hack that exists so that we
             can give a different "schema" (set of columns) to each
             different kind of write event depending on its
             payload.</aside>
           </dd>
           <dt><code>mytrace.scheduler.timeslices_p_cpu</code></dt>
           <dd>
             <p>This table is a span table partitioned by CPU
             representing the scheduler activity of the system.</p>
           </dd>
           <dt><code>mytrace.scheduler.cpufreq_p_cpu</code></dt>
           <dd>
             <p>This table is a span table partitioned by CPU
             representing the CPU frequency that each CPU is
             known to have.</p>
           </dd>
           <dt><code>mytrace.last_ts</code></dt>
           <dd>
             <p>Single-column, single-value event table giving the
             largest timestamp found in the trace.  It's useful for
             building spans that cover the whole trace, but see
             <code>quantize</code> immediately below.</p>
           </dd>
           <dt><code>mytrace.quantize(interval=&gt;NULL)</code></dt>
           <dd>
             <p>This table-valued function generates a payloadless span
             table that divides the trace timeline into fixed-size
             spans of duration <code>interval</code>.  This table is
             useful for quantizing the trace timeline into fixed-size
             blocks for display or analysis, and is designed to work
             with span group operations.</p>
             <p>If <code>interval</code> is <code>NULL</code>,
             generates a span table with one huge span covering the
             whole trace.</p>
             <aside class="example code-example"><![CDATA[SELECT SPAN SUM(_duration)/5s AS non_idle_ratio
 FROM (SELECT SPAN * FROM my_cpu_timeslices WHERE pid != 0)
 GROUP USING SPANS FROM mytrace.quantize(5s) ]]></aside>
           </dd>
         </dl>

         <h2>The DCTV namespace</h2>

         <p>DCTV-specific query functions live under the
         <code>dctv.</code> namespace prefix.</p>

         <dl>
           <dt><code><dfn><a name="time_series_to_spans">
             dctv.time_series_to_spans(*, sources, columns, partition=&gt;NULL)
           </a></dfn></code></dt>
           <dd>
             <p>This function implements the time series to span
             conversion operation described <a
             href="#time_series_to_span_conversion"> above</a>.</p>
             <p><code>sources</code> is a list of source
             specifications.  Each sources specification is a dict with
             the following entries; entries are optional unless
             otherwise indicated.  As a convenience, a source
             specification can also be a list, the elements of which
             are turned into dict elements in the order given below.
             If a source specification is neither a dict nor a list, it
             is treated as if it were a dict with only the source
             element provided.  (This way, a bare table is a valid
             event source.)</p>
             <dl>
               <dt>source</dt>
               <dd>The event table providing the raw events that this routine
               turns into spans. Mandatory.</dd>
               <dt>role</dt>
               <dd>Either <code>"start"</code> or <code>"stop"</code>, defaulting to
               <code>"start"</code>.  Indicates whether the given source starts and
               separates output spans (in the former case) or whether it stops only
               started spans (the latter case).</dd>
               <dt>partition</dt>
               <dd>Either a string naming the column by which this
               source is partitioned or <code>NULL</code>, indicating
               that the source is unpartitioned.  Defaults to
               <code>NULL</code>.</dd>
               <dt>timestamp</dt>
               <dd>The name of the column in the source providing the timestamp.
               Defaults to <code>"_ts"</code>.</dd>
               <dt>nickname</dt>
               <dd>An optional string assigning a name to this source that column
               specifications in <code>columns</code> can reference.</dd>
             </dl>
             <p><code>columns</code> is a list of column
             specifications, each representing one payload column in
             the generated span table.</p>

             <p>Each column specification is a dict with the elements
             below.  As a convenience, a column specification can also
             be a list, the elements of which are turned into dict
             elements in the order given below.  If a column
             specification is neither a dict nor a list, it must be a
             string, and it is treated as if it were a dict with only
             the column element set.  (This way, a simple string is a
             valid column descriptor in the case that we have only one
             source.)</p>

             <dl>
               <dt>column</dt>
               <dd>String naming the output column in the generated
               span table.  Mandatory.</dd>
               <dt>source</dt>
               <dd>Identifies the source that supplies this output
               column.  May be omitted when only one source is given to
               the call; otherwise, must either be a number (naming a
               source positionally) or a string (matching the nickname
               given to a source in its specification).</dd>
               <dt>source_column</dt>
               <dd>Name of the column in the source event table that
               supplies the value of the corresponding column in the
               output table.  Defaults to the name of the output
               column.</dd>
               <dt>edge</dt>
               <dd>Either <code>"rising"</code> or
               <code>"falling"</code>, defaulting to the former.
               Determines which event supplies the value of the column
               in the output table: the event that starts a span or the
               event that ends a span.</dd>
             </dl>
             <p><code>partition</code> is the name of the partition
             column in the output span table.  If it is specified, all
             sources must have their own partitions specified.  If it
             is not yet, then no source may be partitioned.</p>

             <aside class="example code-example"><![CDATA[SELECT SPAN * FROM dctv.time_series_to_span(
   sources=>[my_raw_events_table],
   columns=>["foo", "bar", "qux"],
   )]]></aside>

             <aside class="example code-example"><![CDATA[SELECT SPAN * FROM dctv.time_series_to_span(
   sources=>[{source=>table1, partition=>"cpu", nickname="foo"},
             {source=>table2, partition=>"cpu", role="stop"}],
   columns=>[{column=>"total_things",
              source=>"foo",
              source_column=>"last_things",
              edge=>"falling"}])
 )]]></aside>
             <p>This routine looks pretty ugly when called.  Most of
             the time, you want to use one of the pre-defined span
             tables in the <a href="#standard_library">standard
             library</a>, which call <code>time_series_to_spans</code>
             for you.</p>
           </dd>
           <dt><dfn><code>dctv.stack_history()</code></dfn></dt>
           <dd>
             <p>This table-valued function understands "nested" events,
             turning them into stacks for further analysis.
             This table-valued function generates a span table mapping
             time intervals to stack IDs.</p>
             <p>See the <a href="#stackification">stackification</a>
             sub-section of the data model section.</p>
             <!-- TODO(dancol): document argument list -->
           </dd>
           <dt><dfn><code>dctv.stack_contents()</code></dfn></dt>
           <dd>
             <p>This table-valued function generates the
             <emph>contents</emph> of the stack IDs generated by the
             previous function.</p>
             <p>See the <a href="#stackification">stackification</a>
             sub-section of the data model section.</p>
           </dd>
           <dt><dfn><code>dctv.generate_sequential_spans(start, stop, duration)</code></dfn></dt>
           <dd>
             <p>This table-valued function generates "synthetic" spans
             useful for a variety of purposes.  See the <a
             href="#span_generation">span generation</a> sub-section of
             the data model section above.</p>
             <p><code>start</code> is a timestamp at which the spans
             should start.  <code>stop</code> is the time at which the
             last span should end.  <code>duration</code> is the length
             of each generated span.  Output spans are generated with
             no gaps.</p>
           </dd>
         </dl>

         <h1><a name="example">Worked example</a></h1>
         <p>Having read the above manual, this query should make sense.</p>
         <p><code>TODO(dancol):</code> expand this section.</p>
         <ol>
           <li>Extract from “print|B” a list of frame-start events.</li>
           <li>Take these events and, using time_series_to_span’s
           “start-start” mode, assemble a set of spans partitioning the
           trace timeline into frames.</li>
           <li>Select those frame-spans that lasted longer than 17ms,
           i.e., that took a long time to render.</li>
           <li>Intersect this bad-frame span set with the per-processor
           span table describing what the system is actually
           doing. Don’t consider the idle process.</li>
         </ol>
         <code class="blockquote"><![CDATA[WITH frames AS (SELECT SPAN * FROM dctv.time_series_to_spans(
                       sources=>[{source=>(SELECT * FROM trace.raw_events.`print|B` W
       HERE name='eglBeginFrame'),
                                  timestamp=>'ts'}],
                       columns=>[])),
            bad_frames AS (SELECT SPAN * FROM frames WHERE _duration > 17ms),
            bad_timeslices AS (SELECT SPAN * FROM trace.scheduler.timeslices_p_cpu
                                        SPAN BROADCAST FROM bad_frames)
       SELECT comm, cpu, SUM(_duration) AS totdur FROM bad_timeslices
       WHERE pid != 0
       GROUP BY comm, cpu
       ORDER BY totdur DESC
       LIMIT 20
         ]]></code>
       </main>
     </div>
   </body>
 </html>