| <?xml version='1.0' encoding='utf-8'?> |
| <!DOCTYPE html5> |
| <html lang="en" xmlns="http://www.w3.org/1999/xhtml"> |
| <!-- N.B. Sentences in this document are double-spaced so that Emacs |
| sentence-editing functions work more reliably. --> |
| <head> |
| <title>DCTV trace analysis system</title> |
| <link rel="stylesheet" href="reset.css" /> |
| <link rel="stylesheet" href="styles.css" /> |
| </head> |
| <body> |
| <div id="container"> |
| <header id="header">DCTV</header> |
| <nav id="sidebar"> |
| <ul> |
| <li> |
| <a href="#introduction">Introduction</a> |
| <ul> |
| <li><a href="#quickstart">Quick start</a></li> |
| <li><a href="#background">Background</a></li> |
| <li><a href="#conventions">Document conventions</a></li> |
| <li><a href="#datamodel">Data model</a></li> |
| <li><a href="#differences">Differences from standard SQL</a></li> |
| </ul> |
| </li> |
| <li><a href="#syntax">Syntax reference</a></li> |
| <li><a href="#standard_library">Standard library</a></li> |
| <li><a href="#example">Worked example</a></li> |
| </ul> |
| </nav> |
| <main id="manual"> |
| <h1><a name="introduction">Introduction</a></h1> |
| <p>DCTV is a data exploration toolkit designed for both |
| interactive and batch analysis of trace files and other |
| heterogeneous time series data. It's designed to answer |
| complex of the sort of data that one frequently finds in |
| records of system activity.</p> |
| <p>Important features of DCTV are:</p> |
| <ul> |
| <li>SQL1999 querying of trace files</li> |
| <li>specialized relational algebra and SQL syntax for time series</li> |
| <li>comprehensive dimensional analysis for unit conversion |
| and error detection</li> |
| <li>support for analyzing very large (larger than memory) |
| trace files</li> |
| <li>powerful GUI for interactive trace exploration</li> |
| </ul> |
| <p>Use cases include:</p> |
| <ul> |
| <li>examining CPU time spent by a particular application</li> |
| <li>examining CPU time spent in <emph>part</emph> of an |
| application</li> |
| <li>examining memory activity of the whole system to determine |
| what caused a game to miss a frame deadline</li> |
| <li>finding which functions cause the most page faults |
| during app startup</li> |
| <li>tracking down slow memory leaks</li> |
| <li>finding why a real-time thread took too long to run and |
| poll a device</li> |
| <li>bulk analysis of traces from production to extract metrics for a |
| dashboard</li> |
| </ul> |
| <p>DCTV is a "power user" tool: using it effectively requires |
| an understanding of both the system components that generate |
| the trace events being queried and an understanding of |
| SQL-like declarative query systems. This document aims to |
| describe and document DCTV's functionality, walk through a few |
| examples of trace analysis, and invite the reader to |
| investigate further.</p> |
| <h2><a name="quickstart">Quick start</a></h2> |
| <aside class="warning"> |
| DCTV is under active development and is not yet stable. |
| It also currently runs on Linux systems only; a port to |
| macOS is underway. See the <a href="go/dctv-db-design">DB |
| design document</a> for further information on checking out and |
| building the source code. |
| </aside> |
| <h3>Getting DCTV</h3> |
| <ol> |
| <li>Be running gLinux (we'll port eventually)</li> |
| <li><code>git clone sso://team/dctv/dctv</code></li> |
| <li><code>make dev</code></li> |
| <li>follow prompts; install dependencies</li> |
| <li>while the build is broken, complain to dancol@, goto 2</li> |
| <li><code>./dctv</code></li> |
| </ol> |
| <h3>Hello world</h3> |
| <code class="blockquote"><![CDATA[$ ./dctv repl mytrace=mytrace.ftrace |
| Type .help for help. |
| DCTV> SELECT COUNT(*) FROM mytrace.scheduler.timeslices_p_cpu; |
| COUNT() |
| ------- |
| 32362 |
| ]]></code> |
| <p> |
| </p> |
| <h2>Background</h2> |
| <blockquote> |
| Life is just one damned thing after another. |
| <cite>Arnold J. Toynbee</cite> |
| </blockquote> |
| <h3>Purpose of DCTV</h3> |
| <p>A trace file by itself is of limited utility: it's |
| gigabytes of detailed, low-level records of system activity. |
| When we analyze a trace file, what we really want to do is |
| <emph>pose questions</emph> to that trace file and get back |
| meaningful answers. The information we want lies in the |
| non-trivial <emph>relationships</emph> between trace events, |
| the relationships between relationships, and so on, in a way |
| that puts limits on the kind of trace analysis that it's |
| possible to do using ad-hoc analysis of trace |
| events themselves.</p> |
| <p>After we pose questions to a trace file and get answers, we |
| frequently want to use these answers as the basis for further |
| questions. In this way, we gradually increase the level of |
| abstraction of our analysis, moving from questions posed in |
| terms of raw trace events to ones posed in terms of the |
| problem we've actually trying to solve.</p> |
| <p>DCTV is a question-answering machine. By incrementally |
| constructing queries and then querying against them (for |
| example, using the <code>WITH</code> construction), users |
| extract increasingly abstract data from trace files, data not |
| directly represented by discrete and specific low-level events |
| in a trace. The SQL REPL and the GUI both provide |
| information-querying capabilities.</p> |
| <p>DCTV also provides a <a href="#standard_library">standard |
| library</a> of ready-made building blocks that users can query |
| during trace analysis.</p> |
| <h3>Other trace analysis tools</h3> |
| <p>DCTV is not the first such tool for trace analysis. |
| It integrates the best parts of WPA, LISA, and Perfetto's |
| trace analysis models.</p> |
| <code>TODO(dancol): flesh out this section</code> |
| <h2><a name="conventions">Document conventions</a></h2> |
| <p>This document currently assumes the reader is familiar with |
| the basics of SQL and the basics of trace processing, focusing |
| on DCTV's specific features in this area.</p> |
| <h3>Time tables</h3> |
| <p>Some figures below are "time tables" (they have "Time ▶" in |
| the upper-left). They represent timelines, where each row in |
| the table is a separate and independent data series. |
| Some tables represent operands and results; in this case, a |
| thick black line separates the input rows and output rows.</p> |
| |
| <h3>Function signatures</h3> |
| <p>Table-valued function signatures are given in Python |
| syntax, with a bare <code>*</code> signifying that all |
| arguments following the <code>*</code> are keyword-only and |
| cannot be specified positionally. (That is, if a function |
| signature is <code>foo(*, bar=7)</code>, then you have to |
| write either <code>foo()</code> (using <code>bar</code>'s |
| default value or write <code>foo(bar=>5)</code> (specifying an |
| explicit value of the keyword argument), and you can't write |
| <code>foo(1)</code> (because we can't specify |
| <code>bar</code> positionally.)</p> |
| <h2><a name="datamodel">Data model</a></h2> |
| <p>DCTV is designed around querying one or more trace files |
| using SQL queries. DCTV performs no hardcoded pre-processing |
| of trace files: we model each event in a trace file as a row |
| of the "raw events" table corresponding to that event's type. |
| Each field in an event is a column in that event's table; |
| users extract higher-level information from these low-level |
| events by defining views in terms of these low-level events. |
| By querying the views, users can extract higher-level trace |
| events; users can also define views in terms of other views to |
| answer more abstract questions.</p> |
| <h3>Table types</h3> |
| <p>DCTV's query engine provides the tables and set functions |
| that any SQL system provides, but extends these facilities |
| with a set of operators and functions dedicated to working |
| with heterogeneous time series. Tables in DCTV are |
| first-class <emph>typed</emph> objects: tables are either |
| regular tables, span tables, or event tables. Each type of |
| table has a set of query operations that it supports; DCTV |
| provides functions to convert one type of table to another as |
| needed.</p> |
| |
| <aside class="note">It's always possible to "view" one of |
| DCTV's special table types as a regular table by just using |
| regular table operations (like the non-<code>SPAN</code> |
| variant of <code>SELECT</code>) on it. The result of any of |
| these non-special operations is itself a regular |
| table.</aside> |
| |
| <p>This table summarizes the special operations DCTV supports. |
| Don't worry if you don't recognize some of these terms (like |
| "partitioned span table"): they're defined below.</p> |
| |
| <table class="general"> |
| <tr> |
| <th>Operation</th> |
| <th>Left operand</th> |
| <th>Right operand</th> |
| <th>Result</th> |
| </tr> |
| <tr> |
| <td>SELECT</td> |
| <td>Regular table</td> |
| <td>N/A</td> |
| <td>Regular table</td> |
| </tr> |
| <tr> |
| <td>SELECT</td> |
| <td>Span table</td> |
| <td>N/A</td> |
| <td>Regular table</td> |
| </tr> |
| <tr> |
| <td>SELECT SPAN</td> |
| <td>Span table</td> |
| <td>N/A</td> |
| <td>Span table</td> |
| </tr> |
| <tr> |
| <td>SPAN JOIN</td> |
| <td>Unpartitioned span table</td> |
| <td>Unpartitioned span table</td> |
| <td>Unpartitioned span table</td> |
| </tr> |
| <tr> |
| <td>SPAN BROADCAST INTO</td> |
| <td>Unpartitioned span table</td> |
| <td>Partitioned span table</td> |
| <td>Partitioned span table</td> |
| </tr> |
| <tr> |
| <td>SPAN BROADCAST FROM</td> |
| <td>Partitioned span table</td> |
| <td>Unpartitioned span table</td> |
| <td>Partitioned span table</td> |
| </tr> |
| <tr> |
| <td>GROUP USING PARTITION</td> |
| <td>Partitioned span table</td> |
| <td>N/A</td> |
| <td>Unpartitioned span table</td> |
| </tr> |
| <tr> |
| <td>GROUP USING SPANS FROM</td> |
| <td>Partitioned span table</td> |
| <td>Unpartitioned span table</td> |
| <td>Partitioned span table</td> |
| </tr> |
| <tr> |
| <td>GROUP USING SPANS FROM</td> |
| <td>Unpartitioned span table</td> |
| <td>Unpartitioned span table</td> |
| <td>Unpartitioned span table</td> |
| </tr> |
| </table> |
| |
| <p>A <dfn>regular SQL table</dfn> is essentially a list of |
| points in high-dimensional space, with each column in the |
| table representing one dimension along which a point can |
| vary.</p> |
| |
| <p>A <dfn>span table</dfn> represents data that vary over the |
| time dimension. An interval of time over which the data in a |
| span table remain the same is called a <dfn>span</dfn>. |
| The collection of time-varying data described by a span table |
| is the <dfn>payload</dfn> of that span table.</p> |
| <!-- TODO(dancol): talk about different time basis? --> |
| <p>All span tables have two special columns: |
| <dfn><code>_ts</code></dfn> and |
| <dfn><code>_duration</code></dfn>. <code>_ts</code> is an |
| <code>INT64</code> timestamp, in nanoseconds since the start |
| of the trace. <code>_duration</code> is a non-zero |
| <code>INT64</code> number of nanoseconds that the span covers. |
| (That is, the span describes the region of time |
| [<code>_ts</code>, <code>_ts</code> + |
| <code>_duration</code>].)</p> |
| |
| <p><code>_ts</code> and <code>_duration</code> are always |
| non-<code>NULL</code>, and a span table is always ordered by |
| increasing values of <code>_ts</code>. Spans in a span table |
| cannot "overlap": a span must end either before or at exactly |
| the same time as the next span begins. (Spans from different |
| partitions may overlap, however: see immediately below.) A |
| span table need not be contiguous: that is, it's legal for |
| gaps to exist between spans.</p> |
| |
| <p>For example, imagine that you're looking at a Christmas |
| tree light that changes color in time with music. We might |
| describe the color of the light using spans. The following |
| diagram depicts how we might use spans to describe the light's |
| state. Each pair of numbers (one above the table, one below) |
| indicates the time corresponding the vertical line connecting |
| them.</p> |
| |
| <table class="spanop"> |
| <caption>Light color</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="2">Red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">Green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| </table> |
| |
| <p>Here, the light was red from time one to time three and |
| then green from time four to time five, inclusive. (From time |
| three to time four, the light was off; we're choosing to |
| represent "off" as the absence of a span, but an equally valid |
| choice would be to make a span with a special "Off" value for |
| the color.)</p> |
| |
| <p>It's useful to look at the physical table representation |
| of the above set of spans.</p> |
| |
| <table class="general spantable"> |
| <caption>Light color (span table representation)</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>color</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>2</td> |
| <td>red</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>1</td> |
| <td>green</td> |
| </tr> |
| </table> |
| |
| <p>Note that one row in the physical table representation of a |
| span table corresponds to one <emph>logical</emph> span.</p> |
| |
| <aside class="note"> |
| It's because span tables are always ordered by |
| <code>_ts</code> that DCTV disallows queries of the form |
| <code>SELECT SPAN ... ORDER BY ...</code>. Re-ordering a |
| span table makes no sense. If you don't want to |
| <code>SELECT</code> from a span table and make the result a |
| span table, you can choose to instead view the span table as |
| a regular table by using the non-<code>SPAN</code> variant |
| of select (<code>SELECT * FROM my_span_table</code>), |
| and in this mode, <code>SELECT</code> will let you order |
| the result set by whatever you want. |
| </aside> |
| <p>An <dfn>event table</dfn> is like a span table, but without |
| the <code>_duration</code> column. It represents a sequence |
| of "points" in time. The advantage of using an event table |
| over a regular SQL table to represent points is automatic |
| integration of the event table into time-based operations |
| on spans.</p> |
| |
| <h3>Partitions</h3> |
| |
| <p>A span table is either a <dfn>partitioned span table</dfn> |
| or a <dfn>non-partitioned span table</dfn>. A non-partitioned |
| span table is just the kind of span table described above. |
| A partitioned span table, by contrast, has an additional |
| special column, the <dfn>partition column</dfn>. |
| A partitioned span table is basically a bundle of logical |
| partition tables all combined into a single table under a |
| single name. Each distinct <emph>value</emph> of the |
| partition column, which is called a <dfn>partition</dfn>, |
| defines one independent sequence of spans.</p> |
| |
| <p>All of DCTV's operations on span tables know about |
| partitioned span tables (the partition column is part of the |
| span table's type) and operate on each partition within a span |
| table independently. There are also operations that transform |
| a partitioned span table into a non-partitioned span table |
| through the use of SQL grouping operators.</p> |
| |
| <p>It's useful to sequences of spans this way instead of |
| putting each in own table: this way, using a partitioned span |
| table, we can operate on groups of related time series |
| uniformly without having to change our queries depending on |
| how many different time series we have: for example, a |
| CPU-related query should look the same on any system no matter |
| how many CPUs it has!</p> |
| |
| <p>DCTV currently allows a span table to have either zero or |
| one partition column, but not more. This limit is just an |
| implementation limit, and in the future, DCTV will allow |
| partitioning by more than one column.</p> |
| |
| <p>Let's look at our Christmas tree light example, but with |
| partitions. Here, we're looking at two lights, one called |
| "light#0" and another called "light#1". We use a sequence of |
| spans to describe each light's state. It's critical to |
| understand that each light has a distinct state history, but |
| that we store all of these histories in the same physical |
| table, using a column to describe the specific light that a |
| specific row describes.</p> |
| |
| <aside class="note">For the remainder of this document, when |
| the character "#" appears in a span row label, it refers to a |
| specific partition of a partitioned span table.</aside> |
| |
| <table class="spanop"> |
| <caption>Colors of two lights</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| <tr> |
| <td>Light#0</td> |
| <td colspan="2">Red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">Green</td> |
| </tr> |
| <tr> |
| <td>Light#1</td> |
| <td colspan="1">Green</td> |
| <td colspan="3">Red</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| </table> |
| |
| <p>Here's the physical partitioned span table representation |
| of the logical spans from the above diagram.</p> |
| |
| <table class="general spantable"> |
| <caption>Colors of two lights (span table representation)</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>lightno</th> |
| <th>color</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>2</td> |
| <td>0</td> |
| <td>red</td> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>1</td> |
| <td>1</td> |
| <td>green</td> |
| </tr> |
| <tr> |
| <td>2</td> |
| <td>3</td> |
| <td>1</td> |
| <td>red</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>1</td> |
| <td>0</td> |
| <td>green</td> |
| </tr> |
| </table> |
| |
| <p>Like an unpartitioned span table, a partitioned span table |
| is ordered strictly by increasing <code>_ts</code>. If spans |
| from two different partitions begin at the same time, the |
| ordering of those with the same <code>_ts</code> value is |
| unspecified. </p> |
| |
| <aside class="example"> |
| A real world use of spans is analyzing CPU-specific data. |
| On a multi-CPU system, each CPU has its own frequency. |
| A CPU might change from 800MHz to 1GHz and then down to |
| 600MHz, while another CPU, at the same time, might change |
| its frequency from 600MHz to 800MHz and then up to 1GHz. |
| Each of the two time series (the first CPU's frequency |
| history and the second CPU's frequency history) is an |
| independent time series. |
| </aside> |
| |
| <h3>Span operations</h3> |
| <p>While we can apply normal SQL querying operations to span |
| tables, we can answer certain questions much more conveniently |
| by using DCTV's special span operations, which are designed to |
| make it easy to work with real-world time series data.</p> |
| |
| <h4>Span join</h4> |
| |
| <p>The <dfn>span join</dfn> family of operations merge spans |
| together in a timewise-correct way and generates new spans |
| divided on the common boundaries of the spans that flow as |
| input into the span join.</p> |
| |
| <p>It's easiest to demonstrate a span join visually.</p> |
| <!-- TODO(dancol): can we make this diagram more fun? --> |
| <table class="spanop"> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| <tr> |
| <td>Size</td> |
| <td colspan="2">tiny</td> |
| <td colspan="1">giant</td> |
| </tr> |
| <tr> |
| <td>Species</td> |
| <td colspan="1">fish</td> |
| <td colspan="2">squirrel</td> |
| </tr> |
| <tr class="result-divider"><td>SPAN JOIN</td></tr> |
| <tr> |
| <td>Phenotype</td> |
| <td>tiny fish</td> |
| <td>tiny squirrel</td> |
| <td>giant squirrel</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| </table> |
| |
| <p>Here, we're joining two hypothetical time series (as |
| represented by span tables), a time series of sizes and a time |
| series of animal types. (Imagine we're trying to reconstruct |
| the state of an animal given a record of the transmutation |
| spells some novice sorcer's apprentice might have haphazardly |
| cast.)</p> |
| |
| <p>In this trace, the "make the animal tiny" spell was in |
| effect from timestamp one to timestamp three (inclusive), and |
| the "make the animal giant" spell was in effect from timestamp |
| 3 onward. Likewise, the "make the animal a fish" spell was in |
| effect from timestamp one to timestamp two (inclusive) and the |
| "make the animal a squirrel" spell was in effect from |
| timestamp two onward. The first row depicts the result of the |
| size spells, and the second row depicts the effect of the |
| animal-type spell. (We imagine that each spell cancels the |
| effect of the last spell of the same type.)</p> |
| |
| <p>The last row, "phenotype", represents a span table giving |
| the type of animal that we observe at each moment, inferred |
| from the effects of the previous two rows. Note that the |
| result span table has a span division wherever any of the |
| inputs has a span division. We ensure that all the properties |
| of any of the input spans stay constant "within" any of the |
| output spans, allowing for correct future computation |
| involving these values. |
| </p> |
| |
| <p>It may be informative to look at the row-wise representation |
| of the above span tables:</p> |
| |
| <table class="general spantable"> |
| <caption>Size</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>size</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>2</td> |
| <td>tiny</td> |
| </tr> |
| <tr> |
| <td>2</td> |
| <td>1</td> |
| <td>giant</td> |
| </tr> |
| </table> |
| |
| <table class="general spantable"> |
| <caption>Species</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>species</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>1</td> |
| <td>fish</td> |
| </tr> |
| <tr> |
| <td>2</td> |
| <td>2</td> |
| <td>squirrel</td> |
| </tr> |
| </table> |
| |
| <table class="general spantable"> |
| <caption>Phenotype</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>size</th> |
| <th>species</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>1</td> |
| <td>tiny</td> |
| <td>fish</td> |
| </tr> |
| <tr> |
| <td>2</td> |
| <td>1</td> |
| <td>tiny</td> |
| <td>squirrel</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>1</td> |
| <td>giant</td> |
| <td>squirrel</td> |
| </tr> |
| </table> |
| |
| <h4>Span join: inner and outer</h4> |
| <p>What happens when spans don't line up exactly?</p> |
| <p>Span joins come in two varieties, named after the varieties |
| of regular SQL joins: <dfn>inner span join</dfn> and |
| <dfn>outer span join</dfn>. When all the inputs to a span |
| join cover the same period of time, the difference doesn't |
| matter. But when there are gaps in one sequence or another, |
| the difference becomes important. Just as in the previous |
| section, we'll start with a diagram.</p> |
| |
| <table class="spanop"> |
| <caption>Sample inputs</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| <tr> |
| <td>Breath</td> |
| <td colspan="1">fire</td> |
| <td class="empty"/> |
| <td colspan="1">ice</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="1">red</td> |
| <td colspan="2">green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| </table> |
| |
| <p>Here, we see that there is no magic breath spell in effect |
| from time two to time three, inclusive. What happens when we |
| perform a span join on these span tables? It depends on the |
| kind of span join.</p> |
| |
| <table class="spanop"> |
| <caption>Span inner join</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| <tr> |
| <td>Breath</td> |
| <td colspan="1">fire</td> |
| <td class="empty"/> |
| <td colspan="1">ice</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="1">red</td> |
| <td colspan="2">green</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Span inner join</td> |
| </tr> |
| <tr> |
| <td>Phenotype</td> |
| <td colspan="1">fire-breathing red</td> |
| <td class="empty"/> |
| <td colspan="1">ice-breathing green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| </table> |
| |
| <table class="spanop"> |
| <caption>Span outer join</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| <tr> |
| <td>Breath</td> |
| <td colspan="1">fire</td> |
| <td class="empty"/> |
| <td colspan="1">ice</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="1">red</td> |
| <td colspan="2">green</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Span outer join</td> |
| </tr> |
| <tr> |
| <td>Phenotype</td> |
| <td colspan="1">fire-breathing red</td> |
| <td colspan="1"><code>NULL</code>-breathing green</td> |
| <td colspan="1">ice-breathing green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| </table> |
| |
| <p>In the span inner join case, we emit an output span only |
| when <i>all</i> input spans cover a time interval. In the |
| span outer join case, we emit an output span when <i>any</i> |
| input span covers a specific time region, providing NULL for |
| the value of any payload column not provided by a span for |
| that region.</p> |
| |
| <p>The table representations of the two result span tables may |
| make the result more clear.</p> |
| |
| <table class="general spantable"> |
| <caption>Span inner join result (table view)</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>breath</th> |
| <th>color</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>1</td> |
| <td>fire</td> |
| <td>red</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>1</td> |
| <td>ice</td> |
| <td>green</td> |
| </tr> |
| </table> |
| |
| <table class="general spantable"> |
| <caption>Span outer join result (table view)</caption> |
| <tr> |
| <th>_ts</th> |
| <th>_duration</th> |
| <th>breath</th> |
| <th>color</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>1</td> |
| <td>fire</td> |
| <td>red</td> |
| </tr> |
| <tr> |
| <td>2</td> |
| <td>1</td> |
| <td><code>NULL</code></td> |
| <td>green</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>1</td> |
| <td>ice</td> |
| <td>green</td> |
| </tr> |
| </table> |
| |
| <p>Note that even a span outer join won't produce a result |
| span that covers a period of time that no input span covered, |
| as the following diagram indicates.</p> |
| |
| <table class="spanop"> |
| <caption>Holes in span outer join</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| <tr> |
| <td>Breath</td> |
| <td colspan="1" class="empty"/> |
| <td colspan="1" class="empty"/> |
| <td colspan="1">ice</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td class="empty"/> |
| <td colspan="1">red</td> |
| <td colspan="1">green</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Span outer join </td> |
| </tr> |
| <tr> |
| <td>Phenotype</td> |
| <td class="empty"/> |
| <td colspan="1"><code>NULL</code>-breathing red</td> |
| <td colspan="1">ice-breathing green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span><span>4</span></td> |
| </tr> |
| </table> |
| |
| <h4>Span broadcast</h4> |
| |
| <p>A <dfn>span broadcast</dfn> is a special kind of span join |
| that operates on two span tables, one partitioned and one not. |
| Normally, DCTV treats each partition within a partitioned span |
| table as a separate time series and operates on each |
| independently; DCTV refuses to perform span operations on span |
| tables partitioned by different columns or between partitioned |
| and non-partitioned span tables, since the desired operation |
| isn't obvious.</p> |
| |
| <p>With a span broadcast, we can tell DCTV to perform a |
| special kind of span join between a partitioned and |
| non-partitioned table, "broadcasting" the non-partitioned span |
| into every partition in the partitioned span table in such a |
| way that the result has useful properties.</p> |
| |
| <p>The overall result is <emph>almost</emph> as if we copied |
| the non-partitioned span table N times, one for each N |
| partition, into a new partitioned span table, and then joined |
| that new partitioned span table with the other partitioned |
| span table that we had when we started. The difference |
| between this hypothetical operation and span broadcast is that |
| span broadcast doesn't generate any output spans for regions |
| not covered by any span in the partitioned span table, even if |
| that region is covered by the non-partitioned span table.</p> |
| |
| <p>Another way to think of it is that span broadcast "labels" |
| each span in a partitioned span table with the payload of the |
| non-partitioned table. The output of a span broadcast |
| operation is partitioned in the same way as its partitioned |
| input.</p> |
| |
| <p>As usual, a diagram may be illustrative. Here, "Size#0" |
| and "Size#1" indicate two spans of the same span table (let's |
| suppose animals 0 and 1 have different size spells cast on |
| them), "Size". "Color" is the input non-partitioned span |
| table (let's suppose color spells affect all animal at the |
| same time).</p> |
| |
| <table class="spanop"> |
| <caption>Sample inputs</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| <tr> |
| <td>Size #0</td> |
| <td colspan="1">tiny</td> |
| <td colspan="2">giant</td> |
| </tr> |
| <tr> |
| <td>Size #1</td> |
| <td colspan="3">tiny</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="1">red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="2">green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| </table> |
| |
| <p>Just like regular span joins, span broadcasts come in |
| <dfn>span inner broadcast</dfn> and <dfn>span outer |
| broadcast</dfn> varieties, depicted below. Note that the time |
| period from four to five doesn't appear in the result span |
| tables, since from time four to time five, we had a color span |
| from the non-partitioned span, but no spans from size, the |
| partitioned span table.</p> |
| |
| <table class="spanop"> |
| <caption>Inner broadcast of color into size</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| <tr> |
| <td>Size #0</td> |
| <td colspan="1">tiny</td> |
| <td colspan="2">giant</td> |
| </tr> |
| <tr> |
| <td>Size #1</td> |
| <td colspan="3">tiny</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="1">red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="2">green</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Inner broadcast</td> |
| </tr> |
| <tr> |
| <td>Result#0</td> |
| <td colspan="1">tiny red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">giant green</td> |
| </tr> |
| <tr> |
| <td>Result#1</td> |
| <td colspan="1">tiny red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">tiny green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| </table> |
| |
| <table class="spanop"> |
| <caption>Outer broadcast of color into size</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| <tr> |
| <td>Size #0</td> |
| <td colspan="1">tiny</td> |
| <td colspan="2">giant</td> |
| </tr> |
| <tr> |
| <td>Size #1</td> |
| <td colspan="3">tiny</td> |
| </tr> |
| <tr> |
| <td>Color</td> |
| <td colspan="1">red</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="2">green</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Outer broadcast</td> |
| </tr> |
| <tr> |
| <td>Result#0</td> |
| <td colspan="1">tiny red</td> |
| <td colspan="1"><code>NULL</code>-colored giant</td> |
| <td colspan="1">giant green</td> |
| </tr> |
| <tr> |
| <td>Result#1</td> |
| <td colspan="1">tiny red</td> |
| <td colspan="1"><code>NULL</code>-colored tiny</td> |
| <td colspan="1">tiny green</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span><span>5</span></td> |
| </tr> |
| </table> |
| |
| <p>In general, we use a span broadcast when we have a number |
| of different things happening at the same time (each |
| represented by one partition of a span table) and we want to |
| "mix into" this span table knowledge of something that affects |
| the environment as a whole.</p> |
| <aside class="example">We might denote a period of a few |
| seconds time during which the app |
| com.flashlightco.myflashlight starts up in response to a |
| launcher tap. (This is not an efficient flashight app.) If |
| we have a table of process activity, partitioned by CPU, we |
| can apply a span inner broadcast to the process activity table |
| and narrow our view of that table to the interval during which |
| the flashlight app was starting, but keep the result |
| partitioned by CPU. |
| </aside> |
| |
| <h4>Span group</h4> |
| |
| <p>A <dfn>span group</dfn> operation is the opposite of a span |
| join, in a sense. It merges spans together and applies SQL |
| set functions (like <code>MAX</code> and <code>SUM</code>) to |
| the payloads of the merged spans, forming for each payload a |
| combined value determined through the usual SQL aggregation |
| operation..</p> |
| |
| <p>Here's a diagram.</p> |
| |
| <table class="spanop"> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| <tr> |
| <td>Number arms</td> |
| <td>2</td> |
| <td>5</td> |
| <td>0</td> |
| <td>7</td> |
| <td>2</td> |
| <td>4</td> |
| <td>9</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Periods</td> |
| <td colspan="2">A</td> |
| <td colspan="2">B</td> |
| <td colspan="2">C</td> |
| <td colspan="2">D</td> |
| </tr> |
| <tr class="result-divider"><td>Span group</td></tr> |
| <tr> |
| <td><code>MAX(arms)</code></td> |
| <td colspan="2">5</td> |
| <td colspan="2">7</td> |
| <td colspan="2">4</td> |
| <td colspan="2">9</td> |
| </tr> |
| <tr> |
| <td><code>MIN(arms)</code></td> |
| <td colspan="2">2</td> |
| <td colspan="2">0</td> |
| <td colspan="2">2</td> |
| <td colspan="2">0</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| </table> |
| |
| <p>Here, our hapless sorcerer repeatedly changed the numbers |
| of arms that our poor animal had at any time. We want to |
| determine, based on the record of arm-number changes, for each |
| relatively broad interval A, B, C, and D, the minimum and |
| maximum number of arms our animal had during that |
| interval.</p> |
| |
| <p>A span group operation involves two span tables: the |
| <dfn>grouped</dfn> table and the <dfn>grouper</dfn> table. |
| The grouped table ("number of arms", in our example) supplies |
| the source data for the grouping operations; the grouper table |
| (here, "periods") supplies spans describing the groups that |
| form the output value. The grouped table may or may not be |
| partitioned; if it is partitioned, DCTV applies grouping to |
| each partition individually. The grouper table may not |
| currently be partitioned.</p> |
| |
| <p>A span group operation always emits one output span for |
| each span in its grouper input span table. If no grouped span |
| overlaps with a given grouper span, all its aggregate values |
| end up being <code>NULL</code>. An example follows.</p> |
| |
| <table class="spanop"> |
| <caption>Illustration of span group behavior with missing |
| grouped values</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| <tr> |
| <td>Number arms</td> |
| <td>2</td> |
| <td>5</td> |
| <td>0</td> |
| <td class="empty" /> |
| <td class="empty" /> |
| <td class="empty" /> |
| <td>9</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Periods</td> |
| <td colspan="2">A</td> |
| <td colspan="2">B</td> |
| <td colspan="2">C</td> |
| <td colspan="2">D</td> |
| </tr> |
| <tr class="result-divider"><td>Span group</td></tr> |
| <tr> |
| <td><code>MAX(arms)</code></td> |
| <td colspan="2">5</td> |
| <td colspan="2">0</td> |
| <td colspan="2"><code>NULL</code></td> |
| <td colspan="2">9</td> |
| </tr> |
| <tr> |
| <td><code>MIN(arms)</code></td> |
| <td colspan="2">2</td> |
| <td colspan="2">0</td> |
| <td colspan="2"><code>NULL</code></td> |
| <td colspan="2">0</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| </table> |
| |
| <!-- TODO(dancol): make this paragraph more clear --> |
| <p>Span group operations have two flavors: <dfn>span group and |
| intersect</dfn> and <dfn>span group and union</dfn>. |
| The difference matters only when multiple partitions are |
| involved. In the former case, we include payloads from the |
| grouped span table only when all partitions are present in a |
| given interval; in the latter case, we include the grouped |
| span table in the output spans when any input grouped |
| partition is present.</p> |
| |
| <aside class="note">If we want the output of a span join to |
| include only the regions of time covered by the grouped span, |
| first span join the grouper with the grouped span, then use |
| the result as the grouper span table in the span |
| group.</aside> |
| |
| <h4>Span departition</h4> |
| |
| <p>A <dfn>span departition</dfn> operation transforms a |
| partitioned span table into a non-partitioned span table by |
| grouping the partition payloads with SQL set values. |
| This operation is useful mainly when we have a "split up" view |
| of activity on the system and want to derive a whole-system |
| view by matching up all the partitions.</p> |
| |
| <p>To return to our magical forensics example, imagine our |
| apprentice cast some very expensive add-arms-to-animals spells |
| on a number of different animals. We're billed for arms based |
| on the total number we're using at any one time (there's a |
| license server and everything), so we want to reconstruct, |
| based on a record of each animal's arm count, the number of |
| arms we were using in total at a particular moment. In the |
| following table, "Arms#0", "Arms#1", and so on denote the |
| partitions of a single "Arms" span table.</p> |
| |
| <table class="spanop"> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| <tr> |
| <td>Arms#0</td> |
| <td colspan="3">2</td> |
| <td colspan="2">7</td> |
| <td>4</td> |
| <td>9</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Arms#1</td> |
| <td colspan="5">2</td> |
| <td colspan="3">4</td> |
| </tr> |
| <tr class="result-divider"><td>Departition</td></tr> |
| <tr> |
| <td><code>SUM(arms)</code></td> |
| <td colspan="3">4</td> |
| <td colspan="2">9</td> |
| <td colspan="1">8</td> |
| <td colspan="1">13</td> |
| <td colspan="1">4</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| </table> |
| |
| <p>A span departition operation resembles a span group join |
| followed by a span group operation, but it's specified |
| separately so that we can work with partitioned span tables |
| without knowing in advance how many partitions we have or |
| having to expand our queries to work with each partition |
| separately.</p> |
| |
| <p>Span departitions come in two varieties, the <dfn>span |
| departition and union</dfn> and <dfn>span departition and |
| intersect</dfn> operations, with the difference concerning the |
| treatment of missing data. The following table gives the |
| differences between these approaches.</p> |
| |
| <table class="spanop"> |
| <caption>Arm history with missing data</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| <tr> |
| <td>Arms#0</td> |
| <td colspan="3" class="empty" /> |
| <td colspan="2">7</td> |
| <td>4</td> |
| <td>9</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Arms#1</td> |
| <td colspan="5">2</td> |
| <td colspan="3" class="empty" /> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| </table> |
| |
| <p>In intersect mode, we generate an output span for a region |
| of time only when <emph>all</emph> partitions have a span |
| covering that period.</p> |
| |
| <table class="spanop"> |
| <caption>Departition intersect result</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| <tr> |
| <td>Arms#0</td> |
| <td colspan="3">2</td> |
| <td colspan="2">7</td> |
| <td>4</td> |
| <td>9</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Arms#1</td> |
| <td colspan="5">2</td> |
| <td colspan="3">4</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Departition intersect</td> |
| </tr> |
| <tr> |
| <td><code>SUM(arms)</code></td> |
| <td colspan="3" class="empty" /> |
| <td colspan="2">9</td> |
| <td colspan="3" class="empty" /> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| </table> |
| |
| <p>By contrast, in union mode, we generate an output span when |
| <emph>any</emph> partition covers a unit in time. We treat |
| any missing partitions as contributing <code>NULL</code> to |
| the output aggregation for each span. Note that SQL |
| aggregation functions just skip <code>NULL</code> values, so |
| the sums below are correct.</p> |
| |
| <table class="spanop"> |
| <caption>Departition union result</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| <tr> |
| <td>Arms#0</td> |
| <td colspan="3">2</td> |
| <td colspan="2">7</td> |
| <td>4</td> |
| <td>9</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Arms#1</td> |
| <td colspan="5">2</td> |
| <td colspan="3">4</td> |
| </tr> |
| <tr class="result-divider"> |
| <td>Departition union</td> |
| </tr> |
| <tr> |
| <td><code>SUM(arms)</code></td> |
| <td colspan="3">2</td> |
| <td colspan="2">9</td> |
| <td colspan="1">4</td> |
| <td colspan="1">9</td> |
| <td colspan="1">0</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span></td> |
| <td><span>7</span></td> |
| <td><span>8</span><span>9</span></td> |
| </tr> |
| </table> |
| |
| <h3>Trace processing intrinsic functions</h3> |
| |
| <p>DCTV aims to be a general-purpose time series analysis |
| program, one that just happens to be especially useful for |
| processing Android system traces. Its general approach is to |
| avoid system- and metric-specific data processing routines and |
| provide general-purpose operators that users can combine to |
| analyze data in particular situations.</p> |
| |
| <p>The previous section describes operations that DCTV |
| provides in the form of query operators. DCTV also provides |
| some operations, usually less common ones, in the form of |
| table-valued functions.</p> |
| |
| <h4><a name="time_series_to_span_conversion">Time series to span conversion</a></h4> |
| |
| <p>Recall that DCTV exposes events from trace files as raw |
| data points, in event tables. We have to build span tables |
| from these raw data somehow, and the <a |
| href="#time_series_to_spans"><code>time_series_to_spans</code></a> |
| table-valued function does exactly that.</p> |
| |
| <p><code>time_series_to_spans</code> takes as input a set of |
| event sources and a set of output column descriptors and |
| produces a span table as output. Logically, it consuming |
| events from the given sources, in time order, and constructs |
| spans by watching for "start" and "stop" events as denoted by |
| the input sources. Payload values attached to the event |
| sources become payload columns of the output span table |
| according to each column specification's column |
| specification.</p> |
| |
| <p>Each source is either a "start-start" source or a "stop" |
| source. The former case models a set of events that divide a |
| timeline up into discrete chunks.</p> |
| |
| <p>Returning for a moment to our hypothetical wizardly |
| apprentice, we recall that an animal's size might change as |
| our apprentice casts various "change size" spells on it. |
| The raw, event-by-event, record of spells cast by our |
| apprentice might look like this.</p> |
| |
| <table class="general spantable"> |
| <caption>Raw size spell record</caption> |
| <tr> |
| <th>_ts</th> |
| <th>size</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>tiny</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>huge</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>large</td> |
| </tr> |
| <tr> |
| <td>6</td> |
| <td>huge</td> |
| </tr> |
| </table> |
| |
| <p>Processing this raw event table into spans using |
| <code>time_series_to_spans</code>, we end up with a span table |
| that looks like this. (The time scale goes to seven for |
| easier comparison with the next example.)</p> |
| |
| <table class="spanop"> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| <tr> |
| <td>Size</td> |
| <td colspan="2">tiny</td> |
| <td colspan="1">huge</td> |
| <td colspan="2">large</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| </table> |
| |
| <aside class="note">The final "huge" spell isn't reflected in |
| the output span table, because |
| <code>time_series_to_spans</code> ignores spans left "open" |
| (i.e., unclosed) at the end of processing. The intent of this |
| feature is to work with span inner join operations to |
| automatically ignore noisy partial-data "junk intervals" at |
| the beginning and end of traces. If a need arises, |
| <code>time_series_to_spans</code> could be extended in the |
| future to automatically close open spans.</aside> |
| |
| <p><code>time_series_to_spans</code> also supports "stop" |
| events. These events don't start new spans, but do indicate |
| that any open span active at the time of the stop event should |
| be finished. In an operating system context, if |
| <code>sched_switch</code> is a start-start event, a CPU |
| hotplug off event might be a "stop" event, since it would |
| indicate that a CPU has stopped processing traces without |
| producing any new ones.</p> |
| |
| <p>To return to our unfortunate apprentice example, suppose we |
| have an additional table of "size reset" spells that we know |
| were cast during the sequence of size change spells. A size |
| reset spell just returns a creature to whatever size it had |
| without any magical augmentation. The raw table might look |
| something like this.</p> |
| |
| <table class="general spantable"> |
| <caption>Raw size-reset spell record</caption> |
| <tr> |
| <th>_ts</th> |
| </tr> |
| <tr> |
| <td>5</td> |
| </tr> |
| <tr> |
| <td>7</td> |
| </tr> |
| </table> |
| |
| <p>If we feed both our original size spell record event table |
| <emph>and</emph> our size-reset spell table into |
| <code>time_series_to_spans</code>, we end up with a span table |
| that looks like this.</p> |
| |
| <table class="spanop"> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| <tr> |
| <td>Size</td> |
| <td colspan="2">tiny</td> |
| <td colspan="1">huge</td> |
| <td colspan="1">large</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">huge</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| </table> |
| |
| <p>Note the differences: first, we now have a "hole" between |
| times five and six, because the stop table told us that we |
| stopped changing our poor confused creature's size at time |
| five and didn't start changing it again until time six. |
| Second, we have a "huge" span from time six to seven, because |
| the span beginning at time six is no longer left open after |
| <code>time_series_to_spans</code> ends.</p> |
| |
| <aside class="note">If you want a span table that substitutes |
| a concrete value (say, "normal") for the hole, you can combine |
| a span outer join of the whole-trace span with |
| <code>COALESCE</code> on the payload column to make |
| one.</aside> |
| |
| <p>Each payload column that <code>time_series_to_spans</code> |
| generates is described by a "source specification". |
| The specification describes, for each output column, the |
| source event table from which we get the column's value and |
| the "edge" from which we draw the value. (The edge defaults |
| to "rising".) Using the "rising" edge means that we draw the |
| output payload column for a span from the event that started |
| the span; using "falling" instead tells |
| <code>time_series_to_spans</code> to draw the payload column |
| value from the <emph>closing</emph> event. We typically stick |
| with "rising" except in special cases.</p> |
| |
| <p><code>time_series_to_spans</code> supports creating |
| partitioned span tables as well; each source specification can |
| be associated with a partition column in that source table. |
| All sources for a given call to |
| <code>time_series_to_spans</code> must be partitioned the same |
| way.</p> |
| |
| <h4><a name="stackification">Stackification</a></h4> |
| |
| <p>Not all raw input events look like a series of start and |
| stop on a timeline. Another common pattern in row input is |
| the "start-stop stack", in which a series of nested and |
| balanced start and stop events describe the erection and |
| demolition of a stack of some kind of thing.</p> |
| |
| <p>Stacks can be anything: examples include procedure call |
| stacks, Android synchronous atrace regions, and nested |
| interrupt handlers. To keep with our hapless-apprentice |
| example theme, we'll imagine that spells are prepared by |
| simultaneous chanting, waving, and stirring, and that we have |
| distinct "start" and "stop" records for each activity.</p> |
| |
| <p>Suppose we know at what time our apprentice starts a given |
| activity and know at what time an activity ends. Suppose also |
| that our apprentice at least paid enough attention in class to |
| understand that one always stops the magical activity one most |
| recently started.</p> |
| |
| <p>(Note that at time five, a second chant begins even though |
| a chant was already ongoing. A friend must have joined |
| in.)</p> |
| |
| <table class="general spantable"> |
| <caption>Spell starts</caption> |
| <tr> |
| <th>_ts</th> |
| <th>activity</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>stir</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>wave</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>chant</td> |
| </tr> |
| <tr> |
| <td>5</td> |
| <td>chant</td> |
| </tr> |
| </table> |
| |
| <table class="general spantable"> |
| <caption>Spell stops</caption> |
| <tr> |
| <th>_ts</th> |
| </tr> |
| <tr> |
| <td>2</td> |
| </tr> |
| <tr> |
| <td>7</td> |
| </tr> |
| <tr> |
| <td>7</td> |
| </tr> |
| <tr> |
| <td>7</td> |
| </tr> |
| </table> |
| |
| <p>What happens if we rearrange these data into spans?</p> |
| |
| <table class="spanop"> |
| <caption>Notional stackified spells</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| <tr> |
| <td>Effects</td> |
| <td colspan="1">[stir]</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">[wave]</td> |
| <td colspan="1">[wave, chant]</td> |
| <td colspan="2">[wave, chant, chant]</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| </table> |
| |
| <p>This arrangement makes logical sense, but it isn't quite |
| compatible with DCTV's data model. Note that the value of |
| each cell is actually a list! Unlike some databases, DCTV |
| does not support composite (multi-part) values as column |
| values. But here we apparently have composite values in the |
| cells. How do we represent these spans as tables? By <a |
| href="https://en.wikipedia.org/wiki/Database_normalization"> |
| normalization</a>.</p> |
| |
| <table class="general spantable"> |
| <caption>Stack contents</caption> |
| <tr> |
| <th>stack_id</th> |
| <th>depth</th> |
| <th>token</th> |
| </tr> |
| <tr> |
| <td>1</td> |
| <td>0</td> |
| <td>stir</td> |
| </tr> |
| <tr> |
| <td>2</td> |
| <td>0</td> |
| <td>wave</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>0</td> |
| <td>wave</td> |
| </tr> |
| <tr> |
| <td>3</td> |
| <td>1</td> |
| <td>chant</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>0</td> |
| <td>wave</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>1</td> |
| <td>chant</td> |
| </tr> |
| <tr> |
| <td>4</td> |
| <td>2</td> |
| <td>chant</td> |
| </tr> |
| </table> |
| |
| <table class="spanop"> |
| <caption>Normalized stackified spells</caption> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| <tr> |
| <td>Stack Id</td> |
| <td colspan="1">1</td> |
| <td colspan="1" class="empty" /> |
| <td colspan="1">2</td> |
| <td colspan="1">3</td> |
| <td colspan="2">4</td> |
| </tr> |
| <tr class="times"> |
| <td>Time ▶</td> |
| <td><span>1</span></td> |
| <td><span>2</span></td> |
| <td><span>3</span></td> |
| <td><span>4</span></td> |
| <td><span>5</span></td> |
| <td><span>6</span><span>7</span></td> |
| </tr> |
| </table> |
| |
| <p>Now, we can look up the stack corresponding to each span by |
| looking at that span's stack id payload and joining it against |
| the stack contents table. The stackify DCTV intrinsic |
| processes any kind of stack into these two tables (the stack |
| contents regular table and the "stack history" span |
| table).</p> |
| |
| <aside class="note">This approach is admittedly pretty ugly, |
| but it works. If it turns out to be a big enough of a |
| problem, we may just implement support for composite values |
| (which is really just this approach under the hood).</aside> |
| |
| <h4><a name="span_generation">Generating span tables from thin air</a></h4> |
| |
| <p>There are general utility functions to generate specialized |
| span tables useful for composing with others. |
| The <code>generate_sequential_spans</code> table-valued |
| function generates a sequence of spans according to the start |
| time, stop time, and duration specified in the call. |
| It's useful for generating spans to quantize the timeline into |
| discrete intervals and for generating "whole trace" spans that |
| act as inputs to span joins.</p> |
| |
| <p>Each trace namespace has a few convenience functions for |
| succinctly generating, using |
| <code>generate_sequential_spans</code>, certain kinds of span |
| tables. See the <a href="#standard_library">"standard |
| library"</a> reference below.</p> |
| |
| <h3>Dimensional analysis</h3> |
| |
| <p>DCTV provides a dimensional analysis feature to make it |
| easy and natural to query traces using naturally-specified |
| values and to avoid errors that can arise from accidental |
| nonsensical combinations of incompatible units. Each quantity |
| in a query is associated with a <dfn>unit</dfn> and these |
| units propagate through the query as it is processed. |
| Quantities with different units combine according to the rules |
| of <a |
| href="https://en.wikipedia.org/wiki/Dimensional_analysis">dimensional |
| analysis</a>. DCTV also knows how to convert from one |
| compatible unit to another. DCTV will signal errors rather |
| than produce results that are dimensional nonsense. |
| The overall goal of the dimensional analysis feature is to |
| make it easy and natural to query traces using |
| naturally-specified values and to avoid errors that can arise |
| from accidental nonsensical combinations of incompatible |
| units.</p> |
| |
| <h4>Unit specification</h4> |
| |
| <p>Units come from two sources:</p> |
| <ol> |
| <li>intrinsic tagging of |
| quantities with units during trace parsing, and</li> |
| <li>explicit |
| tagging of quantities with units in query syntax.</li> |
| </ol> |
| |
| <p>The syntax for specifying a unit is just adding the name of |
| the unit after a numeric literal. For simple alphanumeric |
| unit names, a bare word is sufficient, e.g., <code>4ns</code>. |
| For more complicated units that contain operators that SQL |
| would otherwise interpret as part of expressions, the unit |
| name needs to be quoted with backticks, as in |
| <code>4`miles/hour`</code>. Without the backticks, SQL would |
| interpret <code>4miles/hour</code> as an attempt to divide the |
| quantity <code>4</code> by the column <code>hour</code>, which |
| is probably not what we want.</p> |
| |
| <h4>Unit names</h4> |
| |
| <!-- TODO(dancol): include the unit name list here --> |
| <p>DCTV understands both common and abbreviated names for |
| units. This document will eventually list all understood unit |
| names; for the moment, see <a |
| href="https://team.git.corp.google.com/dctv/dctv/+/master/src/dctv/units.txt">units.txt</a> |
| in the DCTV source code.</p> |
| |
| <h4>Unit conversion</h4> |
| |
| <p>Queries can explicitly convert units from one type to |
| another using the <code>IN</code> operator. <!-- TODO(dancol): |
| link to syntax --></p> |
| |
| <aside class="example code-example"><![CDATA[DCTV> SELECT 4`inches` IN cm; |
| 4 IN cm [cm] |
| ------------ |
| 10.16]]></aside> |
| |
| <p>In the DCTV REPL, column headers that denote a quantity |
| with unit list that unit in square brackets after the column |
| name. Above, we see <code>[cm]</code> at the end of the |
| column name, indicating that the <code>10.16</code> is |
| specified in terms of centimeters.</p> |
| |
| <p>DCTV's unit analysis also understands rates. In the |
| example below, DCTV gives a unit in terms of miles, because |
| we're multiplying a rate, in miles per hour, by a unit of |
| time. The time unit here need not be the literal unit used in |
| the rate: DCTV will convert units as needed.</p> |
| |
| <aside class="example code-example"><![CDATA[DCTV> SELECT 4`miles/hour` * 2`days`; |
| (4 * 2) [mi] |
| ------------ |
| 192]]></aside> |
| |
| <h3> |
| </h3> |
| |
| <h2><a name="differences">Differences from standard SQL</a></h2> |
| |
| <h3>Nested namespaces</h3> |
| <p>Standard SQL provides a two-level namespace for tables: |
| each table is named by an optional schema (followed by a dot), |
| and then a table name. DCTV, by contrast, allows for |
| arbitrarily deep nesting of namespaces, with each namespace |
| component separated by a period. (SQL's standard syntax is a |
| special case.) We use the nested namespace syntax to talk |
| about specific tables and views embedded in a "trace |
| sub-namespace", which we form when we mount a trace into the |
| global SQL namespace.</p> |
| |
| <h3>Keyword arguments</h3> |
| |
| <p>Normal SQL allows only positional arguments to function |
| calls. DCTV allows for Python-style keyword arguments as |
| well, with each keyword-argument pair separated by the "=>" |
| token. See the <a href="#syntax">syntax reference</a> for |
| details.</p> |
| |
| <h3>Extended table-valued-function-call syntax</h3> |
| <p>DCTV exposes some facilities as table-valued functions. |
| The arguments to these functions are evaluated in a context |
| different from normal SQL expression evaluation, and in this |
| context, DCTV supports extended syntax, including the use of |
| list and dictionary literals. (Table-valued functions are |
| Python functions and these list and dictionary literals become |
| list and dict values inside calls.) See the syntax reference |
| for details.</p> |
| |
| <h3>Miscellaneous syntax extensions</h3> |
| |
| <p>DCTV is designed to minimize users fighting with the |
| syntax. Wherever SQL requires a list of something to be |
| comma-separated, DCTV allows and ignores a trailing comma. |
| Where SQL requires a list terminator (e.g., semicolons after |
| each query statement), DCTV allows users to omit the list |
| terminator.</p> |
| |
| <p>DCTV recognizes <code><></code> and the C-style |
| <code>!=</code> operators as equivalent.</p> |
| |
| <p>DCTV provides the "spaceship" and "anti-spaceship" |
| operators <code><=></code> and <code><!=></code>, |
| respectively, which act like <code>==</code> and |
| <code>!=</code>, except that they treat <code>NULL</code> as |
| being equal to itself. (MySQL calls these operators "null |
| safe comparison operators".)</p> |
| |
| <p>In addition to the standard SQL <code>-- </code> comment |
| prefix, DCTV allows the use of <code>#</code> as a |
| Python-style comment prefix and the use of <code>/*</code> and |
| <code>*/</code> for C-style block comments.</p> |
| |
| <h3>Missing features</h3> |
| |
| <p>DCTV does not implement some features of more traditional |
| databases. The following table summarizes the features not |
| provided, whether we plan to provide them, and any additional |
| relevant information.</p> |
| |
| <table class="general"> |
| <tr> |
| <th>Feature</th> |
| <th>Status</th> |
| <th>Command</th> |
| </tr> |
| <tr> |
| <td>INSERT/UPDATE/DELETE</td> |
| <td>Not planned</td> |
| <td>DCTV is immutable</td> |
| </tr> |
| <tr> |
| <td>SQL1999 window functions</td> |
| <td>Planned</td> |
| <td></td> |
| </tr> |
| <tr> |
| <td>SQL/PL</td> |
| <td>Planned</td> |
| <td>Will be accelerated</td> |
| </tr> |
| <tr> |
| <td>Recursive CTEs</td> |
| <td>Planned</td> |
| <td></td> |
| </tr> |
| <tr> |
| <td>Coordinated subqueries</td> |
| <td>Planned</td> |
| <td></td> |
| </tr> |
| </table> |
| <h1><a name="syntax">Syntax reference</a></h1> |
| <h2>SQL Statement list</h2> |
| <p>The REPL accepts statement lists as top-level input.</p> |
| <object class="sytax" data="sql-stmt-list.syntax.svg" type="image/svg+xml" /> |
| <h2>SQL statement</h2> |
| <p>A given SQL statement is either a SELECT, which performs a |
| query, or one of a few miscellaneous types of data management |
| operation.</p> |
| <object class="sytax" data="sql-stmt.syntax.svg" type="image/svg+xml" /> |
| <h2>SELECT</h2> |
| <p>A SELECT is a combination of one or more "select core" |
| statements (combined together with operators like |
| <code>UNION</code>), all sorted and windowed.</p> |
| <p>Note that span tables cannot be combined using |
| SQL compound operators.</p> |
| <p>Common table expressions are "local" views that exist only |
| in the context of the following SELECT and |
| subsequently-defined common table expressions. (That is, the |
| common table expressions have lexical scope, and the names are |
| bound as with <code>let*</code> in Lisp.)</p> |
| <object class="sytax" data="select-stmt.syntax.svg" type="image/svg+xml" /> |
| <h2>Regular select core</h2> |
| |
| <p>This diagram shows the syntax for the main body of a |
| <code>SELECT</code> statement. If the keyword |
| <code>SPAN</code> appears after the <code>SELECT</code>, the |
| "type" of the result of the <code>SELECT</code> is a span |
| table; otherwise, it's a regular table.</p> |
| |
| <p>In <code>SPAN</code> mode, <code>SELECT</code> always |
| includes the special columns <code>_ts</code>, |
| <code>_duration</code>, and (if partitioned) the partition |
| column in the selected column set. <code>SELECT SPAN FROM |
| ...</code> (with no column list between the <code>SPAN</code> |
| and the <code>FROM</code>) indicates selecting |
| <emph>only</emph> these special columns. These special |
| columns may not be specified "by hand" in the result-column |
| list.</p> |
| |
| <p>The <code>table-or-join</code> clause describes the syntax |
| for span join operators. The <code>GROUP...USING SPANS</code> |
| syntax describes a span group operation from the data model |
| section; the <code>GROUP...USING PARTITION</code> syntax |
| describes a span departition operation. <code>GROUP BY</code> |
| works exactly the same way it does in standard SQL.</p> |
| <p><code>HAVING</code> in span mode always filters the |
| <emph>generated</emph> spans; <code>WHERE</code> filters the |
| <emph>inputs</emph> to any span join and grouping operations, |
| analogously to the distinction between <code>WHERE</code> and |
| <code>HAVING</code> in standard SQL.</p> |
| <object class="sytax" data="regular-select.syntax.svg" type="image/svg+xml" /> |
| <h2>Result column</h2> |
| <object class="sytax" data="result-column.syntax.svg" type="image/svg+xml" /> |
| <h2>Table or join specification</h2> |
| <p>This element describes a "column source" from which a |
| <code>SELECT</code> draws columns. It can be a simple table |
| name, a call to a table-valued function, a subquery (of which |
| <code>VALUES</code> is a special case), or a join operation of |
| other column sources.</p> |
| <p>A comma joining two table specifications is equivalent to |
| <code>INNER JOIN</code>.</p> |
| <p><code>AS</code> assigns a local alias to one of these |
| column sources, the alias being useful in expressions in the |
| result-column clauses. If a column list comes after the |
| <code>AS</code>, the columns of the thing named with |
| <code>AS</code> are renamed to match the columns in the column |
| list that follows the <code>AS</code>, which must have the |
| same length as the set of columns in the named thing.</p> |
| <object class="sytax" data="table-or-join.syntax.svg" type="image/svg+xml" /> |
| <h2>Conventional join</h2> |
| <p>A normal SQL join.</p> |
| <object class="sytax" data="conventional-join.syntax.svg" type="image/svg+xml" /> |
| <h2>Span join</h2> |
| <p>Describes a span join operation. The <code>PARTITION |
| AS</code> clause provides the name of the partition column in |
| the resulting span table, which must be specified if the left |
| and right span tables have partition columns with |
| different names.</p> |
| <object class="sytax" data="span-join.syntax.svg" type="image/svg+xml" /> |
| <h2>Span broadcast</h2> |
| <p>A span broadcast operation. In the |
| <code>BROADCAST..INTO</code> variant, the unpartitioned span |
| table is on the left, whereas in the |
| <code>BROADCAST...FROM</code> variant, the unpartitioned span |
| table is on the right.</p> |
| <object class="sytax" data="span-broadcast.syntax.svg" type="image/svg+xml" /> |
| <h2>Table specification</h2> |
| <p>Names a single table, either as a name of a table |
| in the table namespace, a call to a table-valued |
| function in the table namespace, or a subquery.</p> |
| <object class="sytax" data="table-spec.syntax.svg" type="image/svg+xml" /> |
| <h2>Table-valued function arglist</h2> |
| <p>The argument list for a table-valued function call. Note |
| the keyword arguments.</p> |
| <object class="sytax" data="tvf-arglist.syntax.svg" type="image/svg+xml" /> |
| <h2>Table-valued function (TVF) expression</h2> |
| <p>The syntax for an expression in TVF context. Note the dict |
| and list literal syntax. A subquery is also a valid argument |
| to a table-valued function!</p> |
| <object class="sytax" data="tvf-expr.syntax.svg" type="image/svg+xml" /> |
| <h2>SQL expression</h2> |
| <p>Syntax for expressions that can occur in a query |
| outside TVF context.</p> |
| <object class="sytax" data="expr.syntax.svg" type="image/svg+xml" /> |
| <h2>Function call argument list</h2> |
| <p>The argument list for a call to a function SQL |
| expression context. Note the keyword arguments and |
| the optional <code>DISTINCT</code> keyword.</p> |
| <object class="sytax" data="function-arglist.syntax.svg" type="image/svg+xml" /> |
| <h2>Data type names</h2> |
| <p>List of allowed data type names.</p> |
| <object class="sytax" data="type-name.syntax.svg" type="image/svg+xml" /> |
| <h2>Literal value syntax</h2> |
| <p>Literal values can appear either as regular SQL expressions |
| or as TVF expressions. In TVF (i.e., Python) context, |
| <code>TRUE</code> becomes <code>True</code>, |
| <code>FALSE</code> <code>False</code>, and <code>NULL</code> |
| <code>None</code>.</p> |
| <object class="sytax" data="literal-value.syntax.svg" type="image/svg+xml" /> |
| <h2>Bind parameters</h2> |
| <p>Represents a parameter substitution in a query. |
| Positional argument numbers are assigned automatically to |
| positional <code>?</code> substitutions without explicit |
| numbers. The assignments starts from zero and proceeds |
| left-to-right during parsing, incrementing each unnumbered |
| positional substitution's substitution number by one. |
| Explicitly numbered substitutions do not affect this automatic |
| numbering.</p> |
| <object class="sytax" data="bind-parameter.syntax.svg" type="image/svg+xml" /> |
| <h2>Numeric literal</h2> |
| <object class="sytax" data="numeric-literal.syntax.svg" type="image/svg+xml" /> |
| <h2>VALUES list</h2> |
| <p><code>VALUES</code> works just as it does in standard SQL |
| and allows query authors to include data inline. DCTV does |
| not provide a <code>CREATE TABLE</code> function, but one can |
| achieve a similar effect by using <code>CREATE VIEW</code> |
| with a <code>VALUES</code> select-part.</p> |
| <object class="sytax" data="values-list.syntax.svg" type="image/svg+xml" /> |
| <div /> |
| <object class="sytax" data="values-list-row.syntax.svg" type="image/svg+xml" /> |
| <h2>Common table expression</h2> |
| <p>The common table expression part of a SQL query. |
| The optional column list in the name performs the same |
| column-renaming operation that the optional column list after |
| <code>AS</code> does.</p> |
| <object class="sytax" data="common-table-expression.syntax.svg" type="image/svg+xml" /> |
| <h2>Namespace prefix</h2> |
| <p>Names a part of the DCTV namespace.</p> |
| <object class="sytax" data="ns-prefix.syntax.svg" type="image/svg+xml" /> |
| <h2>Table namespace name</h2> |
| <p>Describes a table in the table namespace.</p> |
| <object class="sytax" data="table-ns-name.syntax.svg" type="image/svg+xml" /> |
| <h2>Table-valued-function name</h2> |
| <p>Describes a table-valued-function in the table namespace.</p> |
| <object class="sytax" data="tvf-ns-name.syntax.svg" type="image/svg+xml" /> |
| <h2>SQL function name</h2> |
| <p>Describes a SQL function name in the function namespace. |
| Note that the function namespace is distinct from the table |
| namespace.</p> |
| <object class="sytax" data="function-name.syntax.svg" type="image/svg+xml" /> |
| <h2>CREATE VIEW</h2> |
| <p><code>CREATE VIEW</code> works just like it does in |
| standard SQL.</p> |
| <object class="sytax" data="create-view-stmt.syntax.svg" type="image/svg+xml" /> |
| <h2>DROP VIEW</h2> |
| <p><code>DROP VIEW</code> works just like it does in |
| standard SQL.</p> |
| <object class="sytax" data="drop-view-stmt.syntax.svg" type="image/svg+xml" /> |
| <h2>DROP ALL</h2> |
| <p><code>DROP ALL</code> drops everything from a prefix |
| of the DCTV namespace. It is useful for "unmounting" |
| traces by detaching a trace sub-namespace from the global |
| namespace.</p> |
| <object class="sytax" data="drop-all-stmt.syntax.svg" type="image/svg+xml" /> |
| <h2>MOUNT TRACE</h2> |
| <p>Mount trace "mounts" a trace file at a prefix of the trace |
| namespace. See the standard library section of this manual |
| for a description of what's available under the mount |
| prefix.</p> |
| <object class="sytax" data="mount-trace-stmt.syntax.svg" type="image/svg+xml" /> |
| <h2>Ordering term</h2> |
| <p>Ordering in a <code>SELECT</code></p> |
| <object class="sytax" data="ordering-term.syntax.svg" type="image/svg+xml" /> |
| <h2>SQL compound operators</h2> |
| <p>Ways of combining multiple <code>SELECT</code> "cores".</p> |
| <p>Not applicable to span tables.</p> |
| <object class="sytax" data="compound-operator-name.syntax.svg" type="image/svg+xml" /> |
| <h2>Comment syntax</h2> |
| <object class="sytax" data="comment-syntax.syntax.svg" type="image/svg+xml" /> |
| <h2>Operators</h2> |
| <p>The following table gives the precedence |
| of each operator. Operators on the same row have |
| the same precedence.</p> |
| |
| <table class="general" style="width:50%"> |
| <caption>Operator precedence</caption> |
| <tr> |
| <td> |
| * |
| / |
| % |
| // |
| </td> |
| </tr> |
| <tr> |
| <td> |
| + |
| - |
| </td> |
| </tr> |
| <tr> |
| <td> |
| << |
| >> |
| </td> |
| </tr> |
| <tr> |
| <td> |
| & |
| </td> |
| </tr> |
| <tr> |
| <td> |
| | |
| </td> |
| </tr> |
| <tr> |
| <td> |
| BETWEEN |
| </td> |
| </tr> |
| <tr> |
| <td> |
| = |
| <=> |
| <!=> |
| >= |
| > |
| <= |
| < |
| != |
| <> |
| == |
| IS |
| </td> |
| </tr> |
| <tr> |
| <td> |
| NOT |
| </td> |
| </tr> |
| <tr> |
| <td> |
| AND |
| </td> |
| </tr> |
| <tr> |
| <td> |
| OR |
| </td> |
| </tr> |
| </table> |
| |
| <table class="general" style="width:50%"> |
| <caption>Operator descriptions</caption> |
| <tr> |
| <th>Operator</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <td>*</td> |
| <td>Multiply</td> |
| </tr> |
| <tr> |
| <td>/</td> |
| <td>True division (yield float)</td> |
| </tr> |
| <tr> |
| <td>%</td> |
| <td>Modulus</td> |
| </tr> |
| <tr> |
| <td>//</td> |
| <td>Floor division (truncates toward zero)</td> |
| </tr> |
| <tr> |
| <td>+</td> |
| <td>Addition</td> |
| </tr> |
| <tr> |
| <td>-</td> |
| <td>Subtraction</td> |
| </tr> |
| <tr> |
| <td><<</td> |
| <td>Left shift</td> |
| </tr> |
| <tr> |
| <td>>></td> |
| <td>Right shift</td> |
| </tr> |
| <tr> |
| <td>&</td> |
| <td>Bitwise AND</td> |
| </tr> |
| <tr> |
| <td>|</td> |
| <td>Bitwise OR</td> |
| </tr> |
| <tr> |
| <td>BETWEEN</td> |
| <td>Standard SQL</td> |
| </tr> |
| <tr> |
| <td>= ==</td> |
| <td>Equality</td> |
| </tr> |
| <tr> |
| <td><=> IS</td> |
| <td>NULL-safe equality</td> |
| </tr> |
| <tr> |
| <td><!=> IS NOT</td> |
| <td>NULL-safe inequality</td> |
| </tr> |
| <tr> |
| <td>>=</td> |
| <td>Greater than or equals</td> |
| </tr> |
| <tr> |
| <td>></td> |
| <td>Greater than</td> |
| </tr> |
| <tr> |
| <td><=</td> |
| <td>Less than or equal</td> |
| </tr> |
| <tr> |
| <td><</td> |
| <td>Less than</td> |
| </tr> |
| <tr> |
| <td>!= <></td> |
| <td>Inequality</td> |
| </tr> |
| <tr> |
| <td>NOT</td> |
| <td>Logical negation</td> |
| </tr> |
| <tr> |
| <td>AND</td> |
| <td>Logical conjunction</td> |
| </tr> |
| <tr> |
| <td>OR</td> |
| <td>Logical disjunction</td> |
| </tr> |
| </table> |
| |
| <h1><a name="standard_library">Trace standard library</a></h1> |
| <!-- TODO(dancol): document columns --> |
| <p>DCTV queries operate on a single shared per-session |
| namespace. (Each leaf level of the namespace has distinct |
| mappings for table-valued things and for SQL functions, but |
| the non-leaf nodes are shared between the namespaces.)</p> |
| <h2>Per-trace names</h2> |
| <p>DCTV makes a trace available by "mounting" it under a |
| namespace prefix. Names beginning with this prefix then refer |
| to the trace that was mounted. Multiple traces can be mounted |
| in the same session, and a single query can pull data from |
| multiple traces.</p> |
| <p>In the description below, <code>mytrace.</code> |
| refers to an arbitrary trace mountpoint.</p> |
| <dl> |
| <dt><code>mytrace.raw_events.*</code></dt> |
| <dd> |
| <p>These tables provide access to the raw events embedded |
| in the trace. For example, |
| <code>mytrace.raw_events.sched_switch</code> is a table of |
| <code>sched_switch</code> events, with one column for each |
| field in the ftrace event.</p> |
| <p>The DCTV event parser has a special case for |
| <code>trace_marker_write</code> events: we put each one in |
| a table formed by the concatenation of the event name and |
| the first part of the event payload. For example, |
| <code>`print|B`</code> refers to those |
| <code>trace_marker_write</code> events that begin with the |
| prefix <code>B|</code>, indicating the start of a |
| synchronous application-defined trace event. We need to |
| write <code>mytrace.raw_events.`print|B`</code> instead of |
| <code>mytrace.raw_events.print|B</code> because |
| <code>|</code> is normally an operator, so to treat it as |
| part of a table name, we need to escape it with |
| backticks.</p> |
| <aside class="note">This special case for |
| <code>write</code> is an ugly hack that exists so that we |
| can give a different "schema" (set of columns) to each |
| different kind of write event depending on its |
| payload.</aside> |
| </dd> |
| <dt><code>mytrace.scheduler.timeslices_p_cpu</code></dt> |
| <dd> |
| <p>This table is a span table partitioned by CPU |
| representing the scheduler activity of the system.</p> |
| </dd> |
| <dt><code>mytrace.scheduler.cpufreq_p_cpu</code></dt> |
| <dd> |
| <p>This table is a span table partitioned by CPU |
| representing the CPU frequency that each CPU is |
| known to have.</p> |
| </dd> |
| <dt><code>mytrace.last_ts</code></dt> |
| <dd> |
| <p>Single-column, single-value event table giving the |
| largest timestamp found in the trace. It's useful for |
| building spans that cover the whole trace, but see |
| <code>quantize</code> immediately below.</p> |
| </dd> |
| <dt><code>mytrace.quantize(interval=>NULL)</code></dt> |
| <dd> |
| <p>This table-valued function generates a payloadless span |
| table that divides the trace timeline into fixed-size |
| spans of duration <code>interval</code>. This table is |
| useful for quantizing the trace timeline into fixed-size |
| blocks for display or analysis, and is designed to work |
| with span group operations.</p> |
| <p>If <code>interval</code> is <code>NULL</code>, |
| generates a span table with one huge span covering the |
| whole trace.</p> |
| <aside class="example code-example"><![CDATA[SELECT SPAN SUM(_duration)/5s AS non_idle_ratio |
| FROM (SELECT SPAN * FROM my_cpu_timeslices WHERE pid != 0) |
| GROUP USING SPANS FROM mytrace.quantize(5s) ]]></aside> |
| </dd> |
| </dl> |
| |
| <h2>The DCTV namespace</h2> |
| |
| <p>DCTV-specific query functions live under the |
| <code>dctv.</code> namespace prefix.</p> |
| |
| <dl> |
| <dt><code><dfn><a name="time_series_to_spans"> |
| dctv.time_series_to_spans(*, sources, columns, partition=>NULL) |
| </a></dfn></code></dt> |
| <dd> |
| <p>This function implements the time series to span |
| conversion operation described <a |
| href="#time_series_to_span_conversion"> above</a>.</p> |
| <p><code>sources</code> is a list of source |
| specifications. Each sources specification is a dict with |
| the following entries; entries are optional unless |
| otherwise indicated. As a convenience, a source |
| specification can also be a list, the elements of which |
| are turned into dict elements in the order given below. |
| If a source specification is neither a dict nor a list, it |
| is treated as if it were a dict with only the source |
| element provided. (This way, a bare table is a valid |
| event source.)</p> |
| <dl> |
| <dt>source</dt> |
| <dd>The event table providing the raw events that this routine |
| turns into spans. Mandatory.</dd> |
| <dt>role</dt> |
| <dd>Either <code>"start"</code> or <code>"stop"</code>, defaulting to |
| <code>"start"</code>. Indicates whether the given source starts and |
| separates output spans (in the former case) or whether it stops only |
| started spans (the latter case).</dd> |
| <dt>partition</dt> |
| <dd>Either a string naming the column by which this |
| source is partitioned or <code>NULL</code>, indicating |
| that the source is unpartitioned. Defaults to |
| <code>NULL</code>.</dd> |
| <dt>timestamp</dt> |
| <dd>The name of the column in the source providing the timestamp. |
| Defaults to <code>"_ts"</code>.</dd> |
| <dt>nickname</dt> |
| <dd>An optional string assigning a name to this source that column |
| specifications in <code>columns</code> can reference.</dd> |
| </dl> |
| <p><code>columns</code> is a list of column |
| specifications, each representing one payload column in |
| the generated span table.</p> |
| |
| <p>Each column specification is a dict with the elements |
| below. As a convenience, a column specification can also |
| be a list, the elements of which are turned into dict |
| elements in the order given below. If a column |
| specification is neither a dict nor a list, it must be a |
| string, and it is treated as if it were a dict with only |
| the column element set. (This way, a simple string is a |
| valid column descriptor in the case that we have only one |
| source.)</p> |
| |
| <dl> |
| <dt>column</dt> |
| <dd>String naming the output column in the generated |
| span table. Mandatory.</dd> |
| <dt>source</dt> |
| <dd>Identifies the source that supplies this output |
| column. May be omitted when only one source is given to |
| the call; otherwise, must either be a number (naming a |
| source positionally) or a string (matching the nickname |
| given to a source in its specification).</dd> |
| <dt>source_column</dt> |
| <dd>Name of the column in the source event table that |
| supplies the value of the corresponding column in the |
| output table. Defaults to the name of the output |
| column.</dd> |
| <dt>edge</dt> |
| <dd>Either <code>"rising"</code> or |
| <code>"falling"</code>, defaulting to the former. |
| Determines which event supplies the value of the column |
| in the output table: the event that starts a span or the |
| event that ends a span.</dd> |
| </dl> |
| <p><code>partition</code> is the name of the partition |
| column in the output span table. If it is specified, all |
| sources must have their own partitions specified. If it |
| is not yet, then no source may be partitioned.</p> |
| |
| <aside class="example code-example"><![CDATA[SELECT SPAN * FROM dctv.time_series_to_span( |
| sources=>[my_raw_events_table], |
| columns=>["foo", "bar", "qux"], |
| )]]></aside> |
| |
| <aside class="example code-example"><![CDATA[SELECT SPAN * FROM dctv.time_series_to_span( |
| sources=>[{source=>table1, partition=>"cpu", nickname="foo"}, |
| {source=>table2, partition=>"cpu", role="stop"}], |
| columns=>[{column=>"total_things", |
| source=>"foo", |
| source_column=>"last_things", |
| edge=>"falling"}]) |
| )]]></aside> |
| <p>This routine looks pretty ugly when called. Most of |
| the time, you want to use one of the pre-defined span |
| tables in the <a href="#standard_library">standard |
| library</a>, which call <code>time_series_to_spans</code> |
| for you.</p> |
| </dd> |
| <dt><dfn><code>dctv.stack_history()</code></dfn></dt> |
| <dd> |
| <p>This table-valued function understands "nested" events, |
| turning them into stacks for further analysis. |
| This table-valued function generates a span table mapping |
| time intervals to stack IDs.</p> |
| <p>See the <a href="#stackification">stackification</a> |
| sub-section of the data model section.</p> |
| <!-- TODO(dancol): document argument list --> |
| </dd> |
| <dt><dfn><code>dctv.stack_contents()</code></dfn></dt> |
| <dd> |
| <p>This table-valued function generates the |
| <emph>contents</emph> of the stack IDs generated by the |
| previous function.</p> |
| <p>See the <a href="#stackification">stackification</a> |
| sub-section of the data model section.</p> |
| </dd> |
| <dt><dfn><code>dctv.generate_sequential_spans(start, stop, duration)</code></dfn></dt> |
| <dd> |
| <p>This table-valued function generates "synthetic" spans |
| useful for a variety of purposes. See the <a |
| href="#span_generation">span generation</a> sub-section of |
| the data model section above.</p> |
| <p><code>start</code> is a timestamp at which the spans |
| should start. <code>stop</code> is the time at which the |
| last span should end. <code>duration</code> is the length |
| of each generated span. Output spans are generated with |
| no gaps.</p> |
| </dd> |
| </dl> |
| |
| <h1><a name="example">Worked example</a></h1> |
| <p>Having read the above manual, this query should make sense.</p> |
| <p><code>TODO(dancol):</code> expand this section.</p> |
| <ol> |
| <li>Extract from “print|B” a list of frame-start events.</li> |
| <li>Take these events and, using time_series_to_span’s |
| “start-start” mode, assemble a set of spans partitioning the |
| trace timeline into frames.</li> |
| <li>Select those frame-spans that lasted longer than 17ms, |
| i.e., that took a long time to render.</li> |
| <li>Intersect this bad-frame span set with the per-processor |
| span table describing what the system is actually |
| doing. Don’t consider the idle process.</li> |
| </ol> |
| <code class="blockquote"><![CDATA[WITH frames AS (SELECT SPAN * FROM dctv.time_series_to_spans( |
| sources=>[{source=>(SELECT * FROM trace.raw_events.`print|B` W |
| HERE name='eglBeginFrame'), |
| timestamp=>'ts'}], |
| columns=>[])), |
| bad_frames AS (SELECT SPAN * FROM frames WHERE _duration > 17ms), |
| bad_timeslices AS (SELECT SPAN * FROM trace.scheduler.timeslices_p_cpu |
| SPAN BROADCAST FROM bad_frames) |
| SELECT comm, cpu, SUM(_duration) AS totdur FROM bad_timeslices |
| WHERE pid != 0 |
| GROUP BY comm, cpu |
| ORDER BY totdur DESC |
| LIMIT 20 |
| ]]></code> |
| </main> |
| </div> |
| </body> |
| </html> |