kythe/docs/kythe-storage.txt - platform/external/kythe - Git at Google

 // Copyright 2014 The Kythe Authors. All rights reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.

 Kythe Storage Model
 ===================
 Michael J. Fromberger <fromberger@google.com>
 v.0.1.1, 18-Feb-2016: Draft
 :toc:
 :priority: 750

 == Summary

 This document describes a persistent storage representation for Kythe analysis
 data. The main goal of this representation is to permit Kythe data to be stored
 in a compact and portable way.  This is a ``storage'' representation --
 intended for persistent storage of Kythe data -- in contrast to ``serving''
 representations intended for efficient implementation of UI-facing services.
 As such, the performance of general-purpose graph queries is not a primary
 consideration. We expect data stored in this format to be further processed
 into more efficient formats for searching or serving, _e.g._, into denormalized
 paged association tables, or graph-database triples.

 == Goals

 Simplicity::
   Each entry is a simple, tabular record that is easy to generate in an
   analyzer and convert between different physical storage formats (_e.g._,
   LevelDB files, CSV, SQLite tables, Google Cloud Datastore).

 Compactness::
   Storage tables are in 1NF (each entry represents a single atomic value; set-
   or sequence-valued fields are stored as multiple entries), and each entry is
   stored only once. Indexing for efficient queries is the responsibility of the
   consumer.

 Neutrality::
   The representation does not depend on specific features of any particular
   programming language or toolchain, allowing tools for this format to be
   written in any language.

 Portability::
   Analysis artifacts stored in this format can be easily packaged with other
   code artifacts such as object files, JARs, and symbol tables.

 Composability::
   Multiple store files can be easily combined to aggregate stores containing
   results from separate analyses into a single store representing all of them
   together.

 == Non-Goals
 Query efficiency::
   This representation is for storage, not serving, so it is not optimized for
   general-purpose searching or queries. Separate indexes should be extracted
   for those purposes.

 Schematization and validation::
   The storage representation here does not include a schema for its contents,
   apart from the structure of the data itself. A schema for what
   <<TermFact,facts>> should be stored, their exact value formats, and other
   validation constraints are outside the scope of this document. Keys and
   values are stored as strings.

 == Overview

 A *graph store* is the interface between Kythe analyzers and the tools that
 consume their outputs. Output from each (analyzer, target) combination is
 converted into the storage format described below, and the results are
 aggregated together into a combined store. Tools consume Kythe artifacts by
 looking up or extracting values from the graph store to produce indexes, or by
 processing the store in bulk.

 The desired high-level architecture is illustrated by this diagram:

 [graphviz]
 ------------------------------------------------------------------------
 digraph Pipeline {
   graph [rankdir=LR];
   node [shape=box, style=filled, fillcolor=white, fontsize=10, fontname="sans-serif"];

   build [label="Build system", shape=ellipse, fillcolor=gray];

   analysis   [shape=record, label="Cross-reference indexer\l|Static analyzer\l|other analyses ...\l"];
   dedup      [label="Remove duplicates", shape=hexagon];
   storage    [label=<<B>Graph Store</B>>, style="rounded,filled", fillcolor=mediumseagreen];
   tools      [label="Other tools", shape=ellipse, fillcolor=gray];
   synthesis  [shape=record,
     label="Cross-language linking\l|Cleanup, canonicalization\l|other syntheses ...\l",
   ];
   processing [shape=record,
     label="reverse edges\l|denormalize nodes\l|other indexes ...\l",
   ];

   build -> analysis -> dedup -> storage -> processing -> tools;
   storage -> synthesis [dir=both];
 }
 ------------------------------------------------------------------------

 Analyzers, shown on the left, consume build information from the build system
 and emit Kythe <<TermFact,facts>>. Tools, shown on the right, read facts from a
 graph store and do useful work with them. Broadly speaking, there are two
 varieties of processing that can be applied to the data in the store:

  * *Synthesis of new facts.* Some tools extend the graph store by locating
     interesting patterns of data and adding new facts to represent
     them. Examples: Cross-language linking, precomputation of transitive
     closure relations, inference of cross edges for parametric types,
     dependency injection.

  * *Extraction of indexes.* Some tools create new tables to efficiently provide
     services. Examples: Denormalized adjacency-list tables for cross-reference
     serving; graph-database triples for whole-graph analysis.

 This document also defines a RPC-style interface that a graph store
 implementation may use to provide format-independent access to its contents.  A
 graph store implementation is not _required_ to support this interface.  A key
 point of the design, however, is to permit tools to operate on the contents of
 a graph store without a need for ``thick'' client libraries.  An implementation
 that has an idiosyncratic internal representation can support this RPC
 interface, allowing Kythe tools to access their contents without baking
 knowledge of that representation into each consumer.

 == Terminology

 Throughout the rest of this document, the following terminology is used.

 === [[TermNode]] Nodes as Vectors

 As viewed from the storage layer, a `node' in a Kythe graph is essentially a
 bag of string-valued key/value properties associated with a unique name.  In
 order for this representation to make sense, each node must have a unique name,
 to distinguish it from all the other nodes in the graph.

 To solve this naming problem, Kythe adopts the view that a node is a
 _d_-dimensional `vector', where each dimension represents some scalar
 <<TermFact,fact>> about the node: The ``dimensions'' in this case are not
 numbers, but arbitrary semantic categories, _e.g._, kind, identifier, snippet,
 location, corpus, language. A particular node is defined by fixing its values
 in each dimension (where ``empty'' is also an option).

 In practice it is not necessary to model nodes as explicit vectors, but the
 ability to do so is very convenient for computation. In particular, searching,
 clustering, indexing, and other related machine-learning tasks are simple to
 express in this model: Node-vectors can be projected into weight vectors for
 computing similarities or identifying clusters, or subjected to principal
 components analysis to identify relevant features, and so on.

 === [[TermVName]] Vector-Name (*VName*)

 Taking the view that a node is essentially a vector of its properties leads to
 the naming scheme Kythe uses for nodes in its graph:

 A node _N_ can be uniquely identified relative to a universe _U_ of nodes by
 fixing any _v_-dimensional projection of the node's attributes that differs
 from all _U_ \ \{N\} under the same projection. In other words, we can choose
 a name for _N_ by picking a small basis of <<TermFact,facts>> about a node, and
 use the node's projection into the basis as its ``name''.  This works as long
 as the facts we pick are sufficient to distinguish all the nodes in our set
 _U_.

 We call a name constructed using this approach a ``Vector-Name'' or *VName* for
 the node. A VName is the primary unit of naming in the Kythe graph store.

 One important property of a VName is that it is extensible: As a collection of
 nodes grows, new nodes may arrive that differ from the existing nodes, but have
 the same VName. To maintain uniqueness, it is only necessary to add one or more
 additional dimensions to the VName projection to account for the new
 data. Updating existing VNames to a new projection is a trivial mechanical
 rewriting process, particularly when the new projection is an extension of the
 old one.

 See also link:kythe-uri-spec.html[Kythe URI Specification],
 which is essentially the same, except that a VName uses UTF-8 and
 a URI uses `pct-encoded` values.

 The initial definition of a VName includes the following 5 fields:

  * *Signature.* An opaque signature generated by the analyzer. The format of
    this string is opaque outside the analyzer itself, but informally should be
    sufficient to distinguish nodes within a corpus of the language. (However,
    VNames for built-ins should be unique across corpora of the same language).
    For example:
    `com.google.common.collect.Lists.newLinkedList<#1>()`.

  * *Corpus.* The corpus of source code this VName belongs to.  Loosely, a
    corpus is a collection of related files, such as the contents of a given
    source repository.  Corpora accessible via the Internet should generally
    prefer labels shaped like URLs or other address-like strings.
    +
    +
    Examples: "chromium", "aosp", "bitbucket.org/creachadair/stringset".
    +
    +
    We reserve corpus names prefixed with `kythe` for the Kythe
    open-source project itself.
    +
    +
    _Note:_ It is possible, though not recommended, to use a local directory
    path as a corpus label. For storage purposes, corpus labels are _not_
    treated like paths (in particular they are not "cleaned" or otherwise
    lexically normalized as described under *Path* below). Moreover, a literal
    path as a corpus label will generally not work well with corpora defined
    elsewhere, so avoid this formulation unless you don't require your data to
    interoperate with other corpora.

  * *Root.* A corpus-specific root label, typically a directory path or project
    identifier, denoting a distinct subset of the corpus. This may also be used
    to designate virtual collections like generated files.
    +
    +
    An empty Root field should signify a concrete file in the corpus
    relative to the corpus root. The interpretation for a VName with an
    empty Root corresponds to a file under version control in (one of)
    the repository(ies) being analyzed; a non-empty Root indicates a
    generated file, for which the Root is typically (part of) a prefix
    to the path of that file.
    +
    +
    _Rationale:_ Usually a corpus will comprise a single rooted tree of files,
    such as a Git repository -- in which case the Root field can be left empty.
    In some cases, though, a corpus may have more than one tree -- for example,
    if the build tool stores generated code in a separate directory structure
    during the build process. In that case, the Root field can be used to
    distinguish generated paths from checked-in source.
    +
    +
    The interpretation of the Root field is always specific to the corpus.  A
    root _may_ be shaped like a path (say, if it names a directory), but it is
    not required to; it can be an opaque label like `generated` or `branch_name`
    if that makes sense for the corpus in question.  If the Root is intended to
    denote a directory path, it should be _cleaned_ as described under *Path*
    and should not end with a "/".

  * *Path.* A path-structured label describing the ``location'' of the named
    object relative to the corpus and the root. For code, this will typically be
    the relative path of the file containing the code under analysis, such as
    `kythe/cxx/tools/kindex_tool_main.cc` in the `kythe` corpus.
    +
    +
    Paths should be normalized to be relative to a root directory of their
    corpus, and should also be normalized
    link:https://www.usenix.org/legacy/event/usenix2000/general/full_papers/pikelex/pikelex.pdf[cleaned]
    to a canonical form,
    so as to be free of relative markers such as "." and "..".  If a VName's
    path can't be represented without "escaping" from its designated root
    directory, it's usually a good sign that a separate Root label should be
    assigned.
    +
    +
    When a path is specified in a query, the query processor *may*
    normalize it before executing the query (but this is up to the
    tool); but all paths that are stored or output must be normalized.
    +
    +
    The components of a path are separated by "/"s. The path
    must not start with a "/" -- that is, it is relative to the *Root*.
    If the root is empty, it is treated as meaning the top of the
    corpus (for example, the same as "/" if the corpus maps directly to the
    file system).

  * *Language.* The language this name belongs to. The schema defines specific
    labels for each supported language, so we don't wind up with a confusion of
    names like "cxx", "cpp", "C\+\+", etc. As a rule of thumb, we will use the
    common name of the language, in lowercase ("c++", "python", "elisp",
    "objectivec").
    +
    +
    The *language* is empty (that is, "") for some nodes, such as
    link:schema/schema.html#file[file].

 Other fields can be added as necessary—for example, if a Branch or Client label
 becomes necessary. As a rule, we try to keep the number of essential VName
 dimensions as small as possible.

 ==== [[VNameComposition]] VName Composition

 The fields of a VName shall be Unicode strings, save that control characters
 (categories Cf, Cs, Co, and Cn) and surrogate pair codepoints are disallowed,
 and category Cc is restricted to TAB (9), CR (10), and LF (13). When encoding
 VName fields for transmission or storage, the encoding format will be UTF-8
 with no byte-order mark, using Normalization Form NFKC.

 ==== [[VNameOrder]] VName Ordering

 When it is necessary to order VNames, the standard order is defined by
 lexicographic comparison of the VName fields in this order:

     Corpus, Language, Path, Root, Signature

 Each field is ordered by lexicographic string comparison of its value.

 === [[TermTicket]] Ticket

 A ticket is defined as a canonical, invertible, textual (and, if practical,
 human-readable) string encoding of a <<TermVName,VName>> (or a projection of a
 VName). A ticket encoding is a rule for rendering a (partial) VName into a
 string such as a URI, JSON or similar.  We have the option to define as many
 such encodings as we may need, subject to the following restrictions:

 Canonicalization::
   If two VNames are equal under a given projection, then the tickets generated
   from those projections must also be equal.  This makes it possible to use a
   ticket as a map key, or hash it to get a sharding function.

 Invertibility::
   Given the ticket for any projection of a VName, it must be possible to
   recover the original VName; the encoding does not discard any information
   from the VName.

 Textual::
   Tickets are encoded so as to be easily manipulated by a human user in a user
   interface. In particular, a ticket may contain only printable non-whitespace
   characters, with the exception of the Unicode `SPACE` character (code 32);
   all other characters must be escaped in some way (_e.g._, base64, HTML
   entities, URL encoding).

 A ticket should be easy to copy and paste and send around in e-mail or as a URI
 or JSON string.  A ticket doesn't necessarily have to make sense to a human
 reader, though to the extent it is possible and practical, it should.

 Any encoding that satisfies the above requirements can be used as a ticket; the
 Kythe open-source tools will use the link:kythe-uri-spec.html[Kythe URI]
 encoding to create tickets.

 === [[TermFact]] Fact

 A fact is a pair (name, value) of strings, with the constraints that the fact
 name (*fact label*) must be non-empty and conform to the following EBNF grammar:

 [literal]
 name   =  "/" | 1*path
 path   =  "/" word
 word   =  1*{LETTER|DIGIT|PUNCT}
 LETTER =  // Unicode letter; see below.
 DIGIT  =  // Unicode digit; see below.
 PUNCT  =  [-.@#$%&_+:()]
 value  = 1*byte

 In other words, a fact name is a simple, path-structured label. This
 construction makes it easy to encode an extensible schema for fact names—the
 first path component can be used as a schema label, _e.g._, `/kythe` for facts
 defined by the Kythe schema. Other schemata or extensions are free to define
 different naming conventions within their own namespace. Fact names must be
 encoded as UTF-8 strings.

 In link:http://www.unicode.org/versions/Unicode6.3.0/[The Unicode Standard
 6.3], Section 4.5 "General Category" defines a set of character categories.  In
 the grammar above, `LETTER` corresponds to characters in categories Lu, Ll, Lt,
 Lm, or Lo, and `DIGIT` corresponds to those in category Nd.

 Values are stored as strings of uninterpreted bytes—it is up to the schema to
 define what values are legal for a given fact name, and how to encode
 them. Importantly, however, the graph store considers values atomic, and does
 not interpret any structure that may be stored inside them. Set- or
 sequence-valued data can be stored as multiple facts with a common prefix,
 _e.g._,

 [width="25%"]
 |=============================================
 |`/kythe/filter/kind#0`  |`DECLARES`
 |`/kythe/filter/kind#1`  |`HAS_TYPE`
 |`/kythe/filter/kind#2`  |`EATS_CABBAGE_WITH`
 |=============================================

 The graph store does not interpret fact names; the schema must define the
 specific rules that apply to their construction and relationships.

 === Entry

 An entry is a data structure that associates a single <<TermFact,fact>> with a
 graph object (a node or an edge). An entry is the primary unit of encoding in
 the graph store, and each has the following components:

 [width="50%"]
 |==============================================================
 |source [ticket] | kind | target [ticket] | fact label | value
 |==============================================================

 where

  * *Source.* The <<TermTicket,ticket>> of the source node (must not be empty).
  * *Kind.* An edge label (may be empty). The schema defines which labels are
    meaningful.
  * *Target.* The <<TermTicket,ticket>> of the target node (may be empty).
  * *Fact.* A fact label (must not be empty). The schema defines which fact labels
    are meaningful.
  * *Value.* The string value of the fact (may be empty).

 If _Kind_ and _Target_ are set, the entry denotes a fact about an edge in the
 graph; otherwise the entry denotes a fact about a node in the graph. To be
 valid, an entry must satisfy this constraint:

  * Either kind and target are both empty (a node entry), or both are non-empty
    (an edge entry).

 Implementations of specific services (_e.g.,_ cross-references) may define
 their own specific node and edge representations as needed.  The primitive
 graph store records only entries. A "node" or "edge" in the graph store is
 simply the collection of entries in the store that share a source, kind, and
 target.

 === [[TermOrder]] Ordering

 Given a particular ticket encoding, the standard entry order is defined by
 lexicographic comparison of the fields of the entry, with ties broken in
 left-to-right order: Source, Kind, Target, Fact, Value. An empty string (ø) is
 considered to precede any non-empty string. The VName valued fields Source and
 Target are compared according to <<VNameOrder,the VName ordering rule>>.

 For example, the following entries are in standard (nondecreasing) entry order;
 the underlined element in each row denotes the field that “causes” the row to
 be ordered after the previous row.

 [width="40%",cols="10%,10%,10%,60%,10%"]
 |=====================================================================================
 |A                |ø               |ø               | /                  | x
 |A                |ø               |ø               | [.underline]#/foo# | w
 |A                |[.underline]#m# |C               | /bar               | w
 |A                |m               |C               | [.underline]#/car# | w
 |A                |m               |C               | /car               | [.underline]#y#
 |A                |[.underline]#n# |B               | /                  | ø
 |A                |n               |[.underline]#C# | /bar               | ø
 |[.underline]#AB# |ø               |ø               | /                  | t
 |=====================================================================================

 Conceptually, each field is compared in isolation; in practice this can be
 implemented by a flat string comparison separating adjacent fields as if they
 by an out-of-band delimiter such as NUL.

 === [[TermDirection]] Edge Direction

 All edges in a Kythe graph are labelled and directional. The schema defines
 which edge labels are expected for various purposes, and more can be added.

 To support efficient queries, services based on Kythe data will often find it
 useful to denormalize the Kythe data, to make it easy to traverse the edges of
 the graph in either direction.  For example, if the schema defines a `calls`
 edge that connects a function call site with the function it invokes, a user
 may be interested in two obvious questions:

  * ``What function is called at this location?'' This is the question answered
    by the edge itself.

  * ``Where are the places where this function is called?''  This question is
    _not_ covered by the `calls` edge directly, but can be obtained by finding
    all the `calls` edges that end at the function of interest, and tracing them
    back to their origins.

 To answer cross-reference queries efficiently, one simple strategy is to
 pre-compute the ``mirror'' of each `calls` edge, that is to say, the edge with
 the endpoints reversed, and the relationship inverted.  Notionally, this is as
 if we had, for each edge of the form `A calls B`, a ``reverse'' edge `B
 called_by A`.

 When adding a new feature to the Kythe schema, a natural question to ask is
 which direction should be the canonical or `forward' edge, _i.e.,_ which edge
 should be emitted by the indexer, and which should be derived?

 In Kythe, we have adopted the convention that the ``forward'' edge, which is
 the one to be emitted by the indexer, should be the one that expresses some
 kind of dependency relationship between the source and the target.  In the
 example above, `A calls B` expresses a dependency between the call site and the
 callee function.  The callee, however, does not have any dependency on the call
 site; the call site could be deleted from the source without affecting the
 target function at all.

 In the rare cases where no obvious dependency relationship exists, either edge
 may be labelled as ``forward'', but we have adopted the convention that the
 forward direction is whichever is expected to have the lesser out-degree. As a
 rule of thumb, if the number of edges `X foo Y` is expected to be some constant
 multiple of the number of occurrences of X, `foo` is designated a forward
 relationship; otherwise `foo` is considered a reverse relationship.

 By convention, Kythe tools use a fixed lexical convention for generating names
 for ``reverse'' edges -- generally synthesized for serving purposes -- from the
 edges specified in the schema (refer to the
 link:https://kythe.io/schema/[Kythe graph schema]).

 === [[TermGraphStore]] Graph Store

 A graph store is any set of valid entries. A *forward* graph store is a graph
 store in which every non-empty edge label is a <<TermDirection,forward>> edge
 label as defined by the schema. Informally, forward edges are those that denote
 a dependency relationship of the source on the target (see above).

 In this model, combining two graph stores into one is accomplished by computing
 the set union of the two stores. How this is implemented depends on the
 concrete representation, but for the simple case of a flat ordered table of
 entries, it can be a simple merge-and-remove-duplicates.

 == [[TermServiceInterface]] Service Interface

 A concrete implementation of a graph store may use whatever physical
 representation it wishes. However, to permit tools to access different
 implementations of a graph store, each implementation _should_ also provide
 access to the store via an RPC interface. The operations supported by this
 interface are:

 === [[ServiceRead]] Reading

 A Read operation should be implemented with time complexity proportional to the
 size of the return set. The general form of a Read query is:

    Read(source, kind)

 This operation returns all entries with the given source <<TermTicket,ticket>>,
 subject to the following rules:

 [width="50%",options="header",cols="10%,90%"]
 |=================================================================
 |Kind   | Result
 |ø      | All entries with kind and target empty (node entries).
 |`*`    | All entries (node and edge, regardless of kind/target).
 |_kind_ | All edge entries with the given kind.
 |=================================================================

 *Rationale:* This formulation allows a client to fetch node metadata and
 implement depth-first traversals over selected parts of the graph. Because
 nodes in a forward graph store are expected to have low out-degree for any
 given edge kind, this allows transitive-closure computations and other simple
 forward-graph explorations to be performed fairly efficiently. Any
 implementation that can easily locate all the entries with a given source
 should be able to implement this query efficiently (even a flat file of entries
 in standard entry order).

 === [[ServiceWrite]] Writing

 Write operations update the contents of the store. The general form of a write
 query is:

    Write(source, updates…)

 This operation atomically inserts or updates a collection of entries into the
 store. Each update is a tuple of the form (kind, target, fact, value). For each
 such update, the entry (source, kind, target, fact, value) is written into the
 store, replacing any existing entry (source, kind, target, fact, value') that
 may exist. Note that this operation cannot delete any data from the store;
 entries are only ever inserted or updated. Apart from acting atomically, no
 other constraints are placed on the implementation.

 ****
 *Note:* Tools that generate Kythe entries should avoid producing entries with
 conflicting data.
 ****

 *Rationale:* Synthesis operations may modify existing data, or add new data,
 but data need not be deleted except to satisfy requirements external to the
 data model such as requests to obliterate sensitive data. Such cases must be
 handled by explicit modification of the underlying storage, and are not
 addressed by the tool interface.

 === [[ServiceScan]] Scanning

 A Scan is similar to a Read, but with no time complexity restrictions. The
 general form of a scan query is:

    Scan(target, kind, fact)

 This operation returns all entries with the specified target, kind, and fact
 label prefix. Each of the parameters may also be empty, in which case that
 parameter is always considered to match. Thus, the full range of possibilities
 is:

 [width="50%",options="header",cols="5%,5%,5%,85%"]
 |===========================================================================
 |target|kind|fact|result
 |ø     |ø   |ø   |All entries.
 |ø     |ø   |F   |All entries with fact label prefix F.
 |ø     |K   |ø   |All entries with kind K.
 |ø     |K   |F   |All entries with kind K and fact label prefix F.
 |T     |ø   |ø   |All entries with target T.
 |T     |ø   |F   |All entries with target T and fact label prefix F.
 |T     |K   |ø   |All entries with target T and kind K.
 |T     |K   |F   |All entries with target T, kind K, and fact label prefix F.
 |===========================================================================

 As the name implies, scans are expected and allowed to be time-consuming,
 full-table scan operations. An implementation is of course free to construct
 side-indices as needed to improve performance, but this isn't required.

 === [[ServiceShard]] Sharding

 ////
 TODO(fromberger): Include an illustration here.
 ////

 A sharding is a special case of a scan that is intended to support bulk
 manipulation of a large, complete graph store, _e.g.,_ via tools like Hadoop or
 MapReduce. The two general forms of sharding query are:

    Count(index, n)
    Shard(index, n)

 These operations nominally partition the graph store into _n > 0_ shards by
 fingerprint of `source` | `kind`, and return to the caller (the count of) all
 the entries in shard index, where _0 ≤ index < n_. Basically this operation
 allows a graph store to be adapted to a MapReduce or Hadoop Reader for sharded
 processing. A caller can invoke `Count(0, 1)` to obtain the total number of
 entries in the graph store.

 *Rationale:* Sharding by source and kind ensures a processing tool can assemble
 complete nodes and edges from the entries sent to a given shard. Since a
 forward graph is expected to have low average out degree, we have omitted more
 general features like configurable hash functions.

 == Implementation Strategies

 The Kythe graph storage format is intentionally designed to be simple and flat,
 allowing Kythe artifacts to be handled by a variety of concrete representations
 to meet different demands of scale and performance.  A few possibilities are
 described below, and more can easily be derived.

 === Single File

 The simplest reasonable graph store implementation is a single file of entries
 in standard order, such as a LevelDB file or a sequence of CSV or JSON records
 with a line index.

 Modifications in this representation are inefficient—they are best handled by
 keeping a log of small append-only tables, one for each file being analyzed,
 say, and periodically merging the results together.  Despite being unsuited for
 writing, However, a single-file format is a good and portable transport method
 for small graph stores, and has the advantage that it is easy to work with.

 This format is a reasonable choice for read-only artifacts emitted from a
 standalone command-line analyzer, where the end results are expected to be
 combined with other stores to form a larger index for serving.  The advantage
 of single-file formats is that they are easy to ship around to other tools and
 can be converted to more complex formats without too much overhead.

 === SQL

 A slightly more flexible graph store implementation can be built on top of a
 SQLite or MySQL database. A trivial implementation can be built using a single
 table of entries, which is more or less a transliteration of the single-file
 implementation into SQL. A more sophisticated approach (and more efficient for
 writing) is to split the entries into three tables -- this example uses the
 syntax supported by SQLite:

   CREATE TABLE Tickets (
     id integer primary key autoincrement,
     ticket text unique not null
   );

   CREATE TABLE Nodes (
     source integer,
     factLabel text not null,
     factValue text,
     foreign key (source) references Tickets(id)
   );

   CREATE TABLE Edges (
     source integer,
     kind text not null,
     target integer,
     factLabel text not null default "/",
     factValue text,
     foreign key (source) references Tickets(id),
     foreign key (target) references Tickets(id)
   );

 Implementing most of the service methods is fairly straightforward in this
 representation: Writes can be wrapped in transactions to ensure they are
 atomic, while reads and scans can be fairly easily turned into simple SQL
 queries. The map operations are harder in this format, however.

 == Graph Store Tools

 There are some existing tools for processing graph stores in the Kythe
 repository:

 write_entries::
   A tool that writes a stream of Kythe entries stored as protobuf messages to
   an arbitrary graph store service.
   [link:/repo/kythe/go/storage/tools/write_entries/write_entries.go[source]]

 read_entries::
   A tool that scans a Kythe graph store, printing each entry to standard output.
   [link:/repo/kythe/go/storage/tools/read_entries/read_entries.go[source]]

 triples::
   A tool to convert a stream of Kythe entries into
   link:http://en.wikipedia.org/wiki/N-Triples[triples].
   [link:/repo/kythe/go/storage/tools/triples/triples.go[source]]

 directory_indexer::
   A tool to generate Kythe entries representing a directory tree.
   [link:/repo/kythe/go/storage/tools/directory_indexer/directory_indexer.go[source]]

 leveldb::
   An implementation of a graph store using link:http://leveldb.org[LevelDB]
   (via link:http://github.com/jmhodges/levigo[levigo]).
   [link:/repo/kythe/go/storage/leveldb/leveldb.go[source]]