| // Copyright 2014 The Kythe Authors. All rights reserved. |
| // |
| // Licensed under the Apache License, Version 2.0 (the "License"); |
| // you may not use this file except in compliance with the License. |
| // You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| |
| Kythe Storage Model |
| =================== |
| Michael J. Fromberger <fromberger@google.com> |
| v.0.1.1, 18-Feb-2016: Draft |
| :toc: |
| :priority: 750 |
| |
| == Summary |
| |
| This document describes a persistent storage representation for Kythe analysis |
| data. The main goal of this representation is to permit Kythe data to be stored |
| in a compact and portable way. This is a ``storage'' representation -- |
| intended for persistent storage of Kythe data -- in contrast to ``serving'' |
| representations intended for efficient implementation of UI-facing services. |
| As such, the performance of general-purpose graph queries is not a primary |
| consideration. We expect data stored in this format to be further processed |
| into more efficient formats for searching or serving, _e.g._, into denormalized |
| paged association tables, or graph-database triples. |
| |
| == Goals |
| |
| Simplicity:: |
| Each entry is a simple, tabular record that is easy to generate in an |
| analyzer and convert between different physical storage formats (_e.g._, |
| LevelDB files, CSV, SQLite tables, Google Cloud Datastore). |
| |
| Compactness:: |
| Storage tables are in 1NF (each entry represents a single atomic value; set- |
| or sequence-valued fields are stored as multiple entries), and each entry is |
| stored only once. Indexing for efficient queries is the responsibility of the |
| consumer. |
| |
| Neutrality:: |
| The representation does not depend on specific features of any particular |
| programming language or toolchain, allowing tools for this format to be |
| written in any language. |
| |
| Portability:: |
| Analysis artifacts stored in this format can be easily packaged with other |
| code artifacts such as object files, JARs, and symbol tables. |
| |
| Composability:: |
| Multiple store files can be easily combined to aggregate stores containing |
| results from separate analyses into a single store representing all of them |
| together. |
| |
| == Non-Goals |
| Query efficiency:: |
| This representation is for storage, not serving, so it is not optimized for |
| general-purpose searching or queries. Separate indexes should be extracted |
| for those purposes. |
| |
| Schematization and validation:: |
| The storage representation here does not include a schema for its contents, |
| apart from the structure of the data itself. A schema for what |
| <<TermFact,facts>> should be stored, their exact value formats, and other |
| validation constraints are outside the scope of this document. Keys and |
| values are stored as strings. |
| |
| == Overview |
| |
| A *graph store* is the interface between Kythe analyzers and the tools that |
| consume their outputs. Output from each (analyzer, target) combination is |
| converted into the storage format described below, and the results are |
| aggregated together into a combined store. Tools consume Kythe artifacts by |
| looking up or extracting values from the graph store to produce indexes, or by |
| processing the store in bulk. |
| |
| The desired high-level architecture is illustrated by this diagram: |
| |
| [graphviz] |
| ------------------------------------------------------------------------ |
| digraph Pipeline { |
| graph [rankdir=LR]; |
| node [shape=box, style=filled, fillcolor=white, fontsize=10, fontname="sans-serif"]; |
| |
| build [label="Build system", shape=ellipse, fillcolor=gray]; |
| |
| analysis [shape=record, label="Cross-reference indexer\l|Static analyzer\l|other analyses ...\l"]; |
| dedup [label="Remove duplicates", shape=hexagon]; |
| storage [label=<<B>Graph Store</B>>, style="rounded,filled", fillcolor=mediumseagreen]; |
| tools [label="Other tools", shape=ellipse, fillcolor=gray]; |
| synthesis [shape=record, |
| label="Cross-language linking\l|Cleanup, canonicalization\l|other syntheses ...\l", |
| ]; |
| processing [shape=record, |
| label="reverse edges\l|denormalize nodes\l|other indexes ...\l", |
| ]; |
| |
| build -> analysis -> dedup -> storage -> processing -> tools; |
| storage -> synthesis [dir=both]; |
| } |
| ------------------------------------------------------------------------ |
| |
| Analyzers, shown on the left, consume build information from the build system |
| and emit Kythe <<TermFact,facts>>. Tools, shown on the right, read facts from a |
| graph store and do useful work with them. Broadly speaking, there are two |
| varieties of processing that can be applied to the data in the store: |
| |
| * *Synthesis of new facts.* Some tools extend the graph store by locating |
| interesting patterns of data and adding new facts to represent |
| them. Examples: Cross-language linking, precomputation of transitive |
| closure relations, inference of cross edges for parametric types, |
| dependency injection. |
| |
| * *Extraction of indexes.* Some tools create new tables to efficiently provide |
| services. Examples: Denormalized adjacency-list tables for cross-reference |
| serving; graph-database triples for whole-graph analysis. |
| |
| This document also defines a RPC-style interface that a graph store |
| implementation may use to provide format-independent access to its contents. A |
| graph store implementation is not _required_ to support this interface. A key |
| point of the design, however, is to permit tools to operate on the contents of |
| a graph store without a need for ``thick'' client libraries. An implementation |
| that has an idiosyncratic internal representation can support this RPC |
| interface, allowing Kythe tools to access their contents without baking |
| knowledge of that representation into each consumer. |
| |
| == Terminology |
| |
| Throughout the rest of this document, the following terminology is used. |
| |
| === [[TermNode]] Nodes as Vectors |
| |
| As viewed from the storage layer, a `node' in a Kythe graph is essentially a |
| bag of string-valued key/value properties associated with a unique name. In |
| order for this representation to make sense, each node must have a unique name, |
| to distinguish it from all the other nodes in the graph. |
| |
| To solve this naming problem, Kythe adopts the view that a node is a |
| _d_-dimensional `vector', where each dimension represents some scalar |
| <<TermFact,fact>> about the node: The ``dimensions'' in this case are not |
| numbers, but arbitrary semantic categories, _e.g._, kind, identifier, snippet, |
| location, corpus, language. A particular node is defined by fixing its values |
| in each dimension (where ``empty'' is also an option). |
| |
| In practice it is not necessary to model nodes as explicit vectors, but the |
| ability to do so is very convenient for computation. In particular, searching, |
| clustering, indexing, and other related machine-learning tasks are simple to |
| express in this model: Node-vectors can be projected into weight vectors for |
| computing similarities or identifying clusters, or subjected to principal |
| components analysis to identify relevant features, and so on. |
| |
| === [[TermVName]] Vector-Name (*VName*) |
| |
| Taking the view that a node is essentially a vector of its properties leads to |
| the naming scheme Kythe uses for nodes in its graph: |
| |
| A node _N_ can be uniquely identified relative to a universe _U_ of nodes by |
| fixing any _v_-dimensional projection of the node's attributes that differs |
| from all _U_ \ \{N\} under the same projection. In other words, we can choose |
| a name for _N_ by picking a small basis of <<TermFact,facts>> about a node, and |
| use the node's projection into the basis as its ``name''. This works as long |
| as the facts we pick are sufficient to distinguish all the nodes in our set |
| _U_. |
| |
| We call a name constructed using this approach a ``Vector-Name'' or *VName* for |
| the node. A VName is the primary unit of naming in the Kythe graph store. |
| |
| One important property of a VName is that it is extensible: As a collection of |
| nodes grows, new nodes may arrive that differ from the existing nodes, but have |
| the same VName. To maintain uniqueness, it is only necessary to add one or more |
| additional dimensions to the VName projection to account for the new |
| data. Updating existing VNames to a new projection is a trivial mechanical |
| rewriting process, particularly when the new projection is an extension of the |
| old one. |
| |
| See also link:kythe-uri-spec.html[Kythe URI Specification], |
| which is essentially the same, except that a VName uses UTF-8 and |
| a URI uses `pct-encoded` values. |
| |
| The initial definition of a VName includes the following 5 fields: |
| |
| * *Signature.* An opaque signature generated by the analyzer. The format of |
| this string is opaque outside the analyzer itself, but informally should be |
| sufficient to distinguish nodes within a corpus of the language. (However, |
| VNames for built-ins should be unique across corpora of the same language). |
| For example: |
| `com.google.common.collect.Lists.newLinkedList<#1>()`. |
| |
| * *Corpus.* The corpus of source code this VName belongs to. Loosely, a |
| corpus is a collection of related files, such as the contents of a given |
| source repository. Corpora accessible via the Internet should generally |
| prefer labels shaped like URLs or other address-like strings. |
| + |
| + |
| Examples: "chromium", "aosp", "bitbucket.org/creachadair/stringset". |
| + |
| + |
| We reserve corpus names prefixed with `kythe` for the Kythe |
| open-source project itself. |
| + |
| + |
| _Note:_ It is possible, though not recommended, to use a local directory |
| path as a corpus label. For storage purposes, corpus labels are _not_ |
| treated like paths (in particular they are not "cleaned" or otherwise |
| lexically normalized as described under *Path* below). Moreover, a literal |
| path as a corpus label will generally not work well with corpora defined |
| elsewhere, so avoid this formulation unless you don't require your data to |
| interoperate with other corpora. |
| |
| * *Root.* A corpus-specific root label, typically a directory path or project |
| identifier, denoting a distinct subset of the corpus. This may also be used |
| to designate virtual collections like generated files. |
| + |
| + |
| An empty Root field should signify a concrete file in the corpus |
| relative to the corpus root. The interpretation for a VName with an |
| empty Root corresponds to a file under version control in (one of) |
| the repository(ies) being analyzed; a non-empty Root indicates a |
| generated file, for which the Root is typically (part of) a prefix |
| to the path of that file. |
| + |
| + |
| _Rationale:_ Usually a corpus will comprise a single rooted tree of files, |
| such as a Git repository -- in which case the Root field can be left empty. |
| In some cases, though, a corpus may have more than one tree -- for example, |
| if the build tool stores generated code in a separate directory structure |
| during the build process. In that case, the Root field can be used to |
| distinguish generated paths from checked-in source. |
| + |
| + |
| The interpretation of the Root field is always specific to the corpus. A |
| root _may_ be shaped like a path (say, if it names a directory), but it is |
| not required to; it can be an opaque label like `generated` or `branch_name` |
| if that makes sense for the corpus in question. If the Root is intended to |
| denote a directory path, it should be _cleaned_ as described under *Path* |
| and should not end with a "/". |
| |
| * *Path.* A path-structured label describing the ``location'' of the named |
| object relative to the corpus and the root. For code, this will typically be |
| the relative path of the file containing the code under analysis, such as |
| `kythe/cxx/tools/kindex_tool_main.cc` in the `kythe` corpus. |
| + |
| + |
| Paths should be normalized to be relative to a root directory of their |
| corpus, and should also be normalized |
| link:https://www.usenix.org/legacy/event/usenix2000/general/full_papers/pikelex/pikelex.pdf[cleaned] |
| to a canonical form, |
| so as to be free of relative markers such as "." and "..". If a VName's |
| path can't be represented without "escaping" from its designated root |
| directory, it's usually a good sign that a separate Root label should be |
| assigned. |
| + |
| + |
| When a path is specified in a query, the query processor *may* |
| normalize it before executing the query (but this is up to the |
| tool); but all paths that are stored or output must be normalized. |
| + |
| + |
| The components of a path are separated by "/"s. The path |
| must not start with a "/" -- that is, it is relative to the *Root*. |
| If the root is empty, it is treated as meaning the top of the |
| corpus (for example, the same as "/" if the corpus maps directly to the |
| file system). |
| |
| * *Language.* The language this name belongs to. The schema defines specific |
| labels for each supported language, so we don't wind up with a confusion of |
| names like "cxx", "cpp", "C\+\+", etc. As a rule of thumb, we will use the |
| common name of the language, in lowercase ("c++", "python", "elisp", |
| "objectivec"). |
| + |
| + |
| The *language* is empty (that is, "") for some nodes, such as |
| link:schema/schema.html#file[file]. |
| |
| Other fields can be added as necessary—for example, if a Branch or Client label |
| becomes necessary. As a rule, we try to keep the number of essential VName |
| dimensions as small as possible. |
| |
| ==== [[VNameComposition]] VName Composition |
| |
| The fields of a VName shall be Unicode strings, save that control characters |
| (categories Cf, Cs, Co, and Cn) and surrogate pair codepoints are disallowed, |
| and category Cc is restricted to TAB (9), CR (10), and LF (13). When encoding |
| VName fields for transmission or storage, the encoding format will be UTF-8 |
| with no byte-order mark, using Normalization Form NFKC. |
| |
| ==== [[VNameOrder]] VName Ordering |
| |
| When it is necessary to order VNames, the standard order is defined by |
| lexicographic comparison of the VName fields in this order: |
| |
| Corpus, Language, Path, Root, Signature |
| |
| Each field is ordered by lexicographic string comparison of its value. |
| |
| === [[TermTicket]] Ticket |
| |
| A ticket is defined as a canonical, invertible, textual (and, if practical, |
| human-readable) string encoding of a <<TermVName,VName>> (or a projection of a |
| VName). A ticket encoding is a rule for rendering a (partial) VName into a |
| string such as a URI, JSON or similar. We have the option to define as many |
| such encodings as we may need, subject to the following restrictions: |
| |
| Canonicalization:: |
| If two VNames are equal under a given projection, then the tickets generated |
| from those projections must also be equal. This makes it possible to use a |
| ticket as a map key, or hash it to get a sharding function. |
| |
| Invertibility:: |
| Given the ticket for any projection of a VName, it must be possible to |
| recover the original VName; the encoding does not discard any information |
| from the VName. |
| |
| Textual:: |
| Tickets are encoded so as to be easily manipulated by a human user in a user |
| interface. In particular, a ticket may contain only printable non-whitespace |
| characters, with the exception of the Unicode `SPACE` character (code 32); |
| all other characters must be escaped in some way (_e.g._, base64, HTML |
| entities, URL encoding). |
| |
| A ticket should be easy to copy and paste and send around in e-mail or as a URI |
| or JSON string. A ticket doesn't necessarily have to make sense to a human |
| reader, though to the extent it is possible and practical, it should. |
| |
| Any encoding that satisfies the above requirements can be used as a ticket; the |
| Kythe open-source tools will use the link:kythe-uri-spec.html[Kythe URI] |
| encoding to create tickets. |
| |
| === [[TermFact]] Fact |
| |
| A fact is a pair (name, value) of strings, with the constraints that the fact |
| name (*fact label*) must be non-empty and conform to the following EBNF grammar: |
| |
| [literal] |
| name = "/" | 1*path |
| path = "/" word |
| word = 1*{LETTER|DIGIT|PUNCT} |
| LETTER = // Unicode letter; see below. |
| DIGIT = // Unicode digit; see below. |
| PUNCT = [-.@#$%&_+:()] |
| value = 1*byte |
| |
| In other words, a fact name is a simple, path-structured label. This |
| construction makes it easy to encode an extensible schema for fact names—the |
| first path component can be used as a schema label, _e.g._, `/kythe` for facts |
| defined by the Kythe schema. Other schemata or extensions are free to define |
| different naming conventions within their own namespace. Fact names must be |
| encoded as UTF-8 strings. |
| |
| In link:http://www.unicode.org/versions/Unicode6.3.0/[The Unicode Standard |
| 6.3], Section 4.5 "General Category" defines a set of character categories. In |
| the grammar above, `LETTER` corresponds to characters in categories Lu, Ll, Lt, |
| Lm, or Lo, and `DIGIT` corresponds to those in category Nd. |
| |
| Values are stored as strings of uninterpreted bytes—it is up to the schema to |
| define what values are legal for a given fact name, and how to encode |
| them. Importantly, however, the graph store considers values atomic, and does |
| not interpret any structure that may be stored inside them. Set- or |
| sequence-valued data can be stored as multiple facts with a common prefix, |
| _e.g._, |
| |
| [width="25%"] |
| |============================================= |
| |`/kythe/filter/kind#0` |`DECLARES` |
| |`/kythe/filter/kind#1` |`HAS_TYPE` |
| |`/kythe/filter/kind#2` |`EATS_CABBAGE_WITH` |
| |============================================= |
| |
| The graph store does not interpret fact names; the schema must define the |
| specific rules that apply to their construction and relationships. |
| |
| === Entry |
| |
| An entry is a data structure that associates a single <<TermFact,fact>> with a |
| graph object (a node or an edge). An entry is the primary unit of encoding in |
| the graph store, and each has the following components: |
| |
| [width="50%"] |
| |============================================================== |
| |source [ticket] | kind | target [ticket] | fact label | value |
| |============================================================== |
| |
| where |
| |
| * *Source.* The <<TermTicket,ticket>> of the source node (must not be empty). |
| * *Kind.* An edge label (may be empty). The schema defines which labels are |
| meaningful. |
| * *Target.* The <<TermTicket,ticket>> of the target node (may be empty). |
| * *Fact.* A fact label (must not be empty). The schema defines which fact labels |
| are meaningful. |
| * *Value.* The string value of the fact (may be empty). |
| |
| If _Kind_ and _Target_ are set, the entry denotes a fact about an edge in the |
| graph; otherwise the entry denotes a fact about a node in the graph. To be |
| valid, an entry must satisfy this constraint: |
| |
| * Either kind and target are both empty (a node entry), or both are non-empty |
| (an edge entry). |
| |
| Implementations of specific services (_e.g.,_ cross-references) may define |
| their own specific node and edge representations as needed. The primitive |
| graph store records only entries. A "node" or "edge" in the graph store is |
| simply the collection of entries in the store that share a source, kind, and |
| target. |
| |
| === [[TermOrder]] Ordering |
| |
| Given a particular ticket encoding, the standard entry order is defined by |
| lexicographic comparison of the fields of the entry, with ties broken in |
| left-to-right order: Source, Kind, Target, Fact, Value. An empty string (ø) is |
| considered to precede any non-empty string. The VName valued fields Source and |
| Target are compared according to <<VNameOrder,the VName ordering rule>>. |
| |
| For example, the following entries are in standard (nondecreasing) entry order; |
| the underlined element in each row denotes the field that “causes” the row to |
| be ordered after the previous row. |
| |
| [width="40%",cols="10%,10%,10%,60%,10%"] |
| |===================================================================================== |
| |A |ø |ø | / | x |
| |A |ø |ø | [.underline]#/foo# | w |
| |A |[.underline]#m# |C | /bar | w |
| |A |m |C | [.underline]#/car# | w |
| |A |m |C | /car | [.underline]#y# |
| |A |[.underline]#n# |B | / | ø |
| |A |n |[.underline]#C# | /bar | ø |
| |[.underline]#AB# |ø |ø | / | t |
| |===================================================================================== |
| |
| Conceptually, each field is compared in isolation; in practice this can be |
| implemented by a flat string comparison separating adjacent fields as if they |
| by an out-of-band delimiter such as NUL. |
| |
| === [[TermDirection]] Edge Direction |
| |
| All edges in a Kythe graph are labelled and directional. The schema defines |
| which edge labels are expected for various purposes, and more can be added. |
| |
| To support efficient queries, services based on Kythe data will often find it |
| useful to denormalize the Kythe data, to make it easy to traverse the edges of |
| the graph in either direction. For example, if the schema defines a `calls` |
| edge that connects a function call site with the function it invokes, a user |
| may be interested in two obvious questions: |
| |
| * ``What function is called at this location?'' This is the question answered |
| by the edge itself. |
| |
| * ``Where are the places where this function is called?'' This question is |
| _not_ covered by the `calls` edge directly, but can be obtained by finding |
| all the `calls` edges that end at the function of interest, and tracing them |
| back to their origins. |
| |
| To answer cross-reference queries efficiently, one simple strategy is to |
| pre-compute the ``mirror'' of each `calls` edge, that is to say, the edge with |
| the endpoints reversed, and the relationship inverted. Notionally, this is as |
| if we had, for each edge of the form `A calls B`, a ``reverse'' edge `B |
| called_by A`. |
| |
| When adding a new feature to the Kythe schema, a natural question to ask is |
| which direction should be the canonical or `forward' edge, _i.e.,_ which edge |
| should be emitted by the indexer, and which should be derived? |
| |
| In Kythe, we have adopted the convention that the ``forward'' edge, which is |
| the one to be emitted by the indexer, should be the one that expresses some |
| kind of dependency relationship between the source and the target. In the |
| example above, `A calls B` expresses a dependency between the call site and the |
| callee function. The callee, however, does not have any dependency on the call |
| site; the call site could be deleted from the source without affecting the |
| target function at all. |
| |
| In the rare cases where no obvious dependency relationship exists, either edge |
| may be labelled as ``forward'', but we have adopted the convention that the |
| forward direction is whichever is expected to have the lesser out-degree. As a |
| rule of thumb, if the number of edges `X foo Y` is expected to be some constant |
| multiple of the number of occurrences of X, `foo` is designated a forward |
| relationship; otherwise `foo` is considered a reverse relationship. |
| |
| By convention, Kythe tools use a fixed lexical convention for generating names |
| for ``reverse'' edges -- generally synthesized for serving purposes -- from the |
| edges specified in the schema (refer to the |
| link:https://kythe.io/schema/[Kythe graph schema]). |
| |
| === [[TermGraphStore]] Graph Store |
| |
| A graph store is any set of valid entries. A *forward* graph store is a graph |
| store in which every non-empty edge label is a <<TermDirection,forward>> edge |
| label as defined by the schema. Informally, forward edges are those that denote |
| a dependency relationship of the source on the target (see above). |
| |
| In this model, combining two graph stores into one is accomplished by computing |
| the set union of the two stores. How this is implemented depends on the |
| concrete representation, but for the simple case of a flat ordered table of |
| entries, it can be a simple merge-and-remove-duplicates. |
| |
| == [[TermServiceInterface]] Service Interface |
| |
| A concrete implementation of a graph store may use whatever physical |
| representation it wishes. However, to permit tools to access different |
| implementations of a graph store, each implementation _should_ also provide |
| access to the store via an RPC interface. The operations supported by this |
| interface are: |
| |
| === [[ServiceRead]] Reading |
| |
| A Read operation should be implemented with time complexity proportional to the |
| size of the return set. The general form of a Read query is: |
| |
| Read(source, kind) |
| |
| This operation returns all entries with the given source <<TermTicket,ticket>>, |
| subject to the following rules: |
| |
| [width="50%",options="header",cols="10%,90%"] |
| |================================================================= |
| |Kind | Result |
| |ø | All entries with kind and target empty (node entries). |
| |`*` | All entries (node and edge, regardless of kind/target). |
| |_kind_ | All edge entries with the given kind. |
| |================================================================= |
| |
| *Rationale:* This formulation allows a client to fetch node metadata and |
| implement depth-first traversals over selected parts of the graph. Because |
| nodes in a forward graph store are expected to have low out-degree for any |
| given edge kind, this allows transitive-closure computations and other simple |
| forward-graph explorations to be performed fairly efficiently. Any |
| implementation that can easily locate all the entries with a given source |
| should be able to implement this query efficiently (even a flat file of entries |
| in standard entry order). |
| |
| === [[ServiceWrite]] Writing |
| |
| Write operations update the contents of the store. The general form of a write |
| query is: |
| |
| Write(source, updates…) |
| |
| This operation atomically inserts or updates a collection of entries into the |
| store. Each update is a tuple of the form (kind, target, fact, value). For each |
| such update, the entry (source, kind, target, fact, value) is written into the |
| store, replacing any existing entry (source, kind, target, fact, value') that |
| may exist. Note that this operation cannot delete any data from the store; |
| entries are only ever inserted or updated. Apart from acting atomically, no |
| other constraints are placed on the implementation. |
| |
| **** |
| *Note:* Tools that generate Kythe entries should avoid producing entries with |
| conflicting data. |
| **** |
| |
| *Rationale:* Synthesis operations may modify existing data, or add new data, |
| but data need not be deleted except to satisfy requirements external to the |
| data model such as requests to obliterate sensitive data. Such cases must be |
| handled by explicit modification of the underlying storage, and are not |
| addressed by the tool interface. |
| |
| === [[ServiceScan]] Scanning |
| |
| A Scan is similar to a Read, but with no time complexity restrictions. The |
| general form of a scan query is: |
| |
| Scan(target, kind, fact) |
| |
| This operation returns all entries with the specified target, kind, and fact |
| label prefix. Each of the parameters may also be empty, in which case that |
| parameter is always considered to match. Thus, the full range of possibilities |
| is: |
| |
| [width="50%",options="header",cols="5%,5%,5%,85%"] |
| |=========================================================================== |
| |target|kind|fact|result |
| |ø |ø |ø |All entries. |
| |ø |ø |F |All entries with fact label prefix F. |
| |ø |K |ø |All entries with kind K. |
| |ø |K |F |All entries with kind K and fact label prefix F. |
| |T |ø |ø |All entries with target T. |
| |T |ø |F |All entries with target T and fact label prefix F. |
| |T |K |ø |All entries with target T and kind K. |
| |T |K |F |All entries with target T, kind K, and fact label prefix F. |
| |=========================================================================== |
| |
| As the name implies, scans are expected and allowed to be time-consuming, |
| full-table scan operations. An implementation is of course free to construct |
| side-indices as needed to improve performance, but this isn't required. |
| |
| === [[ServiceShard]] Sharding |
| |
| //// |
| TODO(fromberger): Include an illustration here. |
| //// |
| |
| A sharding is a special case of a scan that is intended to support bulk |
| manipulation of a large, complete graph store, _e.g.,_ via tools like Hadoop or |
| MapReduce. The two general forms of sharding query are: |
| |
| Count(index, n) |
| Shard(index, n) |
| |
| These operations nominally partition the graph store into _n > 0_ shards by |
| fingerprint of `source` | `kind`, and return to the caller (the count of) all |
| the entries in shard index, where _0 ≤ index < n_. Basically this operation |
| allows a graph store to be adapted to a MapReduce or Hadoop Reader for sharded |
| processing. A caller can invoke `Count(0, 1)` to obtain the total number of |
| entries in the graph store. |
| |
| *Rationale:* Sharding by source and kind ensures a processing tool can assemble |
| complete nodes and edges from the entries sent to a given shard. Since a |
| forward graph is expected to have low average out degree, we have omitted more |
| general features like configurable hash functions. |
| |
| == Implementation Strategies |
| |
| The Kythe graph storage format is intentionally designed to be simple and flat, |
| allowing Kythe artifacts to be handled by a variety of concrete representations |
| to meet different demands of scale and performance. A few possibilities are |
| described below, and more can easily be derived. |
| |
| === Single File |
| |
| The simplest reasonable graph store implementation is a single file of entries |
| in standard order, such as a LevelDB file or a sequence of CSV or JSON records |
| with a line index. |
| |
| Modifications in this representation are inefficient—they are best handled by |
| keeping a log of small append-only tables, one for each file being analyzed, |
| say, and periodically merging the results together. Despite being unsuited for |
| writing, However, a single-file format is a good and portable transport method |
| for small graph stores, and has the advantage that it is easy to work with. |
| |
| This format is a reasonable choice for read-only artifacts emitted from a |
| standalone command-line analyzer, where the end results are expected to be |
| combined with other stores to form a larger index for serving. The advantage |
| of single-file formats is that they are easy to ship around to other tools and |
| can be converted to more complex formats without too much overhead. |
| |
| === SQL |
| |
| A slightly more flexible graph store implementation can be built on top of a |
| SQLite or MySQL database. A trivial implementation can be built using a single |
| table of entries, which is more or less a transliteration of the single-file |
| implementation into SQL. A more sophisticated approach (and more efficient for |
| writing) is to split the entries into three tables -- this example uses the |
| syntax supported by SQLite: |
| |
| CREATE TABLE Tickets ( |
| id integer primary key autoincrement, |
| ticket text unique not null |
| ); |
| |
| CREATE TABLE Nodes ( |
| source integer, |
| factLabel text not null, |
| factValue text, |
| foreign key (source) references Tickets(id) |
| ); |
| |
| CREATE TABLE Edges ( |
| source integer, |
| kind text not null, |
| target integer, |
| factLabel text not null default "/", |
| factValue text, |
| foreign key (source) references Tickets(id), |
| foreign key (target) references Tickets(id) |
| ); |
| |
| Implementing most of the service methods is fairly straightforward in this |
| representation: Writes can be wrapped in transactions to ensure they are |
| atomic, while reads and scans can be fairly easily turned into simple SQL |
| queries. The map operations are harder in this format, however. |
| |
| == Graph Store Tools |
| |
| There are some existing tools for processing graph stores in the Kythe |
| repository: |
| |
| write_entries:: |
| A tool that writes a stream of Kythe entries stored as protobuf messages to |
| an arbitrary graph store service. |
| [link:/repo/kythe/go/storage/tools/write_entries/write_entries.go[source]] |
| |
| read_entries:: |
| A tool that scans a Kythe graph store, printing each entry to standard output. |
| [link:/repo/kythe/go/storage/tools/read_entries/read_entries.go[source]] |
| |
| triples:: |
| A tool to convert a stream of Kythe entries into |
| link:http://en.wikipedia.org/wiki/N-Triples[triples]. |
| [link:/repo/kythe/go/storage/tools/triples/triples.go[source]] |
| |
| directory_indexer:: |
| A tool to generate Kythe entries representing a directory tree. |
| [link:/repo/kythe/go/storage/tools/directory_indexer/directory_indexer.go[source]] |
| |
| leveldb:: |
| An implementation of a graph store using link:http://leveldb.org[LevelDB] |
| (via link:http://github.com/jmhodges/levigo[levigo]). |
| [link:/repo/kythe/go/storage/leveldb/leveldb.go[source]] |