blob: 72e64a420b44f9cb7c9325c7e7cd1f461fe7a898 [file] [log] [blame]
// Copyright 2014 The Kythe Authors. All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
An Overview of Kythe
====================
Michael J. Fromberger <fromberger@google.com>
v0.1.1, 28-Oct-2014: Draft
:toc:
:priority: 1000
== Introduction to Kythe
The Kythe project was founded to provide and support tools and standards that
encourage interoperability among programs that manipulate source code. At a
high level, the main goal of Kythe is to provide a standard, language-agnostic
interchange mechanism, allowing tools that operate on source code -- including
build systems, compilers, interpreters, static analyses, editors, code-review
applications, and more -- to share information with each other smoothly.
The remainder of this document gives a basic introduction to the ideas and
rationale behind Kythe, and provides links to other more specific documentation
about the project.
== Background
As the size and scope of a software project grows, developers rely more and
more on tools for routine tasks such as building, testing, deploying,
debugging, refactoring, and analyzing their source code. Even for a small
development team such as a startup, the toolchain used to write, test, and
deploy code can be fairly complex -- subsuming multiple languages, a variety of
source control systems, a variety of build & deployment tools, test frameworks,
and of course a great patchwork of scripts to tie it all together.
When a codebase is small and dependencies are few, it's fairly easy to do these
tasks manually: A small codebase of a few thousand files can be imported into
an IDE and manipulated directly. As the codebase grows, however, it assumes
more and more dependencies on external (``third party'') libraries and tools.
Moreover, as a developer team grows, the complexity of performing those tasks
increases to the point where it is difficult -- or even infeasible -- to build,
debug, and test the project efficiently on a single workstation.
Kythe grew out of our experience creating a large-scale semantic index of
cross-references for the enormous, multi-lingual internal codebase at Google.
We found that engineers often lose a lot of time adapting a new tool to their
project, and in the process wind up re-inventing a solution to a problem that
had already been solved by some other program (_e.g.,_ the compiler or an
analysis tool embedded in the IDE). The main reason this happens is that the
existing solution usually doesn't play well with the other tools the developers
are using. Some teams `work around' this problem by forcing everyone to use
the same tools; but in our experience, that approach scales poorly: Integrated
environments work well for the languages and tools that are integrated, but the
cost of adding new pieces is high. Developer tool-preferences are highly
diverse and idiosyncratic, and developers' productivity declines sharply when
they are forced to use tools they dislike.
The main premise of Kythe, therefore, is that programming language tools ought
to be able to talk to each other easily: Not just the tools for a given
language, but across all the languages used in your project -- and not just for
a single chosen development environment, but (potentially) for any workable
combination of tools your developers may use. Considering editors, compilers,
build systems, analysis tools, deployment, testing, continuous integration --
there are a lot of options for each of these and more; but at present
relatively few combinations actually work well together in practice.
== Goals of Kythe
The best way to view Kythe is as a ``hub'' for connecting tools for various
languages, clients and build systems. By defining *language-agnostic
protocols and data formats* for representing, accessing and querying
source code information as data, Kythe allows *language analysis and
indexing to be run as services*. This, in turn, enables lightweight
(``thin'') composition of analysis tools with client tools such as
editors, IDEs, and code browsers.
A hub-and-spoke model reduces the overall work to integrate _L_ languages, _C_
clients, and _B_ build systems from a worst-case of O(L×C×B) -- combinatorial
in the size of the ecosystem -- to O(L+C+B): Implementing Kythe compatibility
for a given compiler, editor, or build system is, roughly, a constant up-front
cost for each component, after which that component can interoperate with all
the existing pieces directly.
To make this model work, Kythe provides a language-agnostic graph structure to
capture build-system and compiler metadata, as well as semantic information
about source code such as cross-references (_e.g.,_ definitions and their
usages, type information, and cross-language associations). By design, the
Kythe graph schema is liberal and extensible -- we've defined a number of
useful subgraphs, but new node and edge kinds are structured so that the graph
can easily be extended without recourse to a central authority.
One of the basic design principles of Kythe is that interoperability should not
be `all-or-nothing': Tools should adjust gracefully to missing or incomplete
data. For many purposes, we've found that some information is almost always
better than none. At the same time, it is better to emit _incomplete_ data
than to emit _incorrect_ data. In practice, the important point is that tools
should not ``give up'' in the presence of incomplete data, as partial results
are often still useful.
== Non-goals of Kythe
Although Kythe provides interoperability for many purposes, it does not cover
every possible situation. By design, there are some specific problems we are
explicitly _not_ attempting to solve with this project:
* *Writing a compiler or optimizer.* Kythe's graph is meant to capture
high-level, cross-cutting information that has a similar character across a
variety of languages.
Low-level details like code generation and optimization are, by nature,
language-specific. While you could in principle model a compiler's internal
structures in Kythe's graph, that is not a primary goal.
* *Replacing existing IRs.* Some tools (_e.g.,_ static analyzers) already have
expressive purpose-built internal representations for code. Kythe is not
meant to be a universal replacement for such IRs -- instead, our goal is to
provide a way for such tools to capture ``interesting subsets'' of an
analysis for sharing with other tools.
*In short:* We are not interested in an http://en.wikipedia.org/wiki/UNCOL[UNCOL].
Our goal is to provide a language for sharing data between tools, and while we
find that this works well for a large class of interesting problems, there will
always be situations that are highly specific to a particular language or data
model. For such cases, it's entirely appropriate to use a representation that
is tuned to that purpose.
== What Kythe Provides
The core of the Kythe project centers around three themes, which are embodied
in our open-source tools and supported by the Kythe team at Google together
with any interested contributors:
* *Language-agnostic graph storage format*. Kythe defines a simple, flexible,
and portable graph representation that is easy to emit from an instrumented
compiler, and for clients to consume.
* *Graph schema*. Kythe provides a simple, extensible
https://kythe.io/schema/[graph schema] for a variety of
interesting semantic cross-reference data in various languages, including
C++, Java, and (soon) Go. We also provide some simple, open-source tools
that make it easy to add new elements to the schema, and test whether an
analyzer that produces those elements has met its contract.
* *Analyzers, tools and examples*. The Kythe project provides several
open-source tools for generating and manipulating Kythe data, including
indexers for C++, Java, and (soon) Go; a self-contained server that can use
Kythe data to answer cross-reference queries; and some UI example code that
shows how some of these pieces can be glued together.
== What Kythe Requires
Essentially all that is needed to participate in the Kythe ecosystem is for a
tool to consume and/or emit data in the Kythe format, and -- where appropriate
-- to follow the Kythe schema. You'll need:
*To plug in a language*::
A compiler that can be instrumented to produce an indexer that emits Kythe
data about source code in that language.
*To plug in a build system*::
A tool that can ``extract'' compilation information from the build process,
allowing a language-specific analyzer to be run on the code and its
dependencies.
*To plug in a UI tool*::
Any tool that can consume Kythe graph artifacts can use Kythe data to answer
questions about code. The only specific knowledge that needs to be baked
into the tool is the naming scheme.
*To build a service or other analysis on Kythe data*::
The Kythe data format is simple, flat, and easily portable. A tool or
service can quickly convert Kythe graph data into tabular or other structured
formats for quick serving, graph exploration, visualization, etc.
Other documents and examples cover the details of how these pieces are
implemented in practice, but the unifying principles are the common data
representation and schema.
== Related Documentatation
* link:kythe-compatible-compilers.html[Kythe-Compatible Tools]
* link:schema/[Kythe Schema Reference]