blob: 444886cf0b795d5af4e368926dfb26547f6af7da [file] [log] [blame]
// Copyright 2014 The Kythe Authors. All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
Making Tools Kythe-Compatible
=============================
Michael J. Fromberger <fromberger@google.com>, Steve Yegge <stevey@google.com>
v0.1.1, 28-Oct-2014: Draft
:toc:
:priority: 998
This document outlines the work needed to make a production compiler,
interpreter, or analysis tool able to work smoothly within the Kythe ecosystem.
We won't discuss integrating a build system here -- although the details of
doing so are similar in broad outlines. The specifics differ enough, though,
that build systems are covered in a separate document.
Specific details will vary for a given tool, but there are a number of general
features that most or all tools will need to support in order to communicate
effectively with other tools.
For simplicity, we will use the word `compiler' throughout, though the same
approach applies for instrumenting other analysis tools.
See also: link:kythe-overview.html[An Overview of Kythe]
== Quick Overview
First, provide a way to link against your compiler as a library and invoke it
on some source code via a direct function call, which we'll call `analyze`.
The `analyze` function should run the compiler in a special ``analysis'' mode.
In this mode, the compiler should behave as follows:
* *don't* generate any executable code
* *do* recover from errors and process the whole input (to the extent possible)
* *don't* discard any source information you normally throw away (even comments!)
* *do* generate a rich AST, symbol table and type graph
Using this mode, Kythe tools can invoke the compiler as a reusable component
for other interesting tasks, such as locating semantic cross-references,
extracting structured documentation, linting, syntax highlighting, code
folding, and so forth.
== Adding Support for a New Language
Adding support for a new language to Kythe consists in instrumenting a compiler
for that language to emit information about the fully lexed, parsed, and
type-resolved AST in a Kythe-compliant intermediate representation. In some
cases (_e.g.,_ Haskell), a library may already exist to provide these data --
in which case, the task is much simpler. Often, however, existing tools do not
provide what we need.
The information needed to supply semantic cross-references, extract structured
documentation, perform syntax highlighting and code-folding, and other tasks on
a source file in language X is a subset of what the front-end of a compiler for
that language generates. Lexing and parsing the source is usually not that
complex (though there are notable exceptions like C++), but for more
interesting tasks you'll also need type and dependency information.
In brief, the steps to add support for a new language are:
1. *Identify a suitable compiler* or interpreter for the language.
2. *Instrument the compiler* to emit Kythe-compliant graph artifacts.
3. [optional, as needed] *Update the Kythe schema* for any new concepts that
need to be modeled to support your language.
Step (1) should be straightforward: Ideally, you should use whatever compiler
is used to produce your production binaries for the language -- that way you
can be sure your analysis results will agree with your production code. In
some cases, this may not be feasible, however -- many compilers are not
designed to be invoked as libraries. In that case you may wind up having to
compromise on using some other implementation. The important points are:
1. *Do not rewrite the compiler.* By design, the Kythe project does not write
or maintain separate compilers for the languages it supports. For each
language, choose one canonical compiler (preferably, the one that you would
use to process the language in a production environment), and use it to
build your indexer. If such a compiler does not yet exist, then finding or
creating an alternative should be the first priority.
2. *Push compiler changes upstream.* When you do have to work around
limitations in your chosen compiler, try to find solutions that can be
folded back into the compiler itself, and plan to offer your changes to the
maintainer of the compiler for a future revision. Standardizing the
features needed to produce a valuable index is an important high-level goal
of the Kythe project; owning parts of the toolchain is decidedly not.
3. *Avoid forking the compiler.* Occasionally the only way to get what you
need will be to fork an existing compiler. This should be seen as a measure
of last resort, to be avoided if at all possible. If there truly is no
other solution, try to fork only the pieces you absolutely have to change,
and maintain compatibility with as much of the remaining code as
possible. Moreover, as described in point (2), try to make your changes in
a way that will permit re-integration with the upstream project as soon as
possible.
Step (2) is where the interesting work happens. Ideally, this work would be
done by the author or maintainers of the compiler, but in practice the initial
work is sometimes performed by an outside contributor (at Google, many of our
existing language analyzers were initially written by a Kythe team member, for
example). However, we want to encourage the compiler owners to take control of
this work at some point. We have not yet met a compiler team who are opposed
to having their compiler emit high-quality metadata. For example, they may
improve support by storing documentation comments within the AST so Kythe can
add the text to its index.
Step (3) is an optional step that can be undertaken by whomever is implementing
Step (2). Often, no schema changes will be needed at all, particularly for
analyses that re-use existing schema components like cross-references and
documentation comments, for which we already have a pretty solid model. If
your language requires some new kind of data to work smoothly, though, this is
where it should be documented.
Once all three of these steps are complete, your new language is ready to plug
into the Kythe ecosystem.
== Interfacing with the Compiler
Kythe provides tools for invoking a language analyzer, which it does using
information captured from the build system. The details of build system
integration are outside the scope of this document, though in outline it is
similar to compiler integration. The important point for writing a language
analyzer is to record how to invoke the compiler front-end, given the settings
from the build system, including
* The command-line for the compiler invocation
* Environment variables
* All the input files required to process the compilation
There are several ways to hook into the compiler, and we discuss a few of the
more common models here.
=== Direct Linking
By far the most common (and flexible) approach is for the language analyzer to
statically link against the compiler as a library. The analyzer and compiler
run in the same process, and the indexer queries the compiler's data structures
via direct function or method calls.
This approach has pros and cons. It is usually the easiest path for the
compiler team to take, since it just requires exposing some previously private
functions and data structures. It also yields the richest data for the
indexer, since it's all right there in memory.
One downside is that it tends to force the analyzer to be written in the same
language as the compiler, which can sometimes limit code sharing across Kythe
indexers. Another is that it closely couples the indexer's implementation to
the compiler internals. The coupling can be ameliorated by having the compiler
expose "internal public" APIs, although the extra layer can make it harder to
get at the data the indexer actually needs.
In practice, providing a typed AST visitor and/or access to a symbol table are
usually sufficient for loose coupling and rich data availability.
=== Pre-Composed Index
Some compilers are written to support IDEs directly, such as the Eclipse JDT
(for Java) and CDT (for C++) compilers. Such compilers often produce
high-level code indexes for their own internal consumption. One way to extract
data for Kythe is to process that internal index -- either using library-level
APIs (_i.e.,_ by linking the compiler in-process) or by running the compiler
with the appropriate flags to dump out the underlying data.
The main problem we've encountered with such pre-composed indexes is that they
tend to have a public API that exports only what the designers anticipated
would be needed for their primary client, for instance an IDE. Kythe supports
many different kinds of clients, including static analysis clients and database
query engines -- so as a general rule, Kythe needs _all_ the information your
compiler generates in its semantic analysis. However, in the absence of a more
direct approach, this can be one way to get started.
== Compiler Facilities
When you're instrumenting a compiler for Kythe, you need to decide which Kythe
features you want to support. This is often a moving target, as both compilers
and the languages they implement are subject to change over time. As such,
it's usually a matter of judgement -- but since the Kythe data model is
designed to cope with missing or incomplete information, it's fine to start by
emitting any information that's easy to construct. You can go back and add
additional data later on as the need arises.
The safest way to be forward-compatible is to have your compiler expose
_everything_ up front. If no information is ever thrown away, then we won't
need to ask you to add it back in later. We recognize, however, that this is
often difficult and expensive. Kythe is designed to degrade gracefully if the
compiler is missing information.
=== Type-Resolved AST
The main feature needed to provide minimal Kythe functionality is a fully
type-resolved AST. By _type-resolved_ we mean that for each indexable entity
that has a type, Kythe should be able to find a representation of that entity's
associated type.
This core facility is used in many situations: static analysis queries, file
outlining, structural navigation (_e.g.,_ `next-function`), and others.
Ideally:
* The AST must faithfully represent the original source, with no tree rewrites.
* For languages with macros or code generation, we need ASTs for both the pre-
and post-preprocessed versions.
* All AST nodes must include their source file offset and length. The length
is node-specific but should encompass all children.
* The AST should provide a visitor interface that walks the whole tree. The
indexer can use your visitor to create a richer interface with
preorder/postorder/inorder traversals, node parent links and so on.
* An AST should be created even for files that do not compile (if possible).
The compiler should perform error recovery, and may choose to model error
nodes explicitly.
It's fine if the AST produced by the compiler is highly language-specific; the
purpose of the analyzer is to walk through this structure and emit a subset of
its data into the Kythe graph. The important part for the analyzer is to have
all the information available.
=== Symbol Table
Kythe needs access to the association between named entities and their
referents. This information is important to support features like
jump-to-definition, hover-documentation and cross-references.
Sometimes this information is attached directly to the compiler's AST, but in
other cases it's represented by a separate ``symbol table'' generated by the
compiler. For compilers that use this structure, it should be exposed so that
Kythe tools can resolve names to their corresponding type and definition data.
When possible, a compiler should provide a way to resolve a named entity even
without an explicit reference in the source -- consider the problem of
decorating references to named types embedded in documentation comments, for
example. These kinds of queries can usually be satisfied from a symbol table,
if the compiler provides one.
=== Optional AST facilities
The following AST facilities are optional but desired. Keep in mind that these
features are generally helpful to anyone making tools for your language -- not
just Kythe!
*Delimiters and keywords*::
We would prefer to be able to query the AST for information about delimiters
and keywords. This is optional if we have access to the tokenizer (see
below).
*Parent links*::
Kythe indexers often need to search up the AST from a given node, so it is
helpful if the AST contains parent links. This is optional, however, as we
can build a traversable AST during the first visitor pass.
*Serialization*::
It is very helpful if the AST can be serialized. A binary format is useful
for caching ASTs if parsing is at all slow (which for some languages it is).
A text format is useful for debugging. Either one, or both, is very helpful.
We can of course build our own AST format with a visitor pass, if necessary.
=== What Kythe Typically Records
Kythe indexes information about various code entities, including:
* named, *externally addressable* symbols such as classes, methods and fields
* symbols that are *not normally addressable* outside a function, such as
local variables and function parameters
* *anonymous* classes, anonymous structs or unions, lambda functions
* *type-system information*, including nodes to represent templates, generics,
type constructors and the like.
Every recorded entity gets a *node* in Kythe's graph. Kythe nodes comprise a
unique name (called a *VName*) that serves as the storage ``key'' for the node,
together with a bag of key-value properties describing features of that node
(such as its location, its ``kind'', and so forth), and a collection of
labelled *edges* that express its relationships to other nodes. For example, a
node representing a function parameter may have edges connecting it to the type
of the parameter, the place where the parameter is defined, and so forth.
==== What Entities Should Be Supported
Although there is some variation across languages as to what constitute objects
of interest for Kythe, there are a number of common themes:
*Variables*::
The indexer should be able to locate function parameters and local variables,
and distinguish between the two.
*Usages*::
The AST should ideally provide lists of usage locations for any symbol,
though Kythe can synthesize these if necessary.
*Types*::
Each named entity should have type information (see Type Graph below). When
appropriate, the type information may be an error or unknown type.
*Metadata*::
Each named entity should record its relevant language-specific metadata,
including but not limited to: visibility modifiers on the definition
(private, protected, etc.), access modifiers such as const or final, and
other language-specific modifiers such as static, abstract, virtual, native,
extern, volatile, etc.
*Declarations*::
Kythe must be able to distinguish definitions from declarations, if the
language differentiates them.
*Implicit definitions*::
If the compiler generates any entities that participate in the program
structure or type graph, even if they are not directly nameable or
addressable in the language, Kythe needs to be able to get them.
Examples include default constructors, declarations generated through macro
expansion, function prototypes, an implicit `this` reference, and
iterator/generator objects produced by `yield`. The general rule is that any
structure in the source should be represented in the AST, even if it does not
correspond to literal source text.
*Locations*::
Each (named) entity should have a pointer to its defining location in the
concrete syntax. Any usages should also point to their defining locations.
*Scope Chain*::
A named entity should if at all possible have a pointer to its enclosing
scope(s). Kythe should be able to determine the lexical scope and (if
different) the containing scope for the symbol's definition node. Note that
a pointer to the AST node is minimally sufficient if the AST has a visitor
interface.
*Comments*::
If the language supports documentation comments, Kythe needs to be able to
tie each doc comment to the entity it is documenting -- ideally by looking up
the comment in the symbol table.
*Suggested URLs*::
Some built-in (``intrinsic'') language constructs do not have any AST node
corresponding to their definitions. However, in such cases the compiler
should provide a URL syntax (not necessarily as a compiler facility) that
points to canonical online documentation of the built-in. An indexer may
use this to give a ``location'' to such intrinsics.
=== Working around Frugal Compilers
The Kythe graph records as much information as it can about the ``interesting''
semantic entities in a source program. In some cases, however, the compiler
for the language does not provide everything we would like. In such cases, we
generally attempt to work around the problem by doing additional analysis on
the compiler's AST.
We encounter this issue in purely dynamic languages (such as Python and
JavaScript) that lack type annotations. The AST generated for these languages
typically lacks enough information to do more than rudimentary scoped name
lookup. In some cases, we have had to implement our own type inference
(usually approximate) to get the data we want to record.
Even in statically-typed languages, though, compilers tend to discard data
unless they are immediately relevant to the task at hand. By the time Kythe
walks the AST, much of the distinguishing information may have been dropped --
for instance, the lexical or dynamic scope chains.
In an ideal world, the compiler's AST provides everything Kythe needs, so that
we can build a graph with a simple walk. We prefer this for two main reasons:
1. Any ``synthesized'' language semantics embedded in a Kythe analyzer,
separate from the compiler itself, are inherently fragile -- they can (and
do) bit-rot as the compiler and language evolve.
2. Compilers that provide these data for Kythe are also making them available
for other tools.
It can be tricky to balance data richness and compilation efficiency. Some
compilers wind up needing separate code paths for the two use cases. But we
believe that compilers should -- wherever practical -- provide introspection
facilities natively rather than relying on the tools community to reinvent the
semantic analysis done by the core compiler.
=== Other Optional Compiler Facilities
There are a number of other features we have found useful, beyond the core
cross-reference style information. Broadly speaking, the more data a compiler
can emit, the better your tools will be overall. Tools are generally the main
obstacle to adoption of a language, so it makes sense to support as many as you
can.
==== Modeling Types
A compiler should expose its representation of types and their relationships.
For complex type structures such as aliases, compound types (_e.g.,_ pointers,
unions), generics and their specializations, and so on, it should be possible
to decipher and capture the structure of the underlying types, at least in
outline.
==== Structured Documentation
Some languages (_e.g.,_ Java) define a specific format for structured
documentation along with the language; more commonly, structured documentation
is done by convention, using tools like Doxygen. To the extent possible, the
compiler should make it easy to figure out the association between source
comments, annotations, and other structured or semi-structured source metadata,
and the program structures they're attached to.
Kythe analyzers typically record comments in the index, and attempt to attach
each comment to the appropriate statement or expression.
* It helps greatly if the compiler provides a list of comments attached to
the AST.
* The compiler is likely to be smarter about attaching comments to AST nodes
than the indexer would be, so doing so is helpful.
If the language supports structured comments (_e.g.,_ JavaDoc, JSDoc), the
compiler should provide a comment parser module that can identify tags, types
and other distinguished structures within the comments. Kythe attempts to
record the structure of such comments, as well as links between the comment
structures and their associated semantic entities.
Ideally we would like a structured comment to appear as a pre-parsed,
pre-resolved AST, so that other tools can process structured documentation
after the fact without having to re-invoke the compiler. With sufficient
context, Kythe may even be able to synthesize views that the compiler does
_not_ provide explicitly, like semantic/structural file outlines and
hover-documentation, that a UI can then present without having to incorporate
specific knowledge of the underlying language.
==== Tokenizer
Although this facility is optional, it is _strongly recommended_ that you
support it. Kythe uses your compiler's lexical analyzer as an adjunct to
syntax highlighting, indentation support and several other important use cases.
It is extremely useful to have public access to the tokenizer itself, to run on
an arbitrary source text and get back a token stream. Alternately if we can
get at the tokens via an AST traversal that can usually work as well.
One common use for the tokenizer is to permit Kythe to pick up information that
was discarded from the AST, either by design or by accident.
==== Diagnostics
Compilers should provide access to compilation diagnostics, including:
* compiler configuration errors or warnings
* parse errors
* name-resolution errors
* lint warnings
Kythe records diagnostics in a language-agnostic way, so an analyzer has the
option to emit a node to represent each diagnostic message, and edges to
connect that message to other objects in the graph that are affected by that
diagnostic (_e.g.,_ a file, a variable definition, an expression).