kythe/docs/kythe-compatible-compilers.txt - platform/external/kythe - Git at Google

 // Copyright 2014 The Kythe Authors. All rights reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.

 Making Tools Kythe-Compatible
 =============================
 Michael J. Fromberger <fromberger@google.com>, Steve Yegge <stevey@google.com>
 v0.1.1, 28-Oct-2014: Draft
 :toc:
 :priority: 998

 This document outlines the work needed to make a production compiler,
 interpreter, or analysis tool able to work smoothly within the Kythe ecosystem.
 We won't discuss integrating a build system here -- although the details of
 doing so are similar in broad outlines.  The specifics differ enough, though,
 that build systems are covered in a separate document.

 Specific details will vary for a given tool, but there are a number of general
 features that most or all tools will need to support in order to communicate
 effectively with other tools.

 For simplicity, we will use the word `compiler' throughout, though the same
 approach applies for instrumenting other analysis tools.

 See also: link:kythe-overview.html[An Overview of Kythe]

 == Quick Overview

 First, provide a way to link against your compiler as a library and invoke it
 on some source code via a direct function call, which we'll call `analyze`.

 The `analyze` function should run the compiler in a special ``analysis'' mode.
 In this mode, the compiler should behave as follows:

  * *don't* generate any executable code
  * *do* recover from errors and process the whole input (to the extent possible)
  * *don't* discard any source information you normally throw away (even comments!)
  * *do* generate a rich AST, symbol table and type graph

 Using this mode, Kythe tools can invoke the compiler as a reusable component
 for other interesting tasks, such as locating semantic cross-references,
 extracting structured documentation, linting, syntax highlighting, code
 folding, and so forth.

 == Adding Support for a New Language

 Adding support for a new language to Kythe consists in instrumenting a compiler
 for that language to emit information about the fully lexed, parsed, and
 type-resolved AST in a Kythe-compliant intermediate representation.  In some
 cases (_e.g.,_ Haskell), a library may already exist to provide these data --
 in which case, the task is much simpler.  Often, however, existing tools do not
 provide what we need.

 The information needed to supply semantic cross-references, extract structured
 documentation, perform syntax highlighting and code-folding, and other tasks on
 a source file in language X is a subset of what the front-end of a compiler for
 that language generates.  Lexing and parsing the source is usually not that
 complex (though there are notable exceptions like C++), but for more
 interesting tasks you'll also need type and dependency information.

 In brief, the steps to add support for a new language are:

  1. *Identify a suitable compiler* or interpreter for the language.

  2. *Instrument the compiler* to emit Kythe-compliant graph artifacts.

  3. [optional, as needed] *Update the Kythe schema* for any new concepts that
     need to be modeled to support your language.

 Step (1) should be straightforward: Ideally, you should use whatever compiler
 is used to produce your production binaries for the language -- that way you
 can be sure your analysis results will agree with your production code.  In
 some cases, this may not be feasible, however -- many compilers are not
 designed to be invoked as libraries.  In that case you may wind up having to
 compromise on using some other implementation. The important points are:

  1. *Do not rewrite the compiler.* By design, the Kythe project does not write
     or maintain separate compilers for the languages it supports. For each
     language, choose one canonical compiler (preferably, the one that you would
     use to process the language in a production environment), and use it to
     build your indexer. If such a compiler does not yet exist, then finding or
     creating an alternative should be the first priority.

  2. *Push compiler changes upstream.* When you do have to work around
     limitations in your chosen compiler, try to find solutions that can be
     folded back into the compiler itself, and plan to offer your changes to the
     maintainer of the compiler for a future revision. Standardizing the
     features needed to produce a valuable index is an important high-level goal
     of the Kythe project; owning parts of the toolchain is decidedly not.

  3. *Avoid forking the compiler.* Occasionally the only way to get what you
     need will be to fork an existing compiler. This should be seen as a measure
     of last resort, to be avoided if at all possible. If there truly is no
     other solution, try to fork only the pieces you absolutely have to change,
     and maintain compatibility with as much of the remaining code as
     possible. Moreover, as described in point (2), try to make your changes in
     a way that will permit re-integration with the upstream project as soon as
     possible.

 Step (2) is where the interesting work happens.  Ideally, this work would be
 done by the author or maintainers of the compiler, but in practice the initial
 work is sometimes performed by an outside contributor (at Google, many of our
 existing language analyzers were initially written by a Kythe team member, for
 example).  However, we want to encourage the compiler owners to take control of
 this work at some point.  We have not yet met a compiler team who are opposed
 to having their compiler emit high-quality metadata.  For example, they may
 improve support by storing documentation comments within the AST so Kythe can
 add the text to its index.

 Step (3) is an optional step that can be undertaken by whomever is implementing
 Step (2).  Often, no schema changes will be needed at all, particularly for
 analyses that re-use existing schema components like cross-references and
 documentation comments, for which we already have a pretty solid model.  If
 your language requires some new kind of data to work smoothly, though, this is
 where it should be documented.

 Once all three of these steps are complete, your new language is ready to plug
 into the Kythe ecosystem.

 == Interfacing with the Compiler

 Kythe provides tools for invoking a language analyzer, which it does using
 information captured from the build system.  The details of build system
 integration are outside the scope of this document, though in outline it is
 similar to compiler integration.  The important point for writing a language
 analyzer is to record how to invoke the compiler front-end, given the settings
 from the build system, including

  * The command-line for the compiler invocation
  * Environment variables
  * All the input files required to process the compilation

 There are several ways to hook into the compiler, and we discuss a few of the
 more common models here.

 === Direct Linking

 By far the most common (and flexible) approach is for the language analyzer to
 statically link against the compiler as a library.  The analyzer and compiler
 run in the same process, and the indexer queries the compiler's data structures
 via direct function or method calls.

 This approach has pros and cons.  It is usually the easiest path for the
 compiler team to take, since it just requires exposing some previously private
 functions and data structures.  It also yields the richest data for the
 indexer, since it's all right there in memory.

 One downside is that it tends to force the analyzer to be written in the same
 language as the compiler, which can sometimes limit code sharing across Kythe
 indexers.  Another is that it closely couples the indexer's implementation to
 the compiler internals.  The coupling can be ameliorated by having the compiler
 expose "internal public" APIs, although the extra layer can make it harder to
 get at the data the indexer actually needs.

 In practice, providing a typed AST visitor and/or access to a symbol table are
 usually sufficient for loose coupling and rich data availability.

 === Pre-Composed Index

 Some compilers are written to support IDEs directly, such as the Eclipse JDT
 (for Java) and CDT (for C++) compilers.  Such compilers often produce
 high-level code indexes for their own internal consumption.  One way to extract
 data for Kythe is to process that internal index -- either using library-level
 APIs (_i.e.,_ by linking the compiler in-process) or by running the compiler
 with the appropriate flags to dump out the underlying data.

 The main problem we've encountered with such pre-composed indexes is that they
 tend to have a public API that exports only what the designers anticipated
 would be needed for their primary client, for instance an IDE.  Kythe supports
 many different kinds of clients, including static analysis clients and database
 query engines -- so as a general rule, Kythe needs _all_ the information your
 compiler generates in its semantic analysis.  However, in the absence of a more
 direct approach, this can be one way to get started.

 == Compiler Facilities

 When you're instrumenting a compiler for Kythe, you need to decide which Kythe
 features you want to support.  This is often a moving target, as both compilers
 and the languages they implement are subject to change over time.  As such,
 it's usually a matter of judgement -- but since the Kythe data model is
 designed to cope with missing or incomplete information, it's fine to start by
 emitting any information that's easy to construct.  You can go back and add
 additional data later on as the need arises.

 The safest way to be forward-compatible is to have your compiler expose
 _everything_ up front.  If no information is ever thrown away, then we won't
 need to ask you to add it back in later.  We recognize, however, that this is
 often difficult and expensive.  Kythe is designed to degrade gracefully if the
 compiler is missing information.

 === Type-Resolved AST

 The main feature needed to provide minimal Kythe functionality is a fully
 type-resolved AST.  By _type-resolved_ we mean that for each indexable entity
 that has a type, Kythe should be able to find a representation of that entity's
 associated type.

 This core facility is used in many situations: static analysis queries, file
 outlining, structural navigation (_e.g.,_ `next-function`), and others.

 Ideally:

  * The AST must faithfully represent the original source, with no tree rewrites.

  * For languages with macros or code generation, we need ASTs for both the pre-
    and post-preprocessed versions.

  * All AST nodes must include their source file offset and length.  The length
    is node-specific but should encompass all children.

  * The AST should provide a visitor interface that walks the whole tree.  The
    indexer can use your visitor to create a richer interface with
    preorder/postorder/inorder traversals, node parent links and so on.

  * An AST should be created even for files that do not compile (if possible).
    The compiler should perform error recovery, and may choose to model error
    nodes explicitly.

 It's fine if the AST produced by the compiler is highly language-specific; the
 purpose of the analyzer is to walk through this structure and emit a subset of
 its data into the Kythe graph.  The important part for the analyzer is to have
 all the information available.

 === Symbol Table

 Kythe needs access to the association between named entities and their
 referents. This information is important to support features like
 jump-to-definition, hover-documentation and cross-references.

 Sometimes this information is attached directly to the compiler's AST, but in
 other cases it's represented by a separate ``symbol table'' generated by the
 compiler.  For compilers that use this structure, it should be exposed so that
 Kythe tools can resolve names to their corresponding type and definition data.

 When possible, a compiler should provide a way to resolve a named entity even
 without an explicit reference in the source -- consider the problem of
 decorating references to named types embedded in documentation comments, for
 example.  These kinds of queries can usually be satisfied from a symbol table,
 if the compiler provides one.

 === Optional AST facilities

 The following AST facilities are optional but desired.  Keep in mind that these
 features are generally helpful to anyone making tools for your language -- not
 just Kythe!

 *Delimiters and keywords*::
   We would prefer to be able to query the AST for information about delimiters
   and keywords.  This is optional if we have access to the tokenizer (see
   below).

 *Parent links*::
   Kythe indexers often need to search up the AST from a given node, so it is
   helpful if the AST contains parent links.  This is optional, however, as we
   can build a traversable AST during the first visitor pass.

 *Serialization*::
   It is very helpful if the AST can be serialized.  A binary format is useful
   for caching ASTs if parsing is at all slow (which for some languages it is).
   A text format is useful for debugging.  Either one, or both, is very helpful.
   We can of course build our own AST format with a visitor pass, if necessary.

 === What Kythe Typically Records

 Kythe indexes information about various code entities, including:

  * named, *externally addressable* symbols such as classes, methods and fields

  * symbols that are *not normally addressable* outside a function, such as
    local variables and function parameters

  * *anonymous* classes, anonymous structs or unions, lambda functions

  * *type-system information*, including nodes to represent templates, generics,
     type constructors and the like.

 Every recorded entity gets a *node* in Kythe's graph.  Kythe nodes comprise a
 unique name (called a *VName*) that serves as the storage ``key'' for the node,
 together with a bag of key-value properties describing features of that node
 (such as its location, its ``kind'', and so forth), and a collection of
 labelled *edges* that express its relationships to other nodes.  For example, a
 node representing a function parameter may have edges connecting it to the type
 of the parameter, the place where the parameter is defined, and so forth.

 ==== What Entities Should Be Supported

 Although there is some variation across languages as to what constitute objects
 of interest for Kythe, there are a number of common themes:

 *Variables*::
   The indexer should be able to locate function parameters and local variables,
   and distinguish between the two.

 *Usages*::
   The AST should ideally provide lists of usage locations for any symbol,
   though Kythe can synthesize these if necessary.

 *Types*::
   Each named entity should have type information (see Type Graph below).  When
   appropriate, the type information may be an error or unknown type.

 *Metadata*::
   Each named entity should record its relevant language-specific metadata,
   including but not limited to: visibility modifiers on the definition
   (private, protected, etc.), access modifiers such as const or final, and
   other language-specific modifiers such as static, abstract, virtual, native,
   extern, volatile, etc.

 *Declarations*::
   Kythe must be able to distinguish definitions from declarations, if the
   language differentiates them.

 *Implicit definitions*::
   If the compiler generates any entities that participate in the program
   structure or type graph, even if they are not directly nameable or
   addressable in the language, Kythe needs to be able to get them.
   Examples include default constructors, declarations generated through macro
   expansion, function prototypes, an implicit `this` reference, and
   iterator/generator objects produced by `yield`.  The general rule is that any
   structure in the source should be represented in the AST, even if it does not
   correspond to literal source text.

 *Locations*::
   Each (named) entity should have a pointer to its defining location in the
   concrete syntax.  Any usages should also point to their defining locations.

 *Scope Chain*::
   A named entity should if at all possible have a pointer to its enclosing
   scope(s).  Kythe should be able to determine the lexical scope and (if
   different) the containing scope for the symbol's definition node.  Note that
   a pointer to the AST node is minimally sufficient if the AST has a visitor
   interface.

 *Comments*::
   If the language supports documentation comments, Kythe needs to be able to
   tie each doc comment to the entity it is documenting -- ideally by looking up
   the comment in the symbol table.

 *Suggested URLs*::
   Some built-in (``intrinsic'') language constructs do not have any AST node
   corresponding to their definitions.  However, in such cases the compiler
   should provide a URL syntax (not necessarily as a compiler facility) that
   points to canonical online documentation of the built-in.  An indexer may
   use this to give a ``location'' to such intrinsics.

 === Working around Frugal Compilers

 The Kythe graph records as much information as it can about the ``interesting''
 semantic entities in a source program.  In some cases, however, the compiler
 for the language does not provide everything we would like.  In such cases, we
 generally attempt to work around the problem by doing additional analysis on
 the compiler's AST.

 We encounter this issue in purely dynamic languages (such as Python and
 JavaScript) that lack type annotations.  The AST generated for these languages
 typically lacks enough information to do more than rudimentary scoped name
 lookup.  In some cases, we have had to implement our own type inference
 (usually approximate) to get the data we want to record.

 Even in statically-typed languages, though, compilers tend to discard data
 unless they are immediately relevant to the task at hand. By the time Kythe
 walks the AST, much of the distinguishing information may have been dropped --
 for instance, the lexical or dynamic scope chains.

 In an ideal world, the compiler's AST provides everything Kythe needs, so that
 we can build a graph with a simple walk.  We prefer this for two main reasons:

  1. Any ``synthesized'' language semantics embedded in a Kythe analyzer,
     separate from the compiler itself, are inherently fragile -- they can (and
     do) bit-rot as the compiler and language evolve.

  2. Compilers that provide these data for Kythe are also making them available
     for other tools.

 It can be tricky to balance data richness and compilation efficiency.  Some
 compilers wind up needing separate code paths for the two use cases.  But we
 believe that compilers should -- wherever practical -- provide introspection
 facilities natively rather than relying on the tools community to reinvent the
 semantic analysis done by the core compiler.

 === Other Optional Compiler Facilities

 There are a number of other features we have found useful, beyond the core
 cross-reference style information.  Broadly speaking, the more data a compiler
 can emit, the better your tools will be overall.  Tools are generally the main
 obstacle to adoption of a language, so it makes sense to support as many as you
 can.

 ==== Modeling Types

 A compiler should expose its representation of types and their relationships.
 For complex type structures such as aliases, compound types (_e.g.,_ pointers,
 unions), generics and their specializations, and so on, it should be possible
 to decipher and capture the structure of the underlying types, at least in
 outline.

 ==== Structured Documentation

 Some languages (_e.g.,_ Java) define a specific format for structured
 documentation along with the language; more commonly, structured documentation
 is done by convention, using tools like Doxygen.  To the extent possible, the
 compiler should make it easy to figure out the association between source
 comments, annotations, and other structured or semi-structured source metadata,
 and the program structures they're attached to.

 Kythe analyzers typically record comments in the index, and attempt to attach
 each comment to the appropriate statement or expression.

    * It helps greatly if the compiler provides a list of comments attached to
      the AST.

    * The compiler is likely to be smarter about attaching comments to AST nodes
      than the indexer would be, so doing so is helpful.

 If the language supports structured comments (_e.g.,_ JavaDoc, JSDoc), the
 compiler should provide a comment parser module that can identify tags, types
 and other distinguished structures within the comments.  Kythe attempts to
 record the structure of such comments, as well as links between the comment
 structures and their associated semantic entities.

 Ideally we would like a structured comment to appear as a pre-parsed,
 pre-resolved AST, so that other tools can process structured documentation
 after the fact without having to re-invoke the compiler. With sufficient
 context, Kythe may even be able to synthesize views that the compiler does
 _not_ provide explicitly, like semantic/structural file outlines and
 hover-documentation, that a UI can then present without having to incorporate
 specific knowledge of the underlying language.

 ==== Tokenizer

 Although this facility is optional, it is _strongly recommended_ that you
 support it.  Kythe uses your compiler's lexical analyzer as an adjunct to
 syntax highlighting, indentation support and several other important use cases.

 It is extremely useful to have public access to the tokenizer itself, to run on
 an arbitrary source text and get back a token stream.  Alternately if we can
 get at the tokens via an AST traversal that can usually work as well.

 One common use for the tokenizer is to permit Kythe to pick up information that
 was discarded from the AST, either by design or by accident.

 ==== Diagnostics

 Compilers should provide access to compilation diagnostics, including:

    * compiler configuration errors or warnings
    * parse errors
    * name-resolution errors
    * lint warnings

 Kythe records diagnostics in a language-agnostic way, so an analyzer has the
 option to emit a node to represent each diagnostic message, and edges to
 connect that message to other objects in the graph that are affected by that
 diagnostic (_e.g.,_ a file, a variable definition, an expression).