kythe/docs/kythe-compilation-database.txt - platform/external/kythe - Git at Google

 // Copyright 2016 The Kythe Authors. All rights reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.

 Kythe Compilation Database (KCD) Specification
 ==============================================
 Michael J. Fromberger <fromberger@google.com>
 v.0.1.1, 31-Aug-2016: Draft
 :toc:
 :priority: 750

 == Summary

 This document describes the Kythe compilation database, an index of build
 information used by Kythe to perform semantic analysis of source code.

 == Background & Motivation

 For Kythe to index a source file, we need to know all of the dependencies of
 that file (e.g., imports or include files), as well as any settings that
 control the compiler's behaviour in processing that file (e.g., environment
 variables, flags). Files often depend on generated code (e.g., protobuf
 wrappers, SWIG), produced as part of the build process. Thus: In order to index
 a file, Kythe usually must first _build_ that file—which we do in Kythe using
 Bazel.

 For several reasons, Kythe does not index _during_ the build. Instead, we
 capture a record of each compile action taken by the build process and store it
 for separate processing. The main reasons for this separation are:

 - **Resource constraints.** Builds often run in a special-purpose build
     environment, specialized to handle build executions and typically under
     high load.  Kythe indexers run with a CPU and output profile that isn't a
     good fit for this environment. By storing the build information, we can do
     its processing "offline" from the build, in a less-constrained environment.

 - **Reusability.** Besides Kythe indexers, there are other static analysis that
     require the same basic data that Kythe uses.  Rather than run repeated
     builds, capturing and the compilation records allows these tools to take
     advantage of the same work. It is also helpful to be able to replay a
     stored compilation for testing and repro purposes, without the need to
     re-invoke the build system.

 - **Historical data.** Maintainers of important core libraries find it helpful
     to have records of compilation data over a longer span of time, e.g., for
     analysis of API usage. Keeping an archive of compilation settings for a
     longer period of time than the build caches (order of months, vs. order of
     days) makes it easier to support this kind of exploration.

 == Kythe Compilation Database

 To address these needs, we use a compilation storage format called a
 **compilation database**. This is similar in many respects to the language-
 specific compilation databases produced by tools like Clang.

 === Overview

 A Kythe compilation database represents a storage mechanism for compilation
 data captured from a build system. It consists of two parts:

 1.  The **store** is a content-addressable store of compilation records and
     file contents. Files and compilations are addressed via a lowercase
     hex-encoded SHA256 digest of their contents.

 2.  The **index** records revision information and supports efficient lookup of
     compilation units from some of their properties. This includes:

     -   A *revisions index*, recording which complete revisions (e.g., CLs,
         commit hashes) are recorded in the database, and to which corpus they
         belong.

     -   An *compilation index* of query terms for each compilation unit,
         including target name, source files, revision, corpus label, and
         language.

 === Terminology

 *   A *compilation unit* is a record of a single action taken by the build
     system. Typically this corresponds to the invocation of a compiler with a
     particular set of flags and input files.

 *   A *corpus label* is a string that identifies a corpus of files governed by a
     source repository and build system.

 *   A *digest* is a lowercase hex-encoded digest used to identify an object in
     the content-addressable store. A *unit digest* identifies a compilation
     record ("compilation unit"), while a *file digest* identifies a file.
     +
     +
     A file digest is constructed by encoding the SHA256 digest of the file's
     content, and is the same across all compilation databases.
     +
     +
     A unit digest may be constructed the same way based on the storage format of
     the compilation record, but is not required to be the same from one database
     to another (as storage formats may differ).

 *   A *format key* is a string that provides an optional type hint for the data
     stored in a compilation unit. In Kythe we use the format key `kythe` to
     mean a `kythe.proto.CompilationUnit`.

 *   A *revision marker* is a string that identifies a revision within a corpus.
     A revision marker must be nonempty and contain no ASCII whitespace, but is
     otherwise unconstrained. A revision marker is expected to be unique among
     revisions for its corpus. In a Git repo, for example, we will use a commit
     hash

 === Interface

 The interface to the compilation database is via the following abstract methods:

 -   `Revisions` returns the revision marker, corpus label, and timestamp for
     each indexed revision matching the query terms.

 -   `Find` returns the digests of all compilation units in the store matching
     the given query terms. The query terms supported include: *revision*,
     *language*, *corpus label*, *target name*, *source path*, and *output path*.

 -   `Units` returns the stored compilation data matching the given unit digests.
     The storage format of compilation records may differ by implementation, so
     only units returned by its `Find` method may be considered valid for a given
     KCD instance.

 -   `Files` returns the stored file data matching the given digests.

 -   `FilesExist` checks whether file data is stored for the given file digests.
     The method returns all the proffered file digests that exist in the store.

 -   `WriteRevision` adds or replaces a revision in the revisions index. A
     revision is specified as a *revision marker* and a *corpus*.

 -   `WriteUnit` adds a compilation unit to the content-addressable store and
     updates the compilation index. The unit digest of the stored compilation is
     returned (as by `Find`).

 -   `WriteFile` adds the contents of a file to the content-addressable store.
     The file digest of the stored file is returned.

 A read-only implementation may omit the `WriteRevision`, `WriteUnit`, and
 `WriteFile` methods, or provide stubs that always return an error.

 === Implementations

 A Go description of the abstract interface, along with some support code, is
 defined in `kythe/go/platform/kcd`.

 ==== Concrete implementations:

 -   In-memory (`memdb.go`).
     Build target: `//kythe/go/platform/kcd:memdb`

 -   Unit tests for an arbitrary `kcd.ReadWriter` value can be built using
     (`testutil.go`).

 The intended goal of this design is that clients will use the compilation
 database via a service interface, and will not need a heavyweight client
 library for common tasks such as locating and analyzing compilations.
	// Copyright 2016 The Kythe Authors. All rights reserved.
	//
	// Licensed under the Apache License, Version 2.0 (the "License");
	// you may not use this file except in compliance with the License.
	// You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.

	Kythe Compilation Database (KCD) Specification
	==============================================
	Michael J. Fromberger <fromberger@google.com>
	v.0.1.1, 31-Aug-2016: Draft
	:toc:
	:priority: 750

	== Summary

	This document describes the Kythe compilation database, an index of build
	information used by Kythe to perform semantic analysis of source code.

	== Background & Motivation

	For Kythe to index a source file, we need to know all of the dependencies of
	that file (e.g., imports or include files), as well as any settings that
	control the compiler's behaviour in processing that file (e.g., environment
	variables, flags). Files often depend on generated code (e.g., protobuf
	wrappers, SWIG), produced as part of the build process. Thus: In order to index
	a file, Kythe usually must first _build_ that file—which we do in Kythe using
	Bazel.

	For several reasons, Kythe does not index _during_ the build. Instead, we
	capture a record of each compile action taken by the build process and store it
	for separate processing. The main reasons for this separation are:

	- Resource constraints. Builds often run in a special-purpose build
	environment, specialized to handle build executions and typically under
	high load. Kythe indexers run with a CPU and output profile that isn't a
	good fit for this environment. By storing the build information, we can do
	its processing "offline" from the build, in a less-constrained environment.

	- Reusability. Besides Kythe indexers, there are other static analysis that
	require the same basic data that Kythe uses. Rather than run repeated
	builds, capturing and the compilation records allows these tools to take
	advantage of the same work. It is also helpful to be able to replay a
	stored compilation for testing and repro purposes, without the need to
	re-invoke the build system.

	- Historical data. Maintainers of important core libraries find it helpful
	to have records of compilation data over a longer span of time, e.g., for
	analysis of API usage. Keeping an archive of compilation settings for a
	longer period of time than the build caches (order of months, vs. order of
	days) makes it easier to support this kind of exploration.

	== Kythe Compilation Database

	To address these needs, we use a compilation storage format called a
	compilation database. This is similar in many respects to the language-
	specific compilation databases produced by tools like Clang.

	=== Overview

	A Kythe compilation database represents a storage mechanism for compilation
	data captured from a build system. It consists of two parts:

	1. The store is a content-addressable store of compilation records and
	file contents. Files and compilations are addressed via a lowercase
	hex-encoded SHA256 digest of their contents.

	2. The index records revision information and supports efficient lookup of
	compilation units from some of their properties. This includes:

	- A revisions index, recording which complete revisions (e.g., CLs,
	commit hashes) are recorded in the database, and to which corpus they
	belong.

	- An compilation index of query terms for each compilation unit,
	including target name, source files, revision, corpus label, and
	language.

	=== Terminology

	* A compilation unit is a record of a single action taken by the build
	system. Typically this corresponds to the invocation of a compiler with a
	particular set of flags and input files.

	* A corpus label is a string that identifies a corpus of files governed by a
	source repository and build system.

	* A digest is a lowercase hex-encoded digest used to identify an object in
	the content-addressable store. A unit digest identifies a compilation
	record ("compilation unit"), while a file digest identifies a file.
	+
	+
	A file digest is constructed by encoding the SHA256 digest of the file's
	content, and is the same across all compilation databases.
	+
	+
	A unit digest may be constructed the same way based on the storage format of
	the compilation record, but is not required to be the same from one database
	to another (as storage formats may differ).

	* A format key is a string that provides an optional type hint for the data
	stored in a compilation unit. In Kythe we use the format key `kythe` to
	mean a `kythe.proto.CompilationUnit`.

	* A revision marker is a string that identifies a revision within a corpus.
	A revision marker must be nonempty and contain no ASCII whitespace, but is
	otherwise unconstrained. A revision marker is expected to be unique among
	revisions for its corpus. In a Git repo, for example, we will use a commit
	hash

	=== Interface

	The interface to the compilation database is via the following abstract methods:

	- `Revisions` returns the revision marker, corpus label, and timestamp for
	each indexed revision matching the query terms.

	- `Find` returns the digests of all compilation units in the store matching
	the given query terms. The query terms supported include: revision,
	language, corpus label, target name, source path, and output path.

	- `Units` returns the stored compilation data matching the given unit digests.
	The storage format of compilation records may differ by implementation, so
	only units returned by its `Find` method may be considered valid for a given
	KCD instance.

	- `Files` returns the stored file data matching the given digests.

	- `FilesExist` checks whether file data is stored for the given file digests.
	The method returns all the proffered file digests that exist in the store.

	- `WriteRevision` adds or replaces a revision in the revisions index. A
	revision is specified as a revision marker and a corpus.

	- `WriteUnit` adds a compilation unit to the content-addressable store and
	updates the compilation index. The unit digest of the stored compilation is
	returned (as by `Find`).

	- `WriteFile` adds the contents of a file to the content-addressable store.
	The file digest of the stored file is returned.

	A read-only implementation may omit the `WriteRevision`, `WriteUnit`, and
	`WriteFile` methods, or provide stubs that always return an error.

	=== Implementations

	A Go description of the abstract interface, along with some support code, is
	defined in `kythe/go/platform/kcd`.

	==== Concrete implementations:

	- In-memory (`memdb.go`).
	Build target: `//kythe/go/platform/kcd:memdb`

	- Unit tests for an arbitrary `kcd.ReadWriter` value can be built using
	(`testutil.go`).

	The intended goal of this design is that clients will use the compilation
	database via a service interface, and will not need a heavyweight client
	library for common tasks such as locating and analyzing compilations.