blob: 3156c1fe929fec50364beaefa7cf7780d3956e7b [file] [log] [blame]
// Copyright 2014 The Kythe Authors. All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
Kythe Index Pack Format
=======================
Michael J. Fromberger <fromberger@google.com>
v.0.1.0, 24-Nov-2014: Draft
:toc:
:toclevels: 3
:priority: 500
== Summary
This document specifies a compact persistent storage representation for
compilation information, suitable for use by Kythe to generate cross-reference
data.
== Background
To generate cross-references, Kythe captures a record of each compilation that
is to be indexed (_e.g.,_ a library or binary) with enough information to
enable us to replay the compilation to the front-end of the compiler. This
record consists of a
+CompilationUnit+ link:https://developers.google.com/protocol-buffers[protobuf]
message, together with the content of all the source files and other inputs the
compiler needs to process the compilation (_e.g.,_ header files or type
snapshots from dependencies).
[[_kindex_format]]
*Deprecated in favor of link:https://www.kythe.io/docs/kythe-kzip.html[kzip]*
For a single compilation, we can store this record in a +.kindex+ file, which
contains a GZip-compressed sequence of length-prefixed binary records (length
encoded as a protobuf-style
link:https://developers.google.com/protocol-buffers/docs/encoding#varints[varint]),
organized as:
[cols="10%,90%"]
|================================================================================
|Record # |Content
|0 |link:/repo/kythe/proto/analysis.proto[`CompilationUnit`] protobuf, wire format [specifying n required inputs]
|1 |link:/repo/kythe/proto/analysis.proto[`FileData`] protobuf, wire format [file 1 contents]
|… |…
|n |… [file n contents]
|================================================================================
This file-format is adequate for a single compilation. It doesn't work well for
indexing a corpus comprising many compilations, however, because such
compilations typically share many input files in common, and storing the
complete record per file means we duplicate the contents of all those shared
input files that are shared across multiple records.
== Index Pack Format
To solve the problems described above, we use a more compact format called an
*index pack*. An index pack is a specially-structured directory (using a layout
inspired by link:http://cr.yp.to/proto/maildir.html[Maildir]):
[literal]
root/ # Any legal directory name is fine here
units/
abcd1234.unit # Compilation unit (see below for format)
… # (name is hex-coded SHA256 of uncompressed data)
files/
1a2b3c4e.data # File contents, compressed
… # (name is hex-coded SHA256 of uncompressed data)
This organization separates the compilation unit descriptions from their file
data, which are shared among multiple compilations. Transporting the index pack
is accomplished by archiving it with tar or zip or another format as desired.
=== Directory and File Layout
An index pack is a directory (called the root) containing two subdirectories, one named units and one named files.
* The `units` subdirectory may contain only unit files and temp files, as
described below.
* The `files` subdirectory may contain only data files and temp files, as
described below.
* Other files or directories inside the units or files subdirectories should
cause a tool to consider the index pack invalid.
* Other files or subdirectories in the root should be ignored by a tool
processing the index pack.
A *unit file* is a file containing a GZip-compressed compilation unit
description. The name of a unit file has the form `<digest>.unit`, where
`<digest>` denotes a lower-case hex-encoded SHA256 digest of the uncompressed
compilation unit description stored in the file.
A *data file* is a file containing a GZip-compressed blob of file data. The
name of a data file has the form `<digest>.data`, where `<digest>` denotes a
lower-case hex-encoded SHA256 digest of the uncompressed blob data stored in
the file.
A *temp file* is a file containing arbitrary data being staged into the index
pack. The name of a temp file has the form `<uuid>.new`, where `<uuid>` denotes
an ASCII-encoded Version 4 UUID as described by
link:http://www.ietf.org/rfc/rfc4122.txt[RFC 4122].
=== Compilation Unit Description Format
The contents of a unit file are a GZip-compressed compilation unit
description. The description is encoded as a single
link:http://www.json.org/[JSON] object, having the following top-level
structure:
[source,javascript]
{
"format": <format-key>,
"content": <content-object>
}
The two keys `"format"` and `"content"` are required; any additional keys must
be ignored by a reader that does not understand them (this permits forward
migration to newer versions). The values of these fields are defined as
follows:
<format-key>::
A string value describing the format of the content. This value may be
empty. A reader that discovers a `<format-key>` value it does not recognize
should skip processing the unit in that file.
<content-object>::
A JSON object carrying the compilation description. A reader that discovers a
missing `<content-object>` field, or is unable to parse the value, should
consider the unit ill-formed and return an error.
The format key is intended to support versioning of the format of compilation
units across potentially-breaking changes. Tools that write an index pack can
be changed to emit multiple unit formats into the same pack without waiting for
the consumers to all be updated. Once migration is complete, the old format
can be dropped.
==== Format Values
The following format values are currently defined:
[cols="10%,90%"]
|==================================================================
|Key |Object description
|`"kythe"` |JSON encoding of a single Kythe `CompilationUnit` message.
|==================================================================
=== Encoding and Decoding Protocol
A reader must be able to decode compilation units from JSON, which is
represented in the pseudocode below by a function with signature:
func decodeCompilationUnit(json string, unit *CompilationUnit)
This function decodes the given JSON content object into whatever compilation
unit representation the reader prefers, or returns an error.
A writer must be able to encode compilation units to JSON, which is represented
in the pseucocode below by a function with signature:
func encodeCompilationUnit(unit CompilationUnit) string
This function takes a compilation unit in whatever format the writer prefers,
and returns the preferred format key and the JSON encoding of the message (or
signaling an error).
==== Read Protocol
Multiple concurrent processes can safely read unit files and data files from an
index pack directory, even if there are also concurrent writers. The protocol
for reading a file from the index pack is described by the following
pseudocode:
[source,python]
func readFile(directory, sourceName):
fp = file(path.join(directory, sourceName), 'r')
gz = GZipDecompressor(input=fp)
data = gz.readAll()
fp.close()
return data
Using `readFile` as a primitive, data and unit files are read as follows
(modulo error-handling):
[source,python]
func readDataFile(root, digest):
sourceName = digest+'.data'
return readFile(path.join(root, 'files'), sourceName)
[source,python]
func readUnitFile(root, digest, formatKey, unit):
sourceName = digest+'.unit'
contents = readFile(root, sourceName)
wrapper = JSON.decode(contents)
if wrapper.format == formatKey:
decodeCompilationUnit(wrapper["content"], unit)
Note: While it is safe to read while there are concurrent writers (in the sense
that you will get consistent data), it is not recommended to iterate over the
contents of the index pack while it is being written, since the results may be
incomplete. This is not a problem during normal use.
==== Write Protocol
Multiple concurrent processes can safely write into a single index pack,
provided the index pack is stored in a filesystem that supports an atomic
file-rename operation. Filesystems that are POSIX-compliant support atomic
rename via the `rename(2)` call. The protocol for writing a new file into the
index pack is described by the following pseudocode:
[source,python]
func writeFile(directory, targetName, contents):
tmp = path.join(directory, toHex(UUID4()) + '.new')
fp = file(tmp, 'w')
gz = GZipCompressor(output=fp)
gz.write(contents); gz.flush(); fp.close()
rename(tmp, path.join(directory, targetName))
Using `writeFile` as a primitive, data and unit files are written as follows:
[source,python]
func writeDataFile(root, contents):
targetName = toHex(sha256(contents))+'.data'
writeFile(path.join(root, 'files'), targetName, contents)
[source,python]
func writeUnitFile(root, formatKey, unit):
content = encodeCompilationUnit(unit)
body = JSON{"format": formatKey, "content": content}.toString()
targetName = toHex(sha256(body))+'.unit'
writeFile(path.join(root, 'units'), targetName, body)
Concurrent writers may store the same data without conflict, since files are
named by a cryptographic digest of their contents. There is no provision for
deleting data from an index pack within the protocol specified here. Note that
this specification does not impose any ordering constraints between data files
and unit files—in particular, an implementation may write data files before or
after the unit file, at its option.