blob: 784b5c7c4a8aafddf08bac85800de413b02b5319 [file] [log] [blame]
// Copyright 2014 The Kythe Authors. All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
Kythe URI Specification
=======================
Michael J. Fromberger <fromberger@google.com>
v0.1.1, 29-Oct-2014: Draft
This document defines the schema for Kythe uniform resource identifiers ("Kythe URI").
The primary purpose of a Kythe URI is to provide a textual encoding of a Kythe
VName, which is a unique identifier for a node in the semantic graph generated
by Kythe-compatible tools. A Kythe URI may also be extended to encode simple
queries about a particular VName in a transportable format.
The identifiers described in this document are compatible with the grammar
given in http://tools.ietf.org/html/rfc3987[RFC 3987] (Internationalized
Resource Identifiers) and thereby also with the underlying grammar from
http://tools.ietf.org/html/rfc3986[RFC 3986] (Uniform Resource Identifiers).
== Scheme Label
The scheme label for Kythe URIs will be "`kythe:`".
== Character Set
A Kythe URI is a string of UCS (Unicode) characters. For storage and
transmission, a Kythe URI will be encoded as UTF-8 with no byte-order mark,
using Normalization Form NFKC.
Except as restricted by the syntax, all UCS characters are valid in a Kythe URI.
Reserved characters (_e.g._, "/", "?") and whitespace must be percent-escaped
per Section 2.1 of RFC 3986, e.g., " " becomes "`%20`".
== Syntax
The following grammar defines the syntax of a Kythe URI. Some productions have
provisional values and will change as the Kythe schema evolves.
----
kythe-uri = "kythe:" [corpus] attrs ["#" signature]
corpus = "//" label 0*{"/" path-segment}
label = ireg-name -- RFC 3987
attrs = ["?" lang-attr] ["?" path-attr] ["?" root-attr]
lang-attr = "lang=" language
path-attr = "path=" path-segment 0*{"/" path-segment}
root-attr = "root=" root-segment 0*{"/" root-segment}
language = 1*ipchar -- RFC 3987
signature = 1*ipchar -- RFC 3987
root-segment = 1*ipchar -- RFC 3987
path-segment = 1*{unreserved | pct-encoded | "/"} -- RFC 3987
----
Note that the order of the attributes (the `attrs` production) is fixed, to
ensure that a Kythe URI has a canonical string encoding.
Examples (subject to change):
* Empty (no fields): `kythe:`
* Signature only: `kythe:#loc-a90320dafd60`
* Ad-hoc corpus (signature, corpus, path, language): `kythe://corpusname?lang=c%2B%2B?path=file/base/file.h#class-Foo`
* Bitbucket (corpus, path): `kythe://bitbucket.org/creachadair/stringset?path=README.md`
* Maven (corpus, path, language): `kythe://maven.org/central/org/apache/thrift?lang=java?path=libthrift/0.9.1`
* Language, path, signature: `kythe:?lang=go?path=mapreduce/go/contrib/plan.go#MR`
* Corpus, path, language: `kythe://code.google.com/p/go.tools?lang=go?path=cmd/godoc/doc.go`
* Alternate root: `kythe://chromium.org/chrome?path=openssl/crypto/bf/bf_pi.h?root=third_party/openssl/1650`
=== Rationale
The grammar for `kythe-uri` is compatible with the generic URI syntax defined
in RFC 3986, to the extent that a fairly naive parser should be able to handle
parsing a Kythe URI into its high-level components: The "hostname" and "path"
components of the generic URI will represent the `corpus`, the "query"
component will capture the `attrs`, and the "fragment" component will capture
the `signature`.
The meaning of the strings generated by the `corpus` production is not defined
in this specification; the intent is to allow a corpus to behave like a
hostname, so that a server providing Kythe data can use the corpus string to
locate the data for that corpus. For services that support many independent
corpora (e.g., github.com, bitbucket.org, code.google.com), the corpus field
will probably include information about the project directly (e.g.,
"code.google.com/p/go.text"). In cases where there is only a single corpus
with a body of different branches or subdivisions, some of that context may
be stored in the `root` attribute instead.
The decision about which representation to choose is mainly controlled by
whether the "project" label is likely to vary. A github.com repo will not
frequently change name, so it makes sense to include the repo name as part of
the corpus, and reserve the `root` field for branches. The encoding of the URI
is agnostic to the decision.