| // Copyright 2014 The Kythe Authors. All rights reserved. |
| // |
| // Licensed under the Apache License, Version 2.0 (the "License"); |
| // you may not use this file except in compliance with the License. |
| // You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| |
| Kythe URI Specification |
| ======================= |
| Michael J. Fromberger <fromberger@google.com> |
| v0.1.1, 29-Oct-2014: Draft |
| |
| This document defines the schema for Kythe uniform resource identifiers ("Kythe URI"). |
| |
| The primary purpose of a Kythe URI is to provide a textual encoding of a Kythe |
| VName, which is a unique identifier for a node in the semantic graph generated |
| by Kythe-compatible tools. A Kythe URI may also be extended to encode simple |
| queries about a particular VName in a transportable format. |
| |
| The identifiers described in this document are compatible with the grammar |
| given in http://tools.ietf.org/html/rfc3987[RFC 3987] (Internationalized |
| Resource Identifiers) and thereby also with the underlying grammar from |
| http://tools.ietf.org/html/rfc3986[RFC 3986] (Uniform Resource Identifiers). |
| |
| == Scheme Label |
| |
| The scheme label for Kythe URIs will be "`kythe:`". |
| |
| == Character Set |
| |
| A Kythe URI is a string of UCS (Unicode) characters. For storage and |
| transmission, a Kythe URI will be encoded as UTF-8 with no byte-order mark, |
| using Normalization Form NFKC. |
| |
| Except as restricted by the syntax, all UCS characters are valid in a Kythe URI. |
| Reserved characters (_e.g._, "/", "?") and whitespace must be percent-escaped |
| per Section 2.1 of RFC 3986, e.g., " " becomes "`%20`". |
| |
| == Syntax |
| |
| The following grammar defines the syntax of a Kythe URI. Some productions have |
| provisional values and will change as the Kythe schema evolves. |
| |
| ---- |
| kythe-uri = "kythe:" [corpus] attrs ["#" signature] |
| corpus = "//" label 0*{"/" path-segment} |
| label = ireg-name -- RFC 3987 |
| attrs = ["?" lang-attr] ["?" path-attr] ["?" root-attr] |
| lang-attr = "lang=" language |
| path-attr = "path=" path-segment 0*{"/" path-segment} |
| root-attr = "root=" root-segment 0*{"/" root-segment} |
| language = 1*ipchar -- RFC 3987 |
| signature = 1*ipchar -- RFC 3987 |
| root-segment = 1*ipchar -- RFC 3987 |
| path-segment = 1*{unreserved | pct-encoded | "/"} -- RFC 3987 |
| ---- |
| |
| Note that the order of the attributes (the `attrs` production) is fixed, to |
| ensure that a Kythe URI has a canonical string encoding. |
| |
| Examples (subject to change): |
| |
| * Empty (no fields): `kythe:` |
| * Signature only: `kythe:#loc-a90320dafd60` |
| * Ad-hoc corpus (signature, corpus, path, language): `kythe://corpusname?lang=c%2B%2B?path=file/base/file.h#class-Foo` |
| * Bitbucket (corpus, path): `kythe://bitbucket.org/creachadair/stringset?path=README.md` |
| * Maven (corpus, path, language): `kythe://maven.org/central/org/apache/thrift?lang=java?path=libthrift/0.9.1` |
| * Language, path, signature: `kythe:?lang=go?path=mapreduce/go/contrib/plan.go#MR` |
| * Corpus, path, language: `kythe://code.google.com/p/go.tools?lang=go?path=cmd/godoc/doc.go` |
| * Alternate root: `kythe://chromium.org/chrome?path=openssl/crypto/bf/bf_pi.h?root=third_party/openssl/1650` |
| |
| === Rationale |
| |
| The grammar for `kythe-uri` is compatible with the generic URI syntax defined |
| in RFC 3986, to the extent that a fairly naive parser should be able to handle |
| parsing a Kythe URI into its high-level components: The "hostname" and "path" |
| components of the generic URI will represent the `corpus`, the "query" |
| component will capture the `attrs`, and the "fragment" component will capture |
| the `signature`. |
| |
| The meaning of the strings generated by the `corpus` production is not defined |
| in this specification; the intent is to allow a corpus to behave like a |
| hostname, so that a server providing Kythe data can use the corpus string to |
| locate the data for that corpus. For services that support many independent |
| corpora (e.g., github.com, bitbucket.org, code.google.com), the corpus field |
| will probably include information about the project directly (e.g., |
| "code.google.com/p/go.text"). In cases where there is only a single corpus |
| with a body of different branches or subdivisions, some of that context may |
| be stored in the `root` attribute instead. |
| |
| The decision about which representation to choose is mainly controlled by |
| whether the "project" label is likely to vary. A github.com repo will not |
| frequently change name, so it makes sense to include the repo name as part of |
| the corpus, and reserve the `root` field for branches. The encoding of the URI |
| is agnostic to the decision. |