Kythe - Kythe URI Specification

This document defines the schema for Kythe uniform resource identifiers ("Kythe URI").

The primary purpose of a Kythe URI is to provide a textual encoding of a Kythe VName, which is a unique identifier for a node in the semantic graph generated by Kythe-compatible tools. A Kythe URI may also be extended to encode simple queries about a particular VName in a transportable format.

The identifiers described in this document are compatible with the grammar given in RFC 3987 (Internationalized Resource Identifiers) and thereby also with the underlying grammar from RFC 3986 (Uniform Resource Identifiers).

Scheme Label

The scheme label for Kythe URIs will be "kythe:".

Character Set

A Kythe URI is a string of UCS (Unicode) characters. For storage and transmission, a Kythe URI will be encoded as UTF-8 with no byte-order mark, using Normalization Form NFKC.

Except as restricted by the syntax, all UCS characters are valid in a Kythe URI. Reserved characters (e.g., "/", "?") and whitespace must be percent-escaped per Section 2.1 of RFC 3986, e.g., " " becomes "%20".

Syntax

The following grammar defines the syntax of a Kythe URI. Some productions have provisional values and will change as the Kythe schema evolves.

kythe-uri    = "kythe:" [corpus] attrs ["#" signature]
corpus       = "//" label 0*{"/" path-segment}
label        = ireg-name -- RFC 3987
attrs        = ["?" lang-attr] ["?" path-attr] ["?" root-attr]
lang-attr    = "lang=" language
path-attr    = "path=" path-segment 0*{"/" path-segment}
root-attr    = "root=" root-segment 0*{"/" root-segment}
language     = 1*ipchar  -- RFC 3987
signature    = 1*ipchar  -- RFC 3987
root-segment = 1*ipchar  -- RFC 3987
path-segment = 1*{unreserved | pct-encoded | "/"}  -- RFC 3987

Note that the order of the attributes (the attrs production) is fixed, to ensure that a Kythe URI has a canonical string encoding.

For queries, path-segment is resolved as specified in RFC 3986 Section 5.2.4 (Remove Dot Segments).

Rationale

The grammar for kythe-uri is compatible with the generic URI syntax defined in RFC 3986, to the extent that a fairly naive parser should be able to handle parsing a Kythe URI into its high-level components: The "hostname" and "path" components of the generic URI will represent the corpus, the "query" component will capture the attrs, and the "fragment" component will capture the signature.

The meaning of the strings generated by the corpus production is not defined in this specification; the intent is to allow a corpus to behave like a hostname, so that a server providing Kythe data can use the corpus string to locate the data for that corpus. For services that support many independent corpora (e.g., github.com, bitbucket.org, code.google.com), the corpus field will probably include information about the project directly (e.g., "code.google.com/p/go.text"). In cases where there is only a single corpus with a body of different branches or subdivisions, some of that context may be stored in the root attribute instead.

The decision about which representation to choose is mainly controlled by whether the "project" label is likely to vary. A github.com repo will not frequently change name, so it makes sense to include the repo name as part of the corpus, and reserve the root field for branches. The encoding of the URI is agnostic to the decision.