Kythe Compilation ZIP Format (.kzip)
Summary
This document specifies a compact persistent storage representation for compilation records, suitable for use by Kythe to generate cross-reference data and to apply other static analysis tools to source files.
Background
To generate cross-references, Kythe captures a record of each compilation that
is to be indexed (e.g., a library or binary) with enough information to
enable us to replay the compilation to the front-end of the compiler. This
record consists of a
CompilationUnit
protobuf
message, together with the content of all the source files and other inputs the
compiler needs to process the compilation (e.g., header files or type
snapshots from dependencies).
Kythe ZIP Format (.kzip)
To store compilation records compactly, we use a specially formatted ZIP
archive that we call a kzip file, conventionally given the file extension
.kzip
. A kzip file consists of the following directory structure:
root/ # Any valid non-empty directory name
units/
abcd1234 # Compilation unit (JSON format, see below for details)
... # (name is hex-coded SHA256 of record content)
files/
1a2b3c4e # File contents, uncompressed
... # (name is hex-coded SHA256 of uncompressed file content)
This organization separates the compilation unit descriptions from their file data, which are shared among multiple compilations.
.kzip
can offer alternate structures, in which compilation units are formatted in other encodings. Currently the only other encoding defined and supported is "proto". In this case, the directory structure will look like
root/ # Any valid non-empty directory name
pbunits/
abcd1234 # Compilation unit (proto format, see below for details)
... # (name is hex-coded SHA256 of record content)
files/
1a2b3c4e # File contents, uncompressed
... # (name is hex-coded SHA256 of uncompressed file content)
Directory and File Layout
A kzip is a ZIP file containing a top-level root directory that contains two
subdirectories, one named units
and one named files
.
-
The
units
(orpbunits
) subdirectory may contain only unit files. -
The
files
subdirectory may contain only data files. -
Other files or directories inside the
units
orfiles
subdirectories should cause a tool to consider the kzip file invalid. -
Other files or subdirectories in the root or other subdirectories should be ignored by a tool processing the kzip file.
A unit file is a file containing a compilation unit description. The name of a unit file is computed by digesting the compilation unit with SHA256, and encoding the resulting hash as a string of lowercase ASCII hexadecimal digits. This string becomes the filename of the unit file. Note that the digest should only process the CompilationUnit itself, and should not include the other contents of the wrapper message. Details on the digesting algorithm are described below.
A data file is a file containing an unstructured blob of raw (uncompressed) file data. The name name of a data file is computed by hashing the file contents with SHA256, and encoding the resulting hash as a string of lowercase ASCII hexadecimal digits. This string becomes the filename of the data file.
The root directory must be the first entry in the ZIP file, and its name must not be empty.
Compilation Unit Description Format
The content of a unit file is the canonical JSON encoding of a
kythe.proto.IndexedCompilation
protobuf message.
{ "unit": <encoded kythe.proto.CompilationUnit>, "index": { "revision": ["123", "456", "789"] } }
The "unit"
key is required, and must contain the canonical JSON encoding of a
kythe.proto.CompilationUnit
protobuf message. The "index"
key is optional,
but if set must contain the canonical JSON encoding of an Index
message.
For the proto
encoding of compilation units, the content of the unit
is the standard wire-encoding of the kythe.proto.IndexedCompilation
protobuf message.
Computing the digest for a Compilation Unit
The representative implementation can be found in
kythe/go/platform/kcd/kythe/units.go
.
Definitions
-
NULL
-
the one-byte value corresponding to ASCII NULL (
\0x00
) -
NL
-
the one-byte value corresponding to ASCII newline (
\x0a
)
All strings are emitted in UTF-8 encoding.
Canonical form
For the purposes of computing the digest, a compilation unit should be in canonical form. This is defined as:
-
the field
required_input
is deduplicated and sorted according tocu.required_input.path
-
the field
environment
is sorted byenvironment.name
-
the field
source_file
is sorted -
the field
details
is sorted bydetails.type_url
Digest computation
Let v
be a kythe.proto.VName
. Then a digest of v
is
v.signature NULL v.corpus NULL v.root NULL v.path NULL v.language NULL
Let cu
be an instance of kythe.proto.CompilationUnit
. Then a
digest of cu
is computed as the sequence
"CU" NL cu.vname NULL
followed by, for each required input:
"RI" NL cu.required_input[i].vname NULL
"IN" NL cu.required_input[i].info.path NULL cu.required_input[i].info.digest NULL
followed by:
"ARG" NL cu.argument[0] NULL cu.argument[1] NULL ...
"OUT" NL cu.output_key NULL
"SRC" NL cu.source_file[0] NULL cu.source_file[1] NULL ...
"CWD" NL cu.working_directory NULL
"CTX" NL cu.entry_context NULL
followed by, for each cu.environment
:
"ENV" NL cu.environment[i].name NULL cu.environment[i].value NULL
finally followed by, for each cu.details
:
"DET" NL cu.details[i].type_url NULL cu.details[i].value NULL
For the cu.details.value
, this is the sequence of bytes of the
wire-encoding proto representation of the value.