Summary

This document specifies a compact persistent storage representation for compilation information, suitable for use by Kythe to generate cross-reference data.

Background

To generate cross-references, Kythe captures a record of each compilation that is to be indexed (e.g., a library or binary) with enough information to enable us to replay the compilation to the front-end of the compiler. This record consists of a CompilationUnit protobuf message, together with the content of all the source files and other inputs the compiler needs to process the compilation (e.g., header files or type snapshots from dependencies).

For a single compilation, we can store this record in a .kindex file, which contains a GZip-compressed sequence of length-prefixed binary records (length encoded as a protobuf-style varint), organized as:

Record #

Content

0

CompilationUnit protobuf, wire format [specifying n required inputs]

1

FileData protobuf, wire format [file 1 contents]

n

… [file n contents]

This file-format is adequate for a single compilation. It doesn’t work well for indexing a corpus comprising many compilations, however, because such compilations typically share many input files in common, and storing the complete record per file means we duplicate the contents of all those shared input files that are shared across multiple records.

Index Pack Format

To solve the problems described above, we use a more compact format called an index pack. An index pack is a specially-structured directory (using a layout inspired by Maildir):

root/                # Any legal directory name is fine here
   units/
     abcd1234.unit   # Compilation unit (see below for format)
     …               # (name is hex-coded SHA256 of uncompressed data)
   files/
     1a2b3c4e.data   # File contents, compressed
     …               # (name is hex-coded SHA256 of uncompressed data)

This organization separates the compilation unit descriptions from their file data, which are shared among multiple compilations. Transporting the index pack is accomplished by archiving it with tar or zip or another format as desired.

Directory and File Layout

An index pack is a directory (called the root) containing two subdirectories, one named units and one named files.

  • The units subdirectory may contain only unit files and temp files, as described below.

  • The files subdirectory may contain only data files and temp files, as described below.

  • Other files or directories inside the units or files subdirectories should cause a tool to consider the index pack invalid.

  • Other files or subdirectories in the root should be ignored by a tool processing the index pack.

A unit file is a file containing a GZip-compressed compilation unit description. The name of a unit file has the form <digest>.unit, where <digest> denotes a lower-case hex-encoded SHA256 digest of the uncompressed compilation unit description stored in the file.

A data file is a file containing a GZip-compressed blob of file data. The name of a data file has the form <digest>.data, where <digest> denotes a lower-case hex-encoded SHA256 digest of the uncompressed blob data stored in the file.

A temp file is a file containing arbitrary data being staged into the index pack. The name of a temp file has the form <uuid>.new, where <uuid> denotes an ASCII-encoded Version 4 UUID as described by RFC 4122.

Compilation Unit Description Format

The contents of a unit file are a GZip-compressed compilation unit description. The description is encoded as a single JSON object, having the following top-level structure:

{
  "format":  <format-key>,
  "content": <content-object>
}

The two keys "format" and "content" are required; any additional keys must be ignored by a reader that does not understand them (this permits forward migration to newer versions). The values of these fields are defined as follows:

<format-key>

A string value describing the format of the content. This value may be empty. A reader that discovers a <format-key> value it does not recognize should skip processing the unit in that file.

<content-object>

A JSON object carrying the compilation description. A reader that discovers a missing <content-object> field, or is unable to parse the value, should consider the unit ill-formed and return an error.

The format key is intended to support versioning of the format of compilation units across potentially-breaking changes. Tools that write an index pack can be changed to emit multiple unit formats into the same pack without waiting for the consumers to all be updated. Once migration is complete, the old format can be dropped.

Format Values

The following format values are currently defined:

Key

Object description

"kythe"

JSON encoding of a single Kythe CompilationUnit message.

Encoding and Decoding Protocol

A reader must be able to decode compilation units from JSON, which is represented in the pseudocode below by a function with signature:

func decodeCompilationUnit(json string, unit *CompilationUnit)

This function decodes the given JSON content object into whatever compilation unit representation the reader prefers, or returns an error.

A writer must be able to encode compilation units to JSON, which is represented in the pseucocode below by a function with signature:

func encodeCompilationUnit(unit CompilationUnit) string

This function takes a compilation unit in whatever format the writer prefers, and returns the preferred format key and the JSON encoding of the message (or signaling an error).

Read Protocol

Multiple concurrent processes can safely read unit files and data files from an index pack directory, even if there are also concurrent writers. The protocol for reading a file from the index pack is described by the following pseudocode:

func readFile(directory, sourceName):
  fp   = file(path.join(directory, sourceName), 'r')
  gz   = GZipDecompressor(input=fp)
  data = gz.readAll()
  fp.close()
  return data

Using readFile as a primitive, data and unit files are read as follows (modulo error-handling):

func readDataFile(root, digest):
  sourceName = digest+'.data'
  return readFile(path.join(root, 'files'), sourceName)
func readUnitFile(root, digest, formatKey, unit):
  sourceName = digest+'.unit'
  contents   = readFile(root, sourceName)
  wrapper    = JSON.decode(contents)
  if wrapper.format == formatKey:
    decodeCompilationUnit(wrapper["content"], unit)

Note: While it is safe to read while there are concurrent writers (in the sense that you will get consistent data), it is not recommended to iterate over the contents of the index pack while it is being written, since the results may be incomplete. This is not a problem during normal use.

Write Protocol

Multiple concurrent processes can safely write into a single index pack, provided the index pack is stored in a filesystem that supports an atomic file-rename operation. Filesystems that are POSIX-compliant support atomic rename via the rename(2) call. The protocol for writing a new file into the index pack is described by the following pseudocode:

func writeFile(directory, targetName, contents):
  tmp = path.join(directory, toHex(UUID4()) + '.new')
  fp  = file(tmp, 'w')
  gz  = GZipCompressor(output=fp)
  gz.write(contents); gz.flush(); fp.close()
  rename(tmp, path.join(directory, targetName))

Using writeFile as a primitive, data and unit files are written as follows:

func writeDataFile(root, contents):
  targetName = toHex(sha256(contents))+'.data'
  writeFile(path.join(root, 'files'), targetName, contents)
func writeUnitFile(root, formatKey, unit):
  content    = encodeCompilationUnit(unit)
  body       = JSON{"format": formatKey, "content": content}.toString()
  targetName = toHex(sha256(body))+'.unit'
  writeFile(path.join(root, 'units'), targetName, body)

Concurrent writers may store the same data without conflict, since files are named by a cryptographic digest of their contents. There is no provision for deleting data from an index pack within the protocol specified here. Note that this specification does not impose any ordering constraints between data files and unit files—in particular, an implementation may write data files before or after the unit file, at its option.