Kythe - Kythe Compilation Database (KCD) Specification

Summary
Background & Motivation
Kythe Compilation Database

Summary

This document describes the Kythe compilation database, an index of build information used by Kythe to perform semantic analysis of source code.

Background & Motivation

For Kythe to index a source file, we need to know all of the dependencies of that file (e.g., imports or include files), as well as any settings that control the compiler’s behaviour in processing that file (e.g., environment variables, flags). Files often depend on generated code (e.g., protobuf wrappers, SWIG), produced as part of the build process. Thus: In order to index a file, Kythe usually must first build that file—which we do in Kythe using Bazel.

For several reasons, Kythe does not index during the build. Instead, we capture a record of each compile action taken by the build process and store it for separate processing. The main reasons for this separation are:

Resource constraints. Builds often run in a special-purpose build environment, specialized to handle build executions and typically under high load. Kythe indexers run with a CPU and output profile that isn’t a good fit for this environment. By storing the build information, we can do its processing "offline" from the build, in a less-constrained environment.
Reusability. Besides Kythe indexers, there are other static analysis that require the same basic data that Kythe uses. Rather than run repeated builds, capturing and the compilation records allows these tools to take advantage of the same work. It is also helpful to be able to replay a stored compilation for testing and repro purposes, without the need to re-invoke the build system.
Historical data. Maintainers of important core libraries find it helpful to have records of compilation data over a longer span of time, e.g., for analysis of API usage. Keeping an archive of compilation settings for a longer period of time than the build caches (order of months, vs. order of days) makes it easier to support this kind of exploration.

Kythe Compilation Database

To address these needs, we use a compilation storage format called a compilation database. This is similar in many respects to the language- specific compilation databases produced by tools like Clang.

Overview

A Kythe compilation database represents a storage mechanism for compilation data captured from a build system. It consists of two parts:

The store is a content-addressable store of compilation records and file contents. Files and compilations are addressed via a lowercase hex-encoded SHA256 digest of their contents.
The index records revision information and supports efficient lookup of compilation units from some of their properties. This includes:
- A revisions index, recording which complete revisions (e.g., CLs, commit hashes) are recorded in the database, and to which corpus they belong.
- An compilation index of query terms for each compilation unit, including target name, source files, revision, corpus label, and language.

Terminology

A compilation unit is a record of a single action taken by the build system. Typically this corresponds to the invocation of a compiler with a particular set of flags and input files.
A corpus label is a string that identifies a corpus of files governed by a source repository and build system.
A digest is a lowercase hex-encoded digest used to identify an object in the content-addressable store. A unit digest identifies a compilation record ("compilation unit"), while a file digest identifies a file.

A file digest is constructed by encoding the SHA256 digest of the file’s content, and is the same across all compilation databases.

A unit digest may be constructed the same way based on the storage format of the compilation record, but is not required to be the same from one database to another (as storage formats may differ).
A format key is a string that provides an optional type hint for the data stored in a compilation unit. In Kythe we use the format key kythe to mean a kythe.proto.CompilationUnit.
A revision marker is a string that identifies a revision within a corpus. A revision marker must be nonempty and contain no ASCII whitespace, but is otherwise unconstrained. A revision marker is expected to be unique among revisions for its corpus. In a Git repo, for example, we will use a commit hash

Interface

The interface to the compilation database is via the following abstract methods:

Revisions returns the revision marker, corpus label, and timestamp for each indexed revision matching the query terms.
Find returns the digests of all compilation units in the store matching the given query terms. The query terms supported include: revision, language, corpus label, target name, source path, and output path.
Units returns the stored compilation data matching the given unit digests. The storage format of compilation records may differ by implementation, so only units returned by its Find method may be considered valid for a given KCD instance.
Files returns the stored file data matching the given digests.
FilesExist checks whether file data is stored for the given file digests. The method returns all the proffered file digests that exist in the store.
WriteRevision adds or replaces a revision in the revisions index. A revision is specified as a revision marker and a corpus.
WriteUnit adds a compilation unit to the content-addressable store and updates the compilation index. The unit digest of the stored compilation is returned (as by Find).
WriteFile adds the contents of a file to the content-addressable store. The file digest of the stored file is returned.

A read-only implementation may omit the WriteRevision, WriteUnit, and WriteFile methods, or provide stubs that always return an error.

Implementations

A Go description of the abstract interface, along with some support code, is defined in kythe/go/platform/kcd.

Concrete implementations:

In-memory (memdb.go). Build target: //kythe/go/platform/kcd:memdb
Unit tests for an arbitrary kcd.ReadWriter value can be built using (testutil.go).

The intended goal of this design is that clients will use the compilation database via a service interface, and will not need a heavyweight client library for common tasks such as locating and analyzing compilations.