This document provides an overview of Kythe’s configurable extraction framework, and serves as a de facto getting started guide for onboarding support for a new build system for Kythe extraction.


The Kythe Configurable Extraction system is designed to be a generalized solution providing support for running Kythe extraction on a diversity of build systems. The system consists of a per-repository configuration file and a collection of tools that consume that file to generate a customized extraction image tailored to that repository. An extraction image defines a host environment for hermetically building the repository’s contents (e.g., in Docker) with Kythe extractor tools injected, to generate Kythe Compilation Units. These units are consumed downstream for static analysis and indexing.

The generated extraction image is in the format of a Docker image a standardized container format. Git is currently used for retrieving repository contents, however support for other source control tools can be added as needed.

Note that this is a work in progress not a finished product. The intent is to document the system as it evolves in order provide early adopters with a means of trying it out and providing feedback.

Extraction Configuration Schema

An extraction configuration is used to construct a Docker image suitable for building and extracting a given repository. This schema defines a low-level configuration format. Where practical, configuration settings will be inferred automatically, but in cases where that is not possible, a user-friendly interface may be added to allow users to control the extraction behavior directly. For the time being, this intermediate configuration schema can be utilized by users who would like to get a head-start on enabling Kythe on their repositories. The configuration schema is defined within extraction_config.proto.

Extraction Configuration Usage

Instances of this configuration schema can be placed in the root directory of the repository in a file named: ".kythe-extraction-config", formatted as a JSON encoded protobuf. An example of an existing extraction configuration can be found here: mvn_config.json. The corresponding extraction image which gets generated from the mvn_config.json file can be found here: expected_mvn_config.Dockerfile. This configuration serves as an input to the extractrepo tool which executes the Kythe extraction process on a given repository.

Extraction Configuration Components

repeated Image required_image

This field defines a set of artifacts from a base image to copy into the generated extraction image, where for each listed required_image, the Docker image will have:

 FROM <image.uri> as <>
 # ...repeated...
 COPY <image.copy_spec.source> <image.copy_spec.dest>
 # ...repeated...
 ENV <>=<image.env_var.value>

The Image message has the following parts:

repeated CopySpec copy_spec defines a list of artifacts to be copied from the base image into the generated extraction image.

string uri defines the URI to a base docker image. This can refer to images defined within either local or online docker container registries.

string name defines a unique name for this image, to be referenced when copying artifacts.

repeated EnvVar env_var defines environment variables within the generated extraction image related to the artifacts copied from the base image.

repeated RunCommand run_command

This field configures the execution of arbitrary RUN commands during the construction of the generated extraction image. This provides for the installation of required resources which may not have corresponding base docker images. For each listed run_command, the Docker image will have:

RUN <command> "<arg[0]>" "<arg[1]>" ...
repeated string entry_point

This field defines the entry point for the generated image. The entry point is the logic which is run when the generated image’s container is started. This is typically a script or binary which intiates the build and extraction process. An example entry point binary can be found here: runextractor.go. For each listed entry_point the Docker image will have:

ENTRYPOINT ["<entrypoint[0]>", "<entrypoint[1]>", ...]

Extraction Image Volumes

Each generated extraction image contains default volumes for input and output during the extraction process. These utilize the Docker volume feature to specify host directories which are mounted within the running container.


This volume contains the contents of the repository to be processed by the Kythe extraction framework. It should have read and write privileges as it is common for some build systems' configuration files to require pre-processing in order for successful extraction.


This volume will contain the output artifacts of the Kythe extraction process in the form of kzip files, (note: this format may change in the future). Any diagnostic output from extractors will also be written here. This directory should have read and write privileges.

Extraction Image Environment Variables

In addition to environment variables defined by the configuration schema, generated extraction images also contain a default set of environment variables facilitating access to input and output for extractors running within the container.


This environment variable points to the volume mount path for the /repo volume.


This environment variables points to the volume mount path for the /out volume.

Extraction Wrapper

In the process of enabling support for a new build system, it is common to implement a build system wrapper which serves as the entry point for the generated extraction image. This wrapper is responsible for any pre-processing of build configuration files which might be necessary, as well as invoking the build system with the arguments necessary to hook the extractor into the build system’s compilation step. An example of such a wrapper can be found here: runextractor.go.

A common pattern is to have the wrapper as well as any language specific extraction binaries bundled within an extraction artifacts base image for use in the extraction configuration. An example of such an artifacts base image can be found here: kythe/extractors/java/artifacts.

Extraction Tools

The Kythe project contains a collection of tools available for running and testing extraction manually. The documentation for these tools can be found here: These tools require the following to programs to be locally installed and accessible on the $PATH: Docker, Git.

The extractrepo binary provides a tool for running an extraction manually. It consumes an extraction configuration file either specified as a command line argument, or else contained within the ".kythe-extraction-config" file in the root of the repository. The binary generates the extraction image, clones the repository, and then runs the extraction image’s container to perform the Kythe extraction on its contents. The usage for the binary is as follows:

extractrepo -repo <repo_uri> -output <output_file_path> -config [config_file_path]

The repostester binary provides a tool which runs an extraction on a given repository, and then runs a smoke test to verify adequate file coverage on the extraction’s output. The usage for the binary is as follows:

repotester -repos <comma_delimited,repo_urls> [-config <config_file_path>] [-github_token <github_token>]
repotester -repo_list_file <file> [-config <config_file_path>] [-github_token <github_token>]