This document provides an overview of Kythe’s configurable extraction framework, and serves as a de facto getting started guide for onboarding support for a new build system for Kythe extraction.
The Kythe Configurable Extraction system is designed to be a generalized solution providing support for running Kythe extraction on a diversity of build systems. The system consists of a per-repository configuration file and a collection of tools that consume that file to generate a customized extraction image tailored to that repository. An extraction image defines a host environment for hermetically building the repository’s contents (e.g., in Docker) with Kythe extractor tools injected, to generate Kythe Compilation Units. These units are consumed downstream for static analysis and indexing.
The generated extraction image is in the format of a Docker image a standardized container format. Git is currently used for retrieving repository contents, however support for other source control tools can be added as needed.
Note that this is a work in progress not a finished product. The intent is to document the system as it evolves in order provide early adopters with a means of trying it out and providing feedback.
Extraction Configuration Schema
An extraction configuration is used to construct a Docker image suitable for building and extracting a given repository. This schema defines a low-level configuration format. Where practical, configuration settings will be inferred automatically, but in cases where that is not possible, a user-friendly interface may be added to allow users to control the extraction behavior directly. For the time being, this intermediate configuration schema can be utilized by users who would like to get a head-start on enabling Kythe on their repositories. The configuration schema is defined within extraction_config.proto.
Extraction Configuration Usage
Instances of this configuration schema can be placed in the root directory of the repository in a file named: ".kythe-extraction-config", formatted as a JSON encoded protobuf. An example of an existing extraction configuration can be found here: mvn_config.json. The corresponding extraction image which gets generated from the mvn_config.json file can be found here: expected_mvn_config.Dockerfile. This configuration serves as an input to the extractrepo tool which executes the Kythe extraction process on a given repository.
Extraction Configuration Components
repeated Image required_image
This field defines a set of artifacts from a base image to copy into the
generated extraction image, where for each listed
required_image, the Docker
image will have:
FROM <image.uri> as <image.name> # ...repeated... COPY <image.copy_spec.source> <image.copy_spec.dest> # ...repeated... ENV <image.env_var.name>=<image.env_var.value>
The Image message has the following parts:
repeated CopySpec copy_spec defines a list of artifacts to be copied from the
base image into the generated extraction image.
string uri defines the URI to a base docker image. This can refer to images
defined within either local or online docker container registries.
string name defines a unique name for this image, to be referenced when
repeated EnvVar env_var defines environment variables within the generated
extraction image related to the artifacts copied from the base image.
repeated RunCommand run_command
This field configures the execution of arbitrary RUN commands during the
construction of the generated extraction image. This provides for the
installation of required resources which may not have corresponding base docker
images. For each listed
run_command, the Docker image will have:
RUN <command> "<arg>" "<arg>" ...
repeated string entry_point
This field defines the entry point for the generated image. The entry point is
the logic which is run when the generated image’s container is started. This is
typically a script or binary which intiates the build and extraction process. An
example entry point script can be found here: mvn-extract.sh.
For each listed
entry_point the Docker image will have:
ENTRYPOINT ["<entrypoint>", "<entrypoint>", ...]
Extraction Image Volumes
Each generated extraction image contains default volumes for input and output during the extraction process. These utilize the Docker volume feature to specify host directories which are mounted within the running container.
This volume contains the contents of the repository to be processed by the Kythe extraction framework. It should have read and write privileges as it is common for some build systems' configuration files to require pre-processing in order for successful extraction.
This volume will contain the output artifacts of the Kythe extraction process in the form of kindex files, (note: this format may change in the future). Any diagnostic output from extractors will also be written here. This directory should have read and write privileges.
Extraction Image Environment Variables
In addition to environment variables defined by the configuration schema, generated extraction images also contain a default set of environment variables facilitating access to input and output for extractors running within the container.
This environment variable points to the volume mount path for the /repo volume.
This environment variables points to the volume mount path for the /out volume.
In the process of enabling support for a new build system, it is common to implement a build system wrapper which serves as the entry point for the generated extraction image. This wrapper is responsible for any pre-processing of build configuration files which might be necessary, as well as invoking the build system with the arguments necessary to hook the extractor into the build system’s compilation step. An example of such a wrapper can be found here: mvn-extract.sh
A common pattern is to have the wrapper as well as any language specific extraction binaries bundled within an extraction artifacts base image for use in the extraction configuration. An example of such an artifacts base image can be found here: kythe/extractors/java/artifacts.
The Kythe project contains a collection of tools available for running and testing extraction manually. The documentation for these tools can be found here: README.md. These tools require the following to programs to be locally installed and accessible on the $PATH: Docker, Git.
The extractrepo binary provides a tool for running an extraction manually. It consumes an extraction configuration file either specified as a command line argument, or else contained within the ".kythe-extraction-config" file in the root of the repository. The binary generates the extraction image, clones the repository, and then runs the extraction image’s container to perform the Kythe extraction on its contents. The usage for the binary is as follows:
extractrepo -repo <repo_uri> -output <output_file_path> -config [config_file_path]
The repostester binary provides a tool which runs an extraction on a given repository, and then runs a smoke test to verify adequate file coverage on the extraction’s output. The usage for the binary is as follows:
repotester -repos <comma_delimited,repo_urls> [-config <config_file_path>] [-github_token <github_token>] repotester -repo_list_file <file> [-config <config_file_path>] [-github_token <github_token>]