Indexing Generated Code
Source code generators like Flex, GNU Bison, and SWIG take a high-level description of a software component and generate the code necessary to realize that component in a lower-level or general-purpose programming language. Users browsing projects that use these components usually want cross-references to take them from use sites of a generated interface to the high-level code that brought that interface into being. They do not normally want to see the generated implementation, as this is often difficult (or uninteresting) to read. This document describes how to encode information about generated code to permit cross-language links.
To make the discussion easier to understand let’s pretend we are working with
two languages: SourceLang and TargetLang. SourceLang has .source
file and TargetLang
has .target
files. We also have a tool (generator) that can take generate
foo.target
file from foo.source
file. We have following components:
-
Source Indexer - Kythe indexer that takes
.source
files and outputs index data. -
Target Indexer - Kythe indexer that takes
.target
files and outputs index data. -
Generator - tool that produces
.target
files from.source
files. -
Post processor - Kythe tool that takes all index data produced by all indexers, processes it and outputs final Kythe graph that contains data for both SourceLang and TargetLang.
Now we want to teach Kythe how to create cross-references between generated
foo.target
file and original foo.source
file. The main idea is pretty simple:
Generator has to output extra data containing mapping of elements in foo.target
to the original elements from foo.source
. Then when Target Indexer is indexing
foo.target
it will use that mapping to output generates or imputes edges.
These edges connect nodes from foo.target
with nodes in foo.source
.
Kythe doesn’t require implementors to use one concrete approach for passing mapping metadata and outputting generates and imputes edges. Below we describe two different approaches, each has its own pros and cons. But in both cases it is assumed that implementors can change Generator and Target Indexer. If possible the generates approach is preferred as it requires less post-processing work.
Tip
|
You can find an example implementation at GitHub. The current sample web UI does not interpret the parts of the schema we will use; this is a work in progress. |
Java To JavaScript with imputes edges
This approach is generic and works for any combination of SourceLang and
TargetLang. In this example we generate JavaScript files from Java file so
SourceLang is Java and TargetLang is JavaScript. Given Color.java
:
public enum Color { RED; }
Generator produces color.js
:
const Color = { RED: 0, };
Changes to Generator
To support cross-references betwen color.js
and Color.java
we need to update
Generator to output the following mapping data for Color
, RED
elements.
{ "type": "kythe0", "meta": [{ "type": "anchor_anchor", "source_begin": 13, "source_end": 18, "target_begin": 6, "target_end": 11, "edge": "/kythe/edge/imputes", "source_vname": { "corpus": "corpus", "path": "path/to/Color.java" } }, { "type": "anchor_anchor", "source_begin": 22, "source_end": 25, "target_begin": 18, "target_end": 21, "edge": "/kythe/edge/imputes", "source_vname": { "corpus": "corpus", "path": "path/to/Color.java" } }] }
This mapping has 2 meta
entries. The first entry for Color
, the second for
RED
. Note:
-
Each entry doesn’t contain names of elements. Each entry contains only position of elements in the source (
Color.java
) and target (color.js
) files. -
Each position is defined as byte offset inside file and not as line/column. This is required because in Kythe anchors are defined using byte offsets and not line/column. In this example JavaScript indexer will process this mapping and will need to output anchor for
Color.java
and indexer doesn’t have access to theColor.java
file (it has access only to JS files). Because of that JS indexer can’t translate line/column to byte offset. -
Entry doesn’t contain vnames of elements in
Color.java
orcolor.js
and instead contains positions. VNames of nodes are internal details of each indexer and subject to change. Generator usually a standalone tool that doesn’t know rules for producing vnames for specific language so it’s impossible for Generator to output vnames of nodes. If in your case VNames are stable and well-specified you can use simpler approach using generates described inProtocol Buffer
section below.
To pass this mapping to the JavaScript Indexer Generator will append it
as a comment at the last line of color.js
:
const Color = { RED: 0, }; // Kythe Indexing Metadata: // {"type":"kythe0","meta":[{"type":"anchor_anchor","source_begin":13,...
Inlining metadata inside color.js
has benefit of avoiding passing extra
files to Indexer. All Indexer needs is to know that some JavaScript files can
contain metadata on the last line and parse it.
One downside is that it adds noise to color.js
but usually generated
files are invisible to developers so it’s not a big concern.
Changes to JavaScript Indexer
On JavaScript Indexer side we need to parse metadata and output imputes
edges. To parse metadata indexer can check last two lines of all .js
files
and see if they contain // Kythe Indexing Metadata:
and if so - parse
the last line as JSON.
For each meta
entry indexer should do the following:
-
Output an anchor using
source_begin
andsource_end
.source_vname
should be used as file containing the anchor. -
Find a JavaScript node that has defines/binding anchor with the same
target_begin/end
position. -
Ouptut one imputes edge from the anchor created at step 1 to the node found at step 2.
Note that this only applies to meta entries with type anchor_anchor
. For other
types structure might be different. See issue #3711.
Here is what JavaScript indexer outputs for the Color
element using the
rules above:
Output of Java Indexer looks like this:
Post-processor
Once Java and JavaScript Indexers finished their output is merged and postprocessor finds all anchors that have both defines/binding and imputes edges and creates generates edge:
This is the end state. Now tools using Kythe graph can see that Color enum
in JS is generated by Color enum in Java and perform proper action (for example
IDE upon clicking on Color
in JS file will go to the definition of Color
enum in java file.
Protocol Buffers with generates edges
This approach is easier to implement compared to imputes approach described
above, but it requires tighter integration with Indexer and Generator. When
Generator outputs code it also adds a mapping as in the imputes approach,
but instead of mapping location to location it outputs VNames of nodes from
foo.source
. It requires Generator to know exactly what VNames will be produced
by the Source Indexer. This approach is feasible when either VNames either
have simple stable form or Generator can reuse code from Source Indexer to
generate VNames.
In this example we generate C files from Protocol buffer definitions. So
SourceLang is Protocol Buffers and TargetLang is C
.
The Kythe project uses
protocol buffers for
data interchange. The protoc
compiler reads a domain-specific language
that describes messages and synthesizes code that serializes, deserializes,
and manipulates these messages. It can generate code in a number of different
target languages by swapping out backend components. These accept an encoding
of the message descriptions in the original source file and emit source text.
Indexing .proto
definitions
.proto
files are written in a domain-specific programming language for
describing various properties about messages and other data. It is interesting
to index these on their own, as messages in one .proto
file may be used in
another .proto
file. Here is a very simple example of the language:
syntax = "proto3"; package kythe.examples.proto.example; // A single proto message. message Foo { }
This file describes the empty message kythe.examples.proto.example.Foo
using features from version 3 of the language. When run through protoc
with the appropriate options set, it will generate the interface example.pb.h
and the implementation example.pb.cc
. These may be used to interact with
Foo
messages in C++.
As it turns out, protoc
can be coerced into saving the descriptor that it
passes to its backends. Ordinarily, this descriptor would merely be an
abstract version of the .proto
input file that discards syntax and records
only the details necessary to generate source code. If asked, protoc
will
also keep track of source locations (--include_source_info
) and data about
the .proto
files that are (transitively) imported (--include_imports
).
This information is sufficient to build a Kythe graph for a given .proto
definition file. It will become important later that every object that the
descriptor describes has an address, like "4.0", that corresponds (roughly)
to its position in the descriptor’s AST. These addresses are used as keys into
the table that keeps track of source locations in the original .proto
file.
This extra information is stored as a file that contains a
proto2.FileDescriptorSet
message, which in turn is a list of the
proto2.FileDescriptorProto
messages used in the course of processing .proto
input. Note that this message does not contain .proto
source text, so the
proto_indexer
must have access to the original source files.
We can add a verifier assertion to check that Foo
declares a Kythe node:
syntax = "proto3"; package kythe.examples.proto.example; // A single proto message. //- @Foo defines/binding MessageFoo? message Foo { }
and see that it was unified with the appropriate VName:
MessageFoo: EVar(... = App(vname,
(4.0, kythe, "", kythe/examples/proto/example.proto, protobuf)))
Using generated source code
Imagine that we have a simple C++ user of our generated source code for
Foo
. Its code, with a verifier assertion, looks like this:
#include "kythe/examples/proto/example.pb.h" //- @Foo ref CxxFooDecl? void UseProto(kythe::examples::proto::example::Foo* foo) { }
The Kythe pipeline for indexing our combined program looks like this:
When we use the verifier to inspect the resulting CxxFooDecl
, we see that
it has not been unified with the VName for Foo
:
CxxFooDecl: EVar(... =
App(vname, (srl0y/pwih+G6wsjFLMTVKQPC7lLH3/9MVK2d2aJHeE=,
kythe, bazel-out/genfiles, kythe/examples/proto/example.pb.h,
c++)))
This is because the kythe::examples::proto::example::Foo
type is a C++
type defined in example.pb.h
. That it was defined in some original .proto
file has no meaning to the C++ compiler. Furthermore, the Kythe C++
indexer has no understanding of the protoc
language and the VNames that the
Kythe proto_indexer produces.
Our goal is to add edges in the graph between CxxFooDecl
and MessageFoo
so that clients can take into account their relationship when displaying
cross-references or answering other queries. We do not want to unify them in the
same node, as they are legitimately different objects. Users may wish to
navigate to the generated C++ code for CxxFooDecl
or to view uses of
MessageFoo
in other languages. To support these different uses, we will emit
a generates edge such that MessageFoo
generates CxxFooDecl
. Clients can choose to follow the edge or to disregard
it.
Observe that the C++ indexer and protoc
backend both observe the same
content in the .pb.h
file; therefore, both programs see the same offsets
for various tokens. If the protoc
backend were to link those offsets back
to the objects in the FileDescriptorProto
using well-known names—and if the
Kythe proto_indexer guaranteed a particular mechanism for generating VNames
from those well-known names—we could close the loop in the C++ indexer by
emitting generates edges to the proto_indexer’s nodes whenever the C++
indexer trips over the protoc
backend’s marked offsets.
In other words, if the .pb.h
contained code like:
... class Foo { ...
and the protoc
backend that generated it reported that the text range
Foo
was associated with an object in its original FileDescriptorProto
at
some location encoded as "4.0"—and the proto_indexer guaranteed it would
always emit objects with signatures based on their descriptor locations—the
C++ indexer would only need to watch for defines/binding edges starting at
that text range. Should such an edge be emitted, the C++ indexer would also
emit a generates edge to the proto
node.
Annotations in protoc
backends
We have already seen how to command the protoc
frontend to emit location
information for .proto
source files. The frontend does not, however, know
anything about the source code that its various backends emit. We must pass
additional flags to these backends to get them to produce location information
as proto2.GeneratedCodeInfo
messages. These messages connect byte offsets
in generated source code with paths in the proto2.FileDescriptorProto
AST.
These paths are the same ones used by the proto2.SourceCodeInfo
message that
the Kythe proto_indexer consumes; they are the paths we will use to link up
protobuf
language nodes with the nodes for generated source code.
Each protoc
backend must be individually instrumented to produce
proto2.GeneratedCodeInfo
messages. To turn annotation on for the C++
backend, you can pass --cpp_out=annotate_headers=1:normal/output/path
to
protoc
. In practice, you will also need to provide an annotation_pragma_name
and an annotation_guard_name
, so the full cpp_out
value may look like
annotate_headers=1,annotation_pragma_name=kythe_metadata,annotation_guard_name=KYTHE_IS_RUNNING:normal/output/path
.
When annotate_headers=1
is asserted to the C++ backend, it will generate
.meta
files alongside any files with annotations. For example, in the same
directory as example.pb.h
, you will find an example.pb.h.meta
file. This
file contains a serialized proto2.GeneratedCodeInfo
message. This message
contains a series of spans in example.pb.h
, the filenames to the .proto
files that caused those spans to be generated, and the AST paths in the
FileDescriptorProto
for those .proto
files. example.pb.h
explicitly
depends on example.pb.h.meta
using a pragma and a preprocessor symbol:
// Generated by the protocol buffer compiler. DO NOT EDIT! // source: kythe/examples/proto/example.proto ... #ifdef KYTHE_IS_RUNNING #pragma kythe_metadata "kythe/examples/proto/example.pb.h.meta" #endif // KYTHE_IS_RUNNING ...
The Kythe C++ extractor and indexer both understand what to do with this
pragma (and both define KYTHE_IS_RUNNING
). The extractor will add the .meta
file to the kzip
it produces; the indexer will load the .meta
file,
translate it from protoc
annotations to generic Kythe metadata, and use it
to append generates
edges for defines/binding
edges emitted from
example.pb.h
.
Now we can write verifier assertions that show we have established a link between the proto source and use sites of its generated code:
#include "kythe/examples/proto/example.pb.h" //- @Foo ref CxxFooDecl //- MessageFoo? generates CxxFooDecl //- vname(_, "kythe", "", "kythe/examples/proto/example.proto", "protobuf") //- defines/binding MessageFoo void UseProto(kythe::examples::proto::example::Foo* foo) { }
MessageFoo: EVar(... = App(vname,
(4.0, kythe, "", kythe/examples/proto/example.proto, protobuf)))
Of course, Kythe clients need to understand that generates edges should be followed. Solving this problem is out of this document’s scope.
Providing annotations for other languages
To generate metadata for a different language backend, you must determine or implement the following:
-
The
protoc
backend for the language must be able to produceproto2.GeneratedCodeInfo
buffers. -
There must be some way to signal to your indexer and extractor that a
.meta
file is associated with a different source file. -
That
.meta
file must be made available to the extractor during extraction. For hermetic build systems, this means that the target drivingprotoc
must list the.meta
file as an output. Any target that uses thatprotoc
target must require the.meta
file as an input. -
The indexer must read the
.meta
file and use it to emitgenerates
edges that connect up to the nodes produced by the Kythe proto_indexer.
The method for annotating source code is designed such that it can
be implemented purely at the output stage; for example, if you have an
abstraction for emitting defines/binding edges from anchors, you can
check at every edge (starting from a file with loaded metadata) whether you
should emit an additional generates
edge.