Source code generators like Flex, GNU Bison, and SWIG take a high-level description of a software component and generate the code necessary to realize that component in a lower-level or general-purpose programming language. Users browsing projects that use these components usually want cross-references to take them from use sites of a generated interface to the high-level code that brought that interface into being. They do not normally want to see the generated implementation, as this is often difficult (or uninteresting) to read. This document describes how to encode information about generated code to permit cross-language links.

Tip
You can find an example implementation at GitHub. The current sample web UI does not interpret the parts of the schema we will use; this is a work in progress.

Protocol Buffers

The Kythe project uses protocol buffers for data interchange. The protoc compiler reads a domain-specific language that describes messages and synthesizes code that serializes, deserializes, and manipulates these messages. It can generate code in a number of different target languages by swapping out backend components. These accept an encoding of the message descriptions in the original source file and emit source text.

G protosrc .proto frontend protoc protosrc->frontend descriptor descriptor frontend->descriptor backend C++ language backend descriptor->backend ccsrc .pb.h backend->ccsrc

Indexing .proto definitions

.proto files are written in a domain-specific programming language for describing various properties about messages and other data. It is interesting to index these on their own, as messages in one .proto file may be used in another .proto file. Here is a very simple example of the language:

syntax = "proto3";
package kythe.examples.proto.example;

// A single proto message.
message Foo {
}

This file describes the empty message kythe.examples.proto.example.Foo using features from version 3 of the language. When run through protoc with the appropriate options set, it will generate the interface example.pb.h and the implementation example.pb.cc. These may be used to interact with Foo messages in C++.

As it turns out, protoc can be coerced into saving the descriptor that it passes to its backends. Ordinarily, this descriptor would merely be an abstract version of the .proto input file that discards syntax and records only the details necessary to generate source code. If asked, protoc will also keep track of source locations (--include_source_info) and data about the .proto files that are (transitively) imported (--include_imports). This information is sufficient to build a Kythe graph for a given .proto definition file. It will become important later that every object that the descriptor describes has an address, like "4.0", that corresponds (roughly) to its position in the descriptor’s AST. These addresses are used as keys into the table that keeps track of source locations in the original .proto file.

G protosrc .proto frontend protoc protosrc->frontend indexer Kythe proto_indexer protosrc->indexer descriptor descriptor frontend->descriptor descriptorfile FileDescriptorSet frontend->descriptorfile backend C++ language backend descriptor->backend descriptorfile->indexer entries Kythe entries indexer->entries ccsrc .pb.h backend->ccsrc

This extra information is stored as a file that contains a proto2.FileDescriptorSet message, which in turn is a list of the proto2.FileDescriptorProto messages used in the course of processing .proto input. Note that this message does not contain .proto source text, so the proto_indexer must have access to the original source files.

We can add a verifier assertion to check that Foo declares a Kythe node:

syntax = "proto3";
package kythe.examples.proto.example;

// A single proto message.
//- @Foo defines/binding MessageFoo?
message Foo {
}

and see that it was unified with the appropriate VName:

Output
MessageFoo: EVar(... = App(vname,
    (4.0, kythe, "", kythe/examples/proto/example.proto, protobuf)))

Using generated source code

Imagine that we have a simple C++ user of our generated source code for Foo. Its code, with a verifier assertion, looks like this:

#include "kythe/examples/proto/example.pb.h"

//- @Foo ref CxxFooDecl?
void UseProto(kythe::examples::proto::example::Foo* foo) {
}

The Kythe pipeline for indexing our combined program looks like this:

G usersrc proto_user.cc ccextractor C++ extractor usersrc->ccextractor ccsrc .pb.h usersrc->ccsrc kindex proto_user.kindex ccextractor->kindex ccindexer Kythe C++ indexer kindex->ccindexer protosrc .proto frontend protoc protosrc->frontend indexer Kythe proto_indexer protosrc->indexer descriptor descriptor frontend->descriptor descriptorfile FileDescriptorSet frontend->descriptorfile backend C++ language backend descriptor->backend descriptorfile->indexer entries Kythe entries indexer->entries ccindexer->entries backend->ccsrc ccsrc->ccextractor

When we use the verifier to inspect the resulting CxxFooDecl, we see that it has not been unified with the VName for Foo:

Output
CxxFooDecl: EVar(... =
    App(vname, (srl0y/pwih+G6wsjFLMTVKQPC7lLH3/9MVK2d2aJHeE=,
                kythe, bazel-out/genfiles, kythe/examples/proto/example.pb.h,
                c++)))

This is because the kythe::examples::proto::example::Foo type is a C++ type defined in example.pb.h. That it was defined in some original .proto file has no meaning to the C++ compiler. Furthermore, the Kythe C++ indexer has no understanding of the protoc language and the VNames that the Kythe proto_indexer produces.

Our goal is to add edges in the graph between CxxFooDecl and MessageFoo so that clients can take into account their relationship when displaying cross-references or answering other queries. We do not want to unify them in the same node, as they are legitimately different objects. Users may wish to navigate to the generated C++ code for CxxFooDecl or to view uses of MessageFoo in other languages. To support these different uses, we will emit a generates edge such that MessageFoo generates CxxFooDecl. Clients can choose to follow the edge or to disregard it.

Observe that the C++ indexer and protoc backend both observe the same content in the .pb.h file; therefore, both programs see the same offsets for various tokens. If the protoc backend were to link those offsets back to the objects in the FileDescriptorProto using well-known names—and if the Kythe proto_indexer guaranteed a particular mechanism for generating VNames from those well-known names—we could close the loop in the C++ indexer by emitting generates edges to the proto_indexer’s nodes whenever the C++ indexer trips over the protoc backend’s marked offsets.

In other words, if the .pb.h contained code like:

...
class Foo {
...

and the protoc backend that generated it reported that the text range Foo was associated with an object in its original FileDescriptorProto at some location encoded as "4.0"—and the proto_indexer guaranteed it would always emit objects with signatures based on their descriptor locations—the C++ indexer would only need to watch for defines/binding edges starting at that text range. Should such an edge be emitted, the C++ indexer would also emit a generates edge to the proto node.

Annotations in protoc backends

We have already seen how to command the protoc frontend to emit location information for .proto source files. The frontend does not, however, know anything about the source code that its various backends emit. We must pass additional flags to these backends to get them to produce location information as proto2.GeneratedCodeInfo messages. These messages connect byte offsets in generated source code with paths in the proto2.FileDescriptorProto AST. These paths are the same ones used by the proto2.SourceCodeInfo message that the Kythe proto_indexer consumes; they are the paths we will use to link up protobuf language nodes with the nodes for generated source code.

Each protoc backend must be individually instrumented to produce proto2.GeneratedCodeInfo messages. To turn annotation on for the C++ backend, you can pass --cpp_out=annotate_headers=1:normal/output/path to protoc. In practice, you will also need to provide an annotation_pragma_name and an annotation_guard_name, so the full cpp_out value may look like annotate_headers=1,annotation_pragma_name=kythe_metadata,annotation_guard_name=KYTHE_IS_RUNNING:normal/output/path.

When annotate_headers=1 is asserted to the C++ backend, it will generate .meta files alongside any files with annotations. For example, in the same directory as example.pb.h, you will find an example.pb.h.meta file. This file contains a serialized proto2.GeneratedCodeInfo message. This message contains a series of spans in example.pb.h, the filenames to the .proto files that caused those spans to be generated, and the AST paths in the FileDescriptorProto for those .proto files. example.pb.h explicitly depends on example.pb.h.meta using a pragma and a preprocessor symbol:

// Generated by the protocol buffer compiler.  DO NOT EDIT!
// source: kythe/examples/proto/example.proto

...

#ifdef KYTHE_IS_RUNNING
#pragma kythe_metadata "kythe/examples/proto/example.pb.h.meta"
#endif  // KYTHE_IS_RUNNING

...

The Kythe C++ extractor and indexer both understand what to do with this pragma (and both define KYTHE_IS_RUNNING). The extractor will add the .meta file to the kindex it produces; the indexer will load the .meta file, translate it from protoc annotations to generic Kythe metadata, and use it to append generates edges for defines/binding edges emitted from example.pb.h.

G usersrc proto_user.cc ccextractor C++ extractor usersrc->ccextractor ccsrc .pb.h usersrc->ccsrc kindex proto_user.kindex ccextractor->kindex ccindexer Kythe C++ indexer kindex->ccindexer protosrc .proto frontend protoc protosrc->frontend indexer Kythe proto_indexer protosrc->indexer descriptor descriptor frontend->descriptor descriptorfile FileDescriptorSet frontend->descriptorfile backend C++ language backend descriptor->backend descriptorfile->indexer entries Kythe entries indexer->entries ccindexer->entries backend->ccsrc ccmeta .pb.h.meta backend->ccmeta ccsrc->ccextractor ccsrc->ccmeta ccmeta->ccextractor

Now we can write verifier assertions that show we have established a link between the proto source and use sites of its generated code:

#include "kythe/examples/proto/example.pb.h"

//- @Foo ref CxxFooDecl
//- MessageFoo? generates CxxFooDecl
//- vname(_, "kythe", "", "kythe/examples/proto/example.proto", "protobuf")
//-     defines/binding MessageFoo
void UseProto(kythe::examples::proto::example::Foo* foo) {
}
Output
MessageFoo: EVar(... = App(vname,
    (4.0, kythe, "", kythe/examples/proto/example.proto, protobuf)))

Of course, Kythe clients need to understand that generates edges should be followed. Solving this problem is out of this document’s scope.

Providing annotations for other languages

To generate metadata for a different language backend, you must determine or implement the following:

  • The protoc backend for the language must be able to produce proto2.GeneratedCodeInfo buffers.

  • There must be some way to signal to your indexer and extractor that a .meta file is associated with a different source file.

  • That .meta file must be made available to the extractor during extraction. For hermetic build systems, this means that the target driving protoc must list the .meta file as an output. Any target that uses that protoc target must require the .meta file as an input.

  • The indexer must read the .meta file and use it to emit generates edges that connect up to the nodes produced by the Kythe proto_indexer.

The method for annotating source code is designed such that it can be implemented purely at the output stage; for example, if you have an abstraction for emitting defines/binding edges from anchors, you can check at every edge (starting from a file with loaded metadata) whether you should emit an additional generates edge.