Kythe - Annotating nodes for display

Experimenting with MarkedSource
Generating MarkedSource
- Including MarkedSource by reference
Testing MarkedSource facts

Semantic nodes in a Kythe graph may stand for objects with complex structure, such as polymorphic functions bearing many type constraints. Representing these nodes in a UI for human viewers is often complicated. Displaying only the source text may omit important context (like types inferred by the compiler). On the other hand, fully expanding the node’s internal representation may result in a very long, difficult-to-read string. Semantic information may also be lost, as in the case where programmers use transparent `typedef`s in the C family of languages.

The schema provides a code fact, when attached to an arbitrary semantic node in the Kythe graph, instructs clients on how that node can be presented to users. The fact’s value is a serialized ‘MarkedSource` protocol buffer message, defined in common.proto. Unlike most facts in the Kythe graph, MarkedSource is a structured message rather than a plain string, because clients have differing requirements for the amount and level of detail they display. By including or excluding various parts of this message, clients can precisely format a node’s presentation according to their requirements. The message also offers the ability to link subspans to other nodes and to include other nodes’ code by reference. Kythe indexers are responsible for emitting MarkedSource messages.

Experimenting with `MarkedSource`

The Kythe repository contains a sample utility for rendering documentation, including any included MarkedSource messages. You can build it with:

bazel build //kythe/cxx/doc

To run it in a mode that will accept and render a ASCII MarkedSource message, use:

./bazel-bin/kythe/cxx/doc/doc --common_signatures

An empty message produces the following output (shown between double-quotes with HTML special characters escaped):

Output

      RenderSimpleIdentifier: ""
RenderSimpleQualifiedName-ID: ""
RenderSimpleQualifiedName+ID: ""

Generating `MarkedSource`

MarkedSource messages describe simplified parse trees for source code. The parse tree represented by a MarkedSource message need not correspond exactly to the surface syntax of the language, but is intended to be as similar as possible so that a reader familiar with the language will understand the structure that is represented. Each message is a node in the parse tree. Messages have kinds (distinct from the kind facts on Kythe nodes) that apply to themselves and their children, so a message with the TYPE kind applies the type nature to itself and its subtree. When tools render MarkedSource, they include or exclude parts of the parse tree by inspecting kinds. For a full listing of valid kinds, refer to the message definition in xref.proto.

Renderers traverse the tree in order. If a message is elected to be rendered, its pre_text is appended when it is first visited. Each of the message’s children is traversed. After each child is rendered, the parent’s post_child_text is appended, unless that child is the last child. Once all of the children have been traversed, the parent’s post_text is appended. For example:

kind: IDENTIFIER
pre_text: "pre"
post_child_text: "post_child"
post_text: "post"

(Here and elsewhere we show MarkedSource messages as text format protobuf messages.)

Output

      RenderSimpleIdentifier: "prepost"
RenderSimpleQualifiedName-ID: ""
RenderSimpleQualifiedName+ID: "prepost"

kind: IDENTIFIER
pre_text: "pre"
post_child_text: "post_child"
post_text: "post"
child {
  pre_text: "1"
}
child {
  pre_text: "2"
}

Output

      RenderSimpleIdentifier: "pre1post_child2post"
RenderSimpleQualifiedName-ID: ""
RenderSimpleQualifiedName+ID: "pre1post_child2post"

A MarkedSource representation of a typical C++ qualified name would be:

kind: BOX
child {
  kind: CONTEXT
  child {
    kind: IDENTIFIER
    pre_text: "std"
  }
  child {
    kind: IDENTIFIER
    pre_text: "experimental"
  }
  post_child_text: "::"
  add_final_list_token: true
}
child {
  kind: IDENTIFIER
  pre_text: "string_view"
}

Output

      RenderSimpleIdentifier: "string_view"
RenderSimpleQualifiedName-ID: "std::experimental"
RenderSimpleQualifiedName+ID: "std::experimental::string_view"

A function prototype would look like:

child { kind: TYPE pre_text: "void" }
child { pre_text: " " }
child { kind: IDENTIFIER pre_text: "foo" }
child {
  kind: PARAMETER
  child {
    child { kind: TYPE pre_text: "int" }
    child { pre_text: " " }
    child {
      kind: CONTEXT
      child { kind: IDENTIFIER pre_text: "foo" }
      post_child_text: "::"
      add_final_list_token: true
    }
    child { kind: IDENTIFIER pre_text: "x" }
  }
  child {
    child { kind: TYPE pre_text: "int" }
    child { pre_text: " " }
    child {
      kind: CONTEXT
      child { kind: IDENTIFIER pre_text: "foo" }
      post_child_text: "::"
      add_final_list_token: true
    }
    child { kind: IDENTIFIER pre_text: "y" }
  }
  pre_text: "("
  post_child_text: ", "
  post_text: ")"
}

Output

      RenderSimpleIdentifier: "foo"
          RenderSimpleParams: "x"
          RenderSimpleParams: "y"
RenderSimpleQualifiedName-ID: ""
RenderSimpleQualifiedName+ID: "foo"

Including `MarkedSource` by reference

In the function prototype example above, the MarkedSource for x and y will appear duplicated in the indexer output: once for each variable, then again in the code fact for foo. It is possible to avert this duplication by including the code of another node in the Kythe graph by using a LOOKUP message kind. For example, the prototype could have been equivalently written:

child { kind: TYPE pre_text: "void" }
child { pre_text: " " }
child { kind: IDENTIFIER pre_text: "foo" }
child {
  kind: PARAMETER_LOOKUP_BY_PARAM
  pre_text: "("
  post_child_text: ", "
  post_text: ")"
}

Warning

There is a tradeoff between size and speed in the use of the LOOKUP kinds. You should not expect more than one LOOKUP level to be dereferenced by the serving infrastructure on your behalf.

Testing `MarkedSource` facts

The verifier supports checking MarkedSource subtrees by exploding the protocol buffer into a subgraph. Because this behavior can add many facts to its database, it is disabled by default. Enable it using the --convert_marked_source flag. If some node N has a fact such that N.code is an encoded MarkedSource, that fact will be replaced with a synthesized code edge connected to the root MarkedSource node with facts that are named the same as the fields in the MarkedSource proto definition. Child messages are attached via ParentMS child.N ChildMS edges, where N is the zero-based index of the child in the parent. For example, the following test script checks the MarkedSource attached to a C++ variable:

Variable source. (C++)

//- @x defines/binding VarX
//- VarX code VXRoot
//- VXRoot child.0 VXType
//- VXType.pre_text int
//- VXType.kind "TYPE"
//- VXRoot child.1 VXSpaceBox
//- VXSpaceBox.pre_text " "
//- VXRoot child.2 VXIdentifier
//- VXIdentifier.kind "IDENTIFIER"
//- VXIdentifier.pre_text x
int x;

Experimenting with MarkedSource

Generating MarkedSource

Including MarkedSource by reference

Testing MarkedSource facts

Variable source. (C++)

Experimenting with `MarkedSource`

Generating `MarkedSource`

Including `MarkedSource` by reference

Testing `MarkedSource` facts