Kythe - Kythe Indexer Verification

Kythe is intended to be source language agnostic. This extends to the facility for writing tests to check whether Kythe is correctly processing source text. Here we describe the verifier, a standalone component that checks that a database of Kythe facts supports a set of goals.

Recommended use

For examples of tests that use the verifier, please see //kythe/cxx/indexer/cxx/testdata/basic or //kythe/javatests/com/google/devtools/kythe/analyzers/java/testdata/pkg. It is best to keep the size of a single test small and to develop assertions while frequently checking the graph produced by --annotated_graphviz.

Assuming that you have Kythe installed to /opt/kythe, you can start using the verifier on a C++ file right away:

/opt/kythe/indexers/cxx_indexer -i foo.cc -- -std=c++1y \
    | /opt/kythe/tools/verifier foo.cc --annotated_graphviz > foo.dot
xdot foo.dot

By default (and assuming no errors) you will see the Kythe graph for foo.cc. The remainder of this document discusses the kinds of assertions you can make about the nodes and edges in the graph.

There is a style guide with some suggestions on how to write readable tests.

Basic well-formedness checks

The verifier begins by checking that the input database obeys basic Kythe conventions and some uniqueness properties:

All vnames must have at least one field set.
Facts must be one of the following forms:
- (source, edge, target, /, "")
- (source, edge, target, /kythe/ordinal, base10string)
- (source, "", "", string, _)
forall s k, there may be at most one fact matching (s, "", "", k, _).
forall s e t k v, there may be at most one fact matching (s, e, t, k, v).

For the moment, these checks are done by code baked into the verifier; in the future, they may be specified in a startup script.

Specifying goals

Verification goals may be inlined into ordinary source text. Lines beginning with language-specific magic comments are treated as goals. By default, the verifier looks for goals in comments that start with //-:

//- this-is-a-goal

For languages supporting shell-style comments, like Python or text protobufs, instead use #- and tell the verifier to expect the different syntax using the '--goal_prefix=#-' flag:

#- this-is-a-goal

The verifier will ignore any combination of spaces or tabs before the comment token. Long goals may be split across multiple lines, each of which must begin with the comment token (after zero or more tabs or spaces). Within the assertion language, the verifier is whitespace insensitive.

You may also use a regular expression (using RE2 syntax) to specify which lines include goals and which parts of those lines to include. Your regex must match the entire line and must contain a single capture group. The default goal matching behavior uses the regex \s*\/\/\-(.*). To specify an alternate regex—for example, one that requires goals to be bracketed—use --goal_regex='\s*\/\/\-\s*\[(.*)]'.

You may use BCPL-style comments in the assertion language:

//- this-is-a-goal // and this is ignored.

The // token is independent of the goal prefix or regex:

#- this-is-a-goal // and this is ignored.

Literal strings

The verifier permits literal strings to be specified with C-style escapes:

"foo"
"foo \"bar\""
"foo \\"

Currently, the allowed escape codes are \" and \\. It is an error to use a disallowed escape code.

Existential variables

Any identifier beginning with a capital letter is parsed as an existential variable, or evar. Evars are initially unset. In the course of solving a goal, the verifier may attempt to find values for zero or more evars. It is possible for all goals to be satisfied with some evars left unset. All input files share the same evar namespace.

If you want to match against text beginning with a capital letter, you must surround it with quotes to form a string literal.

Important

Any mention of SomeEvar refers to the same SomeEvar, even across different rules, unless the identifier starts with an underscore. Starting a variable’s name with an underscore indicates to the verifier that you don’t care about its value.

The verifier can return an error result if any (uninspected) existential variable is mentioned only once. This usually indicates that you’ve made a typo somewhere in your goal script. To enable this behavior, pass the --check_for_singletons=true flag. This behavior will eventually become the default.

Anchor specifiers

Tests written against Kythe indexers must frequently refer to anchor nodes, as these connect the syntax of the source code with the indexer’s semantic model. It is alternately possible to write down the vname of the syntactic or semantic objects, but this makes the test depend on a particular vname encoding strategy that may be unreasonable to compute manually. The verifier provides syntactic sugar for referring to anchor nodes. When it encounters an anchor specifier, the verifier generates an evar to refer to the anchor’s vname. It also generates goals to ensure that vname refers to an anchor node with the specified source location.

Tokens

The location specifier @tok refers to the source text range for the identifier tok on the next source line. This may not be the next line in the input file, as is the case when the next line begins with a magic comment as previously described. It is an error if matching tok is ambiguous. Substring matching is used. tok may be a literal string; @text and @"text" are equivalent.

For example, the following two rules both match tokens from the same line. If @text and @more were swapped in this example, the rules would remain valid.

#- @text defines SomeNode
#- @more defines OtherNode
##text more

The following two rules match tokens from different lines:

#- @text defines SomeNode
##text
#- @more defines OtherNode
#more

If @text and @more were swapped in this example, the rules would be invalid.

The constraints generated for some @text are of the form:

#- TextEvar.node/kind anchor
#- TextEvar./kythe/loc/start starting-offset
#- TextEvar./kythe/loc/end ending-offset

These constraints notably leave out the file that the anchor is a child of. If there is concern that the anchor may match against an equivalent range in a different file, you can use an equality constraint to bind a name to the anchor’s implicit evar, then constrain this evar to be the child of some file node.

Important

Each mention of @tok generates a fresh evar.

Offset specifications

If you are only interested in the byte offset of a token or string literal, you may use an offset specification. @^tok will become the offset of the t on the next source line. @$tok will become the offset immediately after the k on the next source line. The rules for tok are the same as for anchor specifiers; it may also be a literal string.

The following snippet demonstrates the relationship between anchors and offsets. AnchorX=@x binds AnchorX to the evar generated for @x; see the section below on naming expressions for details.

#- AnchorX=@x.node/kind anchor
#- AnchorX.loc/start @^x
#- AnchorX.loc/end @$x

Note	Offset specifications are equivalent to the atomic string literal containing the decimal stringification of the relevant byte offset. They are manifested as evars that are unified before solving begins.

Anchors that refer to lines far in the future

It may not be possible to place a goal string on the line directly before the text you want to match. In this case, you can specify that the match should take place on either an absolute line number or a relative line number. For example, the existing anchor specifier @x could (almost) be written with a relative offset as:

#- @+1x defines Foo

It is easier to read as a string literal:

#- @+1"x" defines Foo

If x should appear on line 3, you can use an absolute line number:

#- @:3"x" defines Foo

Relative offsets are calculated from the line on which the anchor specifier begins. (This is the way that @x and @+1x differ: the former matches on the first subsequent non-goal line, while the latter always matches on the next line.) Text matching may only occur on lines following the one on which the match is ordered. Matches will not be resolved inside goals.

For completeness, you may use line references together with offset specifications:

#- AnchorX=@x.node/kind anchor
#- AnchorX.loc/start @^+2x
#- AnchorX.loc/end @$:4x

Ambiguity

In the usual case, it is an error if a location specifier has multiple matches:

#- @text defines SomeNode
text text

It is not always possible to rewrite a test program to ensure unambiguous matches. In these situations, you can choose a particular match by adding its ordinal directly after the @ of the specifier:

#- @#0text defines SomeNode
#- @#1text defines AnotherNode
text text

Edge goals

An edge kind, suitably intercalated between two expressions representing vnames, serves to create a goal that looks for a matching edge in the database. The edge kind is automatically given a /kythe/edge/ prefix. The goal

#- SomeAnchor defines SomeNode

will succeed if there exists any edge with the kind /kythe/edge/defines from the vname SomeAnchor to the vname SomeNode.

If the edge kind begins with a /, it is not given the automatic prefix. The line above can therefore be rewritten as:

#- SomeAnchor /kythe/edge/defines SomeNode

To match an edge with an ordinal component, add it after the edge name using a period:

#- SomeNode param.1 SomeParam

While the edge name is a literal, the ordinal is an atom: you may replace it with an evar:

#- SomeNode param.Ordinal SomeParam

Kythe internals can generate edge kinds that begin with % or #. The following edge kinds are equivalent:

#- SomeInternal %impl Target
#- SomeInternal %/kythe/edge/impl Target

#- SomeInternal #implTwo Target
#- SomeInternal #/kythe/edge/implTwo Target

Node goals

The goal evar-exp . name-literal value-exp is satisfied when evar-exp unifies with a vname that appears in a Kythe fact assigning value-exp to the fact called /kythe/name-literal. For example:

#- SomeNode.content 42

generates a goal that can be satisfied when there exists a node with a fact called /kythe/content.

Similar to internal edge kinds, the following fact names are equivalent:

#- SomeInternal.%fact value
#- SomeInternal.%/kythe/fact value

#- SomeInternal.#factTwo value
#- SomeInternal.#/kythe/factTwo value

Getting feedback

You can check the assignments made to evars (and thus examine the evidence used to support the truth of a goal) using the ? operator after any mention of that evar. A callback provided to the verifier at runtime is called once per instance of ? after all goals are successfully solved but (importantly) before the solver unwinds its state.

The default behavior for the verifier is to print inspections to standard out, eg:

#- SomeAnchor? defines SomeNode?
(...)
SomeAnchor: EVar(0x1a7b298 = App(vname, ("", "", 1, "", "")))
SomeNode: EVar(0x1a7b2c8 = App(vname, ("", "", 2, "", "")))

It may also be useful to see a dump of your goals after they have been parsed. For this, pass the --show_goals command-line flag.

If an evar is inspected, it is never considered to be a singleton.

Explicit unification

At times it might be necessary to match subparts of vnames. A vname is stored as a predicate with head atom vname applied to a 5-tuple of (signature, corpus, root, path, language). Elements of a vname that are missing are set to "".

Note	The Kythe storage specification draws no distinction between empty fields and default-valued or unset fields. As such, the verifier unifies both of these states as the empty identifier, `""`.

#- vname(Signature?, Corpus?, Root?, Path?, Language?) defines SomeNode
(...)
Signature: EVar(0x2180888 = Signature)
Corpus: EVar(0x21808b8 = Corpus)
Root: EVar(0x21808e8 = Root)
Path: EVar(0x2180918 = Path)
Language: EVar(0x2180948 = Language)

Note that this is also possible with facts (the head atom is fact), but not recommended.

Sometimes you don’t care about the value in a particular position and never expect to refer to it again. In this case, write down a _, and the parser will create an unnamed evar distinct from any other evar (including other ones instantiated for _):

#- vname(Signature?, Corpus?, Root?, _?, Language?) defines SomeNode
(...)
Signature: EVar(0x7f4948 = Signature)
Corpus: EVar(0x7f4978 = Corpus)
Root: EVar(0x7f49a8 = Root)
_: EVar(0x7f49d8 = Path)
Language: EVar(0x7f4a08 = Language)

This don’t-care rule applies to any identifier beginning with the underscore.

Naming expressions

You may wish to name the result of an expression match. For example, a node in the graph may have two out-edges with the same kind. If you want to match one of the target nodes using a specific vname, then later write specifications about that particular node, you must write down an equality constraint:

#- Node typed Type = vname("java.lang.Object",_,_,_,_)

This is distinct from these rules:

#- Node typed Type
#- Node typed vname("java.lang.Object",_,_,_,_)

In the latter case, Type may unify with a vname distinct from the one spelled in the second rule.

Take care to avoid introducing cyclic constraints. The verifier rejects tests that attempt to create graph structures through unification:

#- Bad typed Mu = vname(_,_,Mu,_,_)
#- Worse typed vname(_,_,Foo = Bar = Foo,_,_)

Checking that rules are unsatisfiable

It is sometimes necessary to assert that a fact does not appear in a Kythe database. For example, you may be testing an indexer that is meant to refrain from indexing certain objects if a flag is set. The verifier provides a mechanism for making these assertions using goal groups. This mechanism is fairly restrictive: goal groups may not be nested and are always processed last when solving. Negated goal groups begin with a bang (!). Not every goal in a negated goal group must fail, but at least one must fail. Assignments made inside a negated goal group are undone once the group is known to fail. Goal groups never cause assignments made by ungrouped goals or previously completed goals (from non-negated groups) to be undone.

#- This must Pass
#- !{ If this Passes
#-    This must Fail }
#- !{ This must AlwaysFail }

Inspections are performed once all goal groups have been resolved, when a goal group fails, or when a negated goal group succeeds.

Specifying the database

The verifier supports reading Kythe facts from standard input. It expects a sequence of (varint32 record-size, kythe::proto::Entry record) records. To request that the verifier print out debug versions of the facts it reads in, pass the --show_protos flag.

The rule file to parse can be specified as a positional argument.

For example, the following command pipes output from an indexer to the verifier, dumping the protocol buffers along the way:

${INDEXER_BIN} -i $1 | ${VERIFIER_BIN} --show_protos $1

Dumping the database as a graph

Passing the --graphviz flag will cause the verifier to ignore rule files and dump its fact database as a Kythe graph to standard out in graphviz format. Note that the database must still be well-formed according to the basic well-formedness checks described earlier.

In contrast, the --annotated_graphviz flag will still attempt to satisfy all rules. If this is successful, it will dump a graph to standard out in graphviz format. In this graph, nodes used to satisfy rules will be colored blue; their labels will also contain the names of the EVars unified with their VNames.

Reading assertions from the input graph

You can instruct the verifier to read assertions from file content stored in the graph database with the --use_file_nodes flag. In certain cases this may be more convenient than providing that content as actual source files. For example, an indexer may synthesize file nodes that never existed in files in the first place.

Making assertions about tree-valued facts

Certain facts have tree-shaped values. These are encoded in various ways defined by the schema. By default, the verifier will truncate these values to unique but unutterable symbols. If you want to expand these facts into subgraphs, use the appropriate --convert_ flag. (As of the time this was written, there is only one such flag, --convert_marked_source.)

`MarkedSource`

MarkedSource values are converted to subgraphs by creating new nodes for each MarkedSource message. These nodes are given unique but unutterable VNames. The original fact with the MarkedSource value becomes an edge with the same name. Each scalar field of the message becomes a fact on the relevant node. Links are decoded to VNames and become unordered link edges. Children are recursively converted and connected from parent to child with ordered child.N edges.

Return code

If the database passes all well-formedness checks and all rules could be satisfied, the verifier returns 0; otherwise, it returns nonzero.