| // Copyright 2014 The Kythe Authors. All rights reserved. |
| // |
| // Licensed under the Apache License, Version 2.0 (the "License"); |
| // you may not use this file except in compliance with the License. |
| // You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| |
| Kythe Indexer Verification |
| ========================== |
| |
| Kythe is intended to be source language agnostic. This extends to the facility |
| for writing tests to check whether Kythe is correctly processing source text. |
| Here we describe the *verifier*, a standalone component that checks that a |
| *database* of Kythe facts supports a set of *goals*. |
| |
| == Recommended use |
| |
| For examples of tests that use the verifier, please see |
| https://github.com/kythe/kythe/tree/master/kythe/cxx/indexer/cxx/testdata/basic[//kythe/cxx/indexer/cxx/testdata/basic] |
| or https://github.com/kythe/kythe/tree/master/kythe/javatests/com/google/devtools/kythe/analyzers/java/testdata/pkg[//kythe/javatests/com/google/devtools/kythe/analyzers/java/testdata/pkg]. |
| It is best to keep the size of a single test small and to develop assertions |
| while frequently checking the graph produced by `--annotated_graphviz`. |
| |
| Assuming that you have Kythe installed to `/opt/kythe`, you can start using |
| the verifier on a $$C++$$ file right away: |
| |
| /opt/kythe/indexers/cxx_indexer -i foo.cc -- -std=c++1y \ |
| | /opt/kythe/tools/verifier foo.cc --annotated_graphviz > foo.dot |
| xdot foo.dot |
| |
| By default (and assuming no errors) you will see the Kythe graph for `foo.cc`. |
| The remainder of this document discusses the kinds of assertions you can make |
| about the nodes and edges in the graph. |
| |
| There is a https://kythe.io/docs/schema/verifierstyle.html[style guide] with |
| some suggestions on how to write readable tests. |
| |
| == Basic well-formedness checks |
| |
| The verifier begins by checking that the input database obeys basic Kythe |
| conventions and some uniqueness properties: |
| |
| * All vnames must have at least one field set. |
| * Facts must be one of the following forms: |
| - `(source, edge, target, /, "")` |
| - `(source, edge, target, /kythe/ordinal, base10string)` |
| - `(source, "", "", string, _)` |
| * forall `s k`, there may be at most one fact matching `(s, "", "", k, _)`. |
| * forall `s e t k v`, there may be at most one fact matching `(s, e, t, k, v)`. |
| |
| For the moment, these checks are done by code baked into the verifier; in the |
| future, they may be specified in a startup script. |
| |
| == Specifying goals |
| |
| Verification goals may be inlined into ordinary source text. Lines beginning |
| with language-specific magic comments are treated as goals. By default, the |
| verifier looks for goals in comments that start with `//-`: |
| |
| //- this-is-a-goal |
| |
| For languages supporting shell-style comments, like Python or text protobufs, |
| instead use `#-` and tell the verifier to expect the different syntax using the |
| `'--goal_prefix=#-'` flag: |
| |
| #- this-is-a-goal |
| |
| The verifier will ignore any combination of spaces or tabs before the comment |
| token. Long goals may be split across multiple lines, each of which must begin |
| with the comment token (after zero or more tabs or spaces). Within the assertion |
| language, the verifier is whitespace insensitive. |
| |
| You may also use a regular expression (using |
| https://github.com/google/re2/wiki/Syntax[RE2 syntax]) to specify which lines |
| include goals and which parts of those lines to include. Your regex must match |
| the entire line and must contain a single capture group. The default goal |
| matching behavior uses the regex `\s*\/\/\-(.*)`. To specify an alternate |
| regex--for example, one that requires goals to be bracketed--use |
| `--goal_regex='\s*\/\/\-\s*\[(.*)\]'`. |
| |
| You may use BCPL-style comments in the assertion language: |
| |
| //- this-is-a-goal // and this is ignored. |
| |
| The `//` token is independent of the goal prefix or regex: |
| |
| #- this-is-a-goal // and this is ignored. |
| |
| === Literal strings |
| |
| The verifier permits literal strings to be specified with C-style escapes: |
| |
| "foo" |
| "foo \"bar\"" |
| "foo \\" |
| |
| Currently, the allowed escape codes are `\"` and `\\`. It is an error to use |
| a disallowed escape code. |
| |
| === Existential variables |
| |
| Any identifier beginning with a capital letter is parsed as an existential |
| variable, or *evar*. Evars are initially unset. In the course of solving a goal, |
| the verifier may attempt to find values for zero or more evars. It is possible |
| for all goals to be satisfied with some evars left unset. All input files share |
| the same evar namespace. |
| |
| If you want to match against text beginning with a capital letter, you must |
| surround it with quotes to form a string literal. |
| |
| IMPORTANT: Any mention of `SomeEvar` refers to the *same* `SomeEvar`, even |
| across different rules, unless the identifier starts with an underscore. |
| Starting a variable's name with an underscore indicates to the verifier that |
| you don't care about its value. |
| |
| The verifier can return an error result if any (uninspected) existential |
| variable is mentioned only once. This usually indicates that you've made a typo |
| somewhere in your goal script. To enable this behavior, pass the |
| `--check_for_singletons=true` flag. This behavior will eventually become the |
| default. |
| |
| === Anchor specifiers |
| |
| Tests written against Kythe indexers must frequently refer to *anchor* nodes, as |
| these connect the syntax of the source code with the indexer's semantic model. |
| It is alternately possible to write down the vname of the syntactic or semantic |
| objects, but this makes the test depend on a particular vname encoding strategy |
| that may be unreasonable to compute manually. The verifier provides syntactic |
| sugar for referring to anchor nodes. When it encounters an anchor specifier, the |
| verifier generates an evar to refer to the anchor's vname. It also generates |
| goals to ensure that vname refers to an anchor node with the specified source |
| location. |
| |
| ==== Tokens |
| |
| The location specifier `@tok` refers to the source text range for the identifier |
| `tok` on the next source line. This may not be the next line in the input file, |
| as is the case when the next line begins with a magic comment as previously |
| described. It is an error if matching `tok` is ambiguous. Substring matching is |
| used. `tok` may be a literal string; `@text` and `@"text"` are equivalent. |
| |
| For example, the following two rules both match tokens from the same line. |
| If `@text` and `@more` were swapped in this example, the rules would remain |
| valid. |
| |
| #- @text defines SomeNode |
| #- @more defines OtherNode |
| ##text more |
| |
| The following two rules match tokens from different lines: |
| |
| #- @text defines SomeNode |
| ##text |
| #- @more defines OtherNode |
| #more |
| |
| If `@text` and `@more` were swapped in this example, the rules would be invalid. |
| |
| The constraints generated for some `@text` are of the form: |
| |
| #- TextEvar.node/kind anchor |
| #- TextEvar./kythe/loc/start starting-offset |
| #- TextEvar./kythe/loc/end ending-offset |
| |
| These constraints notably leave out the file that the anchor is a child of. |
| If there is concern that the anchor may match against an equivalent range in |
| a different file, you can use an equality constraint to bind a name to the |
| anchor's implicit evar, then constrain this evar to be the child of some |
| file node. |
| |
| IMPORTANT: Each mention of `@tok` generates a fresh evar. |
| |
| ==== Offset specifications |
| |
| If you are only interested in the byte offset of a token or string literal, |
| you may use an offset specification. `@^tok` will become the offset of the |
| `t` on the next source line. `@$tok` will become the offset immediately after |
| the `k` on the next source line. The rules for `tok` are the same as for |
| anchor specifiers; it may also be a literal string. |
| |
| The following snippet demonstrates the relationship between anchors and offsets. |
| `AnchorX=@x` binds `AnchorX` to the evar generated for `@x`; see the section |
| below on <<subexp,naming expressions>> for details. |
| |
| #- AnchorX=@x.node/kind anchor |
| #- AnchorX.loc/start @^x |
| #- AnchorX.loc/end @$x |
| |
| NOTE: Offset specifications are equivalent to the atomic string literal |
| containing the decimal stringification of the relevant byte offset. They are |
| manifested as evars that are unified before solving begins. |
| |
| ==== Anchors that refer to lines far in the future |
| |
| It may not be possible to place a goal string on the line directly before the |
| text you want to match. In this case, you can specify that the match should take |
| place on either an absolute line number or a relative line number. For example, |
| the existing anchor specifier `@x` could (almost) be written with a relative |
| offset as: |
| |
| #- @+1x defines Foo |
| |
| It is easier to read as a string literal: |
| |
| #- @+1"x" defines Foo |
| |
| If `x` should appear on line 3, you can use an absolute line number: |
| |
| #- @:3"x" defines Foo |
| |
| Relative offsets are calculated from the line on which the anchor specifier |
| begins. (This is the way that `@x` and `@+1x` differ: the former matches on the |
| first subsequent non-goal line, while the latter always matches on the next |
| line.) Text matching may only occur on lines following the one on which the |
| match is ordered. Matches will not be resolved inside goals. |
| |
| For completeness, you may use line references together with offset |
| specifications: |
| |
| #- AnchorX=@x.node/kind anchor |
| #- AnchorX.loc/start @^+2x |
| #- AnchorX.loc/end @$:4x |
| |
| ==== Ambiguity |
| |
| In the usual case, it is an error if a location specifier has multiple matches: |
| |
| #- @text defines SomeNode |
| text text |
| |
| It is not always possible to rewrite a test program to ensure unambiguous |
| matches. In these situations, you can choose a particular match by adding its |
| ordinal directly after the `@` of the specifier: |
| |
| #- @#0text defines SomeNode |
| #- @#1text defines AnotherNode |
| text text |
| |
| === Edge goals |
| |
| An edge kind, suitably intercalated between two expressions representing vnames, |
| serves to create a goal that looks for a matching edge in the database. The edge |
| kind is automatically given a `/kythe/edge/` prefix. The goal |
| |
| #- SomeAnchor defines SomeNode |
| |
| will succeed if there exists any edge with the kind `/kythe/edge/defines` from |
| the vname `SomeAnchor` to the vname `SomeNode`. |
| |
| If the edge kind begins with a `/`, it is not given the automatic prefix. The |
| line above can therefore be rewritten as: |
| |
| #- SomeAnchor /kythe/edge/defines SomeNode |
| |
| To match an edge with an ordinal component, add it after the edge name using a |
| period: |
| |
| #- SomeNode param.1 SomeParam |
| |
| While the edge name is a literal, the ordinal is an atom: you may replace it |
| with an evar: |
| |
| #- SomeNode param.Ordinal SomeParam |
| |
| Kythe internals can generate edge kinds that begin with $$%$$ or $$#$$. The |
| following edge kinds are equivalent: |
| |
| #- SomeInternal %impl Target |
| #- SomeInternal %/kythe/edge/impl Target |
| |
| #- SomeInternal #implTwo Target |
| #- SomeInternal #/kythe/edge/implTwo Target |
| |
| === Node goals |
| |
| The goal `evar-exp . name-literal value-exp` is satisfied when `evar-exp` |
| unifies with a vname that appears in a Kythe fact assigning `value-exp` to the |
| fact called `/kythe/name-literal`. For example: |
| |
| #- SomeNode.content 42 |
| |
| generates a goal that can be satisfied when there exists a node with a fact |
| called `/kythe/content`. |
| |
| Similar to internal edge kinds, the following fact names are equivalent: |
| |
| #- SomeInternal.%fact value |
| #- SomeInternal.%/kythe/fact value |
| |
| #- SomeInternal.#factTwo value |
| #- SomeInternal.#/kythe/factTwo value |
| |
| === Getting feedback |
| |
| You can check the assignments made to evars (and thus examine the evidence used |
| to support the truth of a goal) using the `?` operator after any mention of that |
| evar. A callback provided to the verifier at runtime is called once per instance |
| of `?` after all goals are successfully solved but (importantly) before the |
| solver unwinds its state. |
| |
| The default behavior for the verifier is to print inspections to standard out, |
| eg: |
| |
| #- SomeAnchor? defines SomeNode? |
| (...) |
| SomeAnchor: EVar(0x1a7b298 = App(vname, ("", "", 1, "", ""))) |
| SomeNode: EVar(0x1a7b2c8 = App(vname, ("", "", 2, "", ""))) |
| |
| It may also be useful to see a dump of your goals after they have been parsed. |
| For this, pass the `--show_goals` command-line flag. |
| |
| If an evar is inspected, it is never considered to be a singleton. |
| |
| === Explicit unification |
| |
| At times it might be necessary to match subparts of vnames. A vname is stored |
| as a predicate with head atom `vname` applied to a 5-tuple of `(signature, |
| corpus, root, path, language)`. Elements of a vname that are missing are set |
| to `""`. |
| |
| NOTE: The Kythe storage specification draws no distinction between empty fields |
| and default-valued or unset fields. As such, the verifier unifies both of these |
| states as the empty identifier, `""`. |
| |
| #- vname(Signature?, Corpus?, Root?, Path?, Language?) defines SomeNode |
| (...) |
| Signature: EVar(0x2180888 = Signature) |
| Corpus: EVar(0x21808b8 = Corpus) |
| Root: EVar(0x21808e8 = Root) |
| Path: EVar(0x2180918 = Path) |
| Language: EVar(0x2180948 = Language) |
| |
| Note that this is also possible with facts (the head atom is `fact`), but not |
| recommended. |
| |
| Sometimes you don't care about the value in a particular position and never |
| expect to refer to it again. In this case, write down a `_`, and the parser will |
| create an unnamed evar distinct from any other evar (including other ones |
| instantiated for `_`): |
| |
| #- vname(Signature?, Corpus?, Root?, _?, Language?) defines SomeNode |
| (...) |
| Signature: EVar(0x7f4948 = Signature) |
| Corpus: EVar(0x7f4978 = Corpus) |
| Root: EVar(0x7f49a8 = Root) |
| _: EVar(0x7f49d8 = Path) |
| Language: EVar(0x7f4a08 = Language) |
| |
| This don't-care rule applies to any identifier beginning with the underscore. |
| |
| [[subexp]] |
| === Naming expressions |
| |
| You may wish to name the result of an expression match. For example, a node in |
| the graph may have two out-edges with the same kind. If you want to match |
| one of the target nodes using a specific `vname`, then later write |
| specifications about that particular node, you must write down an equality |
| constraint: |
| |
| #- Node typed Type = vname("java.lang.Object",_,_,_,_) |
| |
| This is distinct from these rules: |
| |
| #- Node typed Type |
| #- Node typed vname("java.lang.Object",_,_,_,_) |
| |
| In the latter case, `Type` may unify with a `vname` distinct from the one spelled |
| in the second rule. |
| |
| Take care to avoid introducing cyclic constraints. The verifier rejects tests |
| that attempt to create graph structures through unification: |
| |
| #- Bad typed Mu = vname(_,_,Mu,_,_) |
| #- Worse typed vname(_,_,Foo = Bar = Foo,_,_) |
| |
| === Checking that rules are unsatisfiable |
| |
| It is sometimes necessary to assert that a fact does *not* appear in a Kythe |
| database. For example, you may be testing an indexer that is meant to refrain |
| from indexing certain objects if a flag is set. The verifier provides a |
| mechanism for making these assertions using goal groups. This mechanism is |
| fairly restrictive: goal groups may not be nested and are always processed |
| last when solving. Negated goal groups begin with a bang (`!`). Not every goal |
| in a negated goal group must fail, but at least one must fail. Assignments made |
| inside a negated goal group are undone once the group is known to fail. Goal |
| groups never cause assignments made by ungrouped goals or previously |
| completed goals (from non-negated groups) to be undone. |
| |
| #- This must Pass |
| #- !{ If this Passes |
| #- This must Fail } |
| #- !{ This must AlwaysFail } |
| |
| Inspections are performed once all goal groups have been resolved, when a |
| goal group fails, or when a negated goal group succeeds. |
| |
| === Specifying the database |
| |
| The verifier supports reading Kythe facts from standard input. It expects |
| a sequence of `(varint32 record-size, kythe::proto::Entry record)` records. |
| To request that the verifier print out debug versions of the facts it reads in, |
| pass the `--show_protos` flag. |
| |
| The rule file to parse can be specified as a positional argument. |
| |
| For example, the following command pipes output from an indexer to the verifier, |
| dumping the protocol buffers along the way: |
| |
| ${INDEXER_BIN} -i $1 | ${VERIFIER_BIN} --show_protos $1 |
| |
| === Dumping the database as a graph |
| |
| Passing the `--graphviz` flag will cause the verifier to ignore rule files |
| and dump its fact database as a Kythe graph to standard out in graphviz format. |
| Note that the database must still be well-formed according to the |
| basic well-formedness checks described earlier. |
| |
| In contrast, the `--annotated_graphviz` flag will still attempt to satisfy |
| all rules. If this is successful, it will dump a graph to standard out in |
| graphviz format. In this graph, nodes used to satisfy rules will be colored |
| blue; their labels will also contain the names of the EVars unified with their |
| VNames. |
| |
| === Reading assertions from the input graph |
| |
| You can instruct the verifier to read assertions from file content stored in |
| the graph database with the `--use_file_nodes` flag. In certain cases this may |
| be more convenient than providing that content as actual source files. For |
| example, an indexer may synthesize file nodes that never existed in files in |
| the first place. |
| |
| === Making assertions about tree-valued facts |
| |
| Certain facts have tree-shaped values. These are encoded in various ways defined |
| by the schema. By default, the verifier will truncate these values to unique but |
| unutterable symbols. If you want to expand these facts into subgraphs, use the |
| appropriate `--convert_` flag. (As of the time this was written, there is only |
| one such flag, `--convert_marked_source`.) |
| |
| ==== `MarkedSource` |
| |
| `MarkedSource` values are converted to subgraphs by creating new nodes for each |
| `MarkedSource` message. These nodes are given unique but unutterable VNames. |
| The original fact with the `MarkedSource` value becomes an edge with the same |
| name. Each scalar field of the message becomes a fact on the relevant node. |
| Links are decoded to VNames and become unordered `link` edges. Children are |
| recursively converted and connected from parent to child with ordered |
| `child.N` edges. |
| |
| === Return code |
| |
| If the database passes all well-formedness checks and all rules could be |
| satisfied, the verifier returns *0*; otherwise, it returns nonzero. |