Gauntlet: Finding Bugs in Compilers for Programmable Packet Processing

06/01/2020
by   Fabian Ruffy, et al.
0

Programmable packet-processing devices such as programmable switches and network interface cards are becoming mainstream. These devices are programmed in a domain-specific language such as P4, using a compiler to translate packet-processing programs into instructions for different targets. As networks with programmable devices become widespread, it is critical that these compilers are dependable. This paper considers the problem of finding bugs in compilers for packet processing in the context of P4-16. We introduce domain-specific techniques to induce both abnormal termination of the compiler (crash bugs) and miscompilation (semantic bugs). We apply these techniques to (1) the open-source P4 compiler (P4C) infrastructure, which serves as a base for different P4 back ends; (2) the P4 back end for the P4 reference software switch; and (3) the P4 back end for the Barefoot Tofino switch. Across the 3 platforms, over 4 months of bug finding, our tool Gauntlet detected 78 new and distinct bugs (47 crash and 31 semantic), which we confirmed with the respective compiler developers. 44 have been fixed (27 crash and 17 semantic); the remaining have been assigned to a developer. Our bug-finding efforts also led to 6 P4 specification changes. We have open sourced Gauntlet at https://github.com/p4gauntlet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

05/05/2020

Testing Compilers for Programmable Switches Through Switch Hardware Simulation

Programmable switches have emerged as powerful and flexible alternatives...
01/29/2021

Isolation mechanisms for high-speed packet-processing pipelines

Data-plane programmability is now mainstream, both in the form of progra...
04/22/2020

Towards Runtime Verification of Programmable Switches

Is it possible to patch software bugs in P4 programs without human invol...
04/04/2018

P4K: A Formal Semantics of P4 and Applications

Programmable packet processors and P4 as a programming language for such...
02/25/2019

A Systematic Impact Study for Fuzzer-Found Compiler Bugs

Despite much recent interest in randomised testing (fuzzing) of compiler...
09/25/2018

Network Coding for Critical Infrastructure Networks

The applications in the critical infrastructure systems pose simultaneou...
11/27/2020

Who is Debugging the Debuggers? Exposing Debug Information Bugs in Optimized Binaries

Despite the advancements in software testing, bugs still plague deployed...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Programmable packet-processing devices in the form of programmable switches and network interface cards (NICs) are now common. Such devices provide network flexibility, allowing operators to customize their network, researchers to experiment with new network algorithms, and equipment vendors to upgrade features rapidly in firmware rather than waiting for new hardware. At the core of this move to programmable packet processing are the domain-specific languages (DSLs) for packet processing, along with the compilers that compile DSL programs. As network programmability becomes widespread, these DSL compilers will need to be as dependable as general-purpose compilers such as GCC and LLVM. Motivated by these concerns, this paper considers the problem of finding bugs in compilers for packet processing. Because of the large open-source community around it, we ground our paper in the context of P4 [7], but our ideas also extend to similar DSLs such as NPL [8].

Bug finding in compilers is a well-studied topic, especially in the context of C [48, 26, 46, 13, 27]. Past approaches (§2) to bug finding in C compilers include fuzz testing by using randomly generated C programs [48, 26], translation validation (i.e., proving that a compiler correctly translated a given input program to an output program) [33, 36], and verification of individual compiler passes [30]. These prior approaches have to contend with many difficulties inherent to a general-purpose language like C, e.g., generating random programs that avoid undefined and unspecified behavior [48, 26], providing semantics for pointers and memory aliasing [30], and inferring loop invariants and simulation relations to successfully perform translation validation [36].

Our key insight is that, by leveraging the domain-specific nature of P4, we can avoid much of the complexity associated with bug finding for general-purpose languages. This results in simpler—and hence easier to implement—bug-finding techniques. We leverage this insight to build a compiler bug-finding tool for P4 called . uses three key ideas that each build on the prior one: random program generation, translation validation, and symbolic execution. We now describe these ideas and show how the restrictions of P4 allows them to be simpler than prior work.

First, we use random program generation (§4) to produce syntactically correct and well-typed P4 programs that still induce P4 compiler crashes. Because P4 has very little undefined behavior [15, §7.1.6], random program generation is considerably simpler for P4 than for C [48]. This is because the generator does not have to painstakingly avoid generating programs with undefined and unspecified behavior, which can be interpreted differently across different compilers. The smaller and simpler grammar of P4 relative to C also simplifies the development of a random program generator.

Second, we use translation validation (§5[36, 33] to find miscompilations in P4 compilers in which we can access the transformed program after every compiler pass. Traditionally, translation validation for languages like C has suffered from false alarms [33, 46], where the validator is unable to prove the correctness of translations and reports a compiler bug instead. Fundamentally, false alarms are inevitable for unrestricted C: proving program equivalence in the presence of unbounded loops is undecidable. In P4 however, translation validation is both theoretically decidable because the language is finite by design111

Finite in that input and output packets and state are finite bit vectors. Loops are bounded (parsing 

[15, §12]) or forbidden (control flow [15, §13]). and practically easier because of the absence of pointers, aliasing, loops, and gotos.

Third, we use symbolic execution (§6) to generate input-output test packets for P4 programs based on the semantics we had to develop for translation validation. We use these test packet pairs to find miscompilations in black-box and proprietary P4 compilers where we can not access the transformed program after every compiler pass. Symbolic execution for general-purpose languages [11] is effective at generating inputs that provide sufficient path coverage. But without language semantics, determining the correct output for these test inputs is hard. By creating formal semantics for P4 for translation validation, we are able to generate both input and output test packets that can test the compiled output for a program.

We applied to 3 platforms (§7): (1) the open-source P4 compiler infrastructure () [9], which serves as a common base for different P4 compiler implementations; (2) the P4 back end for the open-source P4 behavioral model () [6], a reference software switch for P4; and (3) the P4 back end for Barefoot Tofino, a high-speed programmable switching chip [4]. Across these 3 platforms, and over 4 months of testing, we found a total of new and distinct bugs, all of which were confirmed and assigned to a compiler developer. Our efforts also led to changes to the P4 specification. of these bugs have already been fixed. We analyze these bugs in detail and describe where they were found, their root causes, and which commits introduced them. We conclude by describing our limitations (§8): the bugs we can not find, the compiler passes we can not handle, and the language constructs we do not cover. We have open sourced our tools at https://github.com/p4gauntlet.

2 Background and Motivation

2.1 Approaches to Testing Compilers

Levels of compiler testing. A compiler must reject incorrect programs with an appropriate error message and accurately translate correct programs. However, a program can be correct to varying levels. McKeeman [31] provides a taxonomy of these levels in the context of C (Table 1). Each level corresponds to the program passing deeper into the compiler before it is rejected (e.g., lexer, parser, type checker, optimizer, code generator). The difficulty of generating test programs also goes up with increasing input level. For instance, while general-purpose fuzzers such as AFL [49] are sufficient to stress test the lexer, more sophistication is required to generate syntactically correct and well-typed programs, which are required to test the optimizer. In the context of the P4 compiler, we observed very limited success in bug finding using a general-purpose fuzzer such as AFL, indicating that testing at the first few levels of Table 1 is already handled adequately by P4’s open-source compiler test suite [9, §3.4].

Hence, for this paper, we only consider programs at the higher levels: static, dynamic, and model-conforming. These are programs that pass the lexing, parsing, type checking, and semantic analysis phases of the compiler, but still trigger compiler bugs. Like Csmith [48], we categorize bugs into crash bugs and semantic bugs. A crash bug occurs when the compiler abnormally terminates on an input program without producing either an output program or a useful error message. Crash bugs include segmentation faults, assertion violations, incomplete error messages, and out-of-memory errors. A semantic bug occurs when the compiler produces an output executable, but its behavior is different from the input program, e.g., due to an incorrect program transformation in a compiler optimization pass. In P4, semantic bugs simply manifest as any packet output that differs from the expected result. Crash bugs correspond to level 5 in Table 1; semantic bugs correspond to levels 6 and 7.

Level Input Class Example of incorrect input
1 Sequence of ASCII characters Binary files
2 Sequence of words and spaces Variable name beginning with $
3 Syntactically correct Missing semicolon
4 Type correct Adding int to string
5 Statically conforming Undefined variables
6 Dynamically conforming Program throwing exceptions
7 Model-conforming Program producing wrong outputs
Table 1: McKeeman [31]’s 7 levels of C compiler correctness.

Bug-finding strategies. We now look at how these bugs are found. A key challenge in compiler bug finding is the oracle problem. Given an input program to a compiler, the expected outcome (i.e., should it accept/reject the program and what should the output be?) is unclear unless one consults an all-knowing oracle. Below, we outline the major techniques used to approximate this oracle knowledge.

In differential testing [31], given two compilers, which both receive the same input program, if compiler A’s output (after compiling and running the program) differs from compiler B’s output, there is a bug in one of them. This works as long as there are at least two independent compiler implementations for the same language. Csmith [48] is one example of this approach; it feeds the same randomly generated C program to multiple C compilers and checks whether the outputs generated by executing the binary produced by each compiler differ. Another example is Different Optimization Levels (DOL) [13], which selectively omits compiler optimizations and compares compiler outputs with and without these optimization passes. If the end result differs after specific passes have been skipped or added, it points to a bug. This technique can be used in any compiler framework that supports selective omission of optimizations.

Metamorphic testing [14] can serve a similar role as differential testing, especially when multiple compilers are not readily available or optimization passes can not be easily disabled. Instead of feeding the same input program to different compilers, different input programs that are expected to produce the same compiler output are fed to the same compiler. The compiler outputs from these different input programs are compared to determine if there is a bug or not. EMI is an example of this approach [26]. Given a randomly generated C program , and random input to this program, EMI uses the path coverage tool gcov to identify dead code in when run on input . EMI then prunes away this dead code to produce new programs whose output must agree with ’s output when run on the input . Then EMI compiles and runs both and to check whether they indeed produce the same output when given as input.

Translation validation is a bug-finding technique that converts the program before and after a compiler optimization pass into a logical formula and checks if both programs/formulas are equivalent using a constraint solver [33, 36, 30, 50]. A failed check indicates a semantic bug. Program equivalence is an undecidable problem for Turing-complete languages such as C, requiring manual assistance to perform translation validation. Typical examples of manual assistance are (1) simulation relations, which encode correspondences between variables in two programs; and (2) loop invariants, required to prove the equivalence of programs with loops. Both can often be learned from running programs on test suites [41, 33], but they are not guaranteed to be precise.

2.2 Motivating ’s Design

We now turn to finding bugs in P4 compilers. Our approach borrows from prior compiler bug-finding techniques, but adapts them to programmable packet processing. From EMI and Csmith, we borrow the idea of generating random programs that are lexically, syntactically, and semantically correct. Unlike EMI and Csmith, however, our random program generation is simpler because it does not have to avoid undefined behavior, which, by design, is limited in P416.

However, we can not directly apply either differential or metamorphic testing to P4 compilers. Differential testing requires two or more independent compiler implementations that are comparable in their output. P416 compilers for different hardware and software targets are not comparable because program behavior is target-dependent [9, §2.1]. On the other hand, presently there aren’t multiple independent compilers for the same target. Further, code coverage tools like gcov required for metamorphic testing such as EMI are not a part of the P4 ecosystem. At the same time, many crash bugs in P4 do not require the full power of differential or metamorphic testing to tease out. This is because they lead to assertion violations. In other words, there is already an oracle for many crash bugs without requiring a second compiler implementation.

For semantic bugs, on the other hand, the domain-specific nature of P4 makes translation validation easier relative to general-purpose languages such as C. P4 programs are finite-state and finite-time, making program equivalence decidable at a theoretical level. P4’s lack of pointers, memory aliasing, and unstructured control flow (e.g., goto), makes it practically easier to develop translation validation for the language. Additionally, an approach based on translation validation provides more completeness than randomized testing approaches such as EMI and Csmith because it exhaustively searches over all packet inputs to the a P4 program to find semantic bugs.

Hence, we combine random testing with formal methods to find P4 compiler bugs. First, we generate random programs to find crash bugs. Second, we convert these random programs into Z3 [16] logic formulas and assert that the logic formulas before and after a compiler pass are equivalent. Third, we use the same logic formulas to determine test cases (i.e., input-output packet pairs) for these random programs. These test cases are then used to test the target implementation of the P4 programs for proprietary compilers.

2.3 Goals and Non-Goals

Find many, but not all bugs. Our goal is to find many crash and semantic bugs in the P4 compiler, but our tool is not exhaustive. Specifically, we are not seeking to build a fully verified compiler like CompCert [28], given the large labor and time costs associated with such an undertaking and the diversity of P4 targets. For instance, after many years of development, CompCert still only supports a restricted subset of C. Instead, our goal is to strengthen existing P4 compilers, not to write a safe replacement.

Check the compiler, not the programmer. We are not trying to verify that a particular P4 program is written correctly relative to a specification or that it is devoid of certain kinds of bugs. This problem is addressed by orthogonal work on P4 program verification [29, 37, 44, 21, 19] and P4 testing [42]. Although can in principle be used in for verifying the program, we have not optimized it to scale to such program verification use cases. Currently, we can not verify large data plane switch programs such as switch.p4 [43] in a reasonable amount of time. The random programs we generate to find bugs in the P4 compiler are much smaller and more targeted than switch.p4, and our tool does not need to be able to generate and efficiently solve Z3 formulas for large P4 programs to tease out compiler bugs.

Develop target-independent techniques. We are designing our tools to be as target-independent as possible and specialize them to test the front and mid end of the compiler. While we support restricted forms of back-end testing (§6.1), we do so in a way that allows us to quickly integrate and adapt to new back ends without having to understand detailed target-specific behavior. In particular, we do not cover target-specific semantics such as externs [15, §4.3].

Only test mature compilers. We only test mature compilers such as and the corresponding behavioral model222Both have entered “permanent beta-status” since November 2019: https://github.com/p4lang/p4c/issues/2080

as well as the commercial Tofino compiler. For example, supports other back ends such as the eBPF, uBPF, and PSA targets, which are pre-alpha quality and preliminary compiler toolchains. Finding bugs is likely unhelpful for the respective compiler developers at this moment.

3 Background on P4

Figure 1: An example P4 compilation model.

P4 is a statically typed, domain-specific language designed to describe computations on network packet headers. This paper focuses on P416, the latest version of P4 [15]. Figure 1 shows the main P416 concepts, which we explain below.

Packages and targets. A P4 program consists of a set of procedures; each procedure is loaded into a programmable block of the target (e.g., a switch [4] or NIC [35]). These programmable blocks correspond to various subsystems such as the parser or the match-action pipeline. The package lists the available programmable blocks in a target. One example of a package for a target is the v1model, which models the architecture of a  [6] software switch target, referred to as “simple switch” [20]. For simplicity, we will refer to as the target instead of simple switch.

P4 compilers. A P416 compiler translates a P416 program and the target package model into target-dependent instructions. These target instructions are combined with the non-programmable blocks (e.g., a fixed scheduler) to form the target’s data plane. These instructions also specify how this data plane can be accessed and configured by the control plane (see also Figure 1).  [9] is the official open-source reference compiler infrastructure of the P416 language and implements the current state of the specification. employs a nanopass design [40], which takes shape as a composable library of front- and mid-end compiler passes that perform code analysis, transformation, and optimization on input programs. We analyze these nanopasses using translation validation.

Compiler back ends. To implement a P416 compiler, developers write back ends, which use ’s front- and mid-end passes along with their own back-end specific transformations, to translate P416 code into instructions for their own target. In this paper, we focus on 2 production-grade back ends: the Tofino [4] and  [6] back ends.

Parsers and control blocks. A parser is a finite state machine that transforms an incoming byte sequence received at the target into a structured representation of header definitions. For example, incoming bytes may be parsed as packets containing Ethernet, IP, and TCP/UDP headers. A deparser converts this representation back into a byte sequence. Control blocks describe the per-packet operations that are performed on the input header. These operations are described in the form of the core primitives of the language: tables, actions, metadata, and extern objects.

Tables. Tables are objects in the control block similar to a Java map or Python dictionary. Table entries are match-action pairs inserted by the network’s control plane [32, 12]. When a table is applied to a packet traversing the control block, its header is compared against the match key of all match-action entries in the table. If any entry’s key matches the header, the action associated with the match is executed. Actions are procedures that can modify state and/or input headers.

Calling conventions. P416 uses “copy-in/copy-out” [15, §6.7] semantics for method calls. For any callable object in P4, the parameter direction (also known as mode [22, §8.2]) explicitly specifies which parameters are read-only and which parameters can be modified, with the modifications persisting after the function terminates. Modifiable parameters are labelled with the direction inout or out in the definition of the procedure. Read-only values are marked in. At the start of a procedure call, the arguments are copied left-to-right into the associated parameter slots. Parameters with out label remain uninitialized. Once the procedure has terminated, all procedure parameters with the label inout or out are copied back towards the original input arguments.

Metadata. Metadata is programmer-defined or target-specific data that is associated with a packet header, while it traverses the target. Examples of metadata include the packet input port, packet length, queue depth, or priority; this information is interpreted by the target according to target-specific rule sets independent of the P4 language specification. Metadata can also be modified during the execution of the control block.

Externs. Externs are an extensibility mechanism, which allows targets to describe built-in functionality; externs are object-like and have methods. Examples include calls to checksum units, hash units, counters, and meters. P4’s calling conventions allow reasoning about externs to some degree; they act like uninterpreted functions with limited side effects.

Categories of P4 compiler bugs. While we focused our bug-finding techniques on the crash and semantic bugs that we described earlier, we also tracked other types of bugs, which we did not include in our total bug count. For instance, some of our bugs resulted in changes to the P4 specification [15], which we term specification bugs. Similarly, the compiler did not output syntactically correct programs after all compiler optimization passes, which we labeled as invalid transformations. These bugs aren’t semantic or crash bugs, but were still considered bugs and fixed by the compiler maintainers.

4 Random Program Generation

4.1 Design

We require diverse input programs to exercise many compiler passes—and hence crash or semantic bugs in those passes. already contains a sample of over 600 programs as part of its test suite, but it is unlikely that these programs trigger bugs. As part of testing, the compiler developers typically inspect the reference output of each of the compiled test programs after the front- and mid-end passes to check for regressions [9, §3.4]. While P4Fuzz [1]

is a tool that can generate random P4 programs, we found that the programs generated by P4Fuzz are not complex enough to trigger new crash or semantic bugs. For instance, programs generated by P4Fuzz contain arbitrarily long and random variable names, but such programs are useful only for triggering bugs from the first few levels of Table 

1. We require randomness at the level of the abstract syntax tree, not within a single token.

Instead, we developed our own generator for random P4 programs. With this generator we can exercise the majority of language constructs in the P4. This leads to diverse test programs covering a range of unique combinations of P4 expressions. We can use these test programs to find programs that lead to an unexpected crash in any pass of the compiler for the specific target. The amount of randomly generated code in our tool is user-configurable, allowing us to keep the size of the program under test small and targeted. This allows us to find an ample number of semantic bugs while avoiding the path explosion problems of symbolic execution on these random programs.

The design of our random program generator is influenced by Csmith [48]

and follows its philosophy of generating only well-formed input programs that pass the lexer, parser, and type checker. At a high level, our generator grows an abstract syntax tree (AST) corresponding to the random program by probabilistically determining what kind of AST node to add to the AST at each step. By adjusting the probabilities of generating each AST node, we can steer the generator towards the language constructs we want to focus on. However, whereas Csmith avoids generating C programs with undefined behaviour, we accommodate it. This is because we chose to provide our own semantics for undefined behavior in P4 as part of the logic formulas that we generate for P4 during translation validation. This allows us to inform compiler developers of suspicious—but not necessarily wrong—compiler transformations on P4 programs with undefined behavior, if it diverges from our semantics.

4.2 Implementation

We implement our random P4 program generator as extension to . The generator uses the intermediate representation (IR) of to automatically grow an abstract syntax tree (AST) by expanding branches of the tree at random. For example, a block statement may generate up to (say) 10 statements or declarations, which in turn may result in further sub nodes. This AST is then converted into a P4 program using ’s ToP4 module. Our random program generator can be specialized towards different compiler back ends by providing a skeleton of the back-end-specific P4 package, back-end-specific restrictions, and which package blocks are to be filled with program snippets. We have currently implemented two back ends for our random program generator corresponding to the  [20] and Tofino [4] targets.

Programs generated by our random program generator are required to be syntactically sound and well-typed. Our aim is not to test if the compiler’s parser can correctly catch syntax errors; already tests those. If ’s parser and type checker (correctly) rejected a generated program, we consider this to be a bug in our random program generator. For example, if an action parameter has a inout or out qualifier, only writable variables may be passed as arguments.

5 Translation Validation

Figure 2: Translation validation in .

5.1 Design

To detect semantic bugs, we employ translation validation [36], a classic technique from the compiler literature in which an external tool certifies that a particular compiler pass has correctly transformed a given input program. To perform translation validation for P4, we developed a symbolic interpreter for the P416 language to transform P4 programs into Z3 formulas [16].

Figure 2 describes P4 translation validation. To validate a P4 program, the symbolic interpreter converts the program into a Z3 formula capturing its input-output semantics. An equivalence checker then submits the Z3 formulas of a program before and after a compiler pass to the Z3 SMT solver. The solver tries to find an input that violates equivalence of these two formulas. If it finds such an input, we report it as a semantic bug to the compiler maintainers. Translation validation has two advantages over random testing. First, it can accurately detect subtle differences in program semantics without any knowledge about expected input packets or table entries. Second, if we can access intermediate P4 programs after each compiler pass, we can pinpoint the erroneous pass.

5.2 Implementation

Like our random program generator, we wrote the interpreter as an extension to . We use the IR generated by the parser to determine the semantics of the P4 program. Each programmable block of a P4 package represents an independent Z3 formula. For example, the v1model package [20] of the back end has 6 different independent programmable blocks: Parser, VerifyChecksum, Ingress, Egress, ComputeChecksum, and Deparser. For each block, we generate a new Z3 formula. We now describe the symbolic interpreter and equivalence checker in more detail.

Developing the symbolic interpreter. Overall, it took us 5 months of implementation effort until our symbolic interpreter was reliable enough to find new semantic bugs in P4 compilers—as opposed to encountering false alarms that were actually interpreter bugs. The fact that contains a sizeable test suite [9, §3.4] was helpful in stress testing our interpreter. We started our development process by performing translation validation on programs in the test suite. A semantic bug on one of these test programs is probably a false alarm and a bug in our interpreter. This is because it is unlikely that the compiler miscompiles test suite programs since the reference outputs of each test after the front- and mid-end passes are tracked as part of regression testing. We continuously consulted with the compiler developers to ensure our understanding of the language semantics was correct.

However, we quickly realized that we also needed to generate random programs to achieve coverage and truly stress test . Subsequently, we co-evolved the interpreter with our generator. We attribute part of our success in finding bugs to this development technique, since it forced us to consider many edge cases—more than does. The test suite for our interpreter now has over 600 tests plus an additional 100 tests that we developed.

Eventually, our interpreter had become complete and reliable enough to perform translation validation for randomly generated programs. This then allowed us to identify semantic bugs in . After we had detected the first semantic bug, we randomly generated around 10000 programs every week and added the resulting compiler bugs to our backlog. Adding support for new P4 language features as part of random program generation typically led to a crash in our interpreter. After we fixed our own bug, we were frequently able to find new semantic bugs in the P4 compiler that pertained to those language features. Because any of the compiler passes may have bugs, our symbolic interpreter does not rely on any compiler pass other than the parser and the ToP4 module to produce P4 code from the IR. Hence, we designed our interpreter to handle any P4 program that successfully passed the parser, i.e., before it is desugared.

struct Hdr { bit<8> a; bit<8> b; }
control ingress(inout Hdr hdr) {
  action assign() { hdr.a = 1; }
  table t {
    key = hdr.a : exact;
    actions = {
      assign();
      NoAction();
    }
    default_action = NoAction();
  }
  apply {
    t.apply();
  }
}
(a) Simplified P4 program applying a table.
Input:  t_table_key, t_action, hdr
Output: hdr_out
hdr_out =
    if (hdr.a == t_table_key) :
        if (1 == t_action) : Hdr(1, hdr.b);
        otherwise : Hdr(hdr.a, hdr.b);
    otherwise : Hdr(hdr.a, hdr.b);
(b) Its semantic interpretation in Z3 shown in functional form.
Figure 3: A P4 table converted to Z3 semantics.

Converting P4 programs into Z3 formulas. We now describe briefly how we convert a P4 program into a Z3 logic formula. Figure 3 shows an example. Conceptually, our goal is to represent P4 programs in a functional form so that the input-output behavior of the functional form is identical to the input-output behavior of the P4 program. To determine function inputs and outputs, we use the parameter directions of each P4 package. Parameters with the direction inout and out make up the output Z3 data type of the function. Parameters with the in and inout direction make up the input Z3 data type of the function.

To determine the functional form, the symbolic interpreter traverses each path through the P4 program, maintaining expressions representing path conditions for branching. Once it reaches a portion of the program where execution ends, it stores an if-then-else Z3 expression with the condition set to the path condition and the return value set to a tuple consisting of the inout and out parameters at that point. Ultimately, the interpreter will return a single nested if-then-else Z3 expression, with each branch corresponding to a unique output from the program under a set of conditions.

Using this expression we can perform operations such as equivalence checking between two Z3 formulas for translation validation or querying Z3 to provide an output for particular input for test case generation. While the generated Z3 formulas could in principle be very large because of the path explosion problem, we haven’t yet had to optimize our formulas, given the small size of our generated programs.

Handling tables. The contents of a table are unknown at compile time. Since we want to make sure we cover any possible table content, we interpret match-action pairs in tables completely symbolically. Figure 3 describes a simplified example of how interprets tables within a control block. Per match-action table call, we generate one symbolic match (t_table_key) and one symbolic action variable (t_action), which represent a single match key and its choice of action respectively. We compare the symbolic packet header with the symbolic match key (hdr.a == t_table_key). If the expression evaluates to true it implies the execution of a specific action, which is chosen based on the value of the symbolic action index (t_action). We express this as a series of nested if-then-else statements per action available to the table. Finally, if the key does not match, the default action is selected. For instance, in Figure 3, we execute action assign (action id 1) iff the symbolic match variable (t_table_key) equals the symbolic header (hdr.a) and the symbolic action variable (t_action) equals 1. With this encoding we can avoid having to use a separate symbolic match-action pair for every entry in the match-action table, which is a prohibitively large number of symbolic variables.

Header validity. The P416 specification does not explicitly restrict the behavior of header validity. We model our semantics to align with the implementation in . We clarified these assumptions with the compiler and specification maintainers. If a previously invalid header is marked valid, all fields in that header are initially set to arbitrary unknown values. If an invalid header is returned in the final output, all fields in the header are set to invalid as well.

Interpreting function calls. Any out parameter in a function call is initially set undefined. If the function returns, we also generate a new free Z3 variable. In our interpreter, externs are treated as a normal function call. However each argument for a parameter that has the label inout and out is set to a new free Z3 variable because the behavior of extern is unknown—similar to an uninterpreted function. Copy-in/copy-out semantics, albeit necessary to control side effects in extern objects, have been a persistent source of bugs in the compiler. A significant portion of the semantic bugs we identified were caused by erroneous passes that perform incorrect argument evaluation and side effect ordering.

Checking equivalence between P4 programs. We use p4test to emit a P4 program after each compiler pass. p4test is a back end used to test . It does not produce any output but exercises all the default front- and mid-end passes. We only examine passes that actually modify the input program and ignore any emitted intermediate program that has a hash identical to its predecessor. We explicitly reparse each emitted P4 file to also catch misbehavior in the parser and the ToP4 module.

For two P4 programs transformed by a compiler pass, A and B, we perform a pair-wise equivalence check. We use our interpreter to retrieve the Z3 formulas for all programmable blocks of the program package and compare each individual block of A to the corresponding block in B. The query for the Z3 solver is a simple inequality. It is satisfiable only if there is a Z3 assignment (e.g., a packet header input or table match-action entry) in which the Z3 formula of A produces a different output from B.

If the inequality query is satisfiable, it produces the assignment that would lead to different results and saves the failed passes for later analysis. With this technique we can precisely pinpoint in which pass a semantic bug may have happened and we can also infer the packet values we need to trigger the bug. If the report turns out to be a false alarm and is not confirmed by compiler developers, we consider this a bug in our translation validation semantics, which we fix.

6 Symbolic Execution

Figure 4: Symbolic execution in .

6.1 Design

Our approach to translation validation is applicable only in scenarios where we have access to the IR (and hence the P4 program) because it rests on having semantics for P4. This is the case for , which has a flag that allows us to emit the P4 program after every compiler pass as a transformed P4 program [9, §3.3]. However, in the back end, a P4 compiler employs back-end-specific passes that translate P4 into their own proprietary formats. Some of these proprietary formats are undocumented, making it hard to provide semantics for them. This renders translation validation impractical.

To address the unavailability of P4 in these scenarios, we developed a bug-finding approach based on automatic test-case generation coupled with symbolic execution. We reuse our symbolic interpreter from translation validation to generate a Z3 formula of a generated input program (Figure 4).

With this representation of a P4 program, we can solve for a packet input that corresponds to a given output and craft packets that traverse unique paths in the processing pipeline. We follow a typical approach to symbolic execution [11]. Based on unique path conditions we generate a test case for each path and feed this case into the testing framework of the compiler’s target. If the framework reports a mismatch, we know that there is likely to be a bug. This test technique can identify semantic bugs without requiring access to the P4 program after every intermediate compiler pass. However, unlike the translation validation approach, it is harder to pinpoint the source of the bug.

6.2 Implementation

Symbolic execution requires a functional P4 target and a testing framework that is capable of taking input packets and producing output packets, which can be matched against an arbitrary expected output. We developed symbolic execution for two back ends: (1) the back end that uses the simple test framework (STF) [10], which feeds packets to a software test switch, records packet capture files, and verifies that the observed packet values correspond to the expected output and (2) the Tofino back end that uses the Packet Test Framework (PTF) [5] to inject and receive packets. We use the Tofino software simulator to check for semantic bugs in Tofino. We initially reconfirmed every semantic bug we found on the Tofino hardware target, but ultimately switched to running only the simulator for testing velocity. However, we confirmed all Tofino bugs with the compiler developers.

Undefined variables. Undefined variables are difficult to model in end-to-end tests since any back end is free to perform arbitrary operations on these variables. We were left with two choices: (1) we could avoid undefined behavior in our P4 programs; (2) alternatively, we could ascribe specific values to undefined variables and check if these values conform with the implementation of the particular target. We picked the second approach because it allows independent testing of compiler optimizations in the face of undefined language constructs.

Computing input and output for test cases. We generate a Z3 formula for a given program using our symbolic interpreter. As we do not have control over paths that involve undefined variables because we cannot force a target to assign specific values to such variables, we add conditions which will cause Z3 to only give us solutions for specific restricted program paths. For any path we can control (for example, a branch that depends on the value of an input header) we compute all the possible input-output values that lead to a new path through the P4 program. This technique is computationally expensive, because the number of paths can be exponential in the length of the program. However, in practice, because the size of the generated programs is small, test-case generation followed by testing on a P4 program still completes quickly in practice.

For every path, we feed these conditions into the solver and retrieve a candidate model of input-output values that would cause program execution to go down that path. Because there are typically many solutions for these input-output values, we configure the Z3 solver to give us input and corresponding output values that are non-zero. In some back ends, using zero values by default may mask erroneous behavior. For example, since initializes any undefined variable with zero, the bug in program 4(c) would not have been caught, had we not asked Z3 for a non-zero input-output pair.

7 Results

We now analyze the P4 compiler bugs found by . Our findings are summarized below.

We confirmed a total of new, distinct bugs across the framework and the and Tofino P4 compilers. Of these bugs, are crash and are semantic bugs.

Our efforts also led to P4 specification changes.

We achieved this in the span of only 4 months of testing with , and despite only generating random programs from a limited subset of the P416 language.

Symbolic execution is effective enough to find semantic bugs in closed-source back ends such as the Tofino compiler, despite us not having access to the intermediate representation of the P4 program.

Bug Type Status Tofino
Crash Filed
Confirmed
Fixed
Semantic Filed
Confirmed
Fixed
Total
Table 2: Bug summary. Unfixed bugs have been assigned.
Location Tofino Total
Front End - -
Mid End - -
Back End -
Total
Table 3: Distribution of bugs in the P4 compilers.

7.1 Sources of Bugs

We distinguish the bugs we found into three primary sources: bugs we found in the common framework and bugs we found in the actual compiler back ends, and the Tofino compiler. Both and Tofino use the front- and mid-end passes. Hence, most bugs detected in also likely apply to these back ends. Note that since the Tofino back end is closed source, we don’t know which passes it uses.

Distribution of Bugs. Table 3 lists where we identified bugs. The overall majority of bugs were found in the front- and mid-end framework, mainly because we concentrated on these areas. The majority of the back end bugs were found in the Tofino compiler. There are two reasons for this: first, the Tofino back end is more complex than as it compiles for a high-speed hardware target and second, we did not extensively test as it is mostly a reference switch for the compiler developers.

Bugs in the infrastructure. As Table 2 shows, we were able to confirm distinct bugs over 4 months. were uncovered in , with an even distribution of crash bugs () and semantic bugs (). Initially, the majority of bugs that we found were crash bugs. However, after these crash bugs were fixed, and as our symbolic interpreter became reliable, the semantic bugs began to exceed the crash bugs.

In addition, of the bugs we found led to corresponding changes in the specification as they uncovered missing cases or ambiguous behavior because our interpretation of a specific language construct clashed with the interpretation of the compiler developers and language designers. We also continuously checked out the master branch to test the latest compiler improvements for bugs. A significant number of bugs (16 out of ) were caused after recent merges of pull requests during the months in which we used for testing. was able to quickly catch these bugs as well. Thus, we believe it would be useful for the P4 compiler developers to use it as a continuous integration tool.

Bugs in the Tofino compiler. Symbolic execution on the proprietary compiler was also successful. We were able to confirm crash bugs and semantic bugs in the Tofino compiler. We note that all of these bugs were completely distinct from the bugs reported to . In fact, the majority of bugs present in could be reproduced in the Tofino compiler as well, because it uses for its front and mid end. Hence, in our Tofino bug count we do not include any front- and mid-end crash and semantic bugs present in the Tofino compiler that were already triggered as part of testing . We also do not include Tofino compiler crashes that were caused by a missed transformation in the front end. The Tofino back end was relying on these passes to correctly transform specific P4 expressions. We filed two of these crashes in the Tofino compiler as missed optimization issues in .

Derivative bugs. 5 of the bugs we found were crash bugs that were not directly caused by random programs generated by . Instead, they were caused by handcrafting specific P416 programs containing specific language constructs. These handcrafted P416 programs were inspired by discussions related to either specification changes or compiler bugs originally found by . We included these bugs in our count because even though they were handcrafted, they were seeded by bug reports originating from . We also encountered 2 new crash bugs when manually reducing our randomly generated programs for semantic bugs. For instance, we uncovered a crash bug caused by a P4 parser loop because we removed transition expressions [15, §12.5] as part of reducing one of our randomly generated programs. However, all our semantic bugs were found directly by .

Fixing the bugs. Out of the new bugs we filed, have been fixed. The remaining bugs have been assigned a developer, but are open because we filed them very recently, they required a specification change to be resolved first, or they have been deprioritized in favor of more pressing bug reports. We have received confirmation by the Tofino compiler developers that 4 bugs have already been resolved, the remainder are targeted to be resolved by the next release.

bit<8> test(inout bit<8> x) {
  return x;
}
control ig(inout Hdr h) {
  apply {
    test(h.h.a);
  }
}
(a) A bug caused by a defective pass.
control ig(inout Hdr h, …) {
  apply {
    h.h.a = (1 << h.h.c) + 8w2;
  }
}
(b) A crash in the type checker.
control ig(inout Hdr h, …) {
  apply {
    bool tmp = 1 != 8w2[7:0];
  }
}
(c) An incorrect type checking error.
control ig(inout Hdr h, …) {
  action a(inout bit<7> val) {
    h.h.a[0:0] = 0;
  }
  apply {
    a(h.h.a[7:1]);
  }
}
(d) Incorrect deletion of an assignment.
control ig(inout Hdr h, …) {
  apply {
    h.h.setInvalid();
    h.h.a = 1;
    h.eth.src_addr = h.h.a;
    if (h.eth.src_addr != 1) {
      h.h.setValid();
      h.h.a = 1;
    }
  }
}
(e) An unsafe compiler optimization.
control ig(inout Hdr h, …) {
  action a(inout bit<16> val) {
      val = 3;
      exit;
  }
  apply {
    a(h.eth.eth_type);
  }
}
(f) Incorrect interpretation of exit statements.
Figure 5: Examples of bugs that were caught by .

7.2 Deep Dive into Bugs

Snowball effects. We observed that many crash bugs were caused because a preceding pass incorrectly transformed an expression or did not process it at all. For example, in program 4(a) the SimplifyDefUse [10] pass removed all variables in the caller scope because of the return statement in the function, although inout parameters continue to exist in the scope. The consequence of this optimization was a crash in a subsequent type checking pass because all variable definitions were cleared and the type checking pass was unable to find the variables corresponding to the inout parameter. In another case, the InlineFunctions [10] pass did not fully inline all functions calls, causing a crash in back ends that were not able to understand function calls and expected them to have been inlined by then.

Crashes in the type checker. Many of the crashes (18 out of ) were in the type checker infrastructure. The code in 4(b) shows an expression that crashed type checking. It is not possible to shift this value since its width is unknown at compile-time. This code is illegal, but the specification did not explicitly forbid it. The type checker tried to infer a type regardless and crashed. This bug also triggered two updates to the P416 specification. In other cases, the type checker was incorrectly forbidding a valid expression. In example 4(c), the program was legal, but because the StrengthReduction [10] pass was missing a safety check, the resulting slice index was computed to be negative, which prompted the type checker to terminate with an error message.

Handling side effects. Side effects from a function operate on the concept of copy-in/copy-out semantics, described earlier. However, these semantics, while seemingly simple, turn out to be hard to implement correctly in the compiler. A large subset of the semantic bugs we found (at least out of ) can be traced back to incorrect handling of side effects and copy-in/copy-out. A particularly tricky case can be seen in 4(d). In the program, a slice of a variable is passed as an inout parameter. At the same time, a disjoint subset of the variable is assigned within the function. The correct behavior here is to leave the assignment, and copy back the sliced portion of the variable. However, the compiler assumed that the entire variable would be assigned and removed the assignment in line 3, an incorrect optimization.

Unstable code. We also found incidents of unstable code [47], which, while conforming with the specification, may lead to stability issues in specific back end targets. Dumitru et al. also discuss the potential safety consequences of undefined variable access in P416 [18].

Program 4(e) is a concrete example. The compiler collapses the assignment of line 4 into line 5, setting h.eth.src_addr, which is still part of a valid header, to 1. All of this is legal behavior, since read and write operations on invalid header values are undefined as part of the P4 specification. The compiler is free to perform these optimizations. However, these changes may cause issues in specific back ends, e.g., back ends in which assignments to invalid headers are no-ops. In this case, the compiler has chosen a particular subset interpretation of undefined behavior, which may clash with the expectations of programmers for that back end. We raised this with the compiler developers, who agreed to print a warning.

Consequences of compiler changes. Once we started actively monitoring the master branch of we observed that many (16 out ) of the bugs we filed in were caused by recent merges into master. A notable example is a recent improvement to the Predication [10] pass, which caused at least 4 (1 crash and 3 semantic) new bugs. We caught and filed these bugs quickly during our weekly routine random code generation. A few P4 programmers also filed bugs on this issue. Their reports were considered duplicates because of our earlier reports, highlighting that our bugs affect P4 programmers.

Specification changes. Some of our bug reports kicked off larger discussions and changes around the P4 language specification. Our bug reports and questions have led to at least distinct specification changes. For example, a concern we had about the validity of uninitialized headers led to three clarification pull requests on the specification repository and a suggestion to propose more fundamental changes for a 2.0 version of the language.

Another prominent example was caused by ambiguity in the specification. In example 4(f), the RemoveActionParameters [10] compiler pass moved the statement in line 3 after the exit statement, because the assumption was that exits called within functions ignores the copy-in/copy-out semantics. We instead interpreted exit statements to still respect copy-in/copy-out semantics and caught the discrepancy. We filed this as a concern with the open-source community and our interpretation was deemed reasonable, which required a specification update. The corresponding compiler changes resulted in at least 3 new bugs, which we detected and filed.

Invalid transformations. Lastly, we uncovered several small issues with how P4 code is emitted and transformed across compiler passes. We detected these bugs by reparsing each P4 program after it had been emitted by the ToP4 compiler module. If the emitted program can not be parsed, it indicates a bug in either ToP4 or a compiler pass. While these bugs typically do not harm correctness, they affect compiler debugging. provides the option to emit transformed programs after each pass as a valid P4 program. Consequently, the compiler developers try to maintain an invariant that each compiler pass in the front and mid end needs to produce syntactically correct P4. Additionally, we found a case where the emitted program being syntactically incorrect was a symptom of a larger bug that affected the compiler’s functionality. Overall, we identified 4 such bugs of invalid intermediate P4; these 4 bugs are not included in our count of . All were fixed.

7.3 Lessons Learned

debugging support. has several facilities that turned out to be useful for our bug-finding process. The ability to dump the intermediate representation, specify which passes to dump, and the ToP4 tool, which converts the P4 IR to P4 programs, have greatly accelerated our development process. In addition, the compiler has comprehensive assert instrumentation with distinct messages, which we used to identify unique crash bugs and to distinguish them from error messages. The AST visitor library in allowed us to develop extensions like our random program generator and interpreter.

’s nanopass architecture, which factors the compiler into a large number of “thin” passes, helps significantly with bug fixing, especially in the context of semantic bugs that were narrowed down to one pass by translation validation. A different architecture that has fewer “thick” passes would need more developer effort to fix semantic bugs. We also observed that almost all crash bugs were assertion violations where an invariant was violated in a particular compiler pass due to an incorrect or absent compiler transformation from a previous pass. In the absence of such assertions, these crash bugs could have easily manifested as semantic bugs that are harder to detect.

Reporting bugs. This project would not have been possible without the responsiveness and receptiveness of the P4 community. Our questions, concerns, and bug reports were answered within a day and in great detail. The developers were able to even dissect our initial questions and confusions into bug reports, guiding us further in our development effort. We were encouraged to participate in the language design working group that discusses changes to the P4 specification.

Likewise, when we filed bugs for the closed-source and proprietary Tofino compiler, we found the developers to be receptive and responsive. Still, the pace of bug finding and fixing with the Tofino compiler was slower than the open-source compiler because of two unavoidable reasons. First, we naturally didn’t have access to the company bug tracker to assess the life cycle of our bug once it had been filed. Second, the official binary of the Tofino compiler updates less frequently than , which can be rebuilt from source after every commit. Hence, we would trigger the same bugs repeatedly in our testing until a new Tofino compiler version with a bug fix was released. Neither of these two problems would manifest, if our tool was to be used internally as part of the compiler development process for Tofino.

Combining random testing with formal methods. One broader takeaway from our results is that a combination of random testing and formal methods is effective at finding bugs in compilers. We use random testing to induce crash bugs and we combine it with translation validation and symbolic execution based on formal Z3 semantics to detect semantic bugs using the same programs. This is a departure from prior compiler bug finding that relies almost exclusively on methods from either random testing [48, 26] or formal methods [23, 33, 30].

8 Limitations

Limitations of symbolic execution. A key assumption in the symbolic execution approach is that generated test cases can actually be fed to the back end of the compiler under test. We have found that test cases where the input packets have certain values in their headers can be dropped silently by the back end without generating an output packet. Effectively, there is a mismatch between the Z3 semantics, which says that a certain output packet must be produced and the back end’s semantics, which produces no output packet (this limitation is also known as the “environment problem” [11]). In such cases, we have had to discard these test cases, reducing the breadth of coverage for testing the compiler.

Missing simulation relations. Our translation validation technique in its current form is not resilient to declared, but undefined, variables being renamed by a compiler pass. There are two potential solutions. We can leverage annotations by the compiler to maintain the original source name of a variable. Or we can augment our equivalence checking with a simulation relation—a mapping from variable names before the pass to variable names after the pass. Early work on translation validation for C automatically infers these simulation relations [33]. We plan to investigate such techniques under the constraint of avoiding false alarms. We note, however, that for the vast majority (53 out of 57) of compiler passes, in which we tried to find semantic bugs, we did not need simulation relations to tease out semantic bugs.

Unsupported language constructs. We currently do not generate programs that contain several P416 language features: extern definitions or calls, method overloading, type definitions, variable bit vectors, run-time indices, match types such as longest prefix or ternary matches, type-inference for generic types in function bodies, annotations, and various custom table properties. Neither does our symbolic interpreter provide semantics for these P416 language constructs. We expect that adding most of these will be straightforward conceptually, although adding each language construct is a fair amount of additional engineering. One particular construct that we anticipate being hard to support is externs because extern behavior is very back-end-specific and it is hard to develop accurate semantics for these externs without detailed hardware knowledge of each target.

Unsupported bug types. We do not handle performance bugs, e.g., a program that causes the compiler to run too long or a program that causes the compiler to generate low performance code. We anticipate that handling performance bugs would require techniques that are conceptually very different from our techniques, which deal with correctness.

Manual test case reduction. We have not developed an automatic test-case reduction suite (e.g., C-Reduce [38]) and still reduce programs in a manual fashion, a laborious process. After our testing pipeline has identified problematic programs in a randomly generated batch, we inspect each P4 program individually. We prune the random P4 program that caused the bug until we get a sufficiently small program that can be attached to a bug report. We hope to automate this process.

9 Related Work

P4K [24] was an early effort to formalize the P4 language using the K-framework [39]. In the process of defining these semantics, the authors found several issues in the P4 specification. P4K supports the use of translation validation similar to our tool. netdiff [17] uses symbolic execution to verify the equivalence of data planes, such as those written in P4. They do so by converting P4 and other data plane programs into the SEFL language [45], which in turn can be converted to Z3. The Z3 expressions corresponding to different data planes can then be compared for equality. netdiff’s equivalence checking technique is comparable to our translation validation technique. However, neither P4K nor netdiff was explicitly designed for finding compiler bugs. To enable such bug finding, we need both a source of random P4 programs and a translation validation technique to compare intermediate versions of these programs. Further, for some back ends such as the Tofino compiler, translation validation is insufficient, requiring us to use symbolic execution instead.

In concurrent work, Kodeswaran et al. [25] use the Ball-Larus encoding [2] to track the execution path of packets traversing the switch. By inspecting this path, a developer can verify that packets have actually taken the expected path through the P4 program. This technique is complementary to our symbolic execution approach. We are considering using it as a path coverage tool for metamorphic testing like EMI [26].

p4pktgen [34] is a P4 test-case generation tool, similar to our symbolic execution technique. p4pktgen parses the JSON file generated by the back end and outputs a Z3 formula, which it uses to create test cases. Using p4pktgen the authors were able to find several bugs in how executes JSON files. However, because it operates on output JSON instead of input P4, unlike , p4pktgen can not test for bugs in intermediate compiler passes.

petr4 [3] aims to be an independent reference implementation of the P4 language. petr4 supports P416 and implements an interpreter that potentially can be used to perform translation validation, fuzz testing, and differential testing on the compiler. While we are explicitly targeting bug finding and specialized our tools accordingly, petr4 aims to be fully independent reference implementation for P4. However, the project is still in development and unreleased, preventing us from using it for our purposes.

10 Conclusion

We presented 3 techniques for finding bugs in P4 compilers. , our combination of random testing and formal methods, is highly effective, uncovering new, confirmed P4 compiler bugs. Ultimately the restrictions in P4 allow us to simplify traditional compiler bug finding considerably when adapting it to packet processing. Going forward, we plan to work on three areas. First, we plan on extending our tool’s coverage of the P416 language. Second, we plan to develop new bug-finding techniques to target classes of bugs (e.g., performance bugs) that we currently do not cover. Third, we plan to deploy our tool as part of ’s continuous integration infrastructure. We have open sourced at https://github.com/p4gauntlet.

References

  • [1] A. A. Agape and M. C. Dănceanu (2018) P4Fuzz: a compiler fuzzer for securing P4 programmable dataplanes. Master’s Thesis, Aalborg University. Cited by: §4.1.
  • [2] T. Ball and J. R. Larus (1996) Efficient path profiling. In Proc. of the IEEE MICRO, Cited by: §9.
  • [3] J. Bao, S. Bautista, N. Ni, R. Peterson, C. Sommers, and N. Foster (2019) Petr4: formal semantics for programmable networks. Note: Accessed: 2020-05-27https://github.com/cornell-netlab/petr4 Cited by: §9.
  • [4] Barefoot Industry-first co-packaged optics Ethernet switch. Note: Accessed: 2020-5-27https://www.barefootnetworks.com/technology/ Cited by: §1, §3, §3, §4.2.
  • [5] A. Bas PTF: Packet Testing Framework. Note: Accessed: 2020-5-27https://github.com/p4lang/ptf Cited by: §6.2.
  • [6] A. Bas The reference P4 software switch. Note: Accessed: 2020-5-27https://github.com/p4lang/behavioral-model Cited by: §1, §3, §3.
  • [7] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, et al. (2014) P4: programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review. Cited by: §1.
  • [8] Broadcom (2019) NPL: open, high-level language for developing feature-rich solutions for programmable networking platforms. Note: Accessed: 2020-5-27https://nplang.org/ Cited by: §1.
  • [9] M. Budiu and C. Dodd (2017) The P416 programming language. ACM SIGOPS Operating Systems Review. Cited by: §1, §2.1, §2.2, §3, §4.1, §5.2, §6.1.
  • [10] M. Budiu (2018) The P416 reference compiler implementation architecture. Note: Accessed: 2020-05-27https://github.com/p4lang/p4c/blob/master/docs/compiler-design.pptx Cited by: §6.2, §7.2, §7.2, §7.2, §7.2.
  • [11] C. Cadar, D. Dunbar, D. R. Engler, et al. (2008) KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs.. In Proc. of the USENIX OSDI, Cited by: §1, §6.1, §8.
  • [12] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, and S. Shenker (2007) Ethane: taking control of the enterprise. ACM SIGCOMM Computer Communication Review. Cited by: §3.
  • [13] J. Chen, W. Hu, D. Hao, Y. Xiong, H. Zhang, L. Zhang, and B. Xie (2016) An empirical comparison of compiler testing techniques. In Proc. of the ACM/IEEE ICSE, Cited by: §1, §2.1.
  • [14] T. Y. Chen, S. C. Cheung, and S. M. Yiu (1998) Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543. Cited by: §2.1.
  • [15] T. P. consortium (2019-10) The P416 language specification, version 1.2.0. Note: https://p4lang.github.io/p4-spec/docs/P4-16-v1.2.0.html Cited by: §1, §2.3, §3, §3, §3, §7.1, footnote 1.
  • [16] L. De Moura and N. Bjørner (2008) Z3: an efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, Cited by: §2.2, §5.1.
  • [17] D. Dumitrescu, R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu (2019) Dataplane equivalence and its applications. In Proc. of the USENIX NSDI, Cited by: §9.
  • [18] M. V. Dumitru, D. Dumitrescu, and C. Raiciu (2020) Can we exploit buggy P4 programs?. In Proc. of the ACM SOSR, Cited by: §7.2.
  • [19] M. Eichholtz, E. Campbell, N. Foster, G. Salvaneschi, and M. Mezini (2019) How to avoid making a billion-dollar mistake: type-safe data plane programming with safep4. arXiv preprint arXiv:1906.07223. Cited by: §2.3.
  • [20] A. Fingerhut (2018) Behavioral model targets. Note: Accessed: 2020-05-27https://github.com/p4lang/behavioral-model/blob/master/targets/README.md Cited by: §3, §4.2, §5.2.
  • [21] L. Freire, M. Neves, L. Leal, K. Levchenko, A. Schaeffer-Filho, and M. Barcellos (2018) Uncovering bugs in P4 programs with assertion-based verification. In Proc. of the ACM SOSR, Cited by: §2.3.
  • [22] J. D. Ichbiah, B. Krieg-Brueckner, B. A. Wichmann, J. G. Barnes, O. Roubine, and J. Heliard (1979) Rationale for the design of the ada programming language. ACM Sigplan notices 14 (6b), pp. 1–261. Cited by: §3.
  • [23] J. Kang, Y. Kim, Y. Song, J. Lee, S. Park, M. D. Shin, Y. Kim, S. Cho, J. Choi, C. Hur, et al. (2018) Crellvm: verified credible compilation for LLVM. In Proc. of the ACM PLDI, Cited by: §7.3.
  • [24] A. Kheradmand and G. Rosu (2018) P4K: a formal semantics of P4 and applications. arXiv preprint arXiv:1804.01468. Cited by: §9.
  • [25] S. Kodeswaran, M. T. Arashloo, P. Tammana, and J. Rexford (2020) Tracking P4 program execution in the data plane. In Proc. of the ACM SOSR, Cited by: §9.
  • [26] V. Le, M. Afshari, and Z. Su (2014) Compiler validation via equivalence modulo inputs. ACM SIGPLAN Notices. Cited by: §1, §2.1, §7.3, §9.
  • [27] V. Le, C. Sun, and Z. Su (2015) Finding deep compiler bugs via guided stochastic program mutation. In Proc. of the ACM OOPSLA, Cited by: §1.
  • [28] X. Leroy (2006) Formal certification of a compiler back-end or: programming a compiler with a proof assistant. In Proc. of the ACM POPL, Cited by: §2.3.
  • [29] J. Liu, W. Hallahan, C. Schlesinger, M. Sharif, J. Lee, R. Soulé, H. Wang, C. Caşcaval, N. McKeown, and N. Foster (2018) p4v: practical verification for programmable data planes. In Proc. of the ACM SIGCOMM, Cited by: §2.3.
  • [30] N. P. Lopes, D. Menendez, S. Nagarakatte, and J. Regehr (2015) Provably correct peephole optimizations with alive. In Proc. of the ACM PLDI, Cited by: §1, §2.1, §7.3.
  • [31] W. M. McKeeman (1998) Differential testing for software. Digital Technical Journal. Cited by: §2.1, §2.1, Table 1.
  • [32] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner (2008) OpenFlow: enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review. Cited by: §3.
  • [33] G. C. Necula (2000) Translation validation for an optimizing compiler. In Proc. of the ACM PLDI, Cited by: §1, §1, §2.1, §7.3, §8.
  • [34] A. Nötzli, J. Khan, A. Fingerhut, C. Barrett, and P. Athanas (2018) P4pktgen: automated test case generation for P4 programs. In Proc. of the ACM SOSR, Cited by: §9.
  • [35] Pensando A New Way of Thinking About Next-Gen Cloud Architectures. Note: Accessed: 2020-05-27https://p4.org/p4/pensando-joins-p4.html Cited by: §3.
  • [36] A. Pnueli, M. Siegel, and E. Singerman (1998) Translation validation. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Cited by: §1, §1, §2.1, §5.1.
  • [37] C. Raiciu (2019) Undefined behaviours in P4 programs: find them, fix them or exploit them.. Note: EuroP4 Keynotehttps://p4.org/events/2019-09-23-euro-p4-workshop/ Cited by: §2.3.
  • [38] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang (2012) Test-case reduction for C compiler bugs. In Proc. of the ACM PLDI, Cited by: §8.
  • [39] G. Roșu and T. F. Șerbănută (2010) An overview of the K semantic framework. The Journal of Logic and Algebraic Programming. Cited by: §9.
  • [40] D. Sarkar, O. Waddell, and R. K. Dybvig (2004) A nanopass infrastructure for compiler education. ACM SIGPLAN Notices 39 (9), pp. 201–212. Cited by: §3.
  • [41] R. Sharma, E. Schkufza, B. Churchill, and A. Aiken (2013) Data-driven equivalence checking. In Proc. of the ACM OOPSLA, Cited by: §2.1.
  • [42] A. Shukla, K. Hudemann, Z. Vági, L. Hügerich, G. Smaragdakis, S. Schmid, A. Hecker, and A. Feldmann (2020) Towards runtime verification of programmable switches. arXiv preprint arXiv:2004.10887. Cited by: §2.3.
  • [43] A. Sivaraman, C. Kim, R. Krishnamoorthy, A. Dixit, and M. Budiu (2015) DC.P4: programming the forwarding plane of a data-center switch. In Proc. of the ACM SOSR, Cited by: §2.3.
  • [44] R. Stoenescu, D. Dumitrescu, M. Popovici, L. Negreanu, and C. Raiciu (2018) Debugging P4 programs with Vera. In Proc. of the ACM SIGCOMM, Cited by: §2.3.
  • [45] R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu (2016) Symnet: scalable symbolic execution for modern networks. In Proc. of the ACM SIGCOMM, Cited by: §9.
  • [46] R. Tate, M. Stepp, Z. Tatlock, and S. Lerner (2009) Equality saturation: a new approach to optimization. In Proc. of the ACM POPL, Cited by: §1, §1.
  • [47] X. Wang, N. Zeldovich, M. F. Kaashoek, and A. Solar-Lezama (2013) Towards optimization-safe systems: analyzing the impact of undefined behavior. In Proc. of the ACM SOSP, Cited by: §7.2.
  • [48] X. Yang, Y. Chen, E. Eide, and J. Regehr (2011) Finding and understanding bugs in C compilers. In Proc. of the ACM PLDI, Cited by: §1, §1, §2.1, §2.1, §4.1, §7.3.
  • [49] M. Zalewski American fuzzy lop. Note: Accessed: 2020-5-27https://lcamtuf.coredump.cx/afl/ Cited by: §2.1.
  • [50] L. Zuck, A. Pnueli, Y. Fang, and B. Goldberg (2002) VOC: a translation validator for optimizing compilers. Electronic notes in theoretical computer science. Cited by: §2.1.