Testing Compilers for Programmable Switches Through Switch Hardware Simulation

by   Michael D. Wong, et al.

Programmable switches have emerged as powerful and flexible alternatives to fixed function forwarding devices. But because of the unique hardware constraints of network switches, the design and implementation of compilers targeting these devices is tedious and error prone. Despite the important role that compilers play in software development, there is a dearth of tools for testing compilers within the software-defined networking sphere. We present Druzhba, a programmable switch simulator used for testing compilers targeting programmable packet processing substrates. We show that we can model the low-level behavior of a switch's programmable hardware. We further show how our machine model can be used by compiler developers to target Druzhba as a compiler backend. Generated machine code programs are fed into Druzhba and tested using a fuzzing-based approach that allows compiler developers to test the correctness of their compilers. Using a program-synthesis-based compiler as a case study, we demonstrate how Druzhba has been successful in testing compiler-generated machine code using our switch pipeline instruction set.



There are no comments yet.


page 3


Gauntlet: Finding Bugs in Compilers for Programmable Packet Processing

Programmable packet-processing devices such as programmable switches and...

Isolation mechanisms for high-speed packet-processing pipelines

Data-plane programmability is now mainstream, both in the form of progra...

One for All, All for One: A Heterogeneous Data Plane for Flexible P4 Processing

The P4 community has recently put significant effort to increase the div...

HTCC: Haskell to Handel-C Compiler

Functional programming languages, such as Haskell, enable simple, concis...

Programmable In-Network Obfuscation of Traffic

Recent advances in programmable switch hardware offer a fresh opportunit...

C for a tiny system

We have implemented support for Padauk microcontrollers, tiny 8-Bit devi...

Compiler Testing: A Systematic Literature Analysis

Compilers are widely-used infrastructures in accelerating the software d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally, network switches have been fixed function; switch behavior is baked into the underlying hardware itself with little to no room for modification in the field. Though there have been programmable switches available (e.g. [11]), it was widely believed that fixed function switches would always be cheaper, more power efficient, and much faster. Programmable switches were failing to reach the 1 Tb/s packet forwarding speeds observed in large data centers and enterprises, causing many opting not to deploy these systems into their networks. However, operators need to be able to dynamically add new protocols such as MPLS [15] and support new packet processing operations whilst ensuring that the device runs at high speeds. While operators can opt to make an investment in new fixed function switch hardware with the functionality they require, this is clearly a time-consuming and costly use of resources. It can often take up to several years for network switch vendors to produce these new devices due to the complications of designing new software and ASIC hardware. After these switches are finally developed, it takes additional time and effort to actually set up and integrate these devices within their existing networking infrastructure.

The emerging prominence of software-defined networking (SDN) [24] has attempted to mitigate these issues. SDN is a centrally-managed approach to network management that is primarily comprised of two components: the control plane and the data plane. The control plane contains one or more controllers that are decoupled from networking hardware and implemented separately in software and is responsible for managing the high-level decision making such as routing (e.g. [5]) within a network. Protocols such as OpenFlow [17] have emerged to allow the controllers to communicate with the underlying network switches to relay these changes.

The data plane, also known as the forwarding plane, involves a more constrained and local view by solely focusing on packet forwarding itself. Programmability in the data plane has been accompanied by the advent of high speed programmable networking substrates which have drastically increased the freedom network operators have in dynamically changing the packet processing functionality of their network devices. The switching chips for these substrates ([6], [8]) have demonstrated that relative to fixed function chips, a certain level of programmability can be achieved without compromising performance within the data plane. Along with these switching chips, high level domain-specific languages for data plane programming such as Domino [23] and P4 [29] have emerged to configure computational manipulations on both packet fields and switch state alike.

Today, most programmable switching chips contain a pipeline of stages that perform packet processing computations. However, building compilers for these chips remains challenging. Unfortunately, programmers bear the weight of these consequences as they rely on compiler heuristics to adequately map their programs to machine code since an incorrect mapping could result in a binary with erroneous behavior. While the testing and development of traditional compilers has never been easy, the issue is exacerbated for compilers targeting switches. First of all, switching chips have restraining budgets of hardware resources such as pipeline stages and arithmetic logic units (ALUs). Also, programmable pipelines have an

all-or-nothing nature, meaning that a program either runs at line rate if it can fit within a pipeline’s resources or it doesn’t run at all. Additionally, due to pipelines requiring computations to be performed in a fixed order, packet processing capabilities are limited. With that said, designing compilers that can map a wide spectrum of programs to switching chips is difficult, demonstrating the need for testing. Furthermore, severe damages can result from bugs whose effects can permeate across an entire network causing issues such as security vulnerabilities if ACLs aren’t correctly implemented, heightening the importance of validating compiler correctness. Testing tools have existed for years for compilers and toolchains for traditional programming languages such as C and C++ as well as for traditional instruction set architectures such as x86 and ARM. Yet, bugs continue to be discovered in these systems despite their widespread use (e.g. [25]). While tools for network switches exist for application debugging and studying algorithmic impact at the networking level (e.g. [21], [12]), tools for testing the mapping to switch instruction sets of compilers for programmable switches are scarce.

We present Druzhba, a hardware simulator for testing compilers targeting high speed programmable switches. We aid compiler development for pipeline-based switches by modeling the low-level hardware primitives of the RMT (Reconfigurable Match Tables) [6] architecture. We are also in the process of implementing simulation for a network processor-based model, dRMT (Disaggregated Reconfigurable Match Tables) [8]. Druzhba’s RMT model is intended for compiler developers looking to test the correctness of their compilers’ abilities to map high-level programs to the instruction set of the programmable switch. To the best of our knowledge, current existing hardware switch simulation and compiler testing tools (e.g. [21], [12], [3]) do not leverage low level instruction set modeling to the extent that we do to test compilers. Druzhba’s approach allows for the detection of erroneous program mappings to the switch pipeline instruction set. We accomplish this by establishing a workflow that enables Druzhba to serve as a compiler target while accurately simulating switch computational behavior. We also show how optimizations can be applied to Druzhba’s code generation component to enhance simulation performance. Druzhba’s source code can be found at https://github.com/chipmunk-project/druzhba-public.

2 Switching Chip Architecture

In this section we discuss high speed packet forwarding performed by switches. We also delve into fixed function switch architectures and how they led to the RMT and dRMT programmable switch architectures that we model. We also delve into our RMT pipeline instruction set modeling methodology.

2.1 Overview

Switches perform high speed packet forwarding which first involves a parser to extract packet fields from an incoming bytestream. Second, they operate on packets using match+action tables. These tables are allocated using local pipeline stage memory and map matches on packet header fields to actions that perform computations on packet header fields, switch metadata, and switch state. Examples of actions include mutating a state variable, dropping a packet, or decrementing a packet’s TTL. CPUs and network processors initially come to mind as ideal candidates for these processing requirements but they do not perform these computations at high speeds. Switching chips can operate at two orders of magnitude faster than many CPUs and one order of magnitude faster than many network processors.

Fig. 1: The left side represents a high level program (e.g. Domino, P4). The compiler takes in a program and maps it to the Druzhba RMT machine model on the right. Dashed lines show the machine code’s configuration of the multiplexers and ALUs.
Fig. 2: Pipeline with depth and width of 2 and PHV length of 2. Each ALU has 1 PHV container value operand. This demonstrates the pipeline stages’ connections by showing PHVs to ALU inputs and ALU outputs to PHVs.

Fixed function switching chip designs. One of the first switching chip models to employ match+action processing is SMT (Single Match Table) [6], which uses a large single match+action table. It consists of a parser that looks for header fields to match with in an incoming packet as well as a corresponding deparser at the end of the switch and one large match+action table; the entries consist of ways to match on incoming packet fields and the different types of actions that can be performed on packets if a match is found. Though this provides an easy-to-understand abstraction, this model is not scalable when many packet headers are used and can lead to a wasteful use of resources. For instance, consider the case where we would like a match on a header field to occur only if a match had occurred on another header field prior to it. This leads to the table having to store the cartesian product of both fields. This deficiency prompted the development of MMT (Multiple Match Tables) [6] which consists pipelines of stages with each stage containing local memory to be used for multiple, smaller match+action tables. The pipelines are referred to as the ingress and egress pipelines respectively and are separated by switching fabric which determines the connections between the input and output ports. However, due to the performance requirements of line rate packet forwarding, fixed function packet processing substrates such as SMT and MMT severely limit the freedom in switch program reconfiguration. This is problematic when it comes to implementing new header fields for matching and actions for tasks such as tunneling, queue management, and traffic engineering.

Programmable pipelines. RMT improves upon MMT and also contains pipelines of match+action tables but goes further in enabling programmatic control of the data plane of the switching chip. The first contribution is that the parser is programmable, enabling new header types and fields to be defined without being restricted to pre-defined ones. Second, the size and number of match tables within the switch can be reconfigured. Third, new actions that haven’t been pre-defined can be created. Lastly, more control is given in allowing packets to be placed in specific queues. The design of RMT’s match+action tables reduces wasteful resource consumption and allows for the ability to conform to different algorithmic requirements. On the other hand, for MMT new hardware often needs to be constructed for a specific configuration that a current switch does not support.

Disaggregated processing. Meanwhile, the dRMT network processor design shifts away from the traditional pipeline paradigm and attempts to ameliorate the previous short comings of pipeline architectures while maintaining programmatic control of the switching chip. dRMT decouples the local memory from each pipeline stage into centralized memory clusters that are accessible via crossbar and instead of stages, it comprises of a set of match+action processors, each running the packet processing program to completion. An incoming packet is sent to one processor and each processor accesses centralized match+action tables using shared memory through a crossbar as opposed to requiring every stage to store match+action tables in local memory in RMT. Due to the feedforward infrastructure of traditional pipeline switches, a packet is confined to traversing the switch tables in a fixed order whereas for dRMT, match and action operations on packets have the flexibility of being weaved together without this constraint. Also since memory is global, it can be accessed at different points of a program’s execution. Furthermore, a pipeline introduces possibilities of wasting resources. For instance, if a specified match+action table is large, it can consume the memory of multiple match+action tables resulting in action units being wasted in all but one stage. But dRMT match+action units are stored independently from the memories mitigating the issue.

2.2 Compilation to Switch Pipelines

Along with the increased freedom in programmability, compilers are responsible for ensuring that high-level programs are mapped to switch hardware primitives. Within the hardware, the parser generates packet header vectors (PHVs) which are vectors of containers each holding a packet or metadata field; metadata is data associated with each packet. Metadata fields include the number of bytes in the packet or the ingress port on which the packet arrived. Action units are implemented using configurable digital circuits which comprise arithmetic logic units (ALUs) and memories. ALUs perform computations and are either stateful or stateless; stateful ALUs can read and write to its switch state values while stateless ALUs solely operate on PHVs. Switch state is data that is stored locally within an ALU and any modification made to state must be visible to the next PHV that the ALU executes on. Compilers translate programs to machine code using the instruction set of the underlying switching chip to determine which header fields for a parser to match on and place into PHVs, implement the tables and ALUs, and generate the connections between ALUs and PHVs. Figure 1 shows the compilation process of taking a high level program written to configure the behavior from the logical view of the switch to machine code. The machine code is then used to program the hardware primitives within our Druzhba machine model; the details of our model are discussed in §


2.3 RMT Instruction Set Modeling

Druzhba doesn’t directly represent the match+action tables, but models the underlying hardware primitives. First, instead of modeling packets directly, we model PHVs for lower level hardware accuracy. Second, we use ALUs to represent the switch action units. Third, we use input and output multiplexers to illustrate the connections between PHVs and ALUs. At the moment, we do not model parsing and matching.

ALU behavior is controlled through opcodes that specify the type of operations to perform and immediate values that are unsigned integer constants. PHV container values are fed into an ALU through input multiplexers with each multiplexer corresponding to an ALU operand. Once the input multiplexers have forwarded the operands to their respective ALUs, the ALUs execute and state variables are written to as needed. Each output multiplexer receives multiple ALU outputs and selects one to write to its allocated PHV container. Figure 2 shows an in-depth view of our model by illustrating Druzhba’s feedforward pipeline structure and the multiplexers that connect the PHVs and ALUs.

Fig. 3: Overview of the grammar of the ALU DSL. Operators include relational (, , , ), arithmetic (, , , ), and unary (). Additional variable declarations include ALU opcodes and immediate operands which can be added using machine code values.

3 Druzhba Design and Implementation

Our Druzhba pipeline simulation is comprised of (1) our pipeline code generator, dgen, and (2) our simulation component, dsim, which uses dgen’s generated code to initiate simulation. In this section, we delve into these details as well as how we employ optimizations to simplify the pipeline code and reduce dsim simulation runtime. Druzhba is written entirely in Rust.

3.1 Hardware Specification

We express our pipeline model by allowing dgen to take specifications of the hardware and convert them into an executable version of the pipeline given (1) the depth and width of the pipeline (i.e. number of stages and number of ALUs per stage), (2) a high-level representation of the ALU structure, and (3) machine code to determine the switch’s behavior. We accomplish this by introducing our ALU DSL to express switching chip ALU capabilities. The accompanying machine code programmatically defines the behavior of the multiplexers and ALUs. The pipeline that is generated by dgen is the design that will be simulated for compiler testing. This flexibility thus effectively allows Druzhba to act as a family of simulators, one for each possible pipeline configuration.

Expressing ALU functionality. We express the capabilities of an ALU via our ALU DSL. This DSL allows us to specify the number of input PHV container value operands and state variables, whether the ALU is stateful or stateless, and the immediates and opcodes that programmatically determine the ALU’s computations. Furthermore, it supports unary and binary expressions as well as additional multiplexers; binary expressions can use either arithmetic or relational operators. Logical operators such as and are also supported. Figure 3 shows the ALU DSL grammar We have written 5 stateless ALUs and 6 stateful ALUs that make use of our ALU DSL grammar that represent the behavior of atoms in Banzai [1], a switch pipeline simulator for Domino. Atoms are Banzai’s natively supported atomic units of packet processing. Figure 4 shows one of our written stateful ALUs that models Banzai’s If Else Raw atom.

Fig. 4: Example of the If Else Raw Banzai atom written using our ALU DSL. Hole variables are comprised of additional machine code values that may be desired in addition to the existing machine code values for the other ALU computations. C() indicates a constant and Opt() indicates a 2-to-1 multiplexer that either returns 0 or its argument.
Fig. 5: Compiler testing workflow with optimizations applied to the pipeline description. Without optimizations, the RMT machine code is given to dsim. Testing is done by checking the equivalence between the 2 output packet traces.

Machine code for switch primitives. Our machine code to run on the pipeline consists of a list of string and integer pairs that specify ALUs’ control flow and computational behavior. Each machine code pair’s string corresponds to one of the pipeline’s hardware primitives. The strings are each given unique names that succinctly denote the primitive that the pair corresponds to and the primitive’s location within the pipeline. The matching value is an integer that determines the behavior of that primitive. An example is an ALU arithmetic operation which uses its machine code value to determine whether to add or subtract its two operands. Our machine code also allows for determining the connections between PHVs to ALU inputs and ALU outputs to PHVs through specifying the behavior of the input and output multiplexers. For instance, a 3-to-1 input multiplexer uses its machine code value to determine which of its 3 PHV container values to send to the connected ALU. Further machine code pairs are used to represent additional ALU DSL variable declarations, such as ALU opcodes andimmediate operands, to specify ALU behavior.

3.2 Pipeline generation.

dgen makes use of these inputs via Rust pipeline code generation which involves the generation of code that represents the scaffolding of the pipeline and the ALUs within it. Abstract Syntax Trees (ASTs) are generated to represent the syntactic structures of the given ALU files. As these ASTs are traversed, corresponding Rust code for pipeline simulation is generated. A function is created for each ALU and subsequent helper functions are created for multiplexers and ALU DSL expressions. These helper functions use machine code values to determine their behaviors. Additional values such as ALU opcodes and Immediate operands can also be defined using additional machine code values. This process is repeated for every stateful and stateless ALU in a stage and every stage in the pipeline. Once these ALU functions in addition to their corresponding helper functions are generated, additional code is generated to initialize a description of the pipeline using the generated ALU functions and multiplexers to show the connections between these ALUs and the PHVs they read from and write to. This initialization code ensures that the input and output multiplexers as well as the ALUs are executed in the proper order within the pipeline. Further, it utilizes a hash table of machine code pairs and passes on these pairs to the proper hardware functions. For instance, it will give input multiplexer functions their proper machine code values needed to determine which operands to forward to their allocated ALUs. We refer to dgen’s generated code as the pipeline description.

Version 1 in Figure 6 shows a simplified sample of a pipeline description from dgen. is a Rust function that represents an ALU defined using our ALU DSL. Its parameters comprise of which is a vector of integers that represents the stateful values stored locally within that ALU, which is a hash table that maps machine code string names to values, and which is a vector of PHV containers. The , , and functions are generated from using our ALU DSL’s and expressions that are each allocated a corresponding machine code value. For simplicity and concision, the string key names are shortened; our actual machine code strings also indicate the pipeline stage and the position within that stage the hardware primitive for that string resides in. Since the machine code string pairs are hardcoded in the pipeline description, it’s essential that the machine code pairs provided by the user align with the proper naming conventions.

3.3 Simulating Computations.

Our simulation component, dsim, can be distinctly simplified into RMT dsim and dRMT sim which simulate RMT and dRMT architectures respectively. RMT dsim consists of a traffic generator and enables the feedforward packet-processing behavior based on the design specified in the pipeline description. The pipeline description file is compiled with dsim and used for simulation. The traffic generator creates a sequence of PHVs where every PHV consists of random unsigned integers. In our original design, dsim also took as input the machine code pairs that were treated as variables in dgen’s generated pipeline description during dsim runtime. They were later taken in as a dgen input to take advantage of opportunities for optimized pipeline generation of pipeline Rust code; this is discussed in §3.4.

At every simulation tick, dsim ensures that a PHV created by the traffic generator enters the pipeline and is executed by the first pipeline stage and that PHVs in subsequent stages are sent to their next respective stages. It also ensures that reads and writes are appropriately performed on PHVs and switch state. To prevent a pipeline stage from reading a PHV in the same tick that it was written to by the previous stage causing the PHV to traverse multiple stages during the same tick, dsim models a PHV in two parts: a read half and a write half. A pipeline stage writes its results to the write half of the resulting PHV while the next stage reads that PHV from the read half that holds the values that were written to it from the previous tick. During the beginning of the next simulation tick, the values in the PHV containers within the write half are moved to the read half. Following simulation, an output trace shows the modified PHVs and the state vectors. Testing can be done through fuzzing using the random PHVs generated by the traffic generator and checking the validity of the output trace. This is done by writing a high-level specification capturing the intended algorithmic behavior on both PHVs and state values and recording both the input and output trace. The input trace is then given to the specification which generates its own output trace. Assertions check the equivalence of the output traces to determine if the behaviors of the Druzhba pipeline and the specification match. Figure 5 shows this compiler testing process.

Fig. 6: Simplified pipeline description sample. is a HashMap containing machine code pairs, is a vector of PHV containers, and is a vector of state variables. Version 1 is unoptimized, version 2 shows sparse conditional constant propagation, version 3 shows function inlining. In this example, is 0, is 0, is 1.

3.4 Optimizations

Sparse Conditional Constant Propagation. Initially, machine code was given to dsim instead of dgen which caused the pipeline description functions to treat the machine code as variables that are passed as arguments during runtime. This allowed machine code to be swapped between simulations without rerunning dgen and recompiling dsim. In beginning optimizations, we give the machine code as input to dgen and note that (1) providing the machine code pairs during pipeline generation enables a global static mapping of names to values and (2) the functions in our pipeline description use if statements to check these values. These observations allow us to use sparse conditional constant (SCC) propagation [26], which involves constant propagation followed by the abstract interpretation of control flow. We do this by replacing machine code variable occurrences with their corresponding integer values. Then we use constant folding by evaluating constant expressions which allows us to determine the results of conditional statements. This results in dead code elimination from unused control paths and solely emitting single simplified expressions in place of the previous function bodies.

For instance, consider an arithmetic operation function that adds its operands if its machine code value is 0 and subtracts otherwise. During optimization, the if statement that checks the machine code value is removed and solely replaced with either the addition or subtraction expression. Large machine code values can cause function behavior to branch in many different ways initially requiring numerous conditional expressions to check against every possible value case but now these computations are not performed during simulation. The version 2 code sample in Figure 6 demonstrates SCC propagation applied to the unoptimized code in version 1. Since the machine code in is known during dgen code generation, the opcode operands for the , , and functions are unnecessary and thus do not need to passed as function arguments. We are then able to use constant folding to calculate the values of the conditional expressions allowing us to determine the branch that will be taken. This technique is also applied to the ALU function itself when if statements are present.

Function Inlining. After our code reduction from our first stage of optimizations, we then observe that our pipeline description uses numerous function calls that can be easily reduced. These numbers can grow very large depending on pipeline dimensions and ALU complexity, making the pipeline description unnecessarily long and abstruse. Currently for our stateful ALUs expressed in our ALU DSL, every ALU per pipeline stage can generate up to over 50 different helper functions resulting in over 200 function calls for a pipeline depth of 2 and a pipeline width of 2. We mitigate this through allowing dgen to remove the function calls altogether and to replace them with the simplified bodies of those functions that is now possible after SCC propagation. We implement function inlining to be helpful in debugging since the pipeline description becomes more concise, making it easier to read. Due to the aggressiveness of the Rust compiler optimizations, we don’t expect significant runtime improvements in simulation time.

The version 3 code sample in Figure 6 demonstrates our use of function inlining. Since the helper functions each contain a simple return statement, emitting the functions is superfluous and adds additional complexity to the code. and are first replaced with and respectively. Then the expression within the return statement is copied, replacing its operands with the and .

Program Pipeline depth, width ALU name Unoptimized (ms) SCC propagation (ms) + Function inlining (ms)
BLUE (decrease) [10] 4,2 sub 986 576 576
BLUE (increase) [10] 4,2 pair 1,268 724 725
Sampling [23] 2,1 if else raw 234 167 169
Marple new flow [16] 2,2 pred raw 404 215 215
Marple TCP NMO [16] 3,2 pred raw 729 481 480
SNAP heavy hitter [4] 1,1 pair 143 103 103
Stateful firewall [4] 4,5 pred raw 1,549 703 703
Flowlets [22] 4,5 pred raw 1,771 983 983
Learn filter [23] 3,5 raw 1,911 1,162 1,163
RCP [27] 3,3 pred raw 1,261 793 793
CONGA [2] 1,5 pair 393 206 206
Spam detection [4] 1,1 pair 145 103 103

TABLE I: RMT runtimes with and without optimizations. ALU names refer to Banzai [1] atoms.

4 DRMT Overview

In this section, we discuss an overview of our ongoing work in dRMT simulation. We model dRMT at the level of matches and actions contrary to our RMT instruction set modeling. To the best of our knowledge there does not exist any dRMT simulation platforms. The reason why we model dRMT to a higher level of abstraction than we do RMT was to extend Druzhba’s usefulness in allowing application debugging. Due to our higher level of modeling for dRMT, reusing this code to increase the level of abstraction for RMT simulation would be straightforward.

4.1 Network Processor Code Generation

dgen is also used as a code generation component for dRMT and takes care of the necessary preprocessing prior to simulation. dgen takes as input a P4 file representing the algorithmic behavior specified in the context of a feed-forward pipeline. dgen converts the given P4 file into a DAG representing the match+action table dependencies [19]. This DAG along with other parameterized data (e.g. number of cycles per match) is then sent to the dRMT scheduler [9]

which determines the order and timing that each match and action needs to be performed at for optimal speeds and to prevent resource contention. Additional information about the hardware constraints are also sent to the scheduler such as the number of ticks per action unit and the number of ticks per match. The scheduling problem is NP-hard and is formulated as an Integer Linear Program (ILP) (refer to

[8] for further details). Once the scheduler has completed, a schedule that dictates which matches and actions perform at which simulation ticks is returned. Static analysis is performed both on the scheduler output and the initial P4 file to extract data about the program such as header-types, packet fields, actions, matches, other relevant data and all of it is packaged into a Rust file to be used by dsim.

4.2 Network Processor Simulation

dRMT dsim allows us to make use of dgen’s generated code in a similar manner to RMT dsim. In addition to that code and the number of ticks, it takes in the number of dRMT match+action processors and a table entries file in our own configuration format that specifies the table entries that will be added to the match+action tables. These entries populate dsim’s dRMT match+action tables prior to packet processing simulation. Also, the dRMT dsim traffic generator generates packets with randomly initialized packet field values based on the fields specified in the P4 file instead of PHVs. dsim first unboxes the information initially within the P4 file such as the header-types, the match+action table formats, and the stateful memories (e.g. registers, meters, counters) to construct a disaggregated model to correspond with the parameters and later populates the match+action tables. The configuration format for the table entries primarily consists of (1) the table that the entry will be added to, (2) the packet field to be matched on, (3) the type of match to perform (e.g. ternary, exact), and (4) the corresponding action to be executed if there is a match. At every simulation tick, a packet is generated by the traffic generator and given to a processor in a round robin fashion. Matches and actions performed on packets specified by the given table entries are performed according to the given dRMT schedule.

5 Evaluation

In this section we evaluate the performance of Druzhba on a number of different packet transactions. Every RMT benchmark was executed by using 50000 PHVs generated from the traffic generator.

5.1 Benchmarks

We execute our benchmarks by taking 12 packet processing programs and measuring the amount of time it took to perform unoptimized and optimized simulations for 50000 PHVs for each one using Rust’s supported benchmark tests. We take each program and measure the amount of time it took to (1) run the pipeline simulation using the unoptimized simulation, (2) the optimized pipeline simulation using SCC propagation, and (3) the optimized pipeline simulation using both SCC propagation and function inlining. We obtained machine code programs for these algorithms through the usage of a program-synthesis-based compiler [31]. The program complexity and number of PHV containers the program uses dictated the pipeline dimensions needed to implement the intended algorithmic behavior.

Generally, programs in table 1 that showed the most significant improvements due to our optimizations were the ones with the highest number of pipeline depths and widths such as stateful firewall, flowlets, and learn filter. Since the pipeline code generated is commensurate to pipeline size, unoptimized runtime was much higher and the optimizations affected a greater portion of code for larger pipeline simulations. The ALUs used in each benchmark varied significantly in complexity and also affected pipeline generation but we found that it had a much lower impact on performance.

5.2 Case Study

Druzhba has shown to be successful when used as a testing tool for Chipmunk [31], a compiler for packet processing pipelines. Chipmunk generates machine code in the form of constant integers from a given Domino file through the use of program synthesis [20]; these constants can be used to target Druzhba’s instruction set. We tested Chipmunk by first creating multiple Domino programs and generating corresponding machine code. Then we defined the PHV structure and algorithmic behavior for each of our Domino programs in Rust and created assertion statements to check the equivalence of the output PHVs and state variables between our Rust implementation and the pipeline simulation with the machine code.

Over 120 Chipmunk machine code programs were determined to be correct after testing, validating both the accuracy of Druzhba’s simulation and Chipmunk’s code generation. Druzhba also initially resulted in 8 program failures that occurred either because the provided machine code was incompatible with the pipeline or because of assertion failures between the pipeline output trace and the high-level specification’s output trace. 2 failures were due to missing machine code pairs from the input file to program the behavior of the pipeline’s output multiplexers. The remaining failures resulted in insufficient machine code values that led to the pipeline simulation failing for large PHV container values over 100. This was because the synthesis engine failed to find machine code to satisfy 10-bit inputs in the allotted time thus only returning machine code that only satisfied a limited range of values.

6 Related Work

Network simulation tools have long been used. Simulators such as mininet [14] emulate numerous data communications devices within a network but do not focus on the specific details of the data plane. PFPSim [21] models the pipelined architecture of the programmable data plane using P4 programs and simulates match+action operations on packets in a feedforward fashion. NS4 [12] also simulates P4 programs but goes a step further from PFPSim and allows emulation of an entire P4-enabled network. On the other hand, Druzhba is a simulation platform that leverages low level RMT instruction set modeling to serve as a compiler target for compiler testing. Banzai [1] is a switch simulator that serves as a compiler target for Domino but is not as detailed and doesn’t model the low level hardware primitives to the same extent as Druzhba.

Compiler testing tools have also been frequently used but there is a dearth of tools for testing compilers for programmable switches in particular. LET [13] is a compiler testing tool that focuses on how to produce bug-ridden test programs to produce erroneous behavior but is only aimed at C compilers. p4pktgen [3] has successfully leveraged symbolic execution by generating table entries and test packets to detect bugs in the P4 compiler, p4c [18]. These bugs deal with p4c’s translation of high-level P4 source code constructs such as header length specification to JSON. This differs from Druzhba’s RMT simulation approach that represents the underlying instruction set and allows for the detection of erroneous mappings to pipeline switch hardware.

7 Future Work

We recognize that there is still room for additional amelioration and development. First we acknowledge that our dRMT simulation isn’t comprehensive of the P4-14 space; details such as packet field length aren’t thoroughly simulated and further detail can be appended to deal with that. We also desire to allow Druzhba to serve as a compiler target for multiple different switching chip targets. We look to do this by modeling dRMT to the same low level granularity as our RMT model by designing a new instruction with similar properties to our RMT instruction set.

Further, we hope to address Druzhba’s limitations in comprehensive program validation and bug finding. Our testing framework is is only capable of fuzz-testing RMT pipelines using randomly generated PHVs and only demonstrates input-output behavior. Though testers can analyze the input and output traces, this can be a tedious process especially for complex algorithms. Thus, in addition to our current pipeline simulation, we wish to use program verification by allowing support for a high-level specification that contains the pipeline’s intended algorithmic behavior as well as PHV and state value constraints. This specification and the pipeline description can be transformed into SMT formulas so that equivalence can be formally proven. Though p4v [7] exists as a verification tool, it is intended for determining correctness of P4 programs and not for testing machine code generation. We are also considering adding support for a domain specific time travel debugger [30] for Druzhba to further aid in bug finding. This debugger would provide useful data to testers in reasoning about the behavior of the pipeline through setting breakpoints to observe PHV container and state values at different points of simulation. Bi-directional traveling in the context of programmable switches can allow testers to rewind pipeline simulation ticks to past pipeline states to trace origins of erroneous behavior. Lastly, we hope to extend Druzhba’s usefulness beyond compiler testing. We look towards using Druzhba to evaluate the impact and effects of new hardware designs by modeling different instruction sets or by adding hardware support for multitenancy [28].

8 Conclusion

We presented Druzhba, a programmable switch simulator that performs low level RMT instruction set modeling. This in turn provides a greater magnitude of control in compiler testing by using Druzhba to act as both a target as well as a simulation platform. We have shown Druzhba’s usefulness in compiler testing by successfully simulating numerous machine code programs from a synthesis-based compiler. We have also shown Druzhba’s potential in modeling dRMT and anticipate that Druzhba can further aid in testing compilers for not only RMT, but also for other switching chip architectures.


We are appreciative and grateful for Suvinay Subramanian’s thoughtful and constructive feedback which helped us to improve this paper.


  • [1] A machine model for line-rate programmable switches. Note: https://github.com/packet-transactions/banzai Cited by: §3.1, TABLE I, §6.
  • [2] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese (2014) CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In SIGCOMM, Cited by: TABLE I.
  • [3] Andres Nötzli, Jehandad Khan, Andy Fingerhut, Clark Barrett, Peter Athanas (2018) p4pktgen: Automated Test Case Generation for P4 Programs. In SOSR, Cited by: §1, §6.
  • [4] M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker (2016) SNAP: Stateful network-wide abstractions for packet processing. In SIGCOMM, Cited by: TABLE I.
  • [5] BGP Protocol. Note: https://en.wikipedia.org/wiki/Border_Gateway_Protocol Cited by: §1.
  • [6] P. Bosshart, G. Gibb, H. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz (2013) Forwarding Metamorphosis: Fast Programmable Match-action Processing in Hardware for SDN. In SIGCOMM, Cited by: §1, §1, §2.1.
  • [7] P. Bosshart, G. Gibb, H. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz (2013) p4v: Practical Verification for Programmable Data Planes. In SIGCOMM, Cited by: §7.
  • [8] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger, G. Mendelson, M. Alizadeh, S. Chuang, I. Keslassy, A. Orda, and T. Edsall (2017) dRMT: Disaggregated Programmable Switching. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, Cited by: §1, §1, §4.1.
  • [9] dRMT Scheduler. Note: https://github.com/anirudhSK/drmt/ Cited by: §4.1.
  • [10] W. Feng, K. G. Shin, D. D. Kandlur, and D. Saha (2002) The BLUE Active Queue Management Algorithms. IEEE/ACM Transactions on Networking. Cited by: TABLE I.
  • [11] Intel IXP Network Processor. Note: https://en.wikipedia.org/wiki/IXP1200 Cited by: §1.
  • [12] Jiasong Bai, Jun Bi, Peng Kuang, Chengze Fan, Yu Zhou, Cheng Zhang (2018) NS4: Enabling Programmable Data Plane Simulation. In SOSR, Cited by: §1, §1, §6.
  • [13] Junjie Chen1, , Yanwei Bai1, , Dan Hao1, , Yingfei Xiong1, Hongyu Zhang , Bing Xie (2017) Learning to Prioritize Test Programs for Compiler Testing. In ICSE, Cited by: §6.
  • [14] Mininet. Note: https://github.com/mininet/mininet Cited by: §6.
  • [15] MPLS Protocol. Note: https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching Cited by: §1.
  • [16] S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh, V. Jeyakumar, and C. Kim (2017) Language-Directed Hardware Design for Network Performance Monitoring. In SIGCOMM, Cited by: TABLE I.
  • [17] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, Jonathan Turner (2008) OpenFlow: Enabling innovation in campus networks. In SIGCOMM CCR, Cited by: §1.
  • [18] P4 Compiler. Note: https://github.com/p4lang/p4c Cited by: §6.
  • [19] P4-hlir. Note: https://github.com/jafingerhut/p4-hlir Cited by: §4.1.
  • [20] Program Synthesis. Note: https://en.wikipedia.org/wiki/Program_synthesis Cited by: §5.2.
  • [21] Samar Abdi, Umair Aftab, Gordon Bailey, Bochra Boughzala, Faras Dewal, Shafigh Parsazad, Eric Tremblay (2016) PFPSim: A Programmable Forwarding Plane Simulator. In ANCS, Cited by: §1, §1, §6.
  • [22] S. Sinha, S. Kandula, and D. Katabi (2004) Harnessing TCPs Burstiness using Flowlet Switching. In HotNets, Cited by: TABLE I.
  • [23] A. Sivaraman, A. Cheung, M. Budiu, C. Kim, M. Alizadeh, H. Balakrishnan, G. Varghese, N. McKeown, and S. Licking (2016) Packet Transactions: High-Level Programming for Line-Rate Switches. In SIGCOMM, Cited by: §1, TABLE I.
  • [24] Software-defined networking. Note: {https://en.wikipedia.org/wiki/Software-defined_networking} Cited by: §1.
  • [25] H. Song (2017) Skeletal program enumeration for rigorous compiler testing. In PLDI, Cited by: §1.
  • [26] Sparse Conditional Constant Propagation. Note: {https://en.wikipedia.org/wiki/Sparse_conditional_constant_propagation} Cited by: §3.4.
  • [27] C.H. Tai, J. Zhu, and N. Dukkipati (2008) Making Large Scale Deployment of RCP Practical for Real Networks. In INFOCOM, Cited by: TABLE I.
  • [28] Tao Wang, Hang Zhu, Fabian Ruffy, Xin Jin, Anirudh Sivaraman, Dan Ports, Aurojit Panda (2020) Multitenancy for fast and programmable networks in the cloud. HotCloud. Cited by: §7.
  • [29] The P4 Language Specification. Note: https://p4.org/p4-spec/p4-14/v1.0.5/tex/p4.pdf Cited by: §1.
  • [30] Time Travel Debugging. Note: {https://en.wikipedia.org/wiki/Time_travel_debugging} Cited by: §7.
  • [31] Xiangyu Gao, Taegyun Kim, Aatish Kishan Varma, Anirudh Sivaraman, Srinivas Narayana (2020) Autogenerating Fast Packet-Processing Code Using Program Synthesis. HotNets. Cited by: §5.1, §5.2.