Segment Routing is a network architecture based on the loose Source Routing paradigm ([filsfils2015segment, rfc8402]). The basic concepts proposed in [filsfils2015segment] have been elaborated and refined in the RFC 8402 [rfc8402] which has recently completed its standardization process in the IETF (July 2018). In the SR architecture, a node can include an ordered list of instructions in the packet headers. These instructions steer the forwarding and the processing of the packet along its path in the network. The single instructions are called segments and each segment can enforce a topological requirement (e.g. pass through a node or an interface) or a service requirement (e.g. execute an operation on the packet). Each segment is encoded by a Segment IDentifier (SID).
The SR architecture is supported by two different data-plane implementations: MPLS (SR-MPLS) and IPv6 (SRv6), in which SIDs are respectively encoded as MPLS labels and IPv6 addresses. Several use cases and requirements for Segment Routing have been collected in a number of documents. In [rfc7855]
, the main use cases identified are: IGP based tunneling (i.e. to support VPN services), Fast ReRoute (FRR), Traffic Engineering (further classified in a number of more specific use cases). A set of Resiliency use cases is described in[rfc8355]. The Segment Routing use cases for IPv6 networks are considered in [rfc8354] with a set of exemplary deployment environments for SRv6: Small Office, Access Network, Data Center, Content Delivery Networks and Core Networks.
SR-MPLS has been the first instantiation of the SR architecture to be rolled out, which allowed to partially leverage the SR benefits ([davoli2015] and [cianfrani2016]), while the recent interest and developments are focusing on SRv6. The SRv6 implementations have drawn a lot of attention to researchers from academia and industry, as witnessed by the publication of several research activities [ventre2019survey]
. A strong open source ecosystem is supporting SRv6 advances. In particular, we want to mention[srnet], [srorg] and the ROSE project [rose]
. Moreover, the data-plane implementations of SRv6 have been supported in many different routers implementations including: open-source software routers such as the Linux kernel and the Vector Packet Processing (VPP) platform[fd-io-vpp], and hardware implementations from different network vendors [ietf-6man-segment-routing-header]. Finally, large scale deployments of SRv6 in production networks have been recently announced [draft-matsushima-spring-srv6-deployment-status]. The list of announced SRv6 deployments expands several network segments such as service providers and data center networking. In this work, the SRv6 data-plane is our main focus and we aim at enable the benchmarking of the different SRv6 data-plane implementations.
The introduction of SRv6 in ISP networks requires the assessment of its non-functional properties like scalability and reliability. Hence, the availability of a realistic performance evaluation framework for SRv6 is of fundamental importance. A measurement platform should allow scaling up to the current transmission line rates. Ideally, it should be available for re-use on any commodity hardware. To the best of our knowledge, there are no such open source performance measurements tools nor works that provide a complete performance evaluation for SRv6. There are some works that reports partial experiments for the SRv6 implementations, like [lebrun2017reaping, lebrun2017implementing, ahmedperformance] and the report published at [csit-report]. [lebrun2017reaping] and [lebrun2017implementing] are early evaluations reporting the performance of the very first implementations of SRv6. [ahmedperformance] provides an early implementation of a performance framework for Linux and reports the performance of some SRv6 behaviors and pointed out few performance issues without providing any solution to these issues. [csit-report] focuses on VPP forwarding in general and reports the performance of few SRv6 behaviors. Considering the interest in performance analysis of SRv6 and the fact that these works do not provide a complete analysis of the available platforms and supported SRv6 behaviors, we advocate the need of an open source reference platform and complete analysis of currently implemented SRv6 behaviors.
The design of a performance evaluation tool for data-plane implementations of forwarders is a very challenging task [intel-cisco-report] as they are required to forward packets at an extremely high rate using a limited CPU budget to process each of these packets. The IETF has defined the guidelines and the tests for benchmarking forwarders implementations [rfc2544]. The tests include: throughput, latency, jitter and frame loss rate. Throughput is the most commonly used measure for forwarders implementations [rfc1242]. It is defined as the maximum rate at which all received packets are forwarded by the device and often reported in number of packet per second (pps). RFC defines different variations of the throughput including No-Drop Rate (NDR), Maximum Receive Rate (MRR) and Partial Drop Rate (PDR) [csit-report].
In this paper we present SRPerf, a performance evaluation framework for software and hardware implementations of SRv6. SRPerf is a modular framework supporting the performance evaluation of packet forwarding in the Linux kernel and in VPP. It can also be extended to support new forwarders. In addition to SRv6, SRPerf supports the performance evaluation of plain IP forwarding. It reports different throughput measures such as NDR, PDR and MRR. The current design relies on TRex [trex-cisco] as a traffic generator. Without lost of generality, the framework can be easily extended to support other packet generators. This work addresses the above requirements providing a realistic performance evaluation framework. SRPerf is open-source and publicly available at [srperf]. The main contributions of this work are the following:
Realization of performance evaluation framework for software and hardware implementations of SRv6; current implementation supports Linux kernel and VPP as forwarders;
Implementation of a generic PDR finder algorithm which allows to estimate with an user defined precision the PDR of a system under test;
Performance evaluation of SRv6 forwarding behaviors supported both by the Linux kernel and by VPP;
Improved performance of existing cross-connect behaviors (namely End.DX6 and End.X) in the Linux kernel;
Implementation of End.DT4 in the Linux kernel;
The paper is structured as follows: Section II presents the SRv6 support in the Linux kernel and in VPP. The design of SRPerf and the evaluation methodology are described in Section III. Section IV explains the testbed and presents the experiments we have performed. We also elaborate on two use cases, showing how we have leveraged SRPerf to benchmark the implementation of a new forwarding behavior and to identify and solve performance issues of existing SRv6 functions. Finally, we report on the related works found in literature in Section V. We draw some conclusions and highlight the directions for future works in Section VI.
Ii SRv6 software implementations
In this section, we firstly provide an overview of the SRv6 networking programming concepts and then we review the status of the software open source implementations, namely Linux kernel (Sec. II-A) and VPP router (Sec. II-B). A deeper introduction to SRv6 technology can be found in [ventre2019survey]. Table I shows the support of the SRv6 networking programming concepts and its extensions [ietf-spring-srv6-network-programming, filsfils-spring-srv6-net-pgm-insertion, ietf-spring-sr-service-programming, id-srv6-mobile-uplane] in the Linux kernel and in VPP software router.
SRv6 has drawn a lot of interest since its introduction at IETF. This interest of the stakeholders as well as the trend of SDN has lead to wide range of SRv6 support both in software forwarders, such as Linux kernel and VPP, and hardware routers [ietf-6man-segment-routing-header]. These implementations have been revised several times to keep up with the evolution of the SRv6 network programming concepts [ietf-spring-srv6-network-programming, filsfils-spring-srv6-net-pgm-insertion, ietf-spring-sr-service-programming] and [id-srv6-mobile-uplane].
SRv6 defines a new type of IPv6 routing extension header known as Segment Routing Header (SRH) [ietf-6man-segment-routing-header]. The SRH contains the Segment List, which implements an SR policy, as an ordered list of IPv6 addresses: each address in the list is a SID. A dedicated field, referred to as Segments Left, is used to maintain the pointer to the active SID of the Segment List.
The SRv6 Network Programming model [ietf-spring-srv6-network-programming] defines two different sets of SRv6 behaviors, known as SR policy headend and endpoint behaviors. SR policy headend behaviors steer received packets into the SRv6 policy matching the packet attributes. Each SRv6 policy has a list of SIDs to be attached to the matched packets. On the other hand, an SRv6 endpoint behavior, also known as behavior associated with a SID, represents a function to be executed on packets at a specific location in the network. Such function can be a simple routing instruction, but also any advanced network function (e.g., firewall, NAT). SR policy headend behaviors are executed in the SR source node (also known as Headend node), while endpoint behaviors in SR Segment Endpoint nodes. Endpoint behaviors are further classified as decap and no-decap whether or not they perform decapsulation of the SRH. Transit nodes can be SR-capable or SR-incapable, as they do not need to inspect the SRH since the destination address of the packet does not correspond to any locally configured segment or interface [rfc8200].
Hereafter we report a short description of the most important behaviors starting with SR policy headend ones. The H.Encaps behavior encapsulates an IPv6 packet as the inner packet of an IPv6-in-IPv6 encapsulated packet. The outer IPv6 header carries the SRH header, which includes the SIDs list. The H.Encaps.L2 behavior is the same as the H.Encaps behavior, with the difference that H.Encaps.L2 encapsulates the full received layer-2 frame rather than the IP packet (Ethernet over IPv6 encapsulation). The H.Insert behavior inserts an SRH in the original IPv6 packet, immediately after the IPv6 header and before the transport level header. The original IPv6 header is modified, in particular the IPv6 destination address is replaced with the IPv6 address of the first segment in the segment list, while the original IPv6 destination address is carried in the SRH header as the last SID of the SIDs list.
The End behavior represents the most basic SRv6 function among the endpoint behaviors. It replaces the IPv6 destination address of the packet with the next SID in the SIDs list. Then, it forwards the packet based by performing a lookup of the updated IPv6 Destination Address in the routing table of the node. We will refer to the lookup in the routing table as FIB lookup, where FIB stands for Forwarding Information Base. The End.T behavior is a variant of the End where the FIB lookup is performed in a specific IPv6 table associated with the SID rather than in the main routing table. The End.X behavior is another variant of the End behavior where the packet is directly forwarded to a specified layer-3 adjacency bound to the SID rather performing any FIB lookup of the IPv6 destination address.
The End.DT6 behavior pops out SRv6 encapsulation and perform a FIB lookup with the IPv6 destination address of the exposed inner packet in a specific IPv6 table associated with the SID. It is possible to associate the default IPv6 routing table with the SID, in this case the inner IPv6 packets will be decapsulated and then forwarded on the basis of its IPv6 destination address according to the default routing of the node. The End.DX6 behavior removes the SRv6 encapsulation from the packet and forwards the resulting IPv6 packet to a specific layer-3 adjacency bound to the SID. End.DT4 and End.DX4 are respectively the IPv4 variant of End.DT6 and End.DX6, i.e. they are used when the encapsulated packet is an IPv4 packet. The End.DX2 behavior is used for packets encapsulated at Layer 2 (e.g. with H.Encaps.L2). It pops out the SRv6 encapsulation and forwards the resulting L2 frame via an output interface associated to the SID.
Finally, other two sets of SRv6 behaviors have been defined in [ietf-spring-sr-service-programming] and [id-srv6-mobile-uplane] respectively for the support of Service Function Chaining of SRv6-unaware network functions and for mobile user plane functions. Some of these behaviors such as End.AD and End.AM are implemented in VPP and in an external Linux kernel module [srext-srv6-net-prog] but not in the Linux base kernel. The details and performance evaluation of the aforementioned behaviors as well as other SRv6 endpoint behaviors like End.B6, End.B6.Encaps and End.BM have not been considered in this work and are left for future works. These endpoint behaviors are used to steer the traffic into an SR policy by sending it to the corresponding Binding SID (BSID).
Ii-a SRv6 support in the Linux kernel
The Linux kernel is the main component of a Linux based operating system and it is the core interface between the hardware and the user processes. The network stack in the Linux kernel can be divided into eight main subsystems; Receive, Routing, Input, Forward, Multicast, Local, Output and Neighbor. Figure 1 shows the main subsystems of the network stack including the Network Driver, which feeds with packets the stack and the Transport Layer which manages local directed packets at higher levels.
The SRv6 implementation was merged in Linux kernel 4.10 [lebrun2017implementing]. Since then, SRv6 support has become more mature in versions 4.16 and 4.18 with the addition of new features and with refinements of the implementation. The SRH [ietf-6man-segment-routing-header] is defined through a structure, named ipv6_sr_hdr. A kernel function, named ipv6_srh_rcv(), is added as a default handler for SRv6 traffic and it is called by the Receive subsystem when an IPv6 extension header is found. The processing of received SRv6 packets is controlled through a per-interface configuration option (sysctl). Based on this per-interface option, the kernel may decide to either accept or drop a received SRv6 packet. If the packet is accepted, it is processed as described in [ietf-6man-segment-routing-header]: the SRH is processed, the packet IPv6 destination address is updated, then the kernel feeds the packet again in the Routing subsystem to be forwarded based on the new destination address.
In the Linux kernel, the SRv6 behaviors are implemented as Linux lightweight tunnels (lwtunnel). The lwtunnel is an infrastructure that was introduced in the version 4.3 of the kernel to allow for scalable flow-based encapsulation such as MPLS and VXLAN. SRv6 SIDs are configured as IPv6 FIB entries into the main routing table, or into any secondary routing table [srv6-impl]. In order to support adding SIDs associated with an SRv6 behavior, the iproute2 user-space utility has been extended [iproute2]. The SRv6 capabilities were improved in the release 4.18 [kernel4-18] (August 2018), to include the netfilter framework [netfilter-hacking] as well as the eBPF framework [lwn-ebpf].
At the time of writing, several SR policy headend behaviors are supported in the Linux kernel, including: H.Insert, H.Encaps, and H.Encaps.L2. As anticipated at the beginning of this section, endpoint behaviors are classified as no-decap and decap. Regarding the no-decap behaviors there is support for End, End.T and End.X. Instead regarding the decap functions there is support for End.DX2, End.DT6, End.DX6, End.DX4. Currently, the implementation of the End.DT4 behavior is missing in the Linux kernel.
Ii-B SRv6 support in VPP
Virtual Packet processor (VPP) is an open source virtual router [fd-io-vpp]. It implements an high performance forwarder that can run on commodity CPUs. In addition, VPP is a very flexible and modular framework that allows the addition of new plugins without the need to change the core kernel code. VPP often runs on top the Data Plane Development Kit (DPDk) [dpdk], which is a platform for high speed I/O operations. DPDK maps directly the network interface card (NIC) into user-space bypassing the underlying Operating System kernel.
The packet processing architecture of VPP consists of graph nodes that are composed together. Each graph node performs one function of the processing stack such as IPv6 packets input (ip6-input), or IPv6 FIB look-up (ip6-lookup). The composition of the several graph nodes of VPP is decided at runtime. Figure 2 shows an example of a VPP packet graph. VPP also supports batch packet processing [barach2018batched], a technique that allows the processing of a batch of packets by one VPP graph nodes before passing them to the next node. This technique improves the packets processing performance by leveraging the CPU caches. Performance aspects of VPP are discussed in [csit-report] and [barach2018batched].
SRv6 capabilities were introduced in the 17.04 release of VPP. Most of the SRv6 endpoint behaviors defined in [ietf-spring-srv6-network-programming] are nowadays supported (e.g. End, End.X, End.DX2, End.DX4, End.DX6, End.DT4, End.DX6). These behaviors are grouped by the endpoint function type and implemented in dedicated VPP graph nodes. For example, all the decap functions share one single graph node, while the End and End.X functions are implemented in another VPP graph node. The SRv6 graph nodes perform the required SRv6 behaviors as well the IPv6 processing (e.g. decrement Hop Limit). When an SRv6 segment is instantiated, a new IPv6 FIB entry is created for the segment address that points to the corresponding VPP graph node. An API was added to allow developers the creation of new SRv6 endpoint behaviors using the plugin framework. In this way, a developer can focus on the actual behavior implementation while the segment instantiation, listing and removal are performed by the existing SRv6 code.
The SR policy concept was introduced to allow the SR policy headend capabilities. Traffic can be steered into an SR policy either by sending it to the corresponding BSID or by configuring a rule, called Steering Policy, that directs all traffic, for example transiting towards a particular IP prefix, into the policy. While for the SR policy headend behaviors there is parity in the capabilities offered by Linux kernel and VPP; it is not the same for the endpoint behaviors where VPP implementation exhibits a broader support of the SRv6 network programming model. Release 17.04 introduced the support for most of the behaviors, which included also the endpoint behaviors bound to a policy: End.B6, and End.B6.Encaps. Instead, End.T was introduced in the subsequent release (17.10). Finally, the SR proxy behaviors were introduced as VPP plugins in release 18.04 [ietf-spring-sr-service-programming].
Iii Performance evaluation framework
In this section, we illustrate the proposed performance evaluation framework (SRPerf). At first, we describe the internal design and the high level architecture of SRPerf (Section III-A); leveraging the SRPerf modular design, we have integrated the VPP platform and the Linux kernel as Forwarder. Section III-B elaborates on our evaluation methodology which uses the Partial Drop Rate (PDR) metric to characterize the performance of a system. Finding the PDR of a given forwarding behavior is a time consuming and error prone process, for this reason we have developed an automatic finder procedure which is described in Section III-C. Our algorithm performs a logarithmic search in the space of the solutions and adapts to different forwarding engines and does not require manual tuning. Moreover, it is demonstrated to be efficient.
Iii-a Design and architecture
We designed SRPerf following the network benchmarking guidelines defined in RFC 2544 [rfc2544]. As shown in Figure 3, the architecture of SRPerf is composed of two main building blocks: the testbed and the Orchestrator. In turn, the testbed is composed by the Tester node and the System Under Test (SUT) node. These nodes have two network interfaces cards (NIC) each and are connected back-to-back using both NICs. The Tester sends traffic towards the SUT through one NIC, which is then received back through the other one, after being forwarded by the SUT. Accordingly, the Tester can easily perform all different kinds of throughput measurements as well as round-trip delay and jitter. In our design, we chose the open source project TRex [trex-cisco] as Traffic Generator (TG) (supporting the transmission and the reception of packets in the Tester Node). As for the SUT Node, we currently support VPP and Linux kernel as Forwarder.
Let us describe SRPerf using a top-down approach. Two configurations files (upper part of the Figure 3) are provided as input to the Orchestrator. The first file, Experiments CFG, represents the necessary input to run the experiments. In particular, it defines: i) the type of experiment (i.e. set of SRv6 behaviors to be tested, type of tests and type of algorithm); ii) the number of runs; iii) the size and type of the packets to be sent between the traffic generator and the Forwarder. The second configuration file (Testbed CFG) defines the forwarding engine of the SUT and the information needed to establish a SSH connection with it. The SRPerf configuration files use the YAML [yaml] syntax, an example of configuration is reported in the upper-left part of the Figure 3.
The Orchestrator leverages the CFG Parser to extract the configuration parameters and to initialize the experiment variables. The CFG Parser is a simple python module which uses PyYAML parser [pyaml] to return python objects to the caller. The Orchestrator is responsible for the automation of the whole evaluation process. According to the input parameters, it creates an Experiment; specifically, the Orchestrator uses different algorithms for calculating the throughput. Each algorithm offers an API (the Experiment interface in Figure 3) through which the Orchestrator can run an Experiment. An example of currently supported throughput measurement algorithms is the Partial Drop Rate (PDR), described in Section III-B. Moreover, the Orchestrator provides a mapping between the forwarding behaviors to be tested and the type of traffic required to test each behavior. For example, to test the End behavior, it is necessary to use an SRv6 packet with an SRH containing a SIDs list of at least two SIDs and the active SID must not be the last SID - the type of packet to be replayed during the experiments has to be passed to the Experiment instance.
The Orchestrator controls the TG (deployed in the Tester node) through the high level abstraction provided by the TG Driver, which translates the calls coming from the other modules in commands to be executed on the TG. Each driver is a python wrapper that can speak native python APIs or use any other transport mechanism supported by the language. For example, the TRex driver includes the python client of the TRex automation API [trex-client-api] that uses as transport mechanism JSON-RPC2 [jsonrpc] over ZMQ [zmq]. The Orchestrator can be deployed on the same node of the TG or in a remote node.
The CFG Manager controls the forwarding engine in the SUT. It is responsible for enforcing the required configuration in the Forwarder. The Orchestrator provides the mapping between the forwarding behaviors to be tested and the required configuration of a given forwarding engine. Hence, the Orchestrator is able to properly instruct the CFG Manager. For each forwarding engine, we implement a CFG which provides the CFG Manager with the means to enforce a required configuration. In particular, a CFG is a bash script defining a configuration procedure for each behavior to be tested. The configuration is applied using the Command Line Interface (CLI) exposed by the forwarder. For example, to test the End behavior in the Linux kernel, we implement a bash procedure called end. In this procedure, we leverage the iproute utility to configure the forwarding engine in the SUT with two FIB entries: 1) an SRv6 SID with the End behavior; 2) a plain IPv6 FIB entry to forward the packet once the End function has been performed. The configuration can be as simple as adding a FIB entry to forward the received packets back to the Tester, but also being a more complex configuration that manipulates the incoming packets before forwarding them back to the Tester. The CFG Manager first pushes the CFG scripts in the SUT and then applies a given configuration running commands over an SSH connection.
The SRPerf implementation is open source and available at [srperf]. SRPerf is mostly written in python, and provides a set of tools to facilitate the deployments of the experiments: it offers an API for the automatic generation of the configuration files. Moreover, it provides a different configuration scripts to setup a performance evaluation experiment using SRPerf on any commodity hardware. These scripts include TG installation and initial configuration, Forwarder installation and initial configuration. The framework is modular and can be expanded in different directions: it can be extended to support new traffic generators by simply creating a new driver for each. A new forwarding behavior can be added by updating the CFG Manager with the configuration required for such behavior. New algorithms for calculating throughput and delay can be developed and plugged into the Orchestrator. It can support different Forwarders in the SUT, which only requires the CFG manager to be updated to recognize them and to implement the related CFG object. In this work we have first considered the Linux kernel networking as Forwarder and then, leveraging the framework described above, we added the support for VPP software router.
Iii-B Evaluation methodology
RFC 1242 [rfc1242] defines the Throughput as the maximum rate at which all received packets are forwarded by the device. It is used as a standard measure to compare performance of network devices from different vendors. Throughput can be reported in number of bits per second (bps) as well as number of packet per second (pps). FD.io CSIT Report [csit-report] defines No-Drop Rate (NDR) and Partial Drop Rate (PDR). NDR is the highest Throughput achieved without dropping packets, so it corresponds to the Throughput defined by RFC 1242. PDR is the highest Throughput achieved without dropping traffic more than a pre-defined loss ratio threshold [opnfv-nfvbench]. We use the notation PDR@X%, where X represents the loss ratio threshold. For example, we can evaluate PDR@0.1%, PDR@0.5%, PDR@1%. NDR can be described as PDR@0%, i.e. PDR with a loss threshold of 0%. Considering that Throughput can be used with wider meanings, the terminology defined in [csit-report] (e.g. No-Drop Rate) is clearer and it will be used hereafter. Hence, we can use Throughput to refer in general to the output forwarding rate of a device. In this work, we will consider only the PDR since it is more generic than the NDR.
Finding the PDR requires the scanning of a broad range of possible traffic rates. In order to explain the process, let us consider the plain IPv6 forwarding in the Linux kernel. Figure 4 plots the Throughput (i.e. the output forwarding rate) and the Delivery Ratio (DR) versus the input rate, defined and evaluated as follows. We generate traffic at a given packet rate [kpps] for a duration [s] (usually in our experiments). Let the number of packets generated by the TGR node and incoming to the SUT in an interval of duration be (Packets INcoming in the SUT). We define the number of packets transmitted by the SUT (and received by the TG) as (Packets OUTgoing from the SUT). The Throughput is [kpps]. We define the DR as . Hence, the DR is the ratio between the input and the output packet rates of a device for a given forwarding behavior under analysis. It is 100% for all incoming data rates less than the device No-Drop Rate. Initially, the Throughput increases linearly with the increase in the incoming rate. This region is often referred to as no drop region, i.e. where the DR is always 100%. If the forwarding process is CPU-limited, the CPU usage at the SUT node increases with the increase of incoming traffic rate (i.e. the sending rate of the Tester). Ideally, the SUT node should be able to forward all received packets until it becomes 100% CPU saturated. On the other hand, in our experiments with the Linux based SUT we measured very small but not negligible packet loss ratio in a region where we have an (almost) linear increase of the Throughput. Therefore, it is better to consider the Partial Drop Rate and we used 0.5% as threshold. The PDR@0.5% is the highest incoming rate at which the Delivery Ratio is at least 0.995. The usefulness of the PDR is that it allows to characterize a given configuration of the SUT with a single scalar value, instead of considering the full relation between Throughput and Incoming rate shown in Figure 4. The procedure for finding the PDR for a given loss threshold is described in the Section III-C
. The output of the finding procedure is furthermore validated to make sure that the estimated PDR value is stable. In particular, calculated a PDR value we run a number of test repetition (e.g. 15) to evaluate the average and standard deviation of theacross these repetitions.
Iii-C PDR finder algorithm
Estimating the PDR of a given forwarding behavior is a time consuming process, since it requires the scanning of a broad range of possible traffic rates. In order to automate the PDR finding process, we have designed and developed the PDR finder algorithm. It scans a range of traffic rates with the objective of estimating the PDR value. Alg. 1 reports the pseudo code of the PDR finder algorithm. It performs a logarithmic search in the space of possible solutions which is upper limited by the line rate of the NICs (see lines and ). It returns an interval of traffic rates estimating with a given confidence () the PDR value. The maximum interval distance is an configurable option to tune the algorithm precision. The algorithm starts to decrease the amplitude of the searching window until such value becomes less than the minimum interval width (line ). At each iteration (loop starting at line ), the size of the searching window is halved and the DR is evaluated for the window middle point, which is considered to be the current traffic rate (from line to line ). If the Delivery Ratio of the middle point is less than the threshold, the upper bound of the window is set to the current rate. Otherwise, the lower bound of the searching window is set with the current rate. This process is iterated until the exit condition is triggered: the algorithm terminates when the difference between and is less or equal than ().
The algorithm described above does not require any tuning and can start operating with a full window and being still efficient. The algorithm works using relative values of the line rate of the TG NIC and performs directly a binary search on the interval , this requires in the worst case 7 steps to get into desired state (). This also means that in about 70s we are able to estimate the PDR of a given forwarding behavior.
Iv Performance evaluation of the SRv6 software implementations
In this section, we present an evaluation of the SRv6 software implementations, namely Linux kernel and VPP software router. The rationale for this evaluation is to provide an indication on the scalability of the SRv6 implementations over a set of experiments. It is not our purposes to make any comparison of the dataplane forwarding performance - the two implementations are internally too different which makes useless any comparison. Section IV-A illustrates the testbed and the parameters of the experiments. We report in Section IV-B the experiment results of the Linux kernel forwarding. During the code analysis of the Linux kernel implementation, we discovered the End.DT4 was missing, Section IV-C illustrates how we have used SRPerf to benchmark the experimental implementation of the End.DT4 we have realized. Instead, Section IV-D shows how we have leveraged SRPerf to solve the performance issues we found in some endpoint behaviors. Finally, Section IV-E reports the experiments results of VPP.
Iv-a Testbed and parameters of the experiments
We deployed our testbed, illustrated in the bottom part of the Figure 3, on CloudLab [cloudlab]. The testbed nodes (Tester and SUT) are powered by a bare metal server equipped with an Intel Xeon E5-2630 v3 processor with 16 cores clocked at 2.40GHz and 128 GB of RAM. Each bare metal server has two Intel 82599ES 10-Gigabit network interface cards to provide back-to-back connectivity between the testbed nodes. The Tester is running TRex in the stateless mode and has the TRex python automation libraries installed. The SUT machine is running Linux kernel 5.2 net-next (upstream branch of the Linux kernel). It has the 5.x release of the iproute2 [iproute2] installed, which provides the means to program the SRv6 behaviors. In addition, ethtool (release 5.x) is installed to provide the means to configure the NIC hardware capabilities such as offloading [ethtool]. Regarding VPP, we have been using the release 19.04.
In order to make clear the results of the experiments, we report hereafter the methodology we have used to perform the experiments and some tuning parameters. Evaluation of forwarding behaviors often includes performance for both single as well as well multiple CPU cores. Here, we show only the performance of SRv6 behaviors for the case of single CPU core. The single CPU measures provide the base performance for a given behavior. Regarding the Linux kernel, in order to force the single CPU core processing of all received traffic, we rely on the Receive-Side Scaling (RSS) and SMP IRQ affinity features. Major details about the tuning of these features are reported in our previous work [ahmedperformance]. Moreover, in order to get the base performance independent of the NIC hardware capabilities, we disabled all the NIC hardware offloading capabilities such as Large Receive Offload (LRO), Generic Receive Offload (GRO), Generic Segmentation Offload (GSO), and all checksum offloading features. Finally, we disabled the hyper-threading feature of the SUT node from the BIOS settings. Similar configurations have been performed for VPP. However, VPP is an user space router; so we had just to customize the startup configurations to use one CPU core and to disable all the DPDK offloading features. Additionally for the Linux Kernel, we enabled the Seg6_flowlabel computation in all our experiments and we configured the TUNSRC for the SR policy headend behaviors doing encapsulation. The latter allows to configure the IPv6 source address of the IPv6 outer header. The TUNSRC has to be configured otherwise the Linux kernel will try to get the address from the interface which will cause a performance drop in the performance of the encaps behavior.
We performed three experiments as follows: i) SR policy headend behaviors; ii) endpoint behaviors with no decapsulation (no-decap); iii) endpoint behaviors with decapsulation (decap). The decap behaviors are required to remove the SRv6 encapsulation from packets before forwarding them. Conversely, the no-decap behaviors forwards SRv6 packets without removing the SRv6 encapsulation from packets. In the first experiment, we use an IPv6 packet of size 64 bytes (102 bytes considering Ethernet header and overhead). While in experiments 2 and 3, we use an SRv6 packet of size 64 bytes plus SRv6 encapsulation. The SRv6 encapsulation is 80 bytes representing 40 bytes of outer IPv6 header and 40 bytes of SRH with two SIDS.
We use the PDR described in Section III-B as a metric. The trail period in our experiments is 10 seconds. We use the bar plots to represent our results, where each bar plot represents the average of 10 PDR values. The estimated PDR value is further validated with 10 repetitions where we analyze the variability of the packets forwarded by the SUT. Table IV, II, V and III
report respectively the average, the Coefficient of Variation (CV) and the 95% Confidence Interval () of each analyzed forwarding behavior. It is important to understand also the numbers reported in the graphs. We report the pps (packet per second) rate in our experiments, we are interested to evaluate the forwarding performance of the implementations. This value is upper limited by the Line Rate which depends on the Size of the packets used during the experiment. We report in the Equation 1 the formula used for the calculation [cisco_lr].
Where the Overhead for the Ethernet frames is 4 bytes for the CRC, 8 bytes for preamble/SFD and 12 bytes for the inter frame gap. In order to clarify what said before, we evaluated in a preliminary experiment the performance of the IP forwarding behavior for both Linux kernel and VPP. Figure 5 reports the results for a frame size of 78 bytes. It is possible to note the IP forwarding is upper limited by 12255 kpps which is the maximum value considering the aforementioned packet size. In this test, we can state that VPP is able to forward the IPv4 packets at line rate ( 12252), while it does not result the same for IPv6 traffic where the forwarding rate results to be lower ( 11327.5). Performance of the Linux kernel are lower, we are around 1221.06 kpps and 1430.38 kpps respectively for IPv6 and IPv4. Similarly to VPP, the forwarding of IPv4 traffic results to be more performant.
Iv-B Linux kernel
We start evaluating the performance of the SR policy headend behaviors: H.Insert, H.Encaps (considering IPv4 and IPv6 traffic), and H.Encaps.L2. The results are shown in Figure 6 for Linux kernel. The H.Insert shows a better forwarding rate of 1039 kpps when compared to 978 kpps and 1029 kpps of H.Encaps.V6 and H.Encaps.V4. For the H.Encaps.L2 behavior, the SUT node is able to forward 828 kpps. The performance of H.Insert behavior is a slightly better compared to H.Encaps since the former needs to push only an SRH while the latter needs to push an outer IPv6 header along with the SRH. As expected, the encap of IPv4 traffic performs better of its IPv6 counterpart. In general, the SR policy headend behaviors have shown very stable performance as witnessed by the low values for the CV shown in Table II.
Regarding the no-decap SRv6 endpoint behaviors, we evaluated the performance of the End, End.T, and End.X behaviors. In case of the End behavior the SUT node is able to forward 900 kpps. The End.T performs better than the End since the routing table used for the lookup is defined by the control plane, hence the kernel saves the cost of performing IP rules lookup that are executed in case of the End behavior. The End.T forwarding performance is 979 kpps. As regards End.X, we found very poor performance. Forwarding rate is 123 kpps. In Section IV-D we provide more insights about this low performance and we show how we have fixed this issue and got improved results in line with the other behaviors.
Finally, last experiment compares the performance of the SRv6 decap endpoint behaviors. The End.DX2 behavior has a throughput of 1299 kpps which is better than the other behaviors. The reason why End.DX2 is performing better than IPv6 forwarding for example is that the kernel does not need to perform Layer-3 lookup once the packet has been decapsulated. Indeed, it pushes the packet directly into the transmit queue of the interface towards the next-hop. Instead, End.DX4 exhibits a rate 929 kpps. As for endpoint behaviors with lookup on a specific table, namely End.DT6, we have a performance of 960 kpps. The performance of endpoint behaviors are less stable with the respect to the SR policy headend behaviors behaviors - the values of (CV) and result to be higher as shown in Table III
Also in this experiment, we have found some problems: firstly End.DT4 is missing in the Linux kernel and then End.DX6 is affected by the same problems of End.X behavior (more details in Section IV-D). In particular, we found that the forwarding rate of the Linux kernel results to be 122 kpps. Regarding End.DT4, Section IV-C illustrates how we have used SRPerf to benchmark the experimental implementation of the End.DT4 we have realized. Moreover, we show in the following sections that SRPerf is fundamental to support both the development and testing aspects of an architecture helping users to perform soak testing, for the early identification of performance problems and bugs in the code.
Iv-C Adding End.DT4 behavior to the Linux kernel
In the context of ROSE project [rose] (which aims to realize an open SRv6 ecosystem), we have realized an experimental implementation of the End.DT4 behavior and we have leveraged SRPerf to assess the correctness of the patch. The implementation is public available and we plan to submit the code to the Linux mainline; the source code of the End.DT4 behavior is available at [fix_repo].
Firstly, we have verified that the functionality was correctly implemented. Then, we used the same tool to stress our implementation and asses the efficiency. SRPerf is a valuable tool for both tasks. Indeed, it can be used to stress the machine for a long time pushing packets at line-rate (to verify for example that there are no memory leaks) but also to evaluate how the new behavior affects existing code (since we are planning to submit the code to the upstream branch).
Thanks to SRPerf we were able to discover that the functionality was realized correctly and no memory leaks were observed in the long runs. Unfortunately, the first implementation was not efficient as we expected. Indeed, we were able to obtain only a Partial Drop Rate of 600 kpps while the IPv6 counterpart (i.e. End.DT6) was able to deliver a PDR of 950 kpps. At this point, we decided to go through a different approach which required more coding and to expose some functionality of the Linux routing to SRv6. With this attempt we were able to fill the gap we found in the performance and obtain a PDR 980 kpps which was inline with the expected performance. Complete results of the Linux forwarding for the sake of simplicity are reported in Figure 8.
Iv-D Fixing a forwarding behavior in the Linux kernel
During our first evaluation, we found that the End.X and End.DX6 behaviors exhibited poor performances and less stability with respect of the other SRv6 endpoint behaviors. Figure 7 focuses the y-axis in the range from 0 to 140 kpps to show a zoomed view of the PDR values of their performance. In this case, SRPerf was useful identifying a brittle implementation and verifying the correctness of the patch we have realized to fix the behaviors.
Coming back to the issue, the two behaviors perform different variations of L3 x-connect to an adjacency which is provided by the control plane. However, the current implementation of these two behaviors in Linux is not fully compliant with their specifications in the SRv6 network programming document [ietf-spring-srv6-network-programming]. The IETF document specifies that SRv6 x-connect behaviors are used to cross connect packet to the next hop through a specific interface. Instead, the current implementation uses a next-hop, provided as input by the control plane and introduces some exceptions in the ipv6 processing workflow to emulate such as “forward” to an adjacency interface behavior. Theoretically, the lookup in the routing tables is not necessary and only the output interface has to be selected. However, such as behavior is not envisaged by the IPv6 routing of the kernel: typically when a packet has to be forwarded, the routing subsystem needs to find a route in the routing tables and return a structure rt6_info as results of the lookup. Caches are used to store the latter structure in order to avoid a lookup in the routing tables and further memory allocations for each packet.
In this case we do not need to perform any lookup because the nexthop is already known and the rt6_info cannot be cached because it does not have any reference in the Forwarding Information Base (FIB) tree. This leads to the execution of unnecessary actions (lookup in primis) and the allocation of a structure for each packet to emulate the complete lookup process. All these things together have a huge performance impact leading to a huge performance drop. Moreover, the garbage collector is activated to avoid the exhaustion of the memory which further worsen the situation.
To fix the poor performance of the cross-connect behaviors we extended their implementation in the Linux kernel to allow forwarding packets based on a outgoing interface instead of using a nexthop and then going through the complete ipv6 workflow. We implemented a new kernel function, named seg6_xcon6, which is called by the End.X and End.DX6 to cross-connect the IPv6 packet to a given interface. We extended the same logic to the SRv6 End.DX4 behavior by implementing another kernel function, named seg6_xcon4, which is called by by the End.DX4 to cross-connect the IPv4 packet to a given interface. The source code of the fixed behaviors is available at [fix_repo]. We verified the goodness and the stability of our patch through SRPerf and we were able to obtain 1245 kpps, 1210 kpps and 1231 kpps respectively for End.DX4, End.DX6 and End.X. Also in this case, it is possible to appreciate better performance of the IPv4 forwarding. Final results of the Linux forwarding are reported in Figure 8.
Iv-E VPP software router
As regards SR policy headend, VPP performs differently from Linux kernel: H.Insert shows a lower performance with a rate of 7387 kpps. H.Encaps.V6 and H.Encaps.V4 exhibit 7709 kpps and 8471 kpps. Instead, VPP H.Encaps.L2 behavior is able to forward 8052 kpps. As expected H.Encaps.V4 performs better than H.Encaps.V6. However, both Linux and VPP with the SR policy headend behaviors are not able to forward packets at line rate. In general, it is possible to appreciate low variability in Table IV.
Instead, VPP is very efficient with the endpoint behaviors. SUT node running VPP is able to perform End, End.X and End.T and forward the packets at line rate, which is equal to 6868 kpps. Moreover, it is possible to note also that (CV) and are 0.0% in Table V. Same applies to the decap behaviors - VPP is able to remove the SRv6 header (including the outer Ethernet frame) and forward the packets at line rate.
V Related Works
The software forwarder performance on commodity CPUs require careful measurement and analysis as such CPUs were not designed specifically for traffic forwarding. In order to address these needs, several frameworks have been developed. However, none of the works found in literature have fully addressed the performance of SRv6 data-plane implementation either in the Linux kernel and other software router implementations (e.g., VPP). [ahmedperformance] is the only work that is close to provide a full evaluation of the SRv6 implementation in the Linux kernel.
DPDK [dpdk] is the state of the art technology for accelerating the virtual forwarding elements. It bypasses the kernel processing and balances the incoming packets over all the CPU cores and processes them by batches to make a better use of the CPU cache. In [begin2018accurate], the authors presented an analytical queuing model to evaluate the performance of a DPDK-based vSwitch. The authors studied several characteristics of DPDK framework such as average queue size, average sojourn time in the system and loss rate under different arrival loads.
In [pitaev2017multi], the performance of several virtual switch implementations including Open vSwitch (OVS) [ovs], SR-IOV and VPP are investigated. The work focuses on the NFV use-cases where multiple VNFs run in x86 servers. The work shows the system throughput in a multi-VNF environment. However, this work considers only IPv4 traffic and does not address SRv6 related performance. In [pitaev2018characterizing], the previous work has been extended by replacing OVS with OVS-DPDK [ovs-dpdk], which promises to significantly increase the I/O performance for virtualized network functions. They use DPDK-enabled VNFs and show how OVS-DPDK compares from a throughput perspective to SR-IOV and VPP as the number of VNFs is increased under multiple feature configurations. However, the work still considers only plain IPv4 forwarding.
In [barach2018batched], the authors explain the main architecture principles and components of VPP including: vector processing, kernel bypass, packets batch processing, multi-loop, branch-prediction and direct cache access. To validate the high speed forwarding capabilities of VPP, the authors have reported some performance measurements such as packet forwarding rate for different vector sizes (i.e, number of packets processed as a single batch), the impact of multi-loop programming practice on the per-packet processing cost as well the variation of the packet processing rate as a function of the input workload process. However, this work analyses VPP forwarding performance only for plain IPv4 forwarding and does not consider other types of forwarding such as IPv6 and SRv6.
Open Platform for NFV Project (OPNFV) [opnfv] is a Linux foundation project which aims to provide carrier-grade, integrated platform to introduce quickly in the industry new products and network services. The NFVbench [nfvbench] toolkit, developed under the OPNFV umbrella, allows developers, system integrators, testers and customers to measure and assess the L2/L3 forwarding performance of an NFV-infrastructure solution stack using a black-box approach. The toolkit is agnostic of the installer, hardware, controller or the network stack used. VSPERF [vsperf] is another project within the OPNFV specialized for benchmarking virtual switch performance. VSPERF reported results for both VPP and OVS which are based on daily executed series of test-cases [vsperf-results].
The FD.io project has released a technical paper [intel-cisco-report] for analysing the performance of several dataplane implementations such as DPDK, VPP, OVS-DPDK. The work reports a comparison between DPDK L2 forwarding, OVS-DPDK L2 Cross-Connect, VPP L2 Cross-Connect and VPP IPv4 forwarding in terms of throughput measured in pps. The FD.io Continuous System Integration and Testing (CSIT) project released a report characterizing VPP performance [csit-report]. The report describes a methodology to test VPP forwarding performance for several test cases including: L2 forwarding, L3 IPv4 forwarding, L3 IPv6 forwarding as well as some SRv6 behaviors. Regarding the latter, the report shows the performance of SRv6 H.Encaps, End.AD, End.AM and End.As behaviors. However, the report does not cover the performance of the rest of the SR policy headend and endpoint behaviors.
The performance of some SRv6 behaviors is reported in [lebrun2017implementing], [lebrun2017reaping]. The work has mainly focused on some SR policy headend behaviors such as H.Insert and H.Encaps. The reported results show the overhead introduced by applying the SRv6 encapsulation to IPv6 traffic. However, the performance reported in this work can be considered out-dated as it considered the SRv6 implementations in Linux kernel 4.12 release. Moreover, the work does not report the performance of any SRv6 endpoint behavior as they were not supported in the kernel by that time.
The work in [vladislavic2019throughput] presented a performance evaluation methodology for Linux kernel packet forwarding. The methodology divides the kernel forwarding operations into execution area (EA) which can be pinned to different CPU cores (or the same core in case of single CPU measures) and measured independently. The EA are: i) sending; ii) forwarding; iii) receiving. The measured results considers only the OVS kernel switching in case of single UDP flow.
In [mayer2019efficient], the authors extended the SRv6 implementations in the Linux kernel to support the SRv6 dynamic proxy (End.AD) described in [ietf-spring-sr-service-programming]. The authors named their proposal SRNK (SR-Proxy Native Kernel). The idea is to integrate the SRv6 dynamic proxy implementation described in  directly in the Linux kernel instead of relying on an external kernel module. The work compared the performance of the SRv6 End.AD behavior in SRNK implementation and SREXT [srext-srv6-net-prog].
[teamsegment] presents a solution where low-level network functions such as SRv6 encapsulation are offloaded to the Intel FPGA programmable cards. In particular, authors partially offloads the SRv6 processing from VPP software router to the NICs of the servers increasing data-path performance and at the same time saving resources. These precious CPU cycles are made available for the VNFs or for other workloads in execution on the servers. The work compared the performance of the SRv6 End.AD behavior executed by a pure VPP implementation and by an accelerated solution. Tests results show in the worst scenario that the FPGA cards bring a CPU saving of 67.5%. Moreover, the maximum throughput achievable by a pure VPP solution with 12 cores is achieved by the accelerated solution by using only 6 cores.
[leeperformance] studies SRv6 as alternative user plane protocol to GTP-U [gtpu]. Firstly, authors proposes an implementation of the GTP-U encap/decap functions and of the SRv6 stateless translation behaviors defined in [id-srv6-mobile-uplane]. These behaviors guarantee the coexistence of the two protocols which is crucial for a gradual roll-out. Authors used programmable data center switches to implement these data plane functionality. Since it is hard to get telemetry from commercial traffic generator when a translation takes place, authors injected timestamp with a resolution of nanoseconds to measure the latency of SRv6 behaviors. Finally, they measured throughput and packet loss under light and heavy traffic conditions on a local environment. Results show no huge performance drop due to the SRv6 translation. Moreover, the latency of the SRv6 behaviors is similar to the GTP-U encap/decap functions.
In [ahmedperformance], the authors reported the performance of some SR policy headend and endpoint behaviors. The work focuses on the Linux kernel and shows the performance of the SRv6 behaviors in comparison to plain IPv6 forwarding (IPv4 related behaviors have not been considered). The work analysed some performance issues of the SRv6 implementations in the Linux kernel related cross-connect behaviors. However, it did not provide any solution to fix these performance issues. Moreover, [ahmedperformance] has not considered the SRv6 implementations in other software router implementations such as VPP. The work described in this paper extended and completed the work started in [ahmedperformance] in several directions. Firstly, VPP has been integrated into the SRPerf platform and a performance evaluation is reported. [ahmedperformance] reported the performance of the Linux kernel 4.18 while this work considers Linux kernel 5.2 net-next and also IPv4 related SRv6 behaviors.
With respect of [ahmedperformance], this work improved the PDR finding procedure removing the exponential search with several advantages. Firstly, our previous algorithm required a per forwarder tuning in order to correctly set a minimum lower bound for the rate. The new PDR finder technically speaking does not require any tuning of the initial rate lower bound and can start operating with a full window and being still more efficient. This is a direct consequence of the complexity of the algorithms which results to be different. Indeed, the old algorithm results to be less efficient - this can be noticed particularly with forwarders that can match the line rate of the Traffic Generator. Let us discuss briefly the complexity of the algorithms with an analysis of the worst case scenario. We assume an initial traffic rate of and a target amplitude of the searching window . The old PDR finder performs first an exponential search: it starts from and doubles the rate value at each iteration. It requires steps before to stop with a searching window about . At this point the logarithmic/binary search starts, which takes around 6 steps to reduce up to the amplitude of the searching window (the binary search requires steps). Instead the new algorithm performs directly a binary search on the interval , this requires in the worst case 7 steps to get into desired state ().
Finally, this work elaborates more and addresses the performance problem analyzed in [ahmedperformance] regarding End.DX6 and End.X). Moreover, this work introduced and measured also the End.DT4 behavior which is currently missing in the Linux kernel.
In this paper, we have described the design and implementation of SRPerf, a performance evaluation framework for SRv6 implementations. SRPerf has been designed to be extendable: it can support different forwarding engines including software and hardware forwarding, but can also be extended to support different traffic generators. For example in this work we have shown how it has been possible the integration of the VPP software router. Moreover, we have presented our evaluation methodology and our PDR finder algorithm.
We have used SRPerf to evaluate the performance of the most used SRv6 behaviors in the Linux kernel and in VPP. Results of the Linux implementation show reasonable performance and no particular issues have been observed once we fixed the well known problems [ahmedperformance]. As regards VPP, it is possible obtain in general higher forwarding rate and for the endpoint behaviors we can reach the line rate. Major problems have been observed with the SR policy headend
behaviors where VPP is not able to get high performance and presents more problems with the inline encapsulation. The difference in the results between Linux and VPP was expected, since VPP leverages DPDK to accelerate the forwarding and a comparison between VPP and Linux is not fair at the moment.
SRPerf is a valuable tool and it can be used also in different contexts: validation of new behaviors, testing and detection of issues in the existing implementation. It fills the gap in the space of reference evaluation platforms to test network stacks. In this sense, we have shown along the paper how we have used SRPerf to identify and fix two issues found in SRv6 implementations. End.DT4 was missing and we have added it to the Linux kernel. Then, we have fixed the issues we found in the x-connect behaviors. We plan to contribute back these behaviors to the upstream branch of the Linux kernel.
Finally, Potential directions for future work include evaluating the performance of the SRv6 proxy behaviors such as End.AD, End.AM and endpoint behaviors bound to encap policies like End.B6 and End.B6.Encaps. Moreover, if new implementations of SRv6 dataplane will arise - we plan to include them in our framework and perform an experimental analysis. In this sense, it could be interesting explore eXpress Data Path [xdp] (XDP), which provides packet processing at the lowest point in the Linux kernel stack and thus allows to avoid most of the overhead introduced by the Linux kernel.
This work has received funding from the Cisco University Research Program Fund.