Today’s cloud and high-performance datacenters form a crucial pillar of compute infrastructures and are growing at unprecedented speeds. At the core, they are a collection of machines connected by a fast network carrying petabits per second of internal and external traffic. Emerging online services such as video communication, streaming, and online collaboration increase the incoming and outgoing traffic volume. Furthermore, the growing deployment of specialized accelerators and general trends towards disaggregation exacerbates the quickly growing network load. Packet processing capabilities are a top performance target for datacenters.
These requirements have led to a wave of modernization in datacenter networks: not only are high-bandwidth technologies going up to 200 Gbit/s gaining wide adoption but endpoints must also be tuned to reduce packet processing overheads. Specifically, remote direct memory access (RDMA) networks move much of the packet and protocol processing to fixed-function hardware units in the network card and directly access data into user-space memory. Even though this greatly reduces packet processing overheads on the CPU, the incoming data must still be processed. A flurry of specialized technologies exists to move additional parts of this processing into network cards, e.g., FPGAs virtualization support , P4 simple rewriting rules , or triggered operations .
|Azure AccellNet ||FPGA-based NICs; Flow-steering;|
|P4 , FlexNIC ||Packet steering and rewriting. FlexNIC adds memory support|
|Mellanox SHARP ||Runs on switches; Data aggregation and reduction|
|Portals 4 , INCA ||Sequences of predefined actions can be expressed with triggered operations|
|Mellanox CORE-Direct ||Sequence of predefined actions can be chained|
|Cray Aries Reduction Engine ||Runs on switches; Data reductions (up to 64 bytes)|
|Quadrics , Myrinet ||Users define threads to run on the NIC / NIC is re-programmable by users|
|SmartNICs [2, 1]||Runs full linux stack; Offloading of new code requires flashing|
|eBPF (host) ||Runs user-defined code (eBPF code) in virtual machine in the OS kernel|
|eBPF (Netronome) ||eBPF programs can be offloaded to NIC|
|DPDK ||Runs in user space. Applications can poll for new raw packets from the NIC|
|StRoM ||Handlers for DMA streams are implemented on FPGA NIC|
|NICA ||Bind kernels running on on-NIC accelerators to user sockets.|
|sPIN ||Applications define C/C++ packet handlers to map to different messages/flows|
Streaming processing in the network (sPIN)  defines a unified programming model and architecture for network acceleration beyond simple RDMA. It provides a user-level interface, similar to CUDA for compute acceleration, considering the specialties and constraints of low-latency line-rate packet processing. It defines a flexible and programmable network instruction set architecture (NISA) that not only lowers the barrier of entry but also supports a large set of use-cases . For example, Di Girolamo at al. demonstrate up to 10x speedups for serialization and deserialization (marshalling) of non-consecutive data . While the NISA defined by sPIN can be implemented on existing SmartNICs , their microarchitecture (often standard ARM SoCs) is not optimized for packet-processing tasks. In this work, we define an open-source high-performance and low-power microarchitecture for sPIN network interface cards (NICs). We break first ground by developing principles for NIC microarchitectures that enable flexible packet processing at 400 Gbit/s line-rate.
As core contributions in this work, we
establish principles for flexible and programmable NIC-based packet processing microarchitectures,
design and implement a fully-functional 32-core SoC for packet processing that can be added into any NIC pipeline,
analyze latencies, message rates, and bandwidths for a large set of example processing handlers, and
open-source the SoC design to benefit the community.
We implement PsPIN in synthesizable hardware description language (HDL) code. Overall, it occupies less than 20mm in a 22nm FDSOI process, which is about 25x smaller than an Intel Skylake Xeon die. It achieves similar or higher throughput than the Xeon for most workloads using at most 6.3W.
2 In-network compute
In-network compute is the capability of an interconnection network to process, steer, and produce data according to a set of programmable actions. The exact definition of action depends on the specific in-network-compute solution: it can vary from pre-defined actions (e.g., pass or drop a packet according to a set of rules) to fully programmable packet or message handlers (e.g., sPIN handlers).
There are several advantages of computing in the network: (1) More overlap. Applications can define actions to execute on incoming data. Letting the network execute them allows applications to overlap these tasks with other useful work; (2) Lower latency. The network can promptly react to incoming data (cf. Portals 4 triggered operations , virtual functions , sPIN handlers), immediately executing actions depending on it. Doing the same on the host requires applications to poll for new data, check for dependent actions, and then execute them. (3) Higher throughput. Some in-network-compute solutions enable stream processing of the incoming data. For example, sPIN can run packet handlers on each incoming packet, potentially improving the overall throughput. (4) Less resource contention. Running tasks in the network can reduce the volume of data moved through the PCIe bus and the memory hierarchy. This implies fewer data movements, less memory contention and cache pollution, potentially improving the performance of host CPU tasks.
Table 1 surveys existing in-network-compute solutions. We categorize these solutions by the location where the policies are run, the level of programmability, the granularity at which the actions are applied, and their usability.
Policies can be executed at different points along the path from the endpoint sending the data to the endpoint receiving it. We classify in-network-compute solutions as: running in network devices (e.g., on NICs or switches); running in network devices but not providing to the application a fast path to run their actions (e.g., SmartNICs run full Linux stack); if they run on the host CPUs.
(P) Programmability. It defines the expressiveness of the actions. Network solutions enabling fully programmable actions that can access the message/packet header and payload, access the NIC and host memory, and issue new network operations (e.g., RDMA put or gets) are marked with . Solutions that provide a predefined set of actions that can be composed among themselves (e.g., P4 match-actions or Portals 4 triggered operations) are marked with . Solutions providing only predefined functions are marked with .
(G) Granularity. Actions can be applied to full messages (), requiring to first fully receive the message, or to single packets, as they are received (). Solutions enabling both types of actions are marked with .
(U) Usability. It defines which entities can install actions into the network. In-network-compute solutions enabling user applications and libraries (even in multi-tenant settings) to install actions are marked with . Solutions that require elevated privileges, service disruption, and/or device memory flashing to install new actions are marked with .
Among all solutions of Table 1, sPIN is the only one that runs in the network (specifically in NICs) and lets the users express per-message or per-packet functions (defined in C or C++, called handlers) from which they can access packet data, share NIC memory, and issue NIC and DMA commands. Moreover, the handlers can be defined by user applications and do not require disruptions of the NIC operation. For these reasons, this work focuses on the sPIN programming model, investigating the challenges of building a sPIN engine, and introducing PsPIN, a general and open-source sPIN implementation that can be integrated into any NIC design.
The key idea of sPIN is to extend RDMA by enabling users to define simple processing tasks, called handlers, to be executed directly on the NIC. A message sent through the network is seen as a sequence of packets: the first packet is defined as header, the last one as completion, and all the intermediate ones as payload. As the packets of a message reach their destination, the receiving NIC invokes the packet handlers for each one of them. For each message, three types of handlers are defined: the header handler, executed only on the header packet; the payload handler, executed on all the packets, and the completion handler, executed after all packets have been processed. Handlers are defined by applications running on the host and cross-compiled for the NIC microarchitecture. The programming model that sPIN proposes is similar to CUDA  and OpenCL : the difference is that in these frameworks, applications define kernels to be offloaded to GPUs. In sPIN, the kernels (i.e., handlers) are offloaded to the NIC, and their execution is triggered by the arrival of packets. Figure 1 sketches the sPIN abstract machine model.
The host CPU defines the packet handlers and associates them with message descriptors Packet handlers are optional: e.g., by specifying only a header or payload handler, a single handler will be triggered for the incoming message, either at the beginning or the end of it, respectively. Message descriptors, together with the packet handlers are installed into the NIC. Incoming packets are matched to message descriptors and handlers are scheduled for execution on Handler Processing Units (HPU). Handlers can also issue new NIC commands and DMA operations to/from the host memory.
2.1.1 Architectural Specialties
The sPIN abstract machine model specifies a streaming execution model with microarchitectural requirements that are quite different from classical specialized packet processing engines, which normally constraint the type of actions that can be performed or the entity that can program them, and traditional compute cores. We now outline a set of architectural properties that a sPIN implementation should provide to enable fully-programmable high-performance packet processing.
S1. Highly parallel. Many payload packets can be processed in parallel. The higher the number of HPUs, the longer the handlers can run without becoming a bottleneck.
S2. Fast scheduling. Arriving packets must be scheduled to HPU cores while maintaining ordering requirements that mandate that header handlers execute before payload handlers that execute before completion handlers.
S3. Fast explicit memory access. Packet processing has low temporal locality by definition (a packet is seen only once), hence scratchpad memories are better than caches.
S4. Local handler state. Handlers can keep state across packets of a message as well as multiple messages. If the memory is partitioned, then scheduling needs to ensure that the state is reachable/addressable.
S5. Low latency, full throughput. To minimize the time a packet stays in the NIC, the time from when the packet is seen by sPIN to when the handlers execute should be minimized. Furthermore, the sPIN unit must not obstruct line-rate.
S6. Area and power efficiency. To lead to an easier integration of a sPIN unit in a broader range of NIC architectures.
S7. Handler isolation. Handlers processing a message should not be able to access memory belonging to other messages, especially if they belong to different applications.
S8. Configurability. A sPIN unit should be easily re-configurable to be scaled to different network requirements.
PsPIN is a sPIN implementation designed to match the architecture specialties of Section 2.1.1. PsPIN builds on top of the PULP (parallel ultra-low power) platform , a silicon-proven  and open  architectural template for scalable and energy-efficient processing. PULP implements the RISC-V ISA  and organizes the processing elements in clusters: each cluster has a fixed number of cores (32-bit, single-issue, in-order) and single-cycle-accessible scratchpad memory (S3). The system can be scaled by adding or removing clusters (S1). We have implemented all hardware components of PsPIN in synthesizable hardware description language (HDL) code.
3.1 Architecture Overview
PsPIN has a modular architecture, where the HPUs are grouped into processing clusters. The HPUs are implemented as RISC-V cores, and each cluster is equipped with a single-cycle access scratchpad memory called L1 memory. All clusters are interconnected to each other (i.e., HPUs can access data in remote L1s) and to three off-cluster memories (L2): the packet buffer, the handler memory, and the program memory. Figure 2 shows an overview of how PsPIN integrates in a generic NIC model and its architecture. We adopt a generic NIC model to identify the general building blocks of a NIC architecture. Later, in Section 3.4, we discuss how PsPIN can be integrated in existing NIC architectures.
Host applications access program and handler memories to offload handlers code and data, respectively. The management of these memory regions is left to the NIC driver, which is in charge of exposing an interface to the applications in order to move code and data. The toolchain and the NIC driver extensions to offload handlers code and data are out of the scope of this work. Once both code and data for the handlers are offloaded, the host builds an execution context, which contains: pointers to the handler functions (header, payload, and completion handlers), a pointer to the allocated handler memory, and information on how to match packets that need to be processed according to this execution context. The execution context is offloaded to the NIC and it is used by the NIC inbound engine to forward packets to PsPIN.
Receiving data. Incoming data is received by the NIC inbound engine, which is normally interfaced with the host for copying the data to host memory. In a PsPIN-NIC, the inbound engine is also interfaced to the PsPIN unit. The inbound engine must be able to distinguish packets that need to be processed by PsPIN from the ones taking the classical non-processing path. To make this distinction, the inbound engine matches packets to PsPIN execution contexts and, if a match is found, it forwards the packet to the PsPIN unit. Otherwise, the packet is copied to the host as normal. While some networks already have the concept of packet matching (e.g., RDMA NICs match packets to queue pairs), in others this concept is missing and needs to be introduced to enable packet-level processing (see Section 3.4).
Packets to be processed on the NIC are copied to the L2 packet buffer. Once the copy is complete, the NIC inbound sends a Handler Execution Request (HER) to PsPIN’s packet scheduler. The HER contains all information necessary to schedule a handler to process the packet, which are a pointer to the packet in the L2 packet buffer and an execution context. If the packet buffer is full, the NIC inbound engine can either back pressure the senders , send explicit congestion notifications , drop packets, or kill connections . The exact policy to adopt depends on the network in which PsPIN is integrated and the choice is similar to the case where the host cannot consume incoming packets fast enough.
The packet scheduler selects the processing cluster that processes the new packet. The cluster-local scheduler (CSCHED) is in charge of starting a DMA copy of the packets from the L2 packet buffer to the L1 Tightly-Coupled Data Memory (TCDM) and selecting an idle HPU (H) where to run handlers for packets that are available in L1. Once the packet processing completes, a notification is sent back to the NIC to let it update its view of the packet buffer (e.g., move the head pointer in case the packet buffer is managed as a ring buffer).
Sending Data. Packet handlers, in addition to processing the packet data, can send data over the network or move data to/from host memory. To send data directly from the NIC, the sPIN API provides an RDMA-put operation: When a handler issues this operation, the PsPIN runtime translates it into a NIC command, which is sent to the NIC outbound engine. If the NIC outbound engine cannot receive new commands, the handler blocks waiting for it to become available again. The NIC outbound can send data from either the L2 packet memory, the L2 handler memory, or L1 memories, or it can specify a host memory address as data source, behaving as a host-issued command. To move data to/from the host, the handlers can issue DMA operations: These operations translate to commands that are forwarded to the off-cluster DMA engine, which writes data to host memory through PCIe.
3.2 Control path
Figure 3 shows the PsPIN control path, which includes: receiving HERs from the NIC inbound engine, scheduling packets, handling commands from the handlers, and sending completion notifications back to the NIC.
3.2.1 Inter-cluster packet scheduling
PsPIN becomes aware that there is a new packet to process when it receives an HER from the NIC inbound engine . The HER is received by the packet scheduler, which is composed of the Message Processing Queue (MPQ) engine and the task dispatcher. The MPQ engine handles scheduling dependencies between the packets. These scheduling dependencies are defined by the sPIN programming model:
the header handler is executed on the first packet of a message and no payload handler can start before its completion;
the completion handler is executed after the last packet of a message is received and all payload handlers are completed.
A message is a sequence of packets mapped to an MPQ and matched to an execution context. We let the NIC define the packets that are part of a message or flow. Once the last packet of a message arrives, the NIC marks the corresponding HER with an end-of-message flag, letting PsPIN to run the completion handler when all other handlers of that MPQ complete.
To enforce scheduling dependencies, the MPQ engine organizes HERs in linked lists, one per message. If a packet is blocked, e.g., because the header handler is still running, its HER is queued in the linked list corresponding to its message. The MPQ engine then selects a ready message queue (i.e., no unsatisfied scheduling dependencies and not empty), from which to generate a processing task in a round-robin manner, and forwards it to the task dispatcher . This approach allows us to have fair scheduling between messages in case different messages are received at the same time. We choose to organize blocked HERs in linked lists because, under normal operations, a message is not in a blocked state and its packets should be scheduled at line rate. Hence, an approach with statically allocated FIFO buffers would result in a waste of memory cells. However, to avoid the case where a message consumes all the buffer space in the MPQ engine, we statically allocate four cells for each MPQ, allowing other messages to progress even in case a message blocks.
Task dispatcher. The task dispatcher selects the processing cluster where to forward a task for its execution . A task can be forwarded to a cluster if that cluster has enough space in its L1 to store the packet data. We use the message ID, which is included in the HER, to determine the home cluster of a message: the task dispatcher tries to schedule packets their home clusters. If the home cluster cannot accept it, then the least loaded cluster is selected. The task dispatcher blocks if there are no clusters that can accept the task.
The rationale behind the concept of home cluster is given by the fact that handlers processing packets of the same message can share L1 memory, hence scheduling them on the same cluster avoids remote L1 accesses. Figure 4 shows the memory latency and bandwidth experienced by a single core when copying data from local or remote memories using different access types (i.e., load/stores, DMA). As each core can execute one single-word memory access at a time, the latency for accessing a chunk of data increases linearly with its size. The DMA engine, on the other hand, moves data in bursts, so multiple words can be “in-flight” concurrently.
Handler execution and completion notification. Within a processing cluster, task execution requests are handled by the cluster-local scheduler. We describe the details of intra-cluster handler scheduling in Section 3.2.2. During their execution, handlers can issue commands that are handled by a command unit . We define three types of commands to interact with the NIC outbound and with the off-cluster DMA engine:
NIC commands to send data over the network: a handler can forward the packet or generate new ones.
DMA commands to move data to and from host memory. The host virtual addresses where the handlers can write to or read from can be passed through application-defined data structures in handler memory.
HostDirect commands are similar to DMA commands but, instead of a source address, they carry 32 B immediate data that is written directly to the host memory address.
Command responses are used to inform the handlers of the completion of the issued commands or error conditions.
Once a handler terminates and there are no in-flight commands for which a response is still pending, a completion notification is generated . The MPQ engine uses this notification to track the state of message queues (e.g., mark a queue as ready when the header handler completes). The notification is also forwarded to the NIC inbound engine, which uses it to free sections in the L2 packet buffer.
3.2.2 Intra-cluster handler scheduling
While HPUs can access the packet data stored in the L2 packet buffer, memory accesses to L2 memories take up to 25 cycles. To save this latency, applications can specify in the execution context the number of bytes of the packet that must be made available in the L1 of the cluster where the handler is executing, enabling single-cycle access to this data. This information is propagated into the task descriptor.
New tasks are received by the cluster-local scheduler (CSCHED), which is in charge of starting a DMA transfer of (part of) the packet data from L2 to L1, as specified by the matched execution context. Tasks that are waiting for a DMA transfer to complete are buffered in a FIFO queue (the DMA engine guarantees in-order completion of the transfers). Once a transfer completes, the corresponding task is popped from the queue and scheduled to an idle HPU. HPUs are interfaced with a memory-mapped device, the HPU driver, from which they can read information about the task to execute.
The PsPIN runtime running on the HPU consists of a loop executing the following steps: (1) Read the handler function pointer from the HPU driver. If the HPU driver has no task/handler to execute, it stops the HPUs by clock-gating it. When a task arrives, the HPU is enabled and the load completes. (2) Prepare the handler arguments (e.g., packet memory pointer). (3) Calling the handler function. (4) Write to a doorbell memory location in the HPU driver to inform it that the handler execution is completed. The HPU driver will send the related completion notification as soon as there are no in-flight commands issued by it. To allow overlapping, the HPU driver can buffer a completed task for which the completion notification cannot be sent and can start executing another handler: The new handler blocks if it issues a command or tries to terminate while the HPU driver is still waiting for sending the notification of the previous handler.
Since multiple HPU drivers can send feedback and issue commands at the same time, we use round-robin arbiters to select, at every cycle, an HPU that can send a feedback and one that can issue a command. Figure 5 shows an overview of a PsPIN processing cluster. The figure shows only the connections relevant to the scheduling processes and to the handling of handler commands. In reality, the HPUs are also interfaced to the cluster DMA engine and can issue arbitrary DMA transfers from/to the accessible L2 handler memory.
Memory accesses and protection. Handlers processing packets matched to the same execution context share the L2 handler memory region that has been allocated by the application when defining the execution context. Additionally, each message shares a statically defined scratchpad area in the L1 of the home cluster. In particular, L1 memories, which are 1 MiB each in our configuration, contain: the packet buffer (32 KiB), the runtime data structure (e.g., HPU stacks, 8 KiB), the message scratchpads (984 KiB). The size of the per-message scratchpad depends on the maximum number of messages that we allow to be in PsPIN at the same time. The current configuration allows for 512 in-flight messages, which are evenly distributed among the clusters (by the message ID), leading to a 7.6 KiB scratchpad per message.
To protect against bad memory accesses and guarantee handler isolation S7, the HPU driver configures the RISC-V Physical Memory Protection (PMP) unit  for each task, allowing the core to access only a subset of the address space (e.g., handler code, packet memory, L1 scratchpad). The handlers are always run in user mode. In case of a memory access violation or any other exception, an interrupt is generated and handled by the PsPIN runtime. The exception handling consists of resetting the environment (e.g., stack pointer) for the next handler execution and informing the HPU driver of the error condition. The HPU driver will then send a command to the HostDirect unit to write the error condition to the execution context descriptor in host memory. A failed handler is considered as a completed one, hence it leads to the release of the occupied resources (i.e., packet buffer space).
3.2.3 Monitoring and control
While processing packets on the NIC, there are two scenarios that must be prevented to ensure correct operation: (1) Packets of a message stop coming and the end-of-message is not received. This can be due to many factors, like network failure, network congestion, or bugs in the applications or protocols. (2) Slow handlers that cannot process packets at line rate.
To detect case (1), we use a pseudo-LRU  solution on active MPQs (i.e., MPQs which are receiving packets). Every time an MPQ is accessed (i.e., a packet for it is received), it is moved to the back of the LRU list. If the candidate victim does not receive packets for more than a threshold specified in the execution context of the message that activated it, the MPQ is reset and marked as idle. This event is signaled to the host through the execution context descriptor. Case (2) is detected by the HPU drivers themselves by using a watchdog timer that generates an interrupt on the HPU and causes the runtime to reset it. The timer is configured according to a threshold specified in the execution context either by the NIC driver or the application itself. This case is handled similarly to memory access violations by notifying the host of the error condition through the execution context descriptor.
To understand the time budget available to the handlers, Figure 6 shows the relation between handlers execution time and line rate. We assume a PsPIN configuration with 32 HPUs. On the left, it shows the maximum duration handlers should have to process packets at line rate for different packet sizes, in case of 200 Gbit/s and 400 Gbit/s networks. On the right, it shows how the processing throughput is affected by handlers duration for different packet sizes and network speeds.
3.3 Data path
We now discuss how data flows within PsPIN, explaining the design choices made to guarantee optimal bandwidth. We equip PsPIN with three interconnects: the NIC-Host interconnect, which interfaces the NIC and the host to PsPIN memories; the DMA interconnect, which interfaces the cluster-local DMA engines to both L2 packet buffer and handler memories; and the processing-elements (PE) interconnect, which allows HPUs to read from either L2 memories or remote L1s. Both NIC-Host and DMA interconnect have wide data ports (512 bit), while the PE interconnect is designed for finer granularity accesses (32 bit). Since PsPIN is clocked at 1 GHz, the offered bandwidth of these interconnects is 512 Gbit/s and 32 Gbit/s, respectively. PsPIN’s on-chip interconnects, memory controllers, and DMA engine are based on .
Figure 7 shows an overview of the PsPIN memories, interconnects, and units that can move data (in gray if they are interfaced to but not within PsPIN). We identify three critical data flows that require full bandwidth in order to not obstruct line rate and optimize PsPIN data paths to achieve this goal.
Flow 1: from NIC inbound to L2 packet buffer to clusters’ L1s. The NIC inbound writes packets to the L2 packet buffer at line rate and, in the worst case, this data is always copied to the L1s of the processing clusters by their DMA engines, before starting the handlers. The main bottleneck of this data flow can be the L2 packet buffer, which is accessed in both write and read directions.
Flow 2: from L2/L1 to host memory. Assuming all handlers copy the data to host, we have a steady flow of data towards the host memory. The data source is specified in the command issued by the handlers and can be either the L2 packet buffer, the L2 handler memory, or the clusters’ L1s. This data is moved by the off-cluster DMA engine, which interfaces to an IOMMU to translate the virtual addresses specified in the handler command to physical ones. The IOMMU is updated by the NIC driver when the host registers memory that can be accessed by the NIC.
Flow 3: from L2/L1 to NIC outbound. Similar to flow 2, but the data is moved towards the NIC outbound engine. We assume the NIC outbound has its own DMA engine, which it uses to read data.
All the identified critical flows can involve the L2 packet buffer. To avoid being a bottleneck, this memory must provide full bandwidth to the NIC inbound engine and to the cluster-local DMA engines (flow 1), plus it must provide full bandwidth to the system composed of the NIC outbound engine and the off-cluster DMA (flow 2 + flow 3), letting them reach up to 256 Gbit/s read-bandwidth each under full load. To achieve this goal, we implement the L2 packet buffer as 4 MiB, two-ports full-duplex, multi-banked (32 banks) word-interleaved memory. With 512 bit words, the L2 packet buffer is suitable more for wide accesses than single (32 bit) load/store accesses from HPUs. In fact, if handlers are going to frequently access packets, then their execution context can be configured to let PsPIN move packets to L1, before the handlers start. The maximum bandwidth that the L2 packet buffer can sustain is 512 Gbit/s per port, full duplex. This bandwidth can be achieved in case there are no bank conflicts. One port of the L2 packet buffer is accessible through the NIC-Host interconnect, where the NIC inbound engine is connected. Only the NIC inbound engine can write through this port, hence it gets the full write bandwidth. Other units connected to the NIC-Host interconnect that can access the L2 packet buffer, namely the NIC outbound engine and the off-cluster DMA engine, share the read bandwidth. The second port is connected to both DMA and PE interconnects. This configuration allows supporting a maximum line rate of 512 Gbit/s, making PsPIN suitable for up to 400 Gbit/s networks.
L2 handler and program memory. The L2 handler memory is less bandwidth-critical than the L2 packet buffer, but not less important. In the current configuration, the handler memory is 4 MiB. The sPIN programming model allows the host to access memory regions on the NIC to, e.g., write data needed by the handlers or read data back when a message is fully processed. Host applications can allocate memory regions in this memory through the NIC driver, which manages the allocation state. The host can copy data in the handler memory before packets triggering handlers using it start arriving. For example, Di Girolamo et al. 
use this memory to store information about MPI datatypes, deploying general handlers that process the packets according to the memory layout described in the handler memory. Differently from the packet buffer, we foresee that the handler memory can be targeted more frequently by the HPUs with 32-bit word accesses, hence we adopt 64 bit-wide banks to reduce the probability of bank conflicts. Similarly, to the L2 packet buffer, the handler memory can be involved by flows 2 and 3 and offers a maximum bandwidth of 512 Gbit/s per port, full duplex.
The program memory (32 KiB) stores handlers code. It is accessed by the host to offload code and by the PE interconnect to refill the per-cluster 4 KiB instruction cache. Since this memory is not on the critical path, we implement it as single-port, half-duplex, with 64 Gbit/s bandwidth.
3.4 NIC integration
We described PsPIN within the context of the NIC model discussed in Section 3.1 but, how to integrate a PsPIN unit in existing networks? To answer this question, we identify a set of NIC capabilities, some of which are required for integrating PsPIN, and others that are optional but can provide a richer handler semantic. The required capabilities are:
Message/flow matching: Packet handlers are defined per message/flow on the receiver side. The NIC must match a packet to a message/flow to identify the handler(s) to execute. We do not explicitly define messages or flows because this depends on the network where PsPIN is integrated into. For PsPIN, a message or flow is a sequence of packets targeting the same message processing queue (MPQ, see Section 3.2.3). The feedback channel to the NIC inbound engine is used to communicate when an MPQ becomes idle (i.e., does not contain enqueued HERs) and can be remapped to a new NIC-defined message or flow.
Header first: The first packet that is processed by PsPIN must carry the information characterizing the message. This requirement can be relaxed if packets carry information to identify a message or flow (e.g., TCP, UDP).
NICs can provide additional capabilities that can (1) extend the functionalities that the handlers have access to and (2) let the applications make stronger assumptions on the network behavior. Applications can query the NIC capabilities, potentially providing different handlers depending on the available capabilities. One such capability is reliability. With a reliable network layer, PsPIN is guaranteed to receive all packets of a message and to not receive duplicated packets. With this capability, applications can employ non-idempotent handlers. Otherwise, the handlers have to take into account that, e.g., they can be executed more than once on the same packet.
3.4.1 Match-action tables
PsPIN can be integrated into any NIC that provides match-action tables [39, 27, 11]. This is the most general approach. These tables allow the user to specify a set of rules to which packets can be matched. If a packet matches a certain rule, a specific action can be executed on this packet. Integrating PsPIN in this context would mean to introduce a new action consisting of forwarding the packet to the PsPIN unit with a given execution context. This approach would offer the greatest flexibility for packet matching by not being tied to a specific network protocol and allowing applications to arbitrarily define their concept of message or flow.
3.4.2 TCP/UDP Processing Sockets
Ethernet  is one of the most dominant technologies in data center networks. Normally, protocols running on top of Ethernet (e.g., TCP or UDP) do not have the concept of a message, which is central for sPIN. Instead, they operate on packets (UDP) or streams of bytes (TCP).
To address this issue, we introduce the concept of processing socket. A processing socket is identified by a processing attribute and it is associated with a set of handlers. If the transport layer of the socket is UDP, then each packet is seen as a message and all the installed handlers are executed on each of them. In this case, the head-first capability is provided by definition (i.e., all packets are header packets). For TCP processing sockets, the header is executed on the first received packet (i.e., SYN) and the completion handler is executed when the connection is shut down. The NIC has to be extended in order to match packets to processing sockets. The matching semantic is determined by the socket protocol: destination port for UDP, or source IP, source port, destination port for TCP. When creating a processing socket, the socket matching information is passed to the NIC, enabling packet matching and processing.
The data read by the applications from a processing socket is formed by the packets that the sPIN handlers forward to the host. If a payload handler terminates with a DROP return code, then the packet is dropped, otherwise, if it terminates with SUCCESS, the packet is forwarded to the host (see Section B.4 of Hoefler et al.  for more details). In the case of a TCP processing socket, the host TCP network stack will need to take into account that the PsPIN can modify or drop packets.
3.4.3 RDMA-Capable Networks
Remote Direct Memory Access (RDMA) networks let applications expose memory regions over the network, enabling remote processes to access them for reading or writing data. When using RDMA, applications register memory regions on the NIC, so that its IOMMU can translate virtual to physical addresses. Whenever a remote process wants to, e.g., perform a write operation, it has to specify where in the target memory the data has to be written. This memory location can be directly specified by its target virtual memory address in the write request [26, 6], or indirectly . In the indirect case, the application not only registers the memory but also specifies a receive descriptor that can be matched by incoming remote memory access requests: e.g., in Portals 4, these descriptors are named list entries or matched list entries according to whether they are associated with a set of matching bits or not.
In general, RDMA NICs already perform the packet matching on the NIC. In the direct case, the NIC matches the virtual address carried by the request to a physical address. In the indirect case, the NIC matches the packet to the receive descriptor, to derive the target memory location. Hence, the message matching capability required is provided; the question is: to which object do we attach the PsPIN handlers? Table 2 reports different RDMA-capable networks and objects where the PsPIN handlers can be attached. For example, associating handlers to the InfiniBand queue pair means that all packets targeting that queue pair will be processed by PsPIN.
|InfiniBand , RoCE ||ibverbs ||Queue Pair|
|Bull BXI , Cray Slingshot ||Portals 4||Match List Entry|
|Cray Gemini , Cray Aries ||uGNI, DMAPP ||Memory Handle|
The second required capability is header first. For InfiniBand, this is given by the in-order delivery that the network already provides. For other networks that cannot guarantee that (e.g., because of adaptive routing), the NIC must be able to buffer or discard payloads packets arriving before the header packet. RDMA-capable networks already implement reliability at the network layer, hence applications can adopt non-idempotent handlers.
3.5 Special cases and exceptions
Can PsPIN deadlock if no processing cluster can accept new tasks? In this case, the task dispatcher will block, waiting for a queue to become available again and this will create back-pressure towards the NIC inbound engine. The system cannot deadlock because the processing clusters can keep running since they are not dependent on new HERs to arrive. The header-before-payloads dependency does not cause problems because if payload handlers are waiting for the header, then it is guaranteed that the header is being already processed (because of the header-first requisite and the in-order scheduling guaranteed by the MPQ engine on a per-message basis). If badly-written handlers deadlock, the HPU driver watchdog will trigger causing the handler termination.
What if a message is not fully delivered? The completion feedback will not be triggered causing resources (e.g., message state in the MPQ engine) to not be freed. PsPIN can detect this case and force resource release (see Section 3.2.3).
Our evaluation aims to answer the following questions: (1) How big is PsPIN in terms of post-synthesis area and how does that scale with the number of HPUs? (2) In which cases can PsPIN sustain line rate? (3) Does the choice of implementing sPIN on top of a RISC-V based architecture with a flat non-coherent memory hierarchy pay off? What are the trade-offs of choosing more complex architectures for sPIN?
Simulation environment. We simulate PsPIN in a cycle-accurate testbed comprised of SystemVerilog modules. We use synthesizable modules for all PsPIN components. We develop simulation-only modules modeling the NIC inbound and outbound engines. Our inbound engine takes a trace of packets as input and injects them in PsPIN at a given rate. The outbound engine reads data from PsPIN according to the received commands, generating memory pressure. The host interface is emulated with a PCIe model (PCIe 4.0, 16 lanes), implemented as a fixed-rate data sink. Unless otherwise specified, we do not limit the packet generator injection rate in order to test the maximum throughput PsPIN can offer. Packet handlers are compiled with the PULP SDK, which contains an extended version of GCC 7.1.1 (riscv32). All handlers are compiled with full optimizations on (-O3 -flto).
4.1 Hardware Synthesis and Power
We synthesized PsPIN in GlobalFoundries’ 22 nm fully depleted silicon on insulator (FDSOI) technology using Synopsys DesignCompiler 2019.12, and we were able to close the timing of the system at 1 GHz. Including memories, the entire accelerator has a complexity on the order of 95 MGE.111One gate equivalent (GE), equals 0.199 in GF 22 nm FDSOI. Of the overall area, the four clusters (including their L1 memory and the intra-cluster scheduler) occupy 43 %, the L2 memory 51 %, the inter-cluster scheduler 3 %, and the inter-cluster interconnect and L2 memory controllers another 3 %. The L2 memory macros occupy a total area of 9.48 mm. Depending on the NIC architecture where PsPIN is integrated into, the L2 packet buffer could be mapped to the NIC packet buffer, saving memory area. The area of the clusters is dominated by the L1 memory macros, which take 1.65 mm per cluster. The instruction cache and the cluster interconnect have a complexity of ca. 700 kGE per cluster, which corresponds to ca. 0.2 mm at 70 % placement density. Each core has a complexity of ca. 50 kGE, which corresponds to ca. 0.014 mm. The total cluster area is ca. 1.99 mm. The total area of our architecture is ca. 18.5 mm (S6). For comparison, from [31, 40] it can be inferred that a Mellanox BlueField SoC, scaled to 16 ARM A72 cores (22 nm), would occupy 51 mm.
We derive an upper bound for the power consumption of our architecture by assuming 100 % toggle rate on all logic cells and 50/50 % read/write activity at each memory macro. The overall power envelope is 6.1 W, 99.8 % of which is dynamic power (S6). The four clusters consume 62 % of the total power, ca. 3.8 W. Within each cluster, the L1 memory consumes ca. 55 % of the power. The L2 memory consumes 18 % of the total power, ca. 1.1 W. The inter-cluster scheduler consumes 8 % of the total power, ca. 0.5 W. The inter-cluster interconnect and L2 memory controllers consume 11.7 %, ca. 0.7 W. As our architecture offers 32 HPUs, the power normalized to the number of HPUs is 190 mW. The actual power consumption will be significantly smaller for most practical applications, but measuring it will only be possible once a physical prototype can be tested extensively.
We now investigate the performance characteristics of PsPIN: we first discuss the latencies experienced by a packet when being processed by PsPIN. Then, we study the maximum packet processing throughput that PsPIN can achieve and how the complexity of the packet handlers can affect it.
4.2.1 Packet Latency
We define the packet latency as the time that elapses from when PsPIN receives an HER from the NIC inbound engine to when the completion notification for that packet is sent back to it. It does not include the time needed by the NIC inbound engine to write the packet to the L2 packet buffer. The measurements of this section are taken in an unloaded system by instrumenting the cycle-accurate simulation. Overall, we observe latencies ranging from 26 ns for 64 B packets to 40 ns for 1024 B ones. In particular, a task execution request takes 3 ns to arrive to the cluster-local scheduler (i.e., CSCHED in Figure 5). At that point, the packet is copied to the cluster L1 by the cluster-local DMA engine. This transfer has latencies varying from 12 ns for 64 B packets to 26 ns for 1024 B packets. Once the data reaches L1, the task is assigned to an HPU driver in a single cycle. The HPU runtime takes 7 ns to invoke the handler: this time is used for reading the handler function pointer, setting up the handler’s arguments, and making the jump. Once the handler completes, the runtime makes a single-cycle store to the HPU driver to inform it of the completion. The completion notification takes 1 ns to get back to the NIC inbound engine, but it can be delayed of additional 6 ns and 2 ns in case of the round-robin arbiters prioritize other HPUs and clusters, respectively.
4.2.2 Packet processing throughput
In Section 3.3 we describe three critical data flows that can run over the PsPIN unit. Flow 1 (inbound flow) moves data from the NIC inbound engine to the L2 packet memory and, from there, to the L1 memory of the processing cluster to which the packet has been assigned. Moving packet data to L1 memories is not always needed. For example, a handler might only use the packet header (e.g., filtering), the packet header plus a small part of the packet payload (e.g., handlers looking at application-specific headers), or they might not need packet data at all (e.g., packet counting). Applications specify the number of bytes that handlers need for each packet. Flows 2 and 3 are the ones moving data from PsPIN to the outside interfaces, namely the NIC outbound (outbound NIC flow) and the host interface through the PCIe bridge (outbound host flow). These flows are generated by the handlers, which can issue commands to move data to the NIC or to the host. Handlers do not necessarily generate commands as they can consume data on the NIC directly and communicate results to the host once the message processing finishes: e.g., handlers performing data reductions on the NIC, letting the completion handler to write data to the host.
Inbound flow. We measure the throughput PsPIN can sustain for the inbound flow. We measure the throughput as function of the frequency of the completion notifications received by the MPQ engine and the packet size. Figure 8 (left) shows the throughput as function of the the number of instructions per handler (x-axis) and for different packet sizes (i.e., 64 B, 512 B, and 1024 B packets). We also include the maximum throughput that the current PsPIN configuration can achieve: this is the minimum between the bandwidth offered by the interconnect and the cumulative bandwidth offered by the 32 HPUs when executing instructions. For this benchmark, we let each handler execute integer arithmetic instructions, each completed in a single cycle. Hence, the x-axis can also be read as handler duration in nanoseconds. The data shows that PsPIN can always schedule packets at the maximum available bandwidth and the HPU runtime introduces minimum overhead (i.e., 8 cycles per packet, see Section 4.2.1).
Figure 8 (right) shows the maximum number of HPUs that are utilized when running handlers executing instructions on different packet sizes. PsPIN can schedule one 64 B packet per cycle. Even with empty handlers, we need 19 HPUs to process them because of the overhead necessary to invoke the handlers. With bigger packets, the time budget increases: handlers with small instruction counts can process 512 B and 1024 B packets at full throughput with a single HPU.
Inbound + outbound flows. We now study the throughput offered when packets are received and sent out of PsPIN. Also in this case, we configure the execution context to move the full packet data to L1. For the outbound NIC flow, we develop handlers implementing a UDP ping-pong communication pattern: the handler swaps the IP source and destination and UDP ports of the packet and issues a NIC command to send it back over the network. Overall, this handler consists of 27 instructions (20 for the swap and 7 for the issuing the command). The handlers for the outbound host flow only issue a DMA command to move the packet to the host, without modifying it. We benchmark both the cases in which the packet is sent from L1 or from the L2 packet buffer.
Figure 9 shows the results of this benchmark. The L2 packet buffer, with its 32 512-bit-wide banks, is optimized for wide accesses, as the ones performed by the DMA engines of the involved units. The L1 TCDM is optimized for serving 32-bit word accesses from the HPUs and organized in 64 32-bit-wide banks. This difference shows up in the throughput and it is caused by a higher number of bank conflicts in the data-from-L1 case: with 64 B packets, both the outbound flows hardly reach 200 Gbit/s when reading from L1, while 400 Gbit/s is reached when reading data from the L2 packet buffer. For bigger packets (512 B), the time budget is large enough to allow also the L1 case to reach full bandwidth.
4.3 Handlers Characterization
This set of experiments outlines the benefits of adopting a simple, RISC-V-based architecture over more powerful and complex ones. We select a set of use cases ranging from packet steering to full message processing and execute their packet handlers on PsPIN. We then compare the measured performance against the one obtained by running the same handlers on the following architectures:
ault is a 64-bit 2-way SMT, 4-way superscalar, Intel Skylake Xeon Gold 6154 @3 GHz. It supports out-of-order execution, and it is equipped with a 24.75 MiB L3 cache.
zynq is a Xilinx Zynq ZU9EG MPSoC featuring a quad-core ARM Cortex-A53. The Cortex-A53 is a 64-bit 2-way superscalar processor running at 1.2 GHz.
To run on these architectures, we develop a benchmark that loads a predefined list of packets in memory, spawns a set of worker threads, and statically assigns the packets to the workers. This setting can be compared to an ideal DPDK execution since the packets are already in memory and the workers do not experience any DPDK-related overhead (e.g., polling device ports, copying bursts in local buffer). If not otherwise specified, the packet size is set to 2 KiB. The selected use cases are described below.
Data reduction. Reducing data of multiple messages is a core operation of collective reductions  and one-sided accumulations . Given messages, each carrying data items of type , it computes an array of entries of type where entry is the reduction (according to a given operator) of the -th data item across the messages. We benchmark an instance of this use case (named reduce) on 512 packet, each carrying 512 32-bit integers. Payload handlers accumulate data in L1 using the sum operator. The completion handler informs the host that the result is available with a a direct host write command. Alternatively, the result can be directly DMAed to host memory.
Data aggregation. A common operation utilized in, e.g., data-mining application , which consists in accumulating the data items carried by a message. This benchmark (aggregate) uses a 1 MiB message of 32-bit integers that are summed up in L1. The completion handler copies the aggregate to host memory.
Packet filtering/rewriting. Packet filtering strategies are employed for intrusion-detection systems, traffic monitoring, and packet sniffing . For each message, this benchmark queries an application-defined hash table (in L2), by using the source IP address (32-bit) as key. If a match is found, the UDP destination port is overwritten with the matched value (i.e., emulating VM-specific port redirection), and written to host memory. This benchmark (filtering) uses 512 messages and a hash table of 65’536 entries.
Key-Value cache. We implement a key-value store (kvstore) cache on the NIC. The cache is stored in L2 and is implemented as a set-associative cache to limit the number of L2 accesses needed to maintain the cache (e.g., eviction victims can be chosen only within a row). We generate a YCSB  workload of 1,000 requests (50/50 read/write ratio, =1.1). The cache associativity is set to 4 and the total number of entries is set to 500. The set is determined as the key (32-bit integer) modulo the number of sets.
. This benchmark sends a 1 MiB message that is copied to host memory in blocks of 256 bytes and with a stride of 512 bytes. The layout description (i.e., block size and stride) is in L2.
Histogram. Given a set of messages, we summarize the received data items by counting them per value. This application is common in distributed join algorithms . In our instance, we receive 512 messages, each carrying 512 integers randomly generated in the interval. The handlers count how many data items per value have been received and finally copy the histogram to the host.
4.3.1 Handler Execution Time
Figure 10 show the handlers’ performance zynq, ault, and PsPIN. This benchmark shows that these architectures achieve similar handlers’ execution times and architectural characteristics like hardware caches have a limited positive impact for packet-processing workloads. We report handlers’ execution times, the number of executed instructions, IPC (instructions-per-cycles), and the number of cache misses. On ault and zynq, we preload all packets in memory and statically partition them over a number of worker threads (i.e., emulating the HPUs). For these architectures, the handlers’ performance is measured with CPU hardware counters . To show the effects of resource contention, we run the experiments first with a single worker thread (i.e., no contention), then with four workers in parallel. For PsPIN, we do not report cache-misses boxplots because PsPIN has no hardware caches.
In most of the cases, the execution of the handlers on PsPIN does not take more than 2x the best case (i.e., no contention) of the other architectures. The worst case is filtering, which computes a hash function on a 8 byte value, resulting in a compute-intensive task, which allows ault to run this handler more than 30x times faster than PsPIN. We observe that some handlers (e.g., aggregate, filtering, strided_ddt) need fewer instructions (1.2x - 1.6x less) on ault and zynq: e.g., on ault, the compiler optimizes aggregate by using SIMD packed integer instructions. However, even if they have fewer instructions, their IPC is limited by the number of cache misses, increasing the overall execution time. In PsPIN, the runtime transfers the packets directly into the L1 scratchpad memory of the processing cluster, enabling single-cycle access. Additionally, since PsPIN has no hardware caches, it does not suffer from cache-line ping-pong scenarios, as happens for, e.g., histogram and reduce, on other architectures. Finally, RISC-V AMOs  enable single-cycle atomic operations that can save up to 3x the instructions over other implementations (e.g., linked load, store conditional) for the reduce and histogram cases.
4.3.2 Handler Throughput
Figure 11 (left) shows the maximum theoretical throughput that considered architectures can achieve while processing the handlers of Figure 10 using 32 cores. To show the maximum achievable throughput, we do not limit the processing by the network bandwidth (i.e., packets are preloaded in memory).
|Arch.||Tech.||Die area||PEs||Memory||Area/PE||Area/PE (scaled)|
|ault||14 nm||485 mm ||18||43.3 MiB||17.978 mm||35.956 mm|
|zynq||16 nm||3.27 mm ||4||1.125 MiB||0.876 mm||1.752 mm|
|PsPIN||22 nm||18.5 mm||32||12 MiB||0.578 mm||0.578 mm|
While this experiment shows that PsPIN achieves comparable throughput in most of the cases, the comparison is not fair because it does not take into account the area occupied by these architectures. Table 3
summarizes the area estimates for the considered architectures. Figure11 (right) shows the maximum throughput per area, which is computed by dividing the die area by the number of cores, getting the area per core (including an equivalent amount of memory) and scaling it to the same production process (22 nm). Then we divide the maximum throughput by the core area. This shows that simpler architectures get the highest payoff in terms of area efficiency for packet-processing workloads. PsPIN proves to be up to 7.71x times more area-efficient than zynq (minimum: 1.08x times, for strided_ddt) and up to 76.6x more area-efficient than ault (minimum: 1.44 times for filtering). This analysis is made under the highly optimistic assumption that the zynq architecture can be linearly scaled to 32 cores, and the actual difference would likely be much higher.
Figure 12 shows the actual throughput achieved by the considered handlers on PsPIN for different packet sizes. Differently from Figure 11, which shows the throughput computed from the handler execution times, in this case the full system is utilized and the throughput is measured as function of the completion notifications. In these settings, handlers experience scheduling overheads, contention on memories, NIC outbound engine and off-cluster DMA engine.
We observe that PsPIN achieves 400 Gbit/s for filtering, kvstore, and strided_ddt already for 512 B packets. In the other cases, handlers are compute-intensive, and they operate on every 32-bit word of each received packet. Nonetheless, PsPIN achieves more than 200 Gbit/s, which the state-of-the-art network speed, from 512 B packets. Thanks to the modularity of this architecture S8, a scenario where 400 Gbit/s must be sustained also for this type of workload can be satisfied by doubling the number of processing clusters.
5 Related Work
One of the oldest concepts related to PsPIN is Active Messages (AM) . However, in the AM model, messages are atomic and can be processed only once they are fully received. In sPIN, the processing happens at the packets level, leading to lower latencies and buffer requirements.
sPIN is closely related to systems such as P4 , which allow users to define match-action rules on a per-packet basis and are supported by switch architectures such as AMT , FlexPipe , and Cavium’s Xplaint. Those architectures target switches and work on packet headers, not packet data. FlexNIC  extends this idea by introducing modifiable memory and enabling fine-grained steering of DMA streams at the receiver NIC. These extensions can be used for, e.g., partition the key-space of a key-value store and steer requests to specific cores. However, the offloading of complex application-specific tasks (e.g., datatype processing ) has not been demonstrated in this programming model. In contrast, PsPIN allows offloading of arbitrary functions executed on general-purpose processing cores with small hardware extensions to increase throughput and reduce latency.
Fully programmable NICs are not new. They have been used in Quadrics QSNet to accelerate collectives  and to implement early versions of Portals . Some Myrinet NICs  allowed users to offload modules written in C to the specialized NIC cores. Modern approaches to NIC offload [46, 18] requires system- or cloud-providers to implement offload functionality as FPGA modules, while PsPIN uses easier to (re-)program RISC-V cores. What differentiates sPIN is a programming model that exposes packetization and enables user-defined packet processing.
Processing data in the network is a necessary step to scale applications along with the network speeds. In this work, we define the principles and architectural characteristics of streaming packet processing NICs, which constitute the next step after RDMA acceleration. We propose PsPIN, a power and area efficient RISC-V based unit implementing the sPIN programming model, which can be integrated in existing and future NIC architectures. We evaluate PsPIN, showing that it can process packets at up to 400 Gbit/s line rate and motivate our architectural choices with a performance study of a set of example handlers over different architectures.
-  Broadcom Stingray SmartNIC. https://www.broadcom.com/products/ethernet-connectivity/smartnic. Accessed: 2020-18-03.
-  Mellanox BlueField SmartNIC. https://www.mellanox.com/products/bluefield-overview. Accessed: 2020-18-03.
-  Wikichip.org: Cortex-A53 - Microarchitectures - ARM. https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a53. Accessed: 2020-15-04.
-  Wikichip.org: Skylake (server) - Microarchitectures - Intel. https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server). Accessed: 2020-15-04.
-  XC Series GNI and DMAPP API User Guide. https://pubs.cray.com/content/S-2446/CLE%207.0.UP01/xctm-series-gni-and-dmapp-api-user-guide. Accessed: 2020-18-03.
-  Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. Cray XC series network. Cray Inc., White Paper WP-Aries 01-1112, 2012.
-  Robert Alverson, Duncan Roweth, and Larry Kaplan. The gemini system interconnect. In 2010 18th IEEE Symposium on High Performance Interconnects, pages 83–87. IEEE, 2010.
-  B. W Barrett, R. Brightwell, S. Hemmert, K. Pedretti, K. Wheeler, K. Underwood, R. Riesen, T. Hoefler, A. B Maccabe, and T. Hudson. The Portals 4.2 Network Programming Interface. Sandia National Laboratories, November 2012, Technical Report SAND2012-10087, 2018.
-  Claude Barthels, Ingo Müller, Timo Schneider, Gustavo Alonso, and Torsten Hoefler. Distributed Join Algorithms on Thousands of Cores. Proceedings of the VLDB Endowment, 10(5), 2017.
-  Tarick Bedeir. Building an RDMA-capable application with IB Verbs. Technical report, HPC Advisory Council, 2010.
-  Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review, 44(3):87–95, 2014.
-  Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, and Mark Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for sdn. ACM SIGCOMM Computer Communication Review, 43(4):99–110, 2013.
-  Darius Buntinas, Dhabaleswar K Panda, and Ponnuswamy Sadayappan. Fast NIC-based barrier over Myrinet/GM. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, pages 8–pp. IEEE, 2000.
-  Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154, 2010.
-  Luca Deri. High-speed dynamic packet filtering. Journal of Network and Systems Management, 15(3):401–415, 2007.
-  Saïd Derradji, Thibaut Palfer-Sollier, Jean-Pierre Panziera, Axel Poudes, and François Wellenreiter Atos. The BXI interconnect architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pages 18–25. IEEE, 2015.
-  Salvatore Di Girolamo, Konstantin Taranov, Andreas Kurth, Michael Schaffner, Timo Schneider, Jakub Beránek, Maciej Besta, Luca Benini, Duncan Roweth, and Torsten Hoefler. Network-Accelerated Non-Contiguous Memory Transfers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA, 2019. Association for Computing Machinery.
-  Haggai Eran, Lior Zeno, Maroun Tork, Gabi Malka, and Mark Silberstein. NICA: An infrastructure for inline acceleration of network applications. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 345–362, 2019.
-  Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. Azure accelerated networking: SmartNICs in the public cloud. In 15th Symposium on Networked Systems Design and Implementation NSDI 18), pages 51–66, 2018.
-  Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K Gürkaynak, and Luca Benini. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(10):2700–2713, 2017.
-  Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC), pages 1–10. IEEE, 2016.
-  Richard L Graham, Steve Poole, Pavel Shamis, Gil Bloch, Noam Bloch, Hillel Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, and Gilad Shainer. ConnectX-2 InfiniBand management queues: First investigation of the new support for network offloaded collective operations. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pages 53–62. IEEE, 2010.
-  Jim Handy. The cache memory book. Morgan Kaufmann, 1998.
-  Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. sPIN: High-Performance Streaming Processing In the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, New York, NY, USA, 2017. Association for Computing Machinery.
-  Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. Remote memory access programming in MPI-3. ACM Transactions on Parallel Computing (TOPC), 2(2):1–26, 2015.
-  InfiniBand Trade Association et al. InfiniBand Architecture Specification Release 1.2. http://www.infinibandta.org, 2000.
-  Antoine Kaufmann, SImon Peter, Naveen Kr Sharma, Thomas Anderson, and Arvind Krishnamurthy. High Performance Packet Processing with FlexNIC. In ACM SIGARCH Computer Architecture News, volume 44, pages 67–81. ACM, 2016.
-  Jakub Kicinski and Nicolaas Viljoen. eBPF Hardware Offload to SmartNICs: clsbpf and XDP.
-  V Pradeep Kumar and RV Krishnaiah. Horizontal aggregations in SQL to prepare data sets for data mining analysis. IOSR Journal of Computer Engineering (IOSRJCE), pages 2278–0661, 2012.
-  Andreas Kurth, Wolfgang Rönninger, Thomas Benz, Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, and Luca Benini. An open-source platform for high-performance non-coherent on-chip communication. arXiv preprint arXiv:2009.05334, 2020.
-  Hugh T Mair, Gordon Gammie, Alice Wang, Rolf Lagerquist, CJ Chung, Sumanth Gururajarao, Ping Kao, Anand Rajagopalan, Anirban Saha, Amit Jain, et al. 4.3 a 20nm 2.5 ghz ultra-low-power tri-cluster cpu subsystem with adaptive power allocation for optimal mobile soc performance. In 2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 76–77. IEEE, 2016.
-  Message Passing Interface Forum. MPI: A Message-Passing Interface Standard Version 3.0, 09 2012. Chapter author for Collective Communication, Process Topologies, and One Sided Communications.
-  Robert M Metcalfe and David R Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(7):395–404, 1976.
-  Sebastiano Miano, Matteo Bertrone, Fulvio Risso, Massimo Tumolo, and Mauricio Vásquez Bernal. Creating complex network services with ebpf: Experience and lessons learned. In 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR), pages 1–8. IEEE, 2018.
-  John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40–53, 2008.
-  Recep Ozdag. Intel® Ethernet Switch FM6000 Series - Software Defined Networking. See goo.gl/AnvOvX, 5, 2012.
-  Ron Brightwell Kevin T Pedretti and Ron Brightwell. A NIC-Offload Implementation of Portals for Quadrics QsNet. In Fifth LCI International Conference on Linux Clusters, 2004.
-  Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg, and Adolfy Hoisie. Hardware-and software-based collective communication on the Quadrics network. In Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001, pages 24–35. IEEE, 2001.
-  Salvatore Pontarelli, Roberto Bifulco, Marco Bonola, Carmelo Cascone, Marco Spaziani, Valerio Bruschi, Davide Sanvito, Giuseppe Siracusano, Antonio Capone, Michio Honda, et al. Flowblaze: Stateful packet processing in hardware. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 531–548, 2019.
-  Jungyul Pyo, Youngmin Shin, Hoi-Jin Lee, Sung-il Bae, Min-su Kim, Kwangil Kim, Ken Shin, Yohan Kwon, Heungchul Oh, Jaeyoung Lim, et al. 23.1 20nm high-k metal-gate heterogeneous 64b quad-core cpus and hexa-core gpu for high-performance and energy-efficient mobile application processor. In 2015 IEEE International Solid-State Circuits Conference-(ISSCC) Digest of Technical Papers, pages 1–3. IEEE, 2015.
-  R Rajesh, Kannan Babu Ramia, and Muralidhar Kulkarni. Integration of LwIP stack over Intel (R) DPDK for high throughput packet delivery to applications. In 2014 Fifth International Symposium on Electronic System Design, pages 130–134. IEEE, 2014.
-  Kadangode Ramakrishnan, Sally Floyd, David Black, et al. The addition of explicit congestion notification (ECN) to IP. 2001.
-  Davide Rossi, Francesco Conti, Andrea Marongiu, Antonio Pullini, Igor Loi, Michael Gautschi, Giuseppe Tagliavini, Alessandro Capotondi, Philippe Flatresse, and Luca Benini. PULP: A parallel ultra low power platform for next generation IoT applications. In 2015 IEEE Hot Chips 27 Symposium (HCS), pages 1–39. IEEE, 2015.
-  Whit Schonbein, Ryan E Grant, Matthew GF Dosanjh, and Dorian Arnold. INCA: in-network compute assistance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–13, 2019.
-  Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. An in-depth analysis of the slingshot interconnect, 2020.
-  David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. Strom: smart remote memory. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020.
-  John E Stone, David Gohara, and Guochun Shi. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3):66–73, 2010.
-  Viswanath Subramanian, Michael R Krause, and Ramesh VelurEunni. Remote direct memory access (RDMA) completion, August 14 2012. US Patent 8,244,825.
-  Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009, pages 157–173. Springer, 2010.
-  Andreas Traber, Florian Zaruba, Sven Stucki, Antonio Pullini, Germain Haugou, Eric Flamand, Frank K Gurkaynak, and Luca Benini. PULPino: A small single-core RISC-V SoC. In 3rd RISCV Workshop, 2016.
-  Thorsten Von Eicken, David E Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active messages: a mechanism for integrated communication and computation. ACM SIGARCH Computer Architecture News, 20(2):256–266, 1992.
-  Adam Wagner, Hyun-Wook Jin, Dhabaleswar K Panda, and Rolf Riesen. NIC-based offload of dynamic user-defined modules for Myrinet clusters. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pages 205–214. IEEE, 2004.
-  Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A Patterson, and Krste Asanović. The risc-v instruction set manual volume ii: Privileged architecture version 1.9. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-129, 2016.
-  Andrew Waterman, Yunsup Lee, David Patterson, Krste Asanovic, and Volume I User level Isa. The RISC-V instruction set manual. Volume I: User-Level ISA’, version, 2, 2014.
-  Weikuan Yu, Darius Buntinas, Richard L Graham, and Dhabaleswar K Panda. Efficient and scalable barrier over quadrics and myrinet with a new nic-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., page 182. IEEE, 2004.