Collie: Finding Performance Anomalies in RDMA Subsystems

04/22/2023
by   Xinhao Kong, et al.
0

High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework.

READ FULL TEXT
research
01/16/2021

PL2: Towards Predictable Low Latency in Rack-Scale Networks

High performance rack-scale offerings package disaggregated pools of com...
research
12/15/2022

A Comprehensive Study on Off-path SmartNIC

SmartNIC has recently emerged as an attractive device to accelerate dist...
research
05/11/2019

RDMA Performance Isolation With Justitia

Despite its increasing popularity, most of RDMA's benefits such as ultra...
research
05/03/2023

CHASE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory

Caches at CPU nodes in disaggregated memory architectures amortize the h...
research
05/21/2018

Identifying OSPF Anomalies Using Recurrence Quantification Analysis

Open Shortest Path First (OSPF) is one of the most widely used routing p...
research
07/09/2020

IOCA: High-Speed I/O-Aware LLC Management for Network-Centric Multi-Tenant Platform

In modern server CPUs, last-level cache (LLC) is a critical hardware res...
research
04/06/2020

Spanning analysis of stock market anomalies under Prospect Stochastic Dominance

We develop and implement methods for determining whether introducing new...

Please sign up or login with your details

Forgot password? Click here to reset