Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

by   Yifan Yuan, et al.

The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples. In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning and for query processing, and evaluate their performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use 25-75 fewer CPU cores and provide up to 85.9 environment than SwitchML. For distributed query processing with floating point data, FPISA enables up to 2.7x better throughput than Spark.



There are no comments yet.


page 3

page 5


BEANNA: A Binary-Enabled Architecture for Neural Network Acceleration

Modern hardware design trends have shifted towards specialized hardware ...

Libra: In-network Gradient Aggregation for Speeding up Distributed Sparse Deep Training

Distributed sparse deep learning has been widely used in many internet-s...

Cheetah: Accelerating Database Queries with Switch Pruning

Modern database systems are growing increasingly distributed and struggl...

NetFC: enabling accurate floating-point arithmetic on programmable switches

In-network computation has been widely used to accelerate data-intensive...

Scaling Distributed Machine Learning with In-Network Aggregation

Training complex machine learning models in parallel is an increasingly ...

Using the pyMIC Offload Module in PyFR

PyFR is an open-source high-order accurate computational fluid dynamics ...

Beyond Application End-Point Results: Quantifying Statistical Robustness of MCMC Accelerators

Statistical machine learning often uses probabilistic algorithms, such a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.