KPynq: A Work-Efficient Triangle-Inequality based K-means on FPGA

K-means is a popular but computation-intensive algorithm for unsupervised learning. To address this issue, we present KPynq, a work-efficient triangle-inequality based K-means on FPGA for handling large-size, high-dimension datasets. KPynq leverages an algorithm-level optimization to balance the performance and computation irregularity, and a hardware architecture design to fully exploit the pipeline and parallel processing capability of various FPGAs. In the experiment, KPynq consistently outperforms the CPU-based standard K-means in terms of its speedup (up to 4.2x) and significant energy-efficiency (up to 218x).



There are no comments yet.


page 1


AccD: A Compiler-based Framework for Accelerating Distance-related Algorithms on CPU-FPGA Platforms

As a promising solution to boost the performance of distance-related alg...

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

In natural language processing (NLP), the "Transformer" architecture was...

Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture

Triangles are the basic substructure of networks and triangle counting (...

Elkan's k-Means for Graphs

This paper extends k-means algorithms from the Euclidean domain to the d...

High-performance K-means Implementation based on a Simplified Map-Reduce Architecture

The k-means algorithm is one of the most common clustering algorithms an...

Exact Acceleration of K-Means++ and K-Means

K-Means++ and its distributed variant K-Means have become de facto tools...

On the Parallel Tower of Hanoi Puzzle: Acyclicity and a Conditional Triangle Inequality

A generalization of the Tower of Hanoi Puzzle—the Parallel Tower of Hano...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

K-means clustering is a widely applied unsupervised learning algorithms, finding its strength in many machine learning application scenarios, such as unlabeled data clustering, image segmentation, and feature learning. Despite its popularity, standard K-means usually has unsatisfactory performance due to its high computation complexity. Previous research studies in K-means hardware acceleration 

[1] [2] optimize K-means for the specific dataset or certain FPGA, which lack adaptability and flexibility. However, KPynq is much more scalable and highly configurable equipped with a set of tunable parameters (e.g. degree of parallelism), which help to handle various datasets. KPynq is targeted at Pynq-Z1, which is based on Xilinx Zynq SoC [3]. This SoC consists of two subsystems: PS (Processing System) and PL (Programmable Logic). Besides, a DMA controller and a high-performance AXIS streaming interface build the data connection between PS and PL. A Python program in PS is responsible for invoking the PL part hardware accelerator and initiate the DMA data transfer. The PL part hardware accelerator of KPynq, as shown in Fig. 1, includes two main components: Multi-level Filters (Point-level and Group-level Filter) and Distance Calculator. Multi-level Filters is for reducing distance computations at the algorithmic level, while the Distance Calculator is for doing distance computations which have not been filtered out.

Fig. 1: Overview of KPynq.

Ii Experiment and Conclusion

Our KPynq design is implemented by using the Xilinx Vivado Design Suite v2018.2. and is deployed on Pynq-Z1 board [3]. This board is built on ZYNQ XC7Z020-1CLG400C all-programmable SoC, which has a 650 MHz dual-core ARM Cortex-A9 processor (PS) and an Artix-7 family programmable logic (PL) on the same die. Each Cortex-A9 processor core has 32 KB L1 4-way cache and shares a 512 KB L2 cache with other cores. The programmable logic has 13,300 logic slices, each with four 6-input LUTs and 8 flip-flops, 630 KB BRAM (280 BRAM_18K), and 220 DSP slices. The auxiliary parts used by our design include a DMA controller and AXIS buses for the data communication among PS, PL, and external DRAM. Experiments show that KPynq consistently excels an optimized CPU-based standard K-means implementation with speedup, and better energy-efficiency on average across the six real-life datasets from [4], which covers a wide range of size and dimensionality.


  • [1] A. G. S. Filho, A. C. Frery, C. C. de Araujo, H. Alice, J. Cerqueira, J. A. Loureiro, M. E. de Lima, M. G. S. Oliveira, and M. M. Horta, “Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm,” in 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings., Sep. 2003, pp. 99–104.
  • [2] H. M. Hussain, K. Benkrid, H. Seker, and A. T. Erdogan, “Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data,” in 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), June 2011, pp. 248–255.
  • [3] “Pynq-z1 reference manual [reference.digilentinc].” [Online]. Available:
  • [4] D. Dheeru and E. K. Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available: