K-means clustering is a widely applied unsupervised learning algorithms, finding its strength in many machine learning application scenarios, such as unlabeled data clustering, image segmentation, and feature learning. Despite its popularity, standard K-means usually has unsatisfactory performance due to its high computation complexity. Previous research studies in K-means hardware acceleration  optimize K-means for the specific dataset or certain FPGA, which lack adaptability and flexibility. However, KPynq is much more scalable and highly configurable equipped with a set of tunable parameters (e.g. degree of parallelism), which help to handle various datasets. KPynq is targeted at Pynq-Z1, which is based on Xilinx Zynq SoC . This SoC consists of two subsystems: PS (Processing System) and PL (Programmable Logic). Besides, a DMA controller and a high-performance AXIS streaming interface build the data connection between PS and PL. A Python program in PS is responsible for invoking the PL part hardware accelerator and initiate the DMA data transfer. The PL part hardware accelerator of KPynq, as shown in Fig. 1, includes two main components: Multi-level Filters (Point-level and Group-level Filter) and Distance Calculator. Multi-level Filters is for reducing distance computations at the algorithmic level, while the Distance Calculator is for doing distance computations which have not been filtered out.
Ii Experiment and Conclusion
Our KPynq design is implemented by using the Xilinx Vivado Design Suite v2018.2. and is deployed on Pynq-Z1 board . This board is built on ZYNQ XC7Z020-1CLG400C all-programmable SoC, which has a 650 MHz dual-core ARM Cortex-A9 processor (PS) and an Artix-7 family programmable logic (PL) on the same die. Each Cortex-A9 processor core has 32 KB L1 4-way cache and shares a 512 KB L2 cache with other cores. The programmable logic has 13,300 logic slices, each with four 6-input LUTs and 8 flip-flops, 630 KB BRAM (280 BRAM_18K), and 220 DSP slices. The auxiliary parts used by our design include a DMA controller and AXIS buses for the data communication among PS, PL, and external DRAM. Experiments show that KPynq consistently excels an optimized CPU-based standard K-means implementation with speedup, and better energy-efficiency on average across the six real-life datasets from , which covers a wide range of size and dimensionality.
-  A. G. S. Filho, A. C. Frery, C. C. de Araujo, H. Alice, J. Cerqueira, J. A. Loureiro, M. E. de Lima, M. G. S. Oliveira, and M. M. Horta, “Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm,” in 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings., Sep. 2003, pp. 99–104.
-  H. M. Hussain, K. Benkrid, H. Seker, and A. T. Erdogan, “Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data,” in 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), June 2011, pp. 248–255.
-  “Pynq-z1 reference manual [reference.digilentinc].” [Online]. Available: https://reference.digilentinc.com/reference/programmable-logic/pynq-z1/reference-manual
-  D. Dheeru and E. K. Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml