Mahdi Nazemi

is this you? claim profile


  • High-Performance FPGA Implementation of Equivariant Adaptive Separation via Independence Algorithm for Independent Component Analysis

    Independent Component Analysis (ICA) is a dimensionality reduction technique that can boost efficiency of machine learning models that deal with probability density functions, e.g. Bayesian neural networks. Algorithms that implement adaptive ICA converge slower than their nonadaptive counterparts, however, they are capable of tracking changes in underlying distributions of input features. This intrinsically slow convergence of adaptive methods combined with existing hardware implementations that operate at very low clock frequencies necessitate fundamental improvements in both algorithm and hardware design. This paper presents an algorithm that allows efficient hardware implementation of ICA. Compared to previous work, our FPGA implementation of adaptive ICA improves clock frequency by at least one order of magnitude and throughput by at least two orders of magnitude. Our proposed algorithm is not limited to ICA and can be used in various machine learning problems that use stochastic gradient descent optimization.

    07/06/2017 ∙ by Mahdi Nazemi, et al. ∙ 0 share

    read it

  • FFT-Based Deep Learning Deployment in Embedded Systems

    Deep learning has delivered its powerfulness in many application domains, especially in image and speech recognition. As the backbone of deep learning, deep neural networks (DNNs) consist of multiple layers of various types with hundreds to thousands of neurons. Embedded platforms are now becoming essential for deep learning deployment due to their portability, versatility, and energy efficiency. The large model size of DNNs, while providing excellent accuracy, also burdens the embedded platforms with intensive computation and storage. Researchers have investigated on reducing DNN model size with negligible accuracy loss. This work proposes a Fast Fourier Transform (FFT)-based DNN training and inference model suitable for embedded platforms with reduced asymptotic complexity of both computation and storage, making our approach distinguished from existing approaches. We develop the training and inference algorithms based on FFT as the computing kernel and deploy the FFT-based inference model on embedded platforms achieving extraordinary processing speed.

    12/13/2017 ∙ by Sheng Lin, et al. ∙ 0 share

    read it

  • A Hardware-Friendly Algorithm for Scalable Training and Deployment of Dimensionality Reduction Models on FPGA

    With ever-increasing application of machine learning models in various domains such as image classification, speech recognition and synthesis, and health care, designing efficient hardware for these models has gained a lot of popularity. While the majority of researches in this area focus on efficient deployment of machine learning models (a.k.a inference), this work concentrates on challenges of training these models in hardware. In particular, this paper presents a high-performance, scalable, reconfigurable solution for both training and deployment of different dimensionality reduction models in hardware by introducing a hardware-friendly algorithm. Compared to state-of-the-art implementations, our proposed algorithm and its hardware realization decrease resource consumption by 50% without any degradation in accuracy.

    01/11/2018 ∙ by Mahdi Nazemi, et al. ∙ 0 share

    read it

  • Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications

    Major advancements in building general-purpose and customized hardware have been one of the key enablers of versatility and pervasiveness of machine learning models such as deep neural networks. To sustain this ubiquitous deployment of machine learning models and cope with their computational and storage complexity, several solutions such as low-precision representation of model parameters using fixed-point representation and deploying approximate arithmetic operations have been employed. Studying the potency of such solutions in different applications requires integrating them into existing machine learning frameworks for high-level simulations as well as implementing them in hardware to analyze their effects on power/energy dissipation, throughput, and chip area. Lop is a library for design space exploration that bridges the gap between machine learning and efficient hardware realization. It comprises a Python module, which can be integrated with some of the existing machine learning frameworks and implements various customizable data representations including fixed-point and floating-point as well as approximate arithmetic operations.Furthermore, it includes a highly-parameterized Scala module, which allows synthesizing hardware based on the said data representations and arithmetic operations. Lop allows researchers and designers to quickly compare quality of their models using various data representations and arithmetic operations in Python and contrast the hardware cost of viable representations by synthesizing them on their target platforms (e.g., FPGA or ASIC). To the best of our knowledge, Lop is the first library that allows both software simulation and hardware realization using customized data representations and approximate computing techniques.

    06/03/2018 ∙ by Mahdi Nazemi, et al. ∙ 0 share

    read it

  • NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference

    Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. However, computational and storage complexity of these models has forced the majority of computations to be performed on high-end computing platforms or on the cloud. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floatingpoint operations, and has a substantially lower latency.

    07/23/2018 ∙ by Mahdi Nazemi, et al. ∙ 0 share

    read it

  • Energy-Aware Scheduling of Task Graphs with Imprecise Computations and End-to-End Deadlines

    Imprecise computations provide an avenue for scheduling algorithms developed for energy-constrained computing devices by trading off output quality with the utilization of system resources. This work proposes a method for scheduling task graphs with potentially imprecise computations, with the goal of maximizing the quality of service subject to a hard deadline and an energy bound. Furthermore, for evaluating the efficacy of the proposed method, a mixed integer linear program formulation of the problem, which provides the optimal reference scheduling solutions, is also presented. The effect of potentially imprecise inputs of tasks on their output quality is taken into account in the proposed method. Both the proposed method and MILP formulation target multiprocessor platforms. Experiments are run on 10 randomly generated task graphs. Based on the obtained results, for some cases, a feasible schedule of a task graph can be achieved with the energy consumption less than 50 minimum energy required for scheduling all tasks in that task graph completely precisely.

    05/10/2019 ∙ by Amirhossein Esmaili, et al. ∙ 0 share

    read it

  • Modeling Processor Idle Times in MPSoC Platforms to Enable Integrated DPM, DVFS, and Task Scheduling Subject to a Hard Deadline

    Energy efficiency is one of the most critical design criteria for modern embedded systems such as multiprocessor system-on-chips (MPSoCs). Dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) are two major techniques for reducing energy consumption in such embedded systems. Furthermore, MPSoCs are becoming more popular for many real-time applications. One of the challenges of integrating DPM with DVFS and task scheduling of real-time applications on MPSoCs is the modeling of idle intervals on these platforms. In this paper, we present a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline. We also present a heuristic approach for solving the MILP and compare its results with those obtained from solving the MILP.

    12/19/2018 ∙ by Amirhossein Esmaili, et al. ∙ 0 share

    read it