Xiaowen Chu

is this you? claim profile


  • AutoML: A Survey of the State-of-the-Art

    Deep learning has penetrated all aspects of our lives and brought us great convenience. However, the process of building a high-quality deep learning system for a specific task is not only time-consuming but also requires lots of resources and relies on human expertise, which hinders the development of deep learning in both industry and academia. To alleviate this problem, a growing number of research projects focus on automated machine learning (AutoML). In this paper, we provide a comprehensive and up-to-date study on the state-of-the-art AutoML. First, we introduce the AutoML techniques in details according to the machine learning pipeline. Then we summarize existing Neural Architecture Search (NAS) research, which is one of the most popular topics in AutoML. We also compare the models generated by NAS algorithms with those human-designed models. Finally, we present several open problems for future research.

    08/02/2019 ∙ by Xin He, et al. ∙ 257 share

    read it

  • Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

    Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In the training of deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution and stochastic gradient descent (SGD), but the running performance of different frameworks might be different even running the same deep model on the same GPU hardware. In this paper, we evaluate the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over single-GPU, multi-GPU and multi-node environments. We first build performance models of standard processes in training DNNs with SGD, and then we benchmark the running performance of these frameworks with three popular convolutional neural networks (i.e., AlexNet, GoogleNet and ResNet-50), after that we analyze what factors that results in the performance gap among these four frameworks. Through both analytical and experimental analysis, we identify bottlenecks and overheads which could be further optimized. The main contribution is two-fold. First, the testing results provide a reference for end users to choose the proper framework for their own scenarios. Second, the proposed performance models and the detailed analysis provide further optimization directions in both algorithmic design and system configuration.

    11/16/2017 ∙ by Shaohuai Shi, et al. ∙ 0 share

    read it

  • GPGPU Performance Estimation with Core and Memory Frequency Scaling

    Graphics Processing Units (GPUs) support dynamic voltage and frequency scaling (DVFS) in order to balance computational performance and energy consumption. However, there still lacks simple and accurate performance estimation of a given GPU kernel under different frequency settings on real hardware, which is important to decide best frequency configuration for energy saving. This paper reveals a fine-grained model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Over a 2.5x range of both core and memory frequencies among 12 GPU kernels, our model achieves accurate results (within 3.5%) on real hardware. Compared with the cycle-level simulators, our model only needs some simple micro-benchmark to extract a set of hardware parameters and performance counters of the kernels to produce this high accuracy.

    01/19/2017 ∙ by Qiang Wang, et al. ∙ 0 share

    read it

  • Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs

    With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of GPU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SGD) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation-based studies.

    05/10/2018 ∙ by Shaohuai Shi, et al. ∙ 0 share

    read it

  • Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

    Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes and achieved 74.9% top-1 test accuracy, and another KNL-based system with 2048 Intel KNLs spent 20 minutes and achieved 75.4% accuracy. Our training system can achieve 75.8% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7% top-1 test accuracy within 4 minutes, which also outperforms all other existing systems.

    07/30/2018 ∙ by Xianyan Jia, et al. ∙ 0 share

    read it

  • MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

    Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks on computer clusters. With the increase of computational power, network communications have become one limiting factor on system scalability. In this paper, we observe that many deep neural networks have a large number of layers with only a small amount of data to be communicated. Based on the fact that merging some short communication tasks into a single one may reduce the overall communication time, we formulate an optimization problem to minimize the training iteration time. We develop an optimal solution named merged-gradient WFBP (MG-WFBP) and implement it in our open-source deep learning platform B-Caffe. Our experimental results on an 8-node GPU cluster with 10GbE interconnect and trace-based simulation results on a 64-node cluster both show that the MG-WFBP algorithm can achieve much better scaling efficiency than existing methods WFBP and SyncEASGD.

    11/27/2018 ∙ by Shaohuai Shi, et al. ∙ 0 share

    read it

  • Performance Evaluation of Deep Learning Tools in Docker Containers

    With the success of deep learning techniques in a broad range of application domains, many deep learning software frameworks have been developed and are being updated frequently to adapt to new hardware features and software libraries, which bring a big challenge for end users and system administrators. To address this problem, container techniques are widely used to simplify the deployment and management of deep learning software. However, it remains unknown whether container techniques bring any performance penalty to deep learning applications. The purpose of this work is to systematically evaluate the impact of docker container on the performance of deep learning applications. We first benchmark the performance of system components (IO, CPU and GPU) in a docker container and the host system and compare the results to see if there's any difference. According to our results, we find that computational intensive jobs, either running on CPU or GPU, have small overhead indicating docker containers can be applied to deep learning programs. Then we evaluate the performance of some popular deep learning tools deployed in a docker container and the host system. It turns out that the docker container will not cause noticeable drawbacks while running those deep learning tools. So encapsulating deep learning tool in a container is a feasible solution.

    11/09/2017 ∙ by Pengfei Xu, et al. ∙ 0 share

    read it

  • Measurement and Analysis of the Bitcoin Networks: A View from Mining Pools

    Mining pools, the main components of the Bitcoin network, dominate the computing resources and play essential roles in network security and performance aspects. Although many existing measurements of the Bitcoin network are available, little is known about the details of mining pool behaviors (e.g., empty blocks, mining revenue and transaction collection strategies) and their effects on the Bitcoin end users (e.g., transaction fees, transaction delay and transaction acceptance rate). This paper aims to fill this gap with a systematic study of mining pools. We traced over 1.56 hundred thousand blocks (including about 257 million historical transactions) from February 2016 to January 2019 and collected over 120.25 million unconfirmed transactions from March 2018 to January 2019. Then we conducted a board range of measurements on the pool evolutions, labeled transactions (blocks) as well as real-time network traffics, and discovered new interesting observations and features. Specifically, our measurements show the following. 1) A few mining pools entities continuously control most of the computing resources of the Bitcoin network. 2) Mining pools are caught in a prisoner's dilemma where mining pools compete to increase their computing resources even though the unit profit of the computing resource decreases. 3) Mining pools are stuck in a Malthusian trap where there is a stage at which the Bitcoin incentives are inadequate for feeding the exponential growth of the computing resources. 4) The market price and transaction fees are not sensitive to the event of halving block rewards. 5) The block interval of empty blocks is significantly lower than the block interval of non-empty blocks. 6) Feerate plays a dominating role in transaction collection strategy for the top mining pools. Our measurements and analysis help to understand and improve the Bitcoin network.

    02/20/2019 ∙ by Canhui Wang, et al. ∙ 0 share

    read it

  • A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks

    Distributed synchronous stochastic gradient descent (S-SGD) with data parallelism requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-k sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers and thus alleviate the network pressure. Top-k sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their indices, and the irregular indices make the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of O(kP), where P is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all Top-k gradients from P workers are needed for the model update, and therefore we propose a novel global Top-k (gTop-k) sparsification mechanism to address the difficulty of aggregating sparse gradients. Specifically, we choose global Top-k largest absolute values of gradients from P workers, instead of accumulating all local Top-k gradients to update the model in each iteration. The gradient aggregation method based on gTop-k sparsification, namely gTopKAllReduce, reduces the communication complexity from O(kP) to O(klog_2P). Through extensive experiments on different DNNs, we verify that gTop-k S-SGD has nearly consistent convergence performance with S-SGD. We evaluate the training efficiency of gTop-k on a cluster with 32 GPU machines which are inter-connected with 1 Gbps Ethernet. The experimental results show that our method achieves up to 2.7-12× higher scaling efficiency than S-SGD with dense gradients, and 1.1-1.7× improvement than the existing Top-k S-SGD.

    01/14/2019 ∙ by Shaohuai Shi, et al. ∙ 0 share

    read it

  • GPU Accelerated Keccak (SHA3) Algorithm

    Hash functions like SHA-1 or MD5 are one of the most important cryptographic primitives, especially in the field of information integrity. Considering the fact that increasing methods have been proposed to break these hash algorithms, a competition for a new family of hash functions was held by the US National Institute of Standards and Technology. Keccak was the winner and selected to be the next generation of hash function standard, named SHA-3. We aim to implement and optimize Batch mode based Keccak algorithms on NVIDIA GPU platform. Our work consider the case of processing multiple hash tasks at once and implement the case on CPU and GPU respectively. Our experimental results show that GPU performance is significantly higher than CPU is the case of processing large batches of small hash tasks.

    02/14/2019 ∙ by Canhui Wang, et al. ∙ 0 share

    read it

  • GPU Accelerated AES Algorithm

    It has been widely accepted that Graphics Processing Units (GPU) is one of promising schemes for encryption acceleration, in particular, the support of complex mathematical calculations such as integer and logical operations makes the implementation easier; however, complexes such as parallel granularity, memory allocation still imposes a burden on real world implementations. In this paper, we propose a new approach for Advanced Encryption Standard accelerations, including both encryption and decryption. Specifically, we adapt the Electronic Code Book mode for cryptographic transformation, look up table scheme for fast lookup, and a granularity of one state per thread for thread scheduling. Our experimental results offer researchers a good understanding on GPU architectures and software accelerations. In addition, both our source code and experimental results are freely available.

    02/14/2019 ∙ by Canhui Wang, et al. ∙ 0 share

    read it