Deep learning, a sub-field of machine learning research, has driven the rapid progress in artificial intelligence research, leading to astonishing breakthroughs on long-standing problems in a plethora of fields such as computer vision and natural language processing. Tools powered by deep learning are changing the way movies are made, diseases are diagnosed, and play a growing role in understanding and communicating with humans.
Such development is made possible by deep learning frameworks, such as Caffe(Jia et al., 2014), Chainer (Tokui et al., 2015), CNTK (Seide and Agarwal, 2016), Apache (incubating) MXNet (Chen et al., 2015)
, PyTorch(Paszke et al., 2017)2016)
, and Theano(Bastien et al., 2012). These frameworks have been crucial in disseminating ideas in the field. Specifically, imperative tools, arguably spearheaded by Chainer, are easy to learn, read, and debug. Such benefits make imperative programming interface quickly adopted by the Gluon API of MXNet (while can be seamlessly switched to symbolic programming for high performance), PyTorch, and TensorFlow Eager.
Leveraging the imperative Gluon API of MXNet, we design and develop the GluonCV and GluonNLP (referred to as GluonCV/NLP hereinafter) toolkits for deep learning in computer vision and natural language processing. To the best of our knowledge, GluonCV/NLP are the first open source toolkits for deep learning in both computer vision and natural language processing that simultaneously i) provide modular APIs to allow customization by re-using efficient building blocks; ii) provide pre-trained state-of-the-art models, training scripts, and training logs to enable fast prototyping and promote reproducible research; iii) leverage the MXNet ecosystem so that models can be deployed in a wide variety of programming languages including C++, Clojure, Java, Julia, Perl, Python, R, and Scala.
2 Design and Features
In the following, we describe the design and features of GluonCV/NLP.
2.1 Modular APIs
GluonCV/NLP provide access to modular APIs to allow users to customize their model design, training, and inference by re-using efficient components across different models. Such common components include (but are not limited to) data processing utilities, models with individual components, initialization methods, and loss functions.
To elucidate how the modular API facilitates efficient implementation, let us take the data API of GluonCV/NLP as an example, which is used to build efficient data pipelines with popular benchmark data sets or those supplied by users. In computer vision and natural language processing tasks, inputs or labels often come in with different shapes, such as images with a varying number of objects and sentences of different lengths. Thus, the data API provides a collection of utilities to sample inputs or labels then transform them into mini-batches to be efficiently computed. Besides, users can access a wide range of popular data sets via the data
2.2 Model Zoo
Building upon those modular APIs, GluonCV/NLP provide pre-trained state-of-the-art models, training scripts, and training logs via the model zoo to enable fast prototyping and promote reproducible research. As of the time of writing, GluonCV/NLP have provided over 100 models for common computer vision and natural language processing tasks, such as image classification, object detection, semantic segmentation, instance segmentation, pose estimation, word embedding, language model, machine translation, sentiment analysis, natural language inference, dependency parsing, and question answering.
2.3 Leveraging the MXNet Ecosystem
GluonCV/NLP have benefitted from the MXNet ecosystem through use of MXNet. At the lowest level, MXNet provides high-performance C++ implementations of operators that are leveraged by GluonCV/NLP; thus, improvements in low-level components of MXNet often result in performance gains in GluonCV/NLP. Same as any other model implemented with MXNet, GluonCV/NLP can be used to train models on CPU, GPU (single or multiple), and multiple machines. In sharp contrast to building upon other deep learning frameworks, through the unique hybridizing mechanism by MXNet (Zhang et al., 2019), usually GluonCV/NLP models can be deployed with no or minimal configuration in a wide spectrum of programming languages including C++, Clojure, Java, Julia, Perl, Python, R, and Scala. There are also ongoing efforts to bring more quantization (int8 and float16 inference) benefits from MXNet to GluonCV/NLP to further accelerate model inference.
The documentation https://gluon-cv.mxnet.io/ and http://gluon-nlp.mxnet.io/ of GluonCV/NLP include installation instructions, contribution instructions, open source repositories, extensive API reference, and comprehensive tutorials. As another benefit of leveraging the MXNet ecosystem, the GluonCV/NLP documentation is supplemented by the interactive open source book Dive into Deep Learning (based on the Gluon API of MXNet) (Zhang et al., 2019), which provides sufficient background knowledge about GluonCV/NLP tasks, models, and building blocks. Notably, some users of Dive into Deep Learning have later become contributors of GluonCV/NLP.
2.4 Requirement, Availability, and Community
GluonCV/NLP are implemented in Python and are available for systems running Linux, macOS, and Windows since Python is platform agnostic. The minimum and open source package (e.g., MXNet) requirements are specified in the documentation. As of the time of writing, GluonCV/NLP have reached version 0.6 and 0.4 respectively, and have been open sourced under the Apache 2.0 license. Since the initial release of the source code in April 2018, GluonCV/NLP have attracted 100 contributors worldwide. Models of GluonCV/NLP have been downloaded for more than 1.6 million times in fewer than 10 months.
We demonstrate the performance of GluonCV/NLP models in various computer vision and natural language processing tasks. Specifically, we evaluate popular or state-of-the-art models on standard benchmark data sets. In the experiments, we compare model performance between GluonCV/NLP and other open source implementations with Caffe, Caffe2, Theano, and TensorFlow, including ResNet (He et al., 2016) and MobileNet (Howard et al., 2017) for image classification (ImageNet), Faster R-CNN (Girshick, 2015) for object detection (COCO), Mask R-CNN (He et al., 2017) for instance segmentation, Simple Pose (Xiao et al., 2018) for pose estimation (COCO), textCNN (Kim, 2014) for sentiment analysis (TREC), and BERT (Devlin et al., 2018) for question answering (SQuAD 1.1), sentiment analysis (SST-2), natural langauge inference (MNLI-m), and paraphrasing (MRPC). Table 1 shows that the GluonCV/GluonNLP implementation matches or outperforms the compared open source implementation for the same model evaluated on the same data set.
|Image Classification||ImageNet||ResNet-50||top-1 acc.||79.2|
|Image Classification||ImageNet||ResNet-101||top-1 acc.||80.5|
|Image Classification||ImageNet||MobileNet 1.0||top-1 acc.||73.3|
|Object Detection||COCO||Faster R-CNN||mAP||40.1|
|Instance Segmentation||COCO||Mask R-CNN||mask AP||33.1|
|Pose Estimation||COCO||Simple Pose (f)||OKS AP||74.2||N.A.|
|Question Answering||SQuAD 1.1||F1/EM||88.5/81.0||88.5/|
|Question Answering||SQuAD 1.1||F1/EM||91.0/84.1||90.9/|
|Natural Language Inference||MNLI-m||acc.||84.6|
|[a] https://github.com/KaimingHe/deep-residual-networks (in Caffe)|
|[b] https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md (in TensorFlow)|
|[c] https://github.com/facebookresearch/Detectron (in Caffe2)|
|[d] https://github.com/yoonkim/CNN_sentence (in Theano)|
|[e] https://github.com/google-research/bert (in TensorFlow)|
GluonCV/NLP provide modular APIs and the model zoo to allow users to rapidly try out new ideas or develop downstream applications in computer vision and natural language processing. GluonCV/NLP are in active development and our future works include further enriching the API and the model zoo, and supporting deployment in more scenarios.
We would like to thank all the contributors of GluonCV and GluonNLP (the git log command can be used to list all the contributors). Specifically, we thank Xiaoting He, Heewon Jeon, Kangjian Wu, and Luyu Xia for providing part of results in Table 1. We would also like to thank the entire MXNet community for their foundational contributions.
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283, 2016.
- Bastien et al. (2012) Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
- Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Girshick (2015) Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- Kim (2014) Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- Seide and Agarwal (2016) Frank Seide and Amit Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.
- Tokui et al. (2015) Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pages 1–6, 2015.
- Xiao et al. (2018) Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 466–481, 2018.
- Zhang et al. (2019) Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. Dive into Deep Learning. 2019. http://www.d2l.ai.