Challenges and Pitfalls of Reproducing Machine Learning Artifacts

04/29/2019 ∙ by Cheng Li, et al. ∙ 6

An increasingly complex and diverse collection of Machine Learning(ML) models as well as hardware/software stacks, collectively referred to as "ML artifacts", are being proposed - leading to a diverse landscape of ML. These ML innovations proposed have outpaced researchers' ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non-reproducible procedures for ML evaluation. The current practice of sharing ML artifacts is through repositories where artifact authors post ad-hoc code and some documentation. The authors often fail to reveal critical information for others to reproduce their results. One often fails to reproduce artifact authors' claims, not to mention adapt the model to his/her own use. This article discusses the common challenges and pitfalls of reproducing ML artifacts, which can be used as a guideline for ML researchers when sharing or reproducing artifacts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

An increasingly complex and diverse collection of ML models as well as hardware/software stacks are being proposed — leading to a diverse landscape of ML.  [2] shows that the number of ML arXiv papers published have outpaced Moore’s law. These ML innovations proposed have outpaced researchers’ ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non reproducible procedures for ML evaluation.

The current practice of sharing ML artifacts is through repositories such as GitHub where model authors post ad-hoc code and some documentation. The authors often fail to reveal critical information for others to reproduce their results. Some authors also release a Dockerfile. However, Docker only guarantee the software stack but does not help model users examine or modify the artifact to adapt to other environments nor does it provide a consistent methodology or API to perform the evaluation. In short, one often fails to reproduce artifact authors’ claims, not to mention adapt the model to his/her own use.

This paper shows how reproducibility is an issue for ML evaluation – motivated by outlining some common pitfalls model users often encounter when attempting to replicate model authors’ claims. These pitfalls also inform the model authors into the minimal types of information they must reveal for others to reproduce their claims. To facilitate the adoption of ML innovations, ML evaluation must be reproducible and a better way of sharing ML artifacts is needed. We propose a specification of model evaluation and a efficient system that consumes the specification to provide ML evaluation while maintaining reproducibility. Please refer to MLModelScope [1] for more details.

Ii Factors that Affect Model Evaluation

Many SW/HW configurations must work in unison within a ML workflow to replicate a model authors’ claims. In the process of developing MLModelScope we identified a few less common pitfalls, show how they arise, and provide a suggested solution. Within MLModelScope all these pitfalls are handled by the platform’s design and the model manifest specification.

Ii-a Hardware

Different hardware architectures can result in varying performance and accuracy, since the ML libraries across architectures could either be different or have different implementations.

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=myred!50, colframe=myred!50, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Pitfall 1: Only look at partial hardware, not the system. E.g. Inference on a Volta GPU must be faster than that on a Pascal GPU.

Figure 1:

ResNet_v1_50 using TensorFlow 1.13 on GPU and GPU systems with varying batch sizes

Figure 1 compares inference performance across systems. Volta (V100) is faster than Pascal (P100) in this case. One often assumes this to be always true. However, looking at only GPU or CPU compute sections when comparing performance is a common pitfall. Figure 2 shows a Pascal system can perform better than a Volta system because of a faster CPU-GPU interconnect. One therefore should consider the entire system and its end-to-end latency under different workload scenarios when reporting system performance results.

Figure 2:

POWER8 with Pascal GPU and NVLink vs X86 with Volta for a “cold-start” inference using Caffe AlexNet for batch size

. The color coding of layers and runtime functions signify that they have the same kernel implementation, but does not imply that the parameters are the same.

With MLModelScope’s profiling capabilities, one can discern why there is a performance difference. Figure 2 shows the layer and GPU kernel breakdown of the model inference on the two systems. We “zoom-into” the longest running layer (FC6) and show the model inference choke point. The difference between the model performance mainly comes form FC6 layer. On identifying this issue, we were able to look at the Caffe source code and observe that Caffe does lazy copy, meaning the layer weights get copied from CPU to GPU only when it’s needed. For FC6, of weights needs to be transferred. As we can see in the GPU kernel breakdown, even though V100 system is faster in the compute, the We “zoom-into” the longest running layer (FC6) and show the model inference choke point. Even though the V100 performs better for SGEMM computation, with the NVLink (faster than PCIe) between CPU and GPU the IBM P8 system achieves higher memory bandwidth and thus achieves a speedup for FC6 layer.

Ii-B Programming Language

Core ML algorithms within frameworks are written in C/C++ for performance and in practice low-latency inference uses C/C+. It is common for developers to use NumPy for numerical computation (NumPy arrays are not Python objects). ML frameworks optimize the execution for NumPy arrays, and avoid memory copy overhead when interfacing with C/C++ code.

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=myred!50, colframe=myred!50, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Pitfal 2: Use Python to report bare-metal benchmark results or to deploy latency sensitive production code.

Figure 3: Execution time (normalized to C/C++) vs. batch size of Inception-v3 inference on CPU and GPU using TensorFlow with C++, Python using NumPy data types, and Python using native lists

While no one claims Python to be as fast as C++, we find researchers believe that the glue code that binds Python to C++ takes negligible time. For example, benchmarks such as MLPerf are implemented in Python and report the latency and throughput for Python code. We show in Figure 3 above that the performance difference between Python and C++ in model evaluation is not negligible and one should use C++ for latency sensitive production code or when reporting bare-metal benchmark results.

Figure 4: Inception v3 with RGB or BGR color mode

Ii-C Pre/Post-Processing

Pre-processing is transforming the user input into a form that can be consumed by the model. Post-processing is processing the model output to be evaluated with metrics or visualized. The processing parameters, methods and order affect accuracy and performance.

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=myred!50, colframe=myred!50, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Pitfall 4: Model authors typically fail to reveal some pre/post-processing details that are needed to reproduce their experiments.

Among all the factors that affect model evaluation accuracy, pre/post-processing is the one that can result in big difference. The input dimension of a model is usually reported by the model author since without the right input dimensions, the model evaluation does not run and gives an error. Even if the input dimension is not explicitly given, model users can inspect the model architecture to figure that out. However, there are some critical pre/post-processing information that if not explicitly reported by the model authors, model users might easily fall into a incorrect evaluation setup and get “silent errors” in accuracy — the evaluation runs but the prediction results for some cases are incorrect. These “silent errors” are difficult to debug. This section takes computer vision models as an example and discuss what model users might struggle with when reproducing others’ results.

Ii-C1 Color Mode

Models are trained with decoded images that are in either RGB or BGR color mode. For legacy reasons, OpenCV decodes images in BGR mode by default and subsequently both Caffe and Caffe2 uses BGR. Other frameworks such as TensorFlow, PyTorch, MXNet use RGB mode. Figure 

4 shows the Inception v3 inference results of the same image using different color modes and everything else being the same.

Figure 5: Inception v3 with and
Figure 6: Pil vs OpenCV Implementation

Ii-C2 Data Layout

The data layout for a two-dimensional image (to be fed into the model as tensors) is represented by four letters:

  • N: Batch size, number of input processed together by the model

  • C: Channel, for computer vision models

  • W: Width, number of pixels in horizontal dimension

  • H: Height, number of pixels in vertical dimension

Models are trained with input in either or layout. Figure 5 shows the inference results of TensorFlow Inception v3 using different layouts for the same input image. The model was trained with layout. As can be seen, the predictions are very different.

Figure 7: Pil vs OpenCV Difference
Figure 8: Difference in the pre-processed images
Figure 9: Difference in the prediction results using TensorFlow Inception v3
Figure 10: Data format table
Figure 11: AlexNet performance difference across frameworks on Volta

Ii-C3 Image Decoding

It is typical for authors to use JPEG as the image data format (with ImageNet being stored as JPEG images). There are different decoding methods for JPEG. One usually use

opencv.imread or PIL.Image.open or tf.image.decode_jpeg to decode a jpeg image. TF uses libJPEG and uses either INTEGER_FAST or INTEGER_ACCURATE as default (varies across systems); PIL maps to INTEGER_ACCURATE method while OpenCV may not use libJPEG.

Even for the same method, ML libraries may have different implementations. For example, JPEG is stored on disk in YCrCb format, and the standard does not require bit-by-bit decoding accuracy. The implementation is defined differently across libraries, as shown in Figure 6. Figure 7 shows the difference between decoding an image using Python and OpenCV. We find that edge pixels (having high or low intensity) are not encoded consistently across libraries, even though these are the more interesting pixels for vision algorithms such as object detection.

Ii-C4 Type Conversion and Normalization

After decoding, the image data is in bytes and is converted to FP32 (assuming FP32 model) before being fed to the model. Also we need to subtract mean and scale the image data so that it has zero mean and unit variance (

). Mathematically, float to byte conversion is float to byte conversion is , and byte to float conversion is . Because of programming language semantics the executed behavior of byte to float conversion is .

The order of type conversion and normalization does matter. Figure 8 shows image processing using different orders with meanByte = 128 and meanFloat = 0.5. (a) is the original image, (b) is the result of reading the image in bytes then normalizing the image with the mean value in byte, , (c) is the result of reading an image in floats then normalizing with the mean value in float, , and (d) is the difference between (b) and (c). Figure 9 shows the prediction results with (b) and (c) using TensorFlow Inception v3, which are very different.

Ii-D Model and Data Formats

There are a variety of formats used by ML frameworks to store models and data on disk. Some frameworks define models as Protocol Buffer and other use custom data formats. Also different model formats can be used for inference and training. Table 10 shows the model format used for inference for different frameworks. Some data formats such as TensorFlow TFRecord or MXNet’s RecordIO are optimized for static datasets. In this blog, the author reports a 7x speedup with TFRecord + TF Dataset Iterator API.

Figure 12: Digging Deep into AlexNet Performance
Figure 13: Caffe install options
Figure 14: Caffe compare plot
Figure 15: Tensorflow varying num threads

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=myred!50, colframe=myred!50, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Pitfall 3: Using an inappropriate format for inference.

Ii-E Software Stack

The major software components affecting reproducibility are ML framework (TensorFlow, MXNet, PyTorch, etc.) and libraries (MKL-DNN, Open-BLAS, cuDNN, etc.). They both impact not only the performance but also the accuracy of the model.

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=myred!50, colframe=myred!50, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Pitfall 5a: If framework A and B use the same cuDNN and other ML libraries, they give the same performance and accuracy for the same model.

Figure 12 shows AlexNet performance across different frameworks. All the frameworks are compiled with GCC 5.5 and use the same software stack (cuDNN and other libraries). We can dig deeper into the inference process for the frameworks to identify the bottlenecks and overheads of each framework.

The above figure shows that ML layers across frameworks have different implementations or dispatch to different library functions. Take the conv2 and the following relu layers for example. In TensorRT, these two layers are merged together and are mapped to 2

trt_volta_scudnn_128x128_relu_small_nn_v1 kernels. While in other three frameworks, the two layers are not merged. Also the conv2 in MXNet is executed very differently from the other frameworks.

Ii-E1 Framework Installation

Researchers usually have the choice to install a ML framework from source or from binary. Even through installation from binary is much easier, binary versions of framework may not use the CPU vectorization instructions (e.g. AVX, AVX2). For best performance, one should install frameworks from source. For example, TensorFlow 1.13 with vectorization is 40

Ii-E2 Framework Compilation

Compilation options for framework and underlying libraries matters. For example, we compile Caffe using GCC 5.5 and with (1) the following compiler flags on the left, referred to as Caffe-default; (2) the default and the following environment variables on the right, referred to as Caffe-Single-Threaded-No-SIMD.

Then we run SphereFace-20 on a Intel NUC system with both Caffe installations. As can be seen below, Caffe-default is almost 2x more performant than the other due to multithreading and vectorization.

Ii-F Hardware Configuration

Hardware configurations such as CPU Scaling, Multi-threading, Vectorization, affect mode evaluation performance.

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=myred!50, colframe=myred!50, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Pitfall 6: Always use the defaults without tuning the system for performance.

Modern CPUs have simultaneous multi-threading (also known as SMT or Hyper-threading). This allows multiple threads to run on the same core with the idea that each thread will not fully utilize the ALUs. As a study we varying the number of threads run by the framework using NUM_THREADS in ML frameworks using experiment variables. For TensorFlow these environment variables correspond to the intra_op_parallelism_threads and inter_op_parallelism_threads variables. The default in TensorFlow is to the number of logical CPU cores and is effective for systems ranging from CPUs with 4 to 70+ combined logical cores. The figure below shows Inception-v3 performance using different NUM_THREADS. As can be seen, on the system used which has 16 logical cores and 2way SMT, the performance varies with different NUM_THREADS and the best is achieved using 16. This may not be the case for other systems and/or in the case of other workloads.

Iii Conclusion

This page showed some of the common pitfalls that one can encounter when trying to reproduce model evaluation. More information on how we address these are in our IJCAI’19 submission.

References