I-a Motivation and Related Work
Recent scientific advancements have led to a general acceptance of various classes of deep learning architectures as state of the art in machine learning, e.g. Convolutional Neural Networks (CNN). Research focused previously on image classification with a large number of classes  and is currently shifting towards object detection with approaches like R-CNN 
, where from the image region proposals are extracted and afterwards computed and classified utilizing CNN and SVM respectively.
CNN are thus a core component of a wide area of computer vision algorithms and are computationally expensive, usually accelerated by GPUs or FPGAs. However, recent advancements in mobile autonomous robotics as well as the Internet of Things (IoT) has opened a wide area of highly promising applications for these kind of algorithms, in which GPUs are not available and optimization of algorithmic performance becomes essential to save energy. At the same time those practical applications often come with the consequence that a large number of classes is not required and thus small CNN architectures are sufficient.
The first goal of this work is therefore to speed up the inference of small pretrained networks. In general the motivation of this is twofold. First, the usual meaning is a high throughput given a large set of images to be classified. Overhead due to e.g. initialization is negligible. Second, and this is more important in this scope of application, the reduction of execution time - which correlates with latency as well as energy consumption. In mobile robotic applications latency is important for near-real-time reaction to sudden changes in the environment. In this case the latency should be as low as possible where computational time also correlates with energy savings. While the energy consumption is also a factor in mobile robotic applications its way more important in the IoT application, where it single handedly defines the lifetime of such a device. The set of images that must be classified at one time is rather low and thus the throughput is an suboptimal criteria for performance in this case.
The second aspect is target platform deployability. Typically the network is embedded in a framework that provides images and processes the results. Embedding machine learning frameworks like TensorFlow 
or Caffe requires much overhead for an inference of a pretrained network. As a consequence, tools for generating object code for inference were developed for those kind of frameworks where TensorFlow XLA111https://www.tensorflow.org/xla/ and Glow222https://facebook.ai/developers/tools/glow are currently state-of-the-art. But their applicability to generic target platforms is limited. TensorFlow XLA generates object code that depends on TensorFlow code limiting the cross compilation capabilities for target platforms, whereas Glow’s capabilities are currently limited to x86-64 and ARM64. It does not offer switches for other platforms, e.g. 32 bit targets, out of the box.
In this paper we propose a neural network code generator (NNCG) that generates C code from a trained CNN model. It focuses on the two relevant goals motivated previously:
Generic scope of applicability and cross compilation for various target platforms
Generation of fast executables allowing CNN inference on resource constrained systems (small robots, embedded microcontrollers etc.) on a CPU only
In contrast to common approaches which compile library code (e.g. Eigen) for operations like matrix multiplication and generate object code, we propose to generate plain ANSI C code. Since specialized code for each atomic operation (e.g. multiplication or addition) is generated and weights are included as constants, no libraries or prewritten code are needed except math.h and libmath.
If desired target architecture dependent enhancements (e.g. SIMD instructions like SSE) can be utilized as well. As a result, the code can easily be compiled using a cross compiler or natively compiled on any target platform.
Utilizing a math library and compiling to object code relies on a good optimization performance of the compiler as well as on the efficiency of the library. However, both library and optimizer are developed for any generic mathematical case. Instead, we exploit our knowledge about CNN in general and especially for the specific trained model to generate the most optimal code. Additionally, we intentionally choose C as output to fully benefit from the optimization capabilities of the compiler by generating code that is easy to optimize.
We identified four design principles to achieve those ideas, which we will discuss in detail in the following Sec. II:
Loop unrolling and caching
Conditional moves instead of branching
Constants wherever possible
Identification of applicable data structures for SIMD instructions
Usually the compiler should be able to handle most of these topics by itself. However, as the compiler has no background information this frequently fails in the field. It has to be noted that these design principles limit the application of NNCG to small networks, as loop unrolling and floats written in ASCII text lead to large code files. E.g. MobileNetV2  would require approx. 4 MB only for printing all weights in ASCII which leads to C files difficult to handle for a compiler.
In the evaluation in Sec. III we will show the advantage of addressing these points in NNCG based on small CNN adequate for our purpose. We compare the performance of NNGC with both above mentioned tools (TF XLA and Glow), wherever possible on a PC as well as a mobile robotic platform. We are able to show speed-ups of factor 1.41 up to 11.81 depending on network size and platform. We also compare the latency of a system with and without GPU. We can show that with small networks and a small number of images to classify the latency of our executable is many times smaller.
The following Sec. II describes the conceptual details of the NNCG and its implementation. Afterwards the results of NNCG are compared to the current state-of-the-art on various target platforms, in Sec. III. This evaluation will focus on mobile robotics as an application area, since in general it offers a larger variety of pattforms with different computational performance levels. The final Sec. IV concludes the paper and provides an outlook to future research.
Ii Neural Network Code Generator (NNCG)
In this section we first describe the design principles (see Sec. I) in detail and continue how the CNN layer are realized fulfilling these principles.
The basic concept is the generation of C code from a trained Keras333https://keras.io model during an exemplary classification of an image. We reimplemented various CNN layer (all layer required for a custom YOLO  net) with focus on simplicity.
During the calculation of each layer C code is written for all atomic operations, e.g. multiplication, addition, max operator etc. including the involved values as constants.
Ii-a Design Principles
Ii-A1 Loop unrolling and caching
In general, a loop consists of code for checking if a condition is met to continue executing the loop and a branch that repeats the loop. This has (mainly) two disadvantages: (1) Code for condition checking and branching and (2) negative effects on the pipeline of the processor resulting in a pipeline filled with wrong instructions if the processor cannot predict the condition correctly.
To mitigate this a compiler can unroll loops meaning the body of the loop is executed multiple times and the condition check and branch is thus executed less often. However, for this to work efficiently the number of loop iterations must be known, or further code is required to met the exact number of iterations.
On the other hand, unrolling results in more instructions that must be loaded from RAM which also affects the efficiency of the CPU cache. If all loops are unrolled completely, all instruction are only executed once and thus caching cannot increase execution speed.
Thus, we organize loop unrolling in different levels so that it can be chosen depending on the cache architecture of the target platform and the structure of the CNN. At level 0 all loops are unrolled. Level 1 does not unroll the outer most loop and so forth.
Ii-A2 Conditional Moves
A typical operation is to copy a value into a register under some condition. In higher programming languages this usually is realized by a conditionally executed code block with a copy. It is skipped if the condition is not met again resulting in the clearance of the pipeline.
Thus, common processors implement copy instructions that are always executed but only actually copy the data if the condition is met. In worst case the time for executing this instruction is lost which is usually faster than refilling the pipeline.
Modern compiler should be able to identify candidates for a conditional move. However, as NNCG knows the semantics it can help by using the ternary operator known in C.
In common frameworks a CNN model is loaded into RAM during run-time and weights are passed to the calculation. The inference then must access these arrays using some addressing scheme. This may lead to unnecessary overhead as we can write the known constants into the corresponding line.
Ii-A4 SIMD Instructions
Single Instruction Multiple Data (SIMD) instructions perform the same operation on multiple values and can thus speed-up the inference significantly. Modern compiler support these instructions but must be able to identify possible parallel calculations. To do so, the structure of the network must be known at compile time.
During code generation the structure of the calculations (matrix multiplications etc.) and the dimensions of vectors and matrices are known. Thus, parallel structures can be identified and SIMD instructions generated.
But, SIMD instructions are platform dependent. Currently we support Intel’s SSSE3 and a general architecture without platform dependent code. However, other platform specific optimizations, such as for ARM’s Cortex-M , can be integrated into the code generator as well.
We focus our work on layers required to implement a small YOLO  net. The following layers are also sufficient for other small networks that are suitable for embedded systems.
Convolutional layers are the most computational demanding layers and thus a focus of this work. We support zero-padding and strides. Possible activations functions are the softmax function and (leaky) ReLU which we describe later.
To support padding we set all values to zero that are out of bounds by defining
where is the input of the convolution layer as a scalar at and channel , the height and the width. Applying this definition our implementation of the convolution can be written as
with as the output at in channel , , height and width of the kernel, and the number of input/output channels, the kernel weight at for output channel and input channel , , the height and width of the padding, , the step height and width and , the dimensions of the input.
We see in Eq. 2 the calculation of a single value requires three nested loops. Furthermore, to calculate all output values three additional nested loops are required. The implementation of the first design principle is thus a trade-off between loop unrolling and code size. For the reasons explained in Sec. II-A1 unrolling all loops infinitely is only adequate for small networks and thus we follow a configurable approach. Currently, we support unrolling all loops with possible exceptions for the first and second outer loop and no unrolling.
To further specialize our code for different channel and spatial dimensions, we created multiple code versions of the convolution with different tradeoffs between cache utilization and register preassure. For each layer we independently benchmark every code version and select the one with the best runtime performance.
Implementation of design principle 3 depends on unrolling. If no loop is unrolled we generate an array containing all weights as constants. If loops are unrolled, the constants can be written into the corresponding code line.
For design principle 4 we identified the output channels (loop over in Eq. 2) as a proper dimension for SIMD instructions. As can be seen, this loop does not affect the three inner loops and is thus simple to apply. For SSSE3 the number of channels (in Eq. 2 denoted by ) should be dividable by 4 such that the number of filters in convolutional layers should be a multiple of 4.
The max-pooling layer searches for the maximum of all values in a two-dimensional window,
This two-dimensional window requires (in a basic form) two nested loops with additional three outer loops for each value of the output feature maps. Comparable to the convolution layer, we support no unrolling and full unrolling with possible exceptions for the outermost and second outermost loop. Furthermore, SIMD instructions are applied over channels if the number of filters in the previous convolution layer is a multiple of 4 (SSSE3).
Ii-B3 (Leaky) ReLU
The ReLU layer consists of only three nested loops. We apply the same rules for unrolling as for max-pooling. The implementation of ReLU is simply,
A leaky ReLU layer can mathematically described as:
where is a factor realizing the ”leaky” feature. The implementation for SSSE3 is also a max function with additional code for . For a general architecture we utilize the conditional operator of the C language to implement the second design principle supporting the compiler utilizing conditional moves.
Ii-B4 Batch Normalization
Batch Normalization was introduced to improve the performance of CNNs, as well as to stabilize the training process. The layer consists of a learnable affine transformation of the input feature map,
The calculation can be incorporated into a preceeding convolutional layer by modifying the weights and bias as shown below,
This section evaluates the main goals of the this work as mentioned in Sec. I-B: simple deployment and fast executables. We compare NNCG with both tools mentioned in the introduction that have comparable intentions: TensorFlow XLA and Glow in versions available in December 2018, 1.12 and c27b61c respectively.
Common robotic platforms are based on CPUs at different performance level. As an example for cognitive mobile robotic applications the Robocup Standard Plattform League has been chosen. Its robot Nao by SoftBank Robotics444https://www.softbankrobotics.com/emea/en/nao is a typical example of a small and cheap mobile robot, which intentionally lack a GPU to save energy. But if energy consumption, heat dissipation and cost are relatively neglectable, also a GPU can be integrated. We thus include various target platforms in our evaluation. A desktop processor Intel i7 8650U with Ubuntu 14.04, an energy efficient platform Intel Atom J1900 with Ubuntu 14.04, the Nao V5 by SoftBank Robotics (Intel Atom Z530) with a custom 32 bit Linux and the NVIDIA GPU GTX 1050 in a mobile system.
We evaluate both goals by presenting exemplary scenarios in simple robotic example applications: a ball detector for robot soccer, a pedestrian detector and a robot detector, all inferred on the mentioned target platforms. The CNN utilized for these purposes are described in Tab. I, Tab. II and Tab. III, respectively. Our evaluation is based on custom CNN designed to be small enough to lead to acceptable sizes of the C code file which is also desirable in terms of inference speed on mobile platforms. For example, a MobileNet V2 leads to an 78 MB C code file. We are still able to compile and run this file. However, we suggest smaller networks and thus we evaluate utilizing the networks presented. The CNN structure is chosen such that decent classification results can be achieved and the networks are adequate for a simple application on embedded devices.
To evaluate the first goal of this work, we give a subjective and comparative overview about simplicity and applicability of the tools to generate an executable of the mentioned pretrained networks (ball and pedestrian detector). Afterwards, the second goal is evaluated by comparing the time required to infer a single image on CPU and GPU using NNCG, TensorFlow XLA and Glow. Besides this we also show how single features of NNCG can lower the latency. We are also interested in how a GPU could perform if no overhead is present. We thus additionally evaluate the throughput of the GPU by applying a large set of images on the GPU and compare this speed per image with the tools on other platforms.
For each scenario we train the CNN presented above utilizing realistic datasets.
. For this the image is first traversed along scanlines and segmented. On the resulting ball segments, multiple scanlines are created to find ball edge points. These in turn are used for circle fitting leading to a ball candidate for the presented CNN which is used for feature extraction and verification. An average of 20 ball candidates is created per image.
The size of the CNN can be very small for multiple reasons. A ball is an object with high contrast (white with black spots) and the appearance is invariant with respect to orientation.
The dataset consist of 455107 images with 125615 balls at a resolution of 16x16, see Fig. 1 for some examples. With 5% of the images for evaluation our trained CNN has an accuracy of 99.975%.
In real world scenarios pedestrian or human detection is an important application. Humans are significantly harder to detect than a ball and we selected this as an example application to compare NNCG utilizing larger CNN, see Tab. II. For training we selected the Daimler pedestrian dataset , which consist of 49000 images with 24000 images of humans at a resolution of 18x36 per image, see Fig. 2 for example images. With 10% of the images for validation we achieve an accuracy of 99.02%.
We present above our ball detection application example, which is similar to the known R-CNN approach. Instead, for a robot detector application we build a pipeline based on the YOLO V2 approach  which is our third application example. In this paper we limit our presentation to the CNN utilized in the pipeline as described in Tab. III.
Iii-B Generic Deployment
In this section we present different application scenarios and utilize NNCG, TensorFlow XLA and Glow to deploy executables including required steps to compile and link for the target platform. We study if the utilized tool is applicable under the circumstances of the scenario and show the simplicity by collecting the steps required for deployment.
Native Compilation for Host Platform
This is the most simple scenario in this evaluation as all tools are able to generate code and compile natively. Additionally, source code and libraries for compilation of the tools are also available. The host is an Ubuntu 18.04.1 LTS 64 Bit.
NNCG generates a C source code file that can be compiled to an object file. There are no dependencies for the robot detection CNN except for SSSE3 intrinsics on Intel platforms (
The ball and robot classification additionally depends on
libmath caused by exponentional functions in Softmax. Thus, all ANSI C compiler should be able to compile the C source file to an object file for a general architecture. Alternatively, if can be included in a project environment (CMake, Visual Studio etc.).
TensorFlow XLA includes the tool
tfcompile to generate object files from a trained and stored CNN. It thus includes one more step than NNCG, the compilation of the code utilizing clang.
However, the object file depends on many functions and the Eigen library shipped with TensorFlow. Thus, it is advisable to link this file to an executable within the TensorFlow environment providing all dependencies.
image-classifier generates an object file utilizing clang as well. It does not depend on libraries as TensorFlow XLA making the linking process as easy as NNCG on this platform. However, as the output is an object, compilation is limited to platforms supporting clang. Furthermore, Glow does not support all layer required for a CNN based on the YOLO approach, namely leaky ReLU.
Deployment on Atom (J1900) with similar OS
In this scenario the host platform for compilation is the same as above with a different target platform. Two limitations differ this scenario. First, the target CPU only supports a limited subset of SIMD instruction compared to the host (SSSE3). Second, Ubuntu is installed in Version 14.04.5 LTS.
The C code file generated by NNCG can be compiled natively on the target platform as it only requires a basic C compiler installation. Alternatively, it can be compiled on the host with static linkage and by specifying the target architecture (
TensorFlow XLA also supports the specification of a target platform. Static linkage is possible including the dependencies to TensorFlow and Eigen. However, a native compilation would require to install TensorFlow on the target platform and is thus not considered here.
image-classifier does not allow to specify a different target platform. Thus, the generated object file contains AVX commands as these are available on the host but not on the target platform resulting in not working executables. Installation of Glow on the target platform was not considered here.
Deployment on Atom (Z530) with different OS
This is the platform of the Nao robot with a preinstalled OS. The CPU is more limited but supports the same SIMD extensions as above. Main difference here is the custom Linux distribution with 32 bit kernel. It does not provide a compiler, thus native compilation is impossible.
C source generated by NNCG can be cross compiled by specifying a 32 bit target and static linkage. In contrast, the object generated by TensorFlow XLA depends on Eigen source that cannot be compiled for 32 bit targets. Glow suffers the same limitations as above and is not applicable here.
Iii-C Fast Executables
The previous section demonstrates the applicability for different platforms. In this section we continue the evaluation by measuring the required time for the inference. We measure the time required to classify a single image (ball or pedestrian) and for detecting robots in an image. We ran small networks 100.000 times and larger networks like the robot detection 1000 times and use the mean value. For each application example the results can be seen in Tab. IV, V and VI. As described in the previous section, some approaches are not applicable and thus no time measurement is available.
As can be seen, the speed-up factor of NNCG compared to TensorFlow XLA is between 1.41 and 11.81, compared to Glow 3.29. This also confirms the results of . Additionally, we evaluated the ball and pedestrian CNN on a GeForce GTX 1050 GPU by NVIDIA using an executable by TensorFlow XLA. As can be seen, the overhead to utilize a GPU is tremendous for small CNN and does not change significantly for under 100 images classified at once.
As described in Sec. II-A, two features of NNCG are configurable: the architecture dependent SIMD extensions and loop unrolling. We can therefore evaluate the acceleration due to these features by first using a general architecture without SIMD extensions where both outer loops are not unrolled. We do this for the ball classifier on the i7 platform. The compiler (clang 6.0.0) is nevertheless enabled to use SIMD extensions and perform loop unrolling. However, it can be seen in Tab. VII that the speed-up factor of applying SIMD instructions as described in Sec. II-A4 is 4.9. If NNCG unrolls all loops there is an additional speed-up of 26%. This shows that the compiler is not able to find the optimum automatically.
|Intel i7 (8650U)||2.10µs||7.53µs||24.81µs|
|Intel Atom (J1900)||17.51µs||N/A||69.12µs|
|Intel Atom (Z530)||46.50µs||N/A||N/A|
|Intel i7 (8650U)||135.7µs||N/A||191.8µs|
|Intel Atom (J1900)||1020.3µs||N/A||1757.2µs|
|Intel Atom (Z530)||2938.6µs||N/A||N/A|
|Intel i7 (8650U)||474µs||2457µs|
|Intel Atom (J1900)||1109µs||6797µs|
|General||SSSE3||SSSE3 and Full Unroll|
Iv Conclusion and Outlook
This paper presents a neural network code generator NNCG that writes ANSI C code for a trained CNN. We shown that embedding this file or a compiled object is a simple task and allows to deploy the CNN on all platforms that provide an ANSI C compiler or that can be a target platform of a cross compiler. Additionally, NNCG can exploit that the structure and gains of the CNN are know at generation time resulting in executables up to 11.81 times faster than previous well-known approaches.
Future work will focus GPU kernel code and more layer types to support modern widely known CNN structures. Furthermore, currently only SSSE3 is a supported SIMD instruction set. An extension of NNCG to other instruction sets like AVX or NEON can be realized rapidly.
-  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). URL https://www.tensorflow.org/. Software available from tensorflow.org
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for
accurate object detection and semantic segmentation.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pp. 675–678. ACM, New York, NY, USA (2014). DOI 10.1145/2647868.2654889. URL http://doi.acm.org/10.1145/2647868.2654889
-  Lai, L., Suda, N., Chandra, V.: Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus. arXiv preprint arXiv:1801.06601 (2018)
-  Munder, S., Gavrila, D.M.: An experimental study on pedestrian classification. IEEE transactions on pattern analysis and machine intelligence 28(11), 1863–1868 (2006)
-  Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6517–6525 (2017). DOI 10.1109/CVPR.2017.690. URL https://doi.org/10.1109/CVPR.2017.690
-  Rotem, N., Fix, J., Abdulrasool, S., Deng, S., Dzhabarov, R., Hegeman, J., Levenstein, R., Maher, B., Nadathur, S., Olesen, J., et al.: Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015). DOI 10.1007/s11263-015-0816-y
-  Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
-  Schwarz, I., Hofmann, M., Urbann, O., Tasse, S.: A robust and calibration-free vision system for humanoid soccer robots. In: Proceedings RoboCup 2015 International Symposium. Hefei, China (2016)