In the past decade, the advance of deep learning (DL) technique has significantly promoted the artificial intelligence (AI). Numerous AI applications, e.g., image processing, object tracking, speech recognition, and natural language processing, have raised urgent requirements to adopt the DL. As a result, various libraries and frameworks, such as TensorFlow
, Caffe, and CNTK , have been proposed and applied in practice.
However, developing AI applications powered by the popular DL frameworks and libraries is a non-trivial task. Usually, these frameworks and libraries are leveraged by native applications that can run on heterogeneous development environments such as Windows, Linux, MacOS/iOS, and Android. The applications are developed by various imperative programming languages, i.e., C/C++ on Windows, Objective-C on iOS and MacOS, and Java on Android. Developing AI applications that is portable to multiple platforms is indeed not easy. The development is particularly complicated for mobile applications, as the app vendors usually need to develop and maintain both iOS and Android versions. In addition, the deployment is also non-trivial, as most current platforms come with an application appstores, some of which require manual testing of submitted applications by the appstore provider before being published-a process that can take several weeks-and applications can be rejected for seemingly arbitrary reasons.
WebDNN, Keras.js, and Mind, were proposed to support DL in browsers. In early 2018, Google released the TensorFlow.js, as a significant step for promoting the in-browser DL tasks.
Although the preceding efforts along with some on-going efforts seem to make running in browser DL tasks possible, we so far have very little knowledge on how, where, and how well they actually work. More importantly, considering the long debate of performance of Web applications compared to that of native applications, the same issue also exist in developing DL-powered applications. Hence, it is urgent to address such a knowledge gap in terms of the feasibility and usability for running DL in Web browsers. In this paper, we make the first empirical study of DL on browsers by answering the following research questions.
RQ1: What the features do existing frameworks provide to implement various kinds of DL tasks in the browser?
RQ2: How well do existing frameworks perform over different DL tasks?
RQ3: How big the performance gap is between running DL in the browser and on the native platform?
We select 7 popular frameworks that support DL on browsers, and conduct a characteristic study over them. To this end, we develop a browser extension to measure the performance as well as system resource utilization when running different DL tasks. We choose the TensorFlow.js and native TensorFlow to compare the performance of DL on browsers with on native platforms.
The key findings of our study includes:
In-browser DL tasks are still at dawn. Most frameworks of DL on browsers support a specific subset of DL tasks. Among all the frameworks, TensorFlow.js provides the most number of functionalities to realize various kinds of DL tasks.
Support of In-browser training is not fledged. In most frameworks, inference has drawn more attention compared to training. For training tasks, the width of DL models dominates the performance variation considering the complexity of DL models. The browser is limited in complex matrix calculation.
Performance variation is marginal among frameworks. . Different frameworks exhibit comparable performance when running various DL tasks on the same configuration. The performance difference is just in one order of magnitude.
Model loading dominates the computation. For inference tasks, loading and warming up the DL model spends more time than running the inference itself. CPU backend performs better than GPU backend when running inference tasks for small-size models.
Integrated graphics card does help for in-browser DL tasks. For popular pre-trained models like MobileNet and Inception, TensorFlow.js has comparable performance with native TensorFlow when running inference on standalone GPU backend, just 1x to 2x performance differences. TensorFlow.js on integrated graphics card backend outperforms the native TensorFlow on CPU when running the same inference task.
System resource can be further explored for in-browser DL tasks. For TensorFlow.js, the CPU is not fully utilized (about 80%) when DL tasks run on the CPU backend. The memory allocated to WebGL is limited by the browser, leading to the crash of some DL tasks. The call stack of TensorFlow.js is much deeper than that of ConvNet and WebDNN, pulling down the performance of TensorFlow.js.
The remainder of this paper is organized as follows. Section 2 shows some background knowledge of deep learning on browsers. Section 3 to 5 describes the results, including the analysis of framework functionality, performance measurement, and comparison with native DL frameworks. Section 6 presents the implications and recommendations drawn from the findings. Section 7 surveys related work and Section 8 concludes the paper with future work.
2 Background and Motivation
In this section, we give some background of deep learning and then discuss how browsers support deep learning tasks.
2.1 Deep Learning
Deep learning is a class of machine learning algorithms that using a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. Deep learning has been applied to many fields such as computer vision and speech recognition.
There are many types of neural networks, among which there are three basic structures. A deep neural network (DNN)
is an typically feedforward network with multiple layers between the input and output layers, in which data flows from the input layer to the output layer without looping back. A convolutional neural network (CNN)32] has connections between nodes form a directed graph along a sequence, allowing it to exhibit temporal dynamic behavior for a time sequence.
Deep learning consists of two phases: training phase where the input data are used to calculate the parameters of the model, and inference phase where the model outputs the value given a specific input sample.
2.2 Deep Learning on Browsers
In-browser deep learning will let the user collect their own data and then train the model right on the client machine. So no server is necessary. The benefits including but not limited to: accessibility and distribution regardless of end devices, reduced data transfer and latency of server-client communication, the ability to offload computation to end-user clients, privacy and security, customization and sociability.
There are always tradeoffs to consider, of course. For example, eliminating the need to upload mode input data repeatedly comes at the cost of an initial model file download. Depending on the size of input data and number of uses per model download, this can be a worthwhile tradeoff.
3 Supported Features of Deep Learning in Browsers
In this section, we make a characteristic study to answer the first research question, i.e., what the features do existing frameworks provide to implement various kinds of DL tasks in the browser? We first introduce the frameworks selected for the study. Then we compare the features among these frameworks from two aspects: provided functionality and developer support. For provided functionality, we mainly examine whether each framework supports some basic functionalities used in the development of DL applications. For developer support, we take a look at some of the factors which may affect the efficiency when developers develop and deploy DL applications. Table 1 summarizes all the results.
|Last Commit Date||Oct 30, 2018||Nov 25, 2016||Aug 17, 2018||Oct 25, 2018||Nov 5, 2018||Mar 25, 2018||Jul 7, 2017|
|Status||Active||Not Active||Not Active||Active||Active||Active||Not Active|
|Support for Training||Y||Y||N||N||Y||Y||Y|
|Supported Network Types||DNN||Y||Y||Y||Y||Y||Y||Y|
|Supported Layer Types||49||7||NA||NA||7||1||1|
|Supported Activation Types||16||4||NA||NA||4||5||2|
|Supported Optimizer Types||7||3||NA||NA||1||NA||NA|
|Support for GPU Accelaration (WebGL)||Y||N||Y||Y||N||N||N|
|Documents||Y||Y||Not finished||Y||Only tutorials||Y||Y|
|Converting Model from Other Frameworks||TensorFlow||Y||N||N||Y||N||N||N|
|API to Save/Load Model||Save||Y||Y||N||N||Y||Y||Y|
|Support for Server Side (Node.js)||Y||Y||Y||Y||Y||Y||Y|
3.1 Selected Frameworks
Keras.js  abstracts away a number of frameworks as backends. In GPU mode, computation is performed by WebGL shaders. Models can be run in Node.js in CPU mode as well. Keras.js supports importing Keras pre-trained models. However, this project is no longer active.
Mind  is a flexible neural network library for Node.js and the browser. The core framework has only 247 lines of code, which uses a matrix implementation to process training data. It supports customization of the network topology and plugins to configure pre-trained networks created by the mind community. However, this framework is no longer active.
3.2 Provided Functionality
Support for training. Most frameworks support training and inference tasks in the browser. However, Keras.js and WebDNN do not support training DL models in browsers. They only support loading pre-trained models to perform inference tasks. That is why it is not available for the number of types of layer/activation/optimizer supported by Keras.js and WebDNN.
Supported network types. Some frameworks are not for general-purpose DL tasks, so they differ in the supported network types. Specifically, TensorFlow.js, Keras.js and WebDNN support three network types - DNN, CNN and RNN. However, ConvNetJS mainly supports CNN tasks and does not support RNN. brain.js and synaptic mainly support RNN tasks, and do not support convolution and pooling operations used in CNN networks. Mind supports only the most basic DNN.
Supported layer types. All frameworks support building neural networks using units of layers. The layer API of TensorFlow.js supports approximately 49 different layers, including dense, convolution, pooling, RNN, normalization, and so on. Other frameworks support a smaller variety of layers which are also related to the network types they support. It should be noted here that the core API of TensorFlow.js is implemented in a way similar to the native TensorFlow which combine various operations to build computational graphs. synaptic is an architecture-free neural network construction framework that supports building any type of first order or even second order RNN networks.
Supported activation/optimizer types.
In general, TensorFlow.js provides developers with the most choices. For activation functions, other frameworks support only basic sigmoid or ReLU. For optimizers, other frameworks mainly support basic stochastic gradient descent (SGD).
Support for GPU acceleration (WebGL). WebGL is an API that uses GPU to accelerate real-time rendering of graphics in browser which can be used to accelerate the calculation of neural networks. TensorFlow.js is the only framework that supports GPU-accelerated training tasks. TensorFLow.js, Keras.js, and WebDNN support the use of GPU-accelerated inference tasks. WebDNN also supports a more advanced technology, WebGPU, but this technology can now only be applied to the technology preview version of Safari and is not compatible with our devices.
3.3 Developer Support
Documentations. Documentations provided by TensorFlow.js, ConvNetJS, WebDNN and synaptic maintainers are completed and in details. The document for Keras.js is not complete and brain.js only has a few tutorials.
Demos. All the frameworks provide demos for developers getting start. TensorFlow.js offers the richest demos covering a wide range of categories.
Converting model from other framework. TensorFlow.js, Keras.js and WebDNN support importing models from frameworks in Python and all of them provide Python scripts for converting models. TensorFlow.js supports models trained by TensorFlow and Keras. Keras.js supports Keras models. WebDNN supports importing models from TensorFlow, Keras, Caffe and Pytorch. With the support of using pre-trained models from other DL frameworks, the development effort can be significantly reduced.
API to save/load model. All frameworks that support training tasks in the browser have APIs for saving models. All frameworks have APIs for importing models.
Support for server side (Node.js). All frameworks are supported for use in Node.js. Such a feature makes it possible to offload computation inside browsers onto remote servers.
Library size. We list the size of the library files that needs to load into browsers, which affects the speed at which the page loads and parses. The smallest one is ConvNetJS, and the largest is TensorFlow.js and brain.js. Smaller size of library files is better for loading in browsers.
4 Performance of Deep Learning in Browsers
In this section, we conduct a measurement study to investigate the second research question, i.e., How well do existing frameworks perform over different DL tasks? We investigate the performance of different frameworks when running training and inference tasks.
4.1 Experiment Setup
Since the network types supported by different frameworks are not the same as explained before, we adopt the most basic fully connected neural network as the model in the experiment. For the dataset to run the DL tasks, we use the classic MNIST handwritten digit recognition database. The model to be trained has 784 input nodes and 10 output nodes. To study the influences of model complexity on the performance, we choose different configurations of the model. The parameters include 1) the number of the hidden layers (depth) of the neural network, which ranges in [1, 2, 4, 8], and 2) the number of neurons (width) in each hidden layer, which ranges in [64, 128, 256]. In the training process, the batch size is always set to 64.
Hardware In order to study the performance difference between CPU and GPU backend, we use a Hasee T97E laptop computer, which has a stand alone graphics card, Nvidia 1070 Max-Q (with 8GB GPU memory). The CPU is Intel i7-8750H, which includes an Intel HD Graphics 630, enabling us to measure the performance using integrated graphics card. In the following, we use nGPU and iGPU to denote the backend of the stand alone Nvidia graphics card and the integrated Intel graphics card, respectively.
Software All the experiments run on the Chrome browser (version: 71.0.3578.10 dev 64-bit) on Ubuntu 18.04.01 LTS (64-bit). For the frameworks, we use their latest published version.
Performance measurement For each DL task, we implement a Web page where the configurations of DL models can be varied through the parameters in the URL. We run each DL task on Chrome browser, and measure the time spent on finishing the task. Since each experiment usually requires running dozens of tasks under different configurations, we developed a Chrome extension to iterate through all the pages and change the configuration after one task is performed. This browser extension is also responsible for monitoring the system resource usage of the Web page. At the same time, a local server records the experimental statistics uploaded by the extension.
4.2 Training Performance
In general, the training time increases with the increase of the network size since more computation is needed to complete the training process for larger networks. Comparing the training time of different frameworks on CPU backend, we can see that ConvNetJS is the fastest among all the frameworks for all network configurations. The reason is that ConvNetJS is designed to be simpler and small in library file size. Brain.js is closely behind, with a performance gap of about two times (2x) with ConvNetJS. Tensorflow.js has a performance gap of two to three times (2x-3x) with ConvNetJS. When comparing the training time ratio of ConvNetJS over TensorFlow.js, we find that the performance gap is gradually reduced when the depth and width increase, indicating that compared with ConvNetJS, TensorFlow.js has relatively large overhead beyond calculation. In addition, the performance gap is larger as the network width increases than the gap as the network depth increases, implying that TensorFlow.js deals better with large scale matrix calculation than ConvNetJS. Synaptic has the worst performance, which has dozens or even hundreds of performance gaps with other frameworks. As the network size increases, the gap is still growing.
GPU benefits. The time spent on training on CPU backend becomes longer with the increase of network size, but the results on GPU backend are not the same. For both the iGPU with weaker computation power and the nGPU which can satisfy a larger-scale matrix calculation, the training time does not increase significantly. But in the process from (4 hidden layers, 128 neurons per layer) to (8 hidden layers, 256 neurons per layer), the training time of iGPU has begun to increase significantly. The reason is that under the network size set in this experiment, the training process does not reach the GPU’s capability bottleneck. Although the matrix computation capability of nGPU is better than that of iGPU, the training time on nGPU is even longer than iGPU. Such a result is caused by the excessive time overhead calling the WebGL to get access to GPU. The real computation time of GPU should be much smaller than this.
TensorFlow.js actually has some unexpected performance on the CPU. There are some times when TensorFlow.js canot maximize the utilization of a single core and its CPU utilization is only 60.96%. At the same time, we can find that when running training tasks on the GPU, CPU is basically in a state of non-full load, and CPU utilization on iGPU is on average 5-7% higher than that on nGPU.
4.3 Inference Performance
The inference task involves loading a pre-trained model and then given a sample input, the model outputs the result. In addition, on GPU backend, there is a warmup process where the first sample for inference is usually used to activate the GPU processor. Therefore, we break down the process into three phases: model loading, warming up, and inference, and study the fine-grained performance.
Model File Size. We first investigate the size of the model file used by different frameworks. As models for inference usually should be downloaded from the network, smaller size of model files means faster downloading time. Table 3 shows the size of model files that are used in all inference experiments. Among them, WebDNN is special. Its model converter converts the same Keras model into different scripts and JSON files which are used in four different backends. For fair comparison, we just count the files containing the model’s parameter values used by WebDNN. ConvNetJS and brain.js use similar JSON encoding, and the size of their model files are nearly the same. The model file used by synaptic uses JSON encoding as well but its size is the largest among all the frameworks. The model files used by TensorFlow.js, Keras.js and WebDNN are all converted from Keras model, so their model files are of the same size. Since the models converted from Keras is compressed and saved as a binary file, the size is greatly reduced, which is only about 1/7 of the model file in JSON.
Model Loading Time. We then compare the time spent on loading the model of different frameworks, as shown in Table 2. For CPU backend, the loading time of different models of the same framework is basically proportional to the size of the model files described in Table 3. However, the loading time of different frameworks is significantly different. ConvNetJS is the fastest. Loading time of brain.js, TensorFlow.js and Keras.js are consistent in terms of magnitude. Interestingly, the increase of loading time of ConvNetJS, brain.js and synaptic are particularly noticeable when the width of the model, i.e. the number of neurons in each layer, increases. The result is caused by their choice of using JSON to encode models. The loading time of synaptic model is slowest among all the frameworks, which are more than 100x to 1000x longer than ConvNetJS. The loading time of TensorFlow.js model is almost unchanged regardless of the model size.
The loading time of different model sizes on the GPU backend does not change much. However, the difference in loading time between different frameworks is still significant. TensorFlow.js is the fastest on loading of the three frameworks. Compared to loading models on the CPU, Keras.js speeds up loading large models, but the loading time of WebDNN is longer. At the same time, it can be seen that there is no difference in speed of model loading between iGPU and nGPU.
Warmup Time. Next, we examine the difference of warmup time on the GPU backend. As shown in Table 3, Keras.js is still far ahead, and can complete the warmup in 3ms on all tasks. Tensorflow.js is the second, and WebDNN is the worst. On the whole, the warmup time on iGPU backend is generally shorter than that on nGPU.
Inference Time. Table 4 shows the average time of doing inference on one sample. In the range of the model sizes we set, the powerful computation capability of GPU does not make a difference. Among all the size of the DNN model, ConvNetJS occupies all the first place, followed by WebDNN using WebAssembly technology with CPU as backend. The inference time of WebDNN on the GPU is longer than the inference time on the CPU. As for TensorFlow.js, running on CPU backend is faster for inference on smaller models, while GPU backend is faster for inference on larger model. Inference time of Keras.js on the CPU and GPU are basically the same.
We can observe that when all the frameworks are doing inference on the CPU, the overhead increases when the models’ width and the number of layers increase. In particular, when the models’ width is increased, the time increases sharply (the cost is almost doubled with the model width doubles). As with the case of training, it also reflects that these frameworks does not make good use of optimization on large-scale matrix operations in the process of forward propagation on the CPU. TensorFlow.js and WebDNN on the GPU do not exhibit this problem, but Keras.js on the GPU are still suffered from this problem.
Based on the above evaluation, we can see that in small-scale fully-connected neural network which browser is capable of, the compact and concise library ConvNetJS is the best in performance both in training and inference. However, since ConvNetJS is no longer in maintenance and has fewer functional interfaces, developers may need to choose some alternatives.
Tensorflow.js is the only framework that can take advantage of GPU-accelerated training processes. It is feature-rich and has not comparable performance with ConVNetJS. So TensorFlow.js is a good choice for both training and inference. We do not recommend using GPU as a backend on a small model. The advantages of GPU’s computation power are not fully exploited.
Finally, we are interested in why ConvNetJS has the best performance for all the tasks among these frameworks. To this end, we compare the call stack of ConvNetJS with that of TensorFlow.js when doing training tasks. It is surprising to find that the call stack of ConvNetJS is only 3 while the call stack of TensorFlow.js is 48! As a result, one possible reason for the performance gap is the so deep call stack that costs a lot of computation resources.
5 Comparison with Native Framework
In this section, we study the third research question, i.e., How big the performance gap is between running DL in the browser and on the native platform? To this end, we compare the performance of TensorFlow.js and native TensorFlow in Python, both of which are released and maintained by Google and have similar APIs. As we have already known from the last section, different in-browser DL frameworks have comparable performance so that the results in TensorFlow.js could be representative for the state-of-the-art.
We study the performance gap from two aspects. On one hand, we leverage well-known pre-trained models to compare the performance when running inference tasks on TensorFlow.js and native TensorFlow. On the other hand, we use decision tree analysis to distinguish the factors contributing to the performance gap. As for the experiment setup, we use the same laptop as in the experiments of last section. We install the latest TensorFlow in Python on the laptop.
5.1 Inference Based on Popular Pre-Trained Models
We choose to use the pre-trained models provided by the Keras official to measure the performance of TensorFlow.js and native TensorFlow when doing inference tasks on these classical CNN models.
5.1.1 Limitation of TensorFlow.js and browser constraints
Keras official provides 11 pre-trained models. Although these models can work using native TensorFlow, we encountered a series of errors when we run them using TensorFlow.js in the browser. These errors imply the limitation of TensorFlow.js itself as well as constraints imposed by the browser.
For the model of NasNet Large, the browser throws out the error message “truncatedNormal is not a valid Distribution”. For the model of ResNet V3, the browser throws out the error message “Unknown layer: Lambda”. The reason is that TensorFlow.js is still under development and so far has offered only a limited number of support for the converted model. Many user-defined operations are not supported by TensorFlow.js, e.g., models with control flow ops (e.g. RNNs) are not yet supported.
When we try to use VGG16 or VGG19, the console generate the error message “GL OUT OF MEMORY”, meaning that the GPU memory is overfilled. The reason is that the VGG16 model applies for more than 1GB’s GPU memory. However, it should not be an issue since the GPU memory of our experiment device is 8GB. As a result, such an error is due to the browsers.
After trying all the models, we finally have 5 models that can be correctly converted and run on the browser. The information of these models are listed in Table 4. The number of trainable parameters is obtained by the build-in summary() method of tensorflow.keras, and flops (Floating Operations) are obtained by tensorflow.propfiler.profile() method.
Figure 5 shows the inference time for each model. It can be seen that on some common models, the inference time of TensorFlow.js on nGPU is similar (1x-2x slower) to native TensorFlow’s. The most encouraging result is that using iGPU backend to accelerate calculation performs better than that of native TensorFlow on CPU backend. This result is not surprising considering the computation capability of iGPU and CPU. However, since traditional native DL frameworks do not support integrated graphics card for acceleration, DL in browsers brings a lot of benefit in such a case with the help of integrated graphics card that is common on current devices.
Under the real-time requirement of client-side deep learning, if users want to achieve 10FPS performance, they need to consider using a more powerful stand alone graphics card. The Mobile Net model accelerated by iGPU can also meet the requirement. If the standard is 1FPS, iGPU is also fully capable. But if only CPU can be used, then these common models are too heavy burden for browsers.
5.2 Decision Tree Analysis
We wonder to which extend does performance of in-browser DL libraries differ from that of native DL frameworks, and how do learning parameters contribute to this difference. To understand the conditions, we build a predictive model based on decision tree analysis.
|Layer||1, 2, 4, 8, 16|
|Width||64, 128, 256, 512|
|Layer||6, 9, 15, 27|
|Width||200, 400, 800|
|Layer||1, 2, 3|
|Width||4, 8, 16, 32, 64, 256|
We build models on DNN, CNN and RNN to investigate the contributing fators to the performance gap. We build the DNN and CNN model to recognize handwritten digits on the MNIST dataset. And the RNN model to perform text generation work from Nietzsche’s writings. The models are derived from Tensorflow.js official examples, with a little modification to set the parameters of DL models.
In the analysis, each configuration is a combination of values for factors backend, layer, width and task listed in Table 5. In the DNN and RNN network, model width refers to the number of neurons of each layer. In CNN network, model width means the number of kernels used in the convolutional layer. The range of model width are selected according to the values set in Tensorflow.js official examples. Model parameters not mentioned above remain default.
we obtain the average time per batch for training tasks and average time per example for inference task on two platforms. The ratio of execution time in TensorFlow.js and that in native Tensorflow is used as measurement of performance gap in this analysis.
We run the decision tree algorithm to predict the ratio of execution time between TensorFlow.js and native TensorFlow. The decision tree depicts the relative importance of contributing factors. Intuitively, factors close to the root of the decision tree affect the time ratio more than those near the leaves. This is because the decision tree chooses to do the splitting of the nodes according to the Entropy-Information Gain criterion. In other words, it places the important factors near the root to gain the best prediction of splits.
First, we produce a fully grown and unpruned decision tree based on all the factors. In this way, each leaf contains only one configuration set. Then we set the depth of the tree to the number of factors, in order to prevent using a factor several times on one path. We obtain the decision tree in Figure 6(a), Figure 6(b) and Figure 6(c) based on deep learning tasks on different models.
The decision trees in three Figures all show that TensorFlow.js preforms higher execution time than native TensorFlow in almost every configuration.
We find that backend is the most important factor in predicting the ratio of execution time, because they locate near the root of the decision tree. When using CPU as the backend, the ratio of execution time can is much higher than that of using GPU backend. For example, the time ratio decrease from 44.7 to 4.4 on training task of the same DNN model with layer numbers over 3 (large layer number) and model width over 256 (large width).
The extreme case happens on the CNN model. On CPU backend, there exists a wide range of the time ratio from below 5 to over 2200 (when layer is less than 7.5 and width is over 600). However, when doing inference task on GPU with layer number over 12 and model width over 600, TensorFlow.js performs almost the same speed as native TensorFlow. This is because CNN makes use of the powerful computation capability of GPU when the model is large enough, yet not exceeding the higher bound of the browser memory.
The second most important factor is task for all three models. Performing training task causes a higher ratio in execution time, while the performance gap of two frameworks on inference task is small. For example, on DNN model, with CPU as backend, training tasks on TensorFlow.js perform 33.9 times slower than native TensorFlow on average, and inference tasks on TensorFlow.js perform 5.8 times slower than native TensorFlow on average.
The decision tree of DNN and RNN model both suggest that the importance of layer number and model width depend on which backend the task is taken on. On CPU, the importance of model width outweighs the importance of layer number. On the other hand, the layer number plays a more important role on the GPU backend. The model width plays a more important role to the execution time than layer number in the case of CNN.
Table 6 summarizes the findings and implications of our study. Specifically, we draw implications for three stakeholders in DL on browsers: application developers, DL-framework vendors, and browser vendors. For application developers, we give recommendations on how to choose frameworks for DL in browsers, how to optimize the model, as well as how to select the backend. For DL-framework vendors, we present some advice on encoding of model files and optimizing call stack. For browser vendors, we suggest on the utilization of system resource.
|1||Specific DL Tasks Support||Frameworks supporting DL in browsers are emerging and being actively maintained. Most of them are not for general purpose and support only a specific subset of DL tasks. However, different frameworks exhibit comparable performance when running different DL tasks on the same configuration.||It is better for developers to use general-purpose DL frameworks like TensorFlow.js to implement their Web applications.||Application Developer|
|2||Model Complexity||The width of DL models dominates the performance variation of both training and inference tasks considering the complexity of DL models.||Developers should pay attention to the width of their models, and balance the width and required performance if possible.||Application Developer|
|3||Model Loading||For inference tasks, loading and warming up the DL model accounts for much longer time than running the inference task itself. The warmup time on integrated graphics card is generally shorter than that on standalone graphics card.||Developers should pre-load and warm up the model before using it for inference.||Application Developer|
|4||Benefits from GPU||For popular pre-trained models like MobileNet and Inception, TensorFlow.js has comparable performance with native TensorFlow when running inference on standalone GPU backend||It is possible to develop Web applications rather than native applications for these tasks.||Application Developer|
|5||Benefits from Integared Graphics Card||TensorFlow.js running on the Integrated graphics card works better than native TensorFlow running on CPU backend.||For devices without standalone GPUs, developers can use the integrated graphics card for acceleration.||Application Developer|
|6||Model File Encoding and Size||Model file encoded in JSON is much bigger (7x) in size than that encoded in JSON, and significantly increases the model loading time.||It is better to encode DL models in binary files.||DL-Framework Vendor|
|7||Framework Call Stack||The call stack of TensorFlow.js is much deeper than that of ConvNet, pulling down the performance.||Framework vendors could leverage compiler optimization techniques to reduce the call stack when the DL models are used.||DL-Framework Vendor|
7 Related Work
To the best of our knowledge, this paper is the first study to characterize the deep learning on browsers. We survey related work on general client-side deep learning and performance measurement of deep learning systems.
7.1 Client-side Deep Learning
Lane et al.  studied typical mobile sensing tasks, such as activity recognition, using DNNs. Han et al. tried to compress the DNN through a three-stage method: pruning, trained quantization and Huffman coding, which showed a considerable reduction in terms of the storage requirements of DNNs. Mobile clients are capable of leveraging deep learning mechanism in applications by offloading the computation-intensive tasks to the cloud  . Teerapittayanon et al.  proposed distributed deep neural networks (DDNNs) over distributed computing hierarchies, consisting of the cloud, the edge (fog) and end devices. Meeds et al. introduce MLitB, a prototype ML framework capable of performing large-scale distributed computing with heterogeneous classes of devices. Ignatov et al. studied state-of-the-art deep learning in the Android ecosystem and describe available frameworks, programming models and the limitations of running AI on smartphones. Arden is a cloud-based deep learning framework for mobile devices. The framework partitions the DNN, and then offloads the resource-hungry training and complex inferences tasks to the cloud. Ichinose et al. proposed a pipelined method for distributed deep learning processing between mobile devices and the Cloud to reduce the amount of data sent to the Cloud and protect the privacy of users. Yao et al. propose DeepSense, a unified deep learning framework that directly addresses the aforementioned customization challenges that arise in mobile sensing applications. Kuhnle et al. introduced a new deep linguistic processing method for deep learning evaluation on mobile devices.
7.2 Performance Measurement of Deep Learning
have shown that DNN model performance is affected by item difficulty as well as training set size. It has used a well-established method for estimating difficulty to analyze DNN model performance as opposed to heuristics. In computer vision, Liu et al. evaluate the performances of leading deep learning methods for object detection Guignard et al. present detailed characterization results of a set of archetypal state-of-the-art deep learning workloads on a last-generation IBM POWER8 system with NVIDIA Tesla P100 GPUs and NVLink interconnects. The goal is to identify the performance bottlenecks (i.e. the accelerable portions) to provide a thorough study that can guide the design of prospective acceleration platforms in a more effective manner. Shi et al. evaluate the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over different GPU hardware environments. They build performance models of standard processes in training DNNs with SGD, and then benchmark the running performance of these four frameworks with three neural networks (i.e., AlexNet, GoogleNet and ResNet-50). By analyzing the factors that result in the performance gap among these four frameworks, they identify bottlenecks and overheads which could be further optimized.
-  brain.js. https://github.com/BrainJS, 2018.
-  Caffe. http://caffe.berkeleyvision.org/, 2018.
-  Cntk. https://www.microsoft.com/en-us/cognitive-toolkit/, 2018.
-  Convnetjs. https://cs.stanford.edu/people/karpathy/convnetjs/, 2018.
-  Keras.js. https://github.com/transcranial/keras-js, 2018.
-  Mil webdnn benchmark. https://mil-tokyo.github.io/webdnn/#benchmar, 2018.
-  Mind. https://github.com/stevenmiller888/mind, 2018.
-  Mxnet. http://mxnet.incubator.apache.org, 2018.
-  synaptic.js. https://github.com/cazala/synaptic, 2018.
-  Tensorflow.js. https://js.tensorflow.org/, 2018.
-  Webassembly. https://webassembly.org/, 2018.
-  Webdnn. https://github.com/mil-tokyo/webdnn, 2018.
-  Webgl. https://www.khronos.org/webgl/, 2018.
-  Webgpu. https://www.w3.org/community/gpu/, 2018.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  M. Auer. Real-time web gis analysis using webgl. International Journal of 3-D Information Modeling (IJ3DIM), 1(3):49–61, 2012.
-  B. Chen and Z. Xu. A framework for browser-based multiplayer online games using webgl and websocket. In Multimedia Technology (ICMT), 2011 International Conference on, pages 471–474, 2011.
-  M. Guignard, M. Schild, C. S. Bederián, N. Wolovick, and A. J. Vega. Performance characterization of state-of-the-art deep learning workloads on an IBM ”minsky” platform. In 51st Hawaii International Conference on System Sciences, HICSS 2018, Hilton Waikoloa Village, Hawaii, USA, January 3-6, 2018, 2018.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
-  A. Ichinose, A. Takefusa, H. Nakada, and M. Oguchi. Performance evaluation of pipeline-based processing for the caffe deep learning framework. IEICE Transactions, 101-D:1042–1052, 2018.
-  A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. V. Gool. Ai benchmark: Running deep neural networks on android smartphones, 2018.
-  A. Kuhnle and A. Copestake. Deep learning evaluation using deep linguistic processing. In Proceedings of the Workshop on Generalization in the Age of Deep Learning, pages 17–23, 2018.
-  J. Lalor, H. Wu, T. Munkhdalai, and H. Yu. Understanding deep learning performance through an examination of test set difficulty: A psychometric case study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4711–4716, 2018.
-  N. D. Lane and P. Georgiev. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, pages 117–122, 2015.
-  Y. LeCun et al. Generalization and network design strategies. Connectionism in perspective, pages 143–155, 1989.
-  W. Liu, J. Cao, L. Yang, L. Xu, X. Qiu, and J. Li. Appbooster: boosting the performance of interactive mobile applications with computation offloading and parameter tuning. IEEE Transactions on Parallel and Distributed Systems, 28(6):1593–1606, 2017.
Y. Liu, P. Sun, M. R. Highsmith, N. M. Wergeles, J. Sartwell, A. Raedeke,
M. Mitchell, H. Hagy, A. D. Gilbert, B. Lubinski, and Y. Shang.
Performance comparison of deep learning techniques for recognizing
birds in aerial images.
2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pages 317–324, 2018.
-  B. Malle, N. Giuliani, P. Kieseberg, and A. Holzinger. The need for speed of ai applications: Performance comparison of native vs. browser-based algorithm implementations, 2018.
-  C. Marrin. Webgl specification. Khronos WebGL Working Group, 2011.
-  E. Meeds, R. Hendriks, S. Al Faraby, M. Bruntink, and M. Welling. Mlitb: machine learning in the browser. PeerJ Computer Science, 1:e11, 2015.
-  J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. P. Singh. Action-conditional video prediction using deep networks in atari games. In NIPS, 2015.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
-  S. Shi, Q. Wang, and X. Chu. Performance modeling and evaluation of distributed deep learning frameworks on gpus. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, DASC/PiCom/DataCom/CyberSciTech 2018, Athens, Greece, August 12-15, 2018, pages 949–957, 2018.
-  S. Teerapittayanon, B. McDanel, and H. Kung. Distributed deep neural networks over the cloud, the edge and end devices. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on, pages 328–339, 2017.
-  A. Tucker, A. Gleave, and S. Russell. Inverse reinforcement learning for video games. arXiv preprint arXiv:1810.10593, 2018.
-  J. Wang, J. Zhang, W. Bao, X. Zhu, B. Cao, and P. S. Yu. Not just privacy: Improving performance of private deep learning in mobile cloud. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 2407–2416, 2018.
-  S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher. Deepsense: A unified deep learning framework for time-series mobile sensing data processing. In Proceedings of the 26th International Conference on World Wide Web, pages 351–360, 2017.