In the past decade, the amount of data generated has grown at an exponential rate. In 2018, over 2.5 quintillion bytes of data were generated every day . That equals 2.5 million terabyte hard disk drives (HDD) every single day. The emergence of the Internet of Things (IoT) is further accelerating this growth of data generation. Data is not meaningful unless processed; moreover, data often need to be processed in real-time or near real-time. For example, in self-driving cars, a large amount of data needs to be processed quickly and locally. The rate of data generation is very high, and the required processing is very heavy. Thus, they can create a backlog of unprocessed data, which prevents the system from making a proper decision. Although new metrics such as Age of Information (AoI) can help preserve the freshness of data by discarding the old data and replacing it with new, valid data , the problem with processing still persists.
With the rise in the use of machine learning algorithms, many problems can be solved at the cost of high computation. Up until recent years, the conventional method to process the massive volume of data was to send it to data centers, process, and resend the results back to the client. This method is associated with common problems such as latency, identity management, and data protection. With the considerable progress in the design of low-power processors such as ARM® A series, the edge computing has emerged as an alternative solution. Although edge computing can be beneficial for small size applications, data centers remain the backbone of heavy computation tasks, including the training of neural networks. Based on the size of the network and the number of samples, the training procedure can take weeks, even on the most powerful processing machines currently existed. Distribution of the training can significantly reduce the overall time, but obstacles to effective distribution include synchronization of the models, data transfer, and power dissipation.
In this paper, we propose a distributed in-storage processing (ISP) framework to address the issues above. The ISP technology is to process data inside the storage devices without sending the data to the host. The ISP-enabled storage devices, known as Computational Storage Devices (CSD), are equipped with an internal processing engine that is capable of processing data in-place. This technology is studied in the literature and has been shown to be beneficial for big data and HPC applications [20, 10, 17, 6]. This technology reduces the data movements, and consequently, it improves both performance and energy consumption. Additionally, in cases where the applications are connected to the storage devices via a network, data should move through this network to be processed. This long-range data movement raises some concerns regarding the security and the privacy of data and makes it vulnerable to privacy attacks. The ISP technology completely solves this problem by avoiding the data movement. The overall contributions of this paper are as follows:
a novel low-power CSD named Newport with augmented processing power;
a framework named Stannis to efficiently parallelize training tasks on clusters of CSDs;
a tuning algorithm to maximize the utilization of a heterogeneous system;
protecting privacy with a hybrid of private and public data in a storage system.
The rest of the paper is structured as follows. Section II reviews related work on neural network training and computational storage devices. Section III introduces the Newport hardware and software stack, followed by our distribution framework named Stannis. Section V presents our experimental results.
Ii Background and Related Work
This section provides a background on the background on deep neural networks and their efficient processing on novel parallel archiectures.
Ii-a Deep Neural Networks
Deep learning has achieved tremendous success and has become one of the most powerful modeling tools combined with neural network. The idea behind neural network is the imitation of human neurons and is developed based on the perceptron model. Deep learning stacks multiple layers of neural network, in which the output of the current layer is fed as the input to the next layer. As an example, the supervised learning trains the neural network by taking the training dataset as input and comparing the output with the corresponding output label. After several training iterations, the parameters in the neural network are updated and able to do the prediction using the same type of input. Thanks to the idea of chain rule, back propagation is developed to train neural network more efficiently. As the name of the algorithm suggests, the update of parameters starts from the last layer and propagate back to the previous layer based on the comparison between the output and labels. In order to optimize the training process, Stochastic Gradient Descent (SGD) is used to reduce the error. Many machine learning libraries have been developed for model developing using neural networks, including Tensorflow, Theano, and PyTorch. However, due to the size and complexity of neural networks and the huge amount of training data, the training period has grown drastically from hours to days and weeks.
Ii-B Parallelization of Training
A practical approach to reducing the training time is to parallelize it on multiple devices. Two common methods for parallelization are the model parallel and data-parallel. In the former, each processing node is responsible for training a part of the neural network; in the latter, all processing nodes have a replica of the entire network and update it locally, but since different copies of the network get different updates, a synchronization method is needed. The initial version of Tensorflow, a well known library for machine learning applications, uses a parameter server approach to synchronize worker nodes. Each worker sends parameter updates to the parameter server where all updates are accumulated, averaged, and then sent back to the workers. This method can cause either computation or communication congestion on the server-side. Pytorch uses a similar approach with a more enhanced allreduce algorithm . In 2017, Horovod  for distributed training of neural networks has drawn attention with its superior scalability compared to Tensorflow’s standard distribution framework. Horovod was developed based on Nvidia’s NCCL ring-allreduce algorithm, where each node communicates only with two peer nodes and pass updates along in a circular fashion. This method is bandwidth optimal, and each node’s communication bandwidth is independent of the number of the nodes. However, where Horovod falls short is the support for heterogeneous processors, a problem we will address in Section III.
Ii-C Hardware Accelerators
Another way to decrease the overall time is to use more powerful devices. Due to its nature of parallel architecture, GPUs are much faster than regular CPUs in floating-point operations that commonly occur in training. For instance, Nvidia Tesla V100 can run a maximum of 100 TFlops whereas an Intel®
Core™ i9-7980XE can output a maximum of 1.3 TFlops. In 2016, Google unveiled its machine-learning chip called Tensor Processing Unit (TPU), a massively paralleled Application Specific Integrated Circuit (ASIC) for tensor operations that can deliver up to 420 TFlops in its latest version. However, the downside of both GPU and TPU is that they are application-specific processors and are not as good as CPU for general operations. Field programmable gate arrays (FPGAs) are also widely used in accelerating different applications, however, they have a long design time and need to be reconfigured for executing different operations.
Ii-D In-Storage Processing
Before the era of NAND flash memories, the I/O speed on hard disk drives (HDD) was the main obstacle in accelerating the operations. Although NAND flash memories are faster than their magnetic counterparts, moving data from the storage to the DRAM for processing still remains as the bottleneck. ISP is an innovative technology that addresses this shortcoming by moving computation to the data instead of moving data to the computation. ISP allows tasks to be executed in place, i.e., in the storage devices, with no need to move huge volumes of data. Based on the processing units, the available CSDs can be categorized into two main groups: 1) CSDs that exploit the processing unit dedicated to SSD controller for ISP [11, 12], and 2) CSDs with a dedicated ISP engine [19, 10, 6]. In the first approach, despite the energy efficiency of exploiting the SSD controller’s processing units for the purpose of ISP, the gained processing power is rather limited due to their relatively low performance; moreover, doing so can also negatively impact the performance of the SSD. In the second approach, dedicated processing units such as FPGAs  or embedded processors  overcome the limited processing power by augmenting the storage units with more computing resources. Although FPGAs can be power efficient with a great performance improvement potential, they suffer from several limitations that make them unsuitable for ISP purpose, including 1) they require an RTL implementation of the tasks, whose high-performance realization can be very challenging; 2) reconfiguration of FPGAs is needed for different tasks; 3) due to the lack of a file system structure, data cannot be accessed and used as a file concept. It has been shown that deploying a dedicated processor is a viable solution for ISPs . In this paper, we propose a dedicated processor-based CSD and show the benefits of using it in the distributed training phase of learning algorithms.
Iii CSD architecture
In-storage processing technology aims at augmenting storage devices with efficient processing engines, enabling them to process data locally without transferring it to a host system. Despite the simplicity of the concept, many of the previously proposed CSD architectures fall short of providing compelling flexibility and efficiency simultaneously.
In this research, we introduce Newport, an ASIC-based CSD architecture that provides a flexible and capable ISP environment for offloading and running user applications without modification. Newport is equipped with a full-fledged OS, which can support a wide range of programming models and languages. Newport is designed to embrace the ISP technology natively, which means its controller is composed of a dedicated ISP processing engine together with other conventional modules, all implemented in a single ASIC chip. Fig. 1 shows the hardware architecture of the Newport CSD, whose internal modules are categorized into three subsystems: front-end subsystem (FE), ISP-dedicated processing engine, and back-end subsystem (BE).
Newport includes four ARM® M7 processors for running the conventional tasks of the storage device. The FE subsystem uses one of the four ARM® M7 processors and a host NVMe interface module. The interface module receives and depacketizes the NVMe commands, and the ARM® M7 processor checks the integrity of the commands and interprets them for the BE subsystem.
The CSD controller is connected to multiple NAND flash chips organized in 16 flash channels, which can do I/O operations in parallel. The BE subsystem manages all the flash channels and performs the I/O operations on the channels. The BE contains a flash memory controller unit, three ARM® M7 processors, and an error correction code (ECC) unit for restoring the original data in case of bit errors from flash access. The BE also handles flash management tasks, including wear leveling, garbage collection, and flash memory logical-to-physical address translation layer (FTL). Besides these two subsystems, Newport is equipped with a unique and dedicated ISP engine composed of a quad-core ARM® Cortex A-53 processors and a complete software stack. The quad-core processor has access to an 8GB DRAM, which is shared among the processing units inside the CSD controller. The Software stack makes it possible to send ISP commands from the host to the Newport CSD and sending the ISP results back to the host.
All the subsystems in the Newport architecture are connected by a data bus, as shown in Fig. 1. Both the FE and ISP subsystems send flash read/write commands to BE, which is responsible for communicating with the flash memory chips to complete the flash I/O operations. In other words, while the data that is transferred to the host needs to go through the FE subsystem and a complex NVMe over PCIe link, the ISP engine bypasses the FE as well as the power-consuming NVMe link to read the data. Thus, the ISP subsystem has efficient and high-speed access to the data stored in the flash memory.
Fig. 2 shows Newport software stack architecture. The ISP engine runs a full-fledged Linux operating system (OS) both for running a wide spectrum of user applications in-place as well as developing the software stack.
To provide filesystem-level access for user applications that run on the ISP engine, we have developed a block device driver, which makes an abstraction layer for flash I/O operations, so the OS on the ISP engine can mount the flash memory as a block device.
For communication purposes, a TCP/IP over PCIe tunnel is provided, which embeds TCP/IP packets inside the PCIe packets. Using this tunnel, user applications running on the host can communicate with the applications running on Newport. The TCP/IP tunnel is composed of a process running on the FE to transfer data from the ISP engine to the host and vice versa, a process running on the host which handles TCP/IP packetization on the host, and a similar process running on the Newport CSD to handle the packetization. These three processes work together to provide a TCP/IP connection between the ISP engine and the host. In the case of attaching multiple Newport CSDs to a host, the TCP/IP over PCIe tunnel spans over all the CSDs to provide a network covering all the CSDs and the host. Using the block device driver and the TCP/IP tunnel, user applications running on Newport have efficient and high-speed access to flash data, and can also communicate with the applications running on the host or other CSDs.
Since both the host and the ISP engine have concurrent access to the flash memory at the filesystem level, a synchronization mechanism should maintain the integrity of the filesystem metadata. To address this issue, we have ported Oracle Cluster FileSystem version 2 (OCFS2) on both the host and Newport CSD. OCFS2 has two agents that run on the host and the Newport CSD. These two agents communicate with each other through the TCP/IP tunnel to synchronize the filesystem metadata. We have prototyped a fully functional Newport CSD to show the feasibility of the proposed design and also to run experiments on a platform equipped with multiple Newport CSDs. Fig. 3 demonstrates the Newport prototype.
Iv Software Framework
System for TrAining of Neural Networks In Storage (Stannis) is a framework for the distribution of neural network training on homogeneous and heterogeneous systems. It overcomes Horovod’s inability to work efficiently on hetereogeneous systems by workload scaling with consideration for transmission delay and data privacy.
Horovod shows great speedup on homogeneous systems, but on a heterogeneous system, the slowest processor becomes the bottleneck due to required synchronization in training. That is, a faster processing engine must wait for the slower ones. To address this problem, we try to equalize the processing time by setting different batch sizes for each processing engine, so that all processors wait for the least amount of time. In other words, the slower processor gets smaller batch size and thus, finishes the epoch in the same elapsed time as the more powerful processors.
Algorithm 1 shows the pseudo code for the proposed approach. It starts by running a series of benchmarks on all processing engines to assess their processing speed and find the optimal batch size on each processor. Based on the results, the best batch size is selected for the slowest engine. Having the batch size for one device, we calculate the time it takes to finish one batch and find the best batch size on the other processing engines that can deliver almost the same elapsed time. We increase the batch size by a fraction () of the difference in time for the two processor, up until getting similar times. is a constant that determines how much to adjust the batch size after each test. Larger C means more fine grained batch size update. We also consider the slowdown that occurs due to the synchronization process and allocates a () margin to the final time. We determined by observing the slowdown pattern for the first number of processing nodes added to the system in our initial tests and defined it to give a fixed 20% margin to the results. Stannis also considers the access permissions of the data and assigns the private data to local ISP engine while sharing the public data with the ISP engine and the host processor. This method eliminates the transmission of private data over the network and to the host and increases the data protection level.
One potential concern with varying the training batch sizes is the loss of accuracy. In practice, however, we observed that the range of changes in batch size makes no tangible impact on accuracy. While setting batch sizes can increase speed within one epoch, an imbalance in the datasets assigned to the computing nodes can stall faster nodes at the end of each epoch. After determining the optimal batch size, Stannis runs a load balancing algorithm to assign the proper number of input data to each node so all nodes finish after the same number of steps in one epoch. Since the host has access to more data than each individual CSD, it is convenient to determine the host’s dataset size based on the batch size ratios and the slower node’s dataset size. The following equation shows how to determine dataset size on host:
In case the portion of the private data to be processed on Newport nodes are not equal, Stannis either uses more portion of the public data on the node with the smaller private database or duplicate the private data to maximize the rate of image per second. Stannis can run any network that is based on Keras or Tensorflow. Examples of such are VGG, ResNet, Inception, MobileNetV2, and DenseNet. Fig.4 shows how the Stannis distributes and process data on system.
Distributing the training process means the central model receives updates less frequently that can cause accuracy loss. However, in , Goyal et al. show that this accuracy degradation can be avoided to a good extent by applying two strategies: a) a linear scaling up of the learning rate based on the number of passed epochs, and b) a warm-up strategy that uses lower learning rates at the start of training, which helps overcoming early optimization difficulties .
V Experimental Results
There are two scenarios for using Stannis in action. The first one is to run it on a server rack in a datacenter where the user data has been transferred to storage systems using end-to-end encryption to guarantee the privacy aspects. The second case considers Newport as a stand-alone system, capable of doing edge computing in an IoT environment. Examples of such cases are the autonomous driving cars where lots of sensor data is created every second that can be processed fully or partially in the car. For now, we stick to the first case and explore the second one in our future work with the addition of the Federated learning techniques.
To evaluate Stannis in the real world, we assembled an AIC 2U-FB201-LX server with an 8-core, 16-thread Intel® Xeon® Silver 4108 CPU with 32GB of DRAM and 24 Newport CSDs, each with 32 TB flash memory, giving us a storage server with a total of 24*32= 768 TB capacity (see Fig. 5). We used an expanded TinyImageNet database  for neural network training and took 72000 images as public data and 12000 images as private distributed over all Newport CSDs. We chose MobileNetV2 with 3.47 million parameters and 56 million multiply-and-accumulate (MAC) operations as our main neural network. To compare the speedup on different neural networks, we then ran the test for NASNet, InceptionV3, and SqueezeNet. The only concern in choosing a network and the batch size is the available DRAM on the systems. Large batch size on big networks can saturate the DRAM and thus stall the entire training process. The 6 GB DRAM on Newport proved to be enough for most of the test cases. The solution for cases with DRAM saturation was to choose a smaller batch size. Since the processing speed converges to a certain number after a certain batch size, this reduction in batch size would not affect the processing speed. For instance, the images-per-second speed for MobilenetV2 on Newport is about 3 images per second for all batch sizes greater than 16. This happens as a result of the full utilization of the processing engine when the task becomes computation intensive rather than communication intensive.
|MobilenetV2||3.47M||7.16M||56M||315 / 25||31.05 / 3.08|
|NASNet||5.3M||10.74M||564M||325 / 15||47.31 / 2.80|
|InceptionV3||23.83M||47.82M||5.72G||370 / 16||30.80 / 1.85|
|squeezenet||1.25M||2.46M||861M||850 / 50||219.0 / 16.3|
V-a Performance Analysis
Stannis started with running the tuning algorithm for the MobileNetV2, which gave us the optimal batch sizes of 25 and 315 and the processing speed of 3 and 32.3 images per second for the Newport and the host, respectively. It then determined the number of input data corresponding to each node to avoid stall on individual cards and then ran the benchmarks on a mixture of host and number of cards. We ran the same test for NASNet, InceptionV3, and SqueezeNet as well. The optimized batch sizes and corresponding speed for different networks are shown in Table LABEL:tab:table_1. Fig. 6 shows the processing speed in the form of an image per second for different networks. There is a slowdown in all processing nodes in a distributed model, which is due to partial stalls when nodes are synchronizing the parameters. The slow down pace fades out, and the individual node’s performance converges to a certain speed after the number of nodes grows beyond 5-6 devices. The relative speedup for different networks is also shown in Fig. 7. The result shows that the smaller networks get better speedup than the larger ones; this is because the more parameters to update, the more time it takes to synchronize the nodes. Another important parameter is the number of MACs. As Fig. 7 shows, SqueezeNet with 2.46M Flops gets less speedup compared to MobileNetV2 with 7.16M Flops, because it has 15x more MACs.
V-B Power Analysis
The closest product to Newport that we could find on the market was the 11 TB Micron MTFDHAL11TATCW-1AR1ZAB SSD. We evaluate the power consumption of an AIC server with 24 11-TB Micron SSDs against the same system with 24 32-TB Newport CSDs. We used an off-the-shelf power meter to measure the input power to the entire server rack. Table II shows the energy per processed image for MobileNetV2. We skip the results for other networks since the power measurements were almost identical. Measurements show up to 69% saving in energy per processed image and 2x FLOPS per watt with 24 Newport CSDs compared to host alone.
|Number of CSDs||0||4||8||16||24|
|Energy per image (J)||13.10||8.30||6.84||5.05||4.02|
|Energy saving (%)||0%||37%||48%||62%||69%|
|FLOPS per watt||5.87M||7.05M||8.18M||10.37M||12.26M|
V-C Accuracy Analysis
To assess the accuracy degradation due to distribution of the training process, we ran a training session with 416,000 images on a single node and on six nodes to compare their accuracy and loss. The results show the loss increased from 1.1859 on a single node to 1.1907 on six nodes, or a 0.5% increase, and the same accuracy of 0.31 on both setups. Note that what is relevant here is not the absolute accuracy but the fact that our heterogeneous distributed setup yields the same accuracy as the centralized one.
This paper introduces a new in-storage processing solution to accelerate deep neural network training. It achieves up to 2.7x speedup while reducing energy by up to 69%. Our solution consists of novel hardware named Newport, a computational storage device (CSD) that augments processing power to the systems and eliminates the need to move data to the host for processing. The hardware runs workload that is distributed by our software framework named Stannis, which extends Horovod to take advantage of heterogeneous processing units by optimizing the batch size and controlling the portion of data allotted to each processing node. Not only is this beneficial to both power and performance but more importantly it protects privacy by never moving private data out of the storage system. We plan to extend Stannis to a general framework that can parallelize any application on heterogeneous architectures and also develop a federated learning framework for training on mobile devices.
-  (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation OSDI, pp. 265–283. Cited by: §II-B.
-  (2014) A reliable effective terascale linear learning system. Journal of Machine Learning Research 15, pp. 1111–1133. Cited by: §II-B.
-  Becoming a data-driven CEO. External Links: Cited by: §I.
-  Bringing HPC techniques to deep learning. External Links: Cited by: §II-B.
Accurate, large minibatch SGD: training ImageNet in 1 hour. CoRR abs/1706.02677. External Links: Cited by: §IV.
-  (2016) Biscuit: a framework for near-data processing of big data workloads. In ACM SIGARCH Computer Architecture News, Vol. 44, pp. 153–165. Cited by: §I, §II-D.
-  (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §IV.
-  (2011) Cloud hooks: security and privacy issues in cloud computing. In 2011 44th Hawaii International Conference on System Sciences, pp. 1–10. Cited by: §I.
-  (2019) Age of information in multiple sensing. External Links: Cited by: §I.
-  (2015) Bluedbm: an appliance for big data analytics. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 1–13. Cited by: §I, §II-D.
-  (2016) In-storage processing of database scans and joins. Information Sciences 327, pp. 183–200. Cited by: §II-D.
-  (2016) In-storage computing for Hadoop MapReduce framework: challenges and possibilities. IEEE Transactions on Computers. Cited by: §II-D.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §II-B.
-  (2018) Scalable multi-queue data transfer scheme for fpga-based multi-accelerators. In 36th International Conference on Computer Design (ICCD), pp. 374–380. Cited by: §II-C.
-  (2018) Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799. External Links: Cited by: §II-B.
-  Tiny imagenet visual recognition challenge. External Links: Cited by: §V.
-  (2019) Accelerating hpc applications using computational storage devices. In 21st International Conference on High Performance Computing and Communications, pp. 1878–1885. Cited by: §I.
-  (2018) CompStor: an in-storage computation platform for scalable distributed processing. In International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1260–1267. Cited by: §II-D.
-  (2019) Catalina: in-storage processing acceleration for scalable big data analytics. In 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 430–437. Cited by: §II-D.
-  (2019) Computational storage: an efficient and scalable platform for big data and hpc applications. Journal of Big Data 6 (1), pp. 100. Cited by: §I.
-  (2019-07) Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. arXiv e-prints, pp. arXiv:1907.10701. External Links: Cited by: §II-C.