Database Meets Deep Learning: Challenges and Opportunities

06/21/2019 ∙ by Wei Wang, et al. ∙ University of Michigan National University of Singapore Zhejiang University Beijing Institute of Technology 0

Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications for many years, and therefore should be playing a lead role in supporting this new wave. However, databases and deep learning are different in terms of both techniques and applications. In this paper, we discuss research problems at the intersection of the two fields. In particular, we discuss possible improvements for deep learning systems from a database perspective, and analyze database applications that may benefit from deep learning techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, we have witnessed the success of numerous data-driven machine-learning-based applications. This has prompted the database community to investigate the opportunities for integrating machine learning techniques in the design of database systems and applications 

[65]. A branch of machine learning, called deep learning [44, 29]

, has attracted worldwide interest in recent years due to its excellent performance in multiple areas including speech recognition, image classification and natural language processing (NLP). The foundation of deep learning was established about twenty years ago in the form of neural networks. Its recent resurgence is mainly fueled by three factors: immense computing power, which reduces the time to train and deploy new models, e.g. Graphic Processing Unit (GPU) enables the training systems to run much faster than those in the 1990s; massive (labeled) training datasets (e.g. ImageNet) enable a more comprehensive knowledge of the domain to be acquired; new deep learning models (e.g. AlexNet 

[42]) improve the ability to capture data regularities.

Database researchers have been working on system optimization and large scale data-driven applications since 1970s, which are closely related to the first two factors. It is natural to think about the relationships between databases and deep learning. First, are there any insights that the database community can offer to deep learning? It has been shown that larger training datasets and a deeper model structure improve the accuracy of deep learning models. However, the side effect is that the training becomes more costly. Approaches have been proposed to accelerate the training speed from both the system perspective [9, 32, 15, 62, 1] and the theory perspective [90, 20]. Since the database community has rich experience with system optimization, it would be opportune to discuss the applicability of database techniques for optimizing deep learning systems. For example, distributed computing and memory management are key database technologies also central to deep learning.

Second, are there any deep learning techniques that can be adapted for database problems? Deep learning emerged from the machine learning and computer vision communities. It has been successfully applied to other domains, like NLP 

[21]. However, few studies have been conducted using deep learning techniques for traditional database problems. This is partially because traditional database problems — like indexing, transaction and storage management — involve less uncertainty, whereas deep learning is good at predicting over uncertain events. Nevertheless, there are problems in databases like knowledge fusion [16] and crowdsourcing [61], which are probabilistic problems. It is possible to apply deep learning techniques in these areas. We will discuss specific problems like querying interface, knowledge fusion, etc. in this paper.

The previous version [83] of this paper has appeared in SIGMOD Record. In this version, we extend it to include the recent developments in this field and references to recent work.

The rest of this paper is organized as follows: Section 2 provides background information about deep learning models and training algorithms; Section 3 discusses the application of database techniques for optimizing deep learning systems. Section 4 describes research problems in databases where deep learning techniques may help to improve performance. Some final thoughts are presented in Section 5.

2 background

Deep learning refers to a set of machine learning models which try to learn high-level abstractions (or representations) of raw data through multiple feature transformation layers. Large training datasets and deep complex structures [5] enhance the ability of deep learning models for learning effective representations for tasks of interest. There are three popular categories of deep learning models according to the types of connections between layers  [44]

, namely feedforward models (direct connection), energy models (undirected connection) and recurrent neural networks (recurrent connection). Feedforward models, including Convolution Neural Network (CNN), propagate input features through each layer to extract high-level features. CNN is the state-of-the-art model for many computer vision tasks. Energy models, including Deep Belief Network (DBN) are typically used to pre-train other models, e.g., feedforward models. Recurrent Neural Network (RNN) is widely used for modeling sequential data. Machine translation and language modeling are popular applications of RNN.

Figure 1: Stochastic Gradient Descent.

Before deploying a deep learning model, the model parameters involved in the transformation layers need to be trained. The training turns out to be a numeric optimization procedure to find parameter values that minimize the discrepancy (loss function) between the expected output and the real output. Stochastic Gradient Descent (SGD) is the most widely used training algorithm. As shown in Figure 

1

, SGD initializes the parameters with random values, and then iteratively refines them based on the computed gradients with respect to the loss function. There are three commonly used algorithms for gradient computation corresponding to the three model categories above: Back Propagation (BP), Contrastive Divergence (CD) and Back Propagation Through Time (BPTT). By regarding the layers of a neural net as nodes of a graph, these algorithms can be evaluated by traversing the graph in certain sequences. For instance, the BP algorithm is illustrated in Figure 

2, where a simple feedforward model is trained by traversing along the solid arrows to compute the data (feature) of each layer, and along the dashed arrows to compute the gradient of each layer and each parameter ( and ).

Figure 2: Data flow of Back-Propagation.

3 Databases to Deep Learning

In this section, we discuss the optimization techniques used in deep learning systems, and research opportunities from the perspective of databases.

3.1 Stand-alone Training

Currently, the most effective approach for improving the training speed of deep learning models is using Nvidia GPU with the cuDNN library. Researchers are also working on other hardware, e.g. FPGA [43]. Besides exploiting advancements in hardware technology, operation scheduling and memory management are two important components to consider.

3.1.1 Operation Scheduling

Training algorithms of deep learning models typically involve expensive linear algebra operations as shown in Figure 3, where the matrix and could be larger than . Operation scheduling is to first detect the data dependency of operations and then place the operations without dependencies onto executors, e.g., CUDA streams and CPU threads. Take the operations in Figure 3 as an example, and in Figure 3 could be computed in parallel because they have no dependencies. The first step could be done statically based on dataflow graph or dynamically [7] by analyzing the orders of read and write operations. Databases also have this kind of problems in optimizing transaction execution [89]

and query plans. Those solutions should be considered for deep learning systems. For instance, databases use cost models to estimate query plans. For deep learning, we may also create a cost model to find an optimal operation placing strategy for the second step of operation scheduling given a fixed computing resources including executors and memory.

Recent developments: Mirhoseini et al. [58]

propose to optimize the placement of operations on heterogeneous hardware devices (e.g., CPU and GPU) using reinforcement learning. Jia et al.

[35, 33] go beyond simple operation parallelism to consider parallelism from multiple dimensions together, including data samples and channels, operations, attributes and parameters. In addition, operation substitution has been studied in [34]

, which substitutes the original operations with new ones that retain the semantics but lead to better overall efficiency. Operation fusing is one example. A cost-based search algorithm is introduced to find optimized computation graphs. Similar fusing techniques are applied in open-source libraries including Tensorflow 

[1]

and PyTorch 

[63].

Figure 3: Sample operations from a deep learning model.

3.1.2 Memory Management

Deep learning models are becoming larger and larger, and already occupy a huge amount of memory space. For example, the VGG model [68] cannot be trained on normal GPU cards due to memory size constraints. Many approaches have been proposed towards reducing memory consumption. Shorter data representation, e.g. 16-bit float [12] is now supported by CUDA. Memory sharing is an effective approach for memory saving [7]. Take Figure 3 as an example, the input and output of the function share the same variable and thus the same memory space. Such operations are called ‘in-place’ operations. Recently, two approaches were proposed to trade-off computation time for memory. Swapping memory between GPU and CPU resolves the problem of small GPU memory and large model size by swapping variables out to CPU and then swapping back manually[13]. Another approach drops some variables to free memory and recomputes them when necessary based on the static dataflow graph[8].

Memory management is a hot topic in the database community with a significant amount of research towards in-memory databases [72, 91], including locality, paging and cache optimization. To elaborate more, the paging strategies could be useful for deciding when and which variable to swap. In addition, failure recovery in databases is similar to the idea of dropping and recomputing approach, hence the logging techniques in databases could be considered. If all operations (and execution time) are logged, we can then do runtime analysis without the static dataflow graph. Other techniques, including garbage collection and memory pool, would also be useful for deep learning systems, especially for GPU memory management.

Recent developments: The recomputing technique has been adopted in PyTorch [64]. Wang et al. [78] combines recomputing and swapping to optimize the memory of convolutional neural networks. Zhang et al.[93] propose a smart memory pool and automatic swapping strategy for deep neural networks to replace manual swapping in [13, 78]. Cai et al. [4] propose to slice the model to reduce the memory and computational resource consumption.

3.2 Distributed Training

Distributed training is a natural solution for accelerating the training speed of deep learning models. The parameter server architecture [15] is typically used, in which the workers compute parameter gradients and the servers update the parameter values after receiving gradients from workers. There are two basic parallelism schemes for distributed training, namely, data parallelism and model parallelism. In data parallelism, each worker is assigned a data partition and a model replica, while for model parallelism, each worker is assigned a partition of the model and the whole dataset. The database community has a long history of working on distributed environment, ranging from parallel databases [46] and peer-to-peer systems [76] to cloud computing [48]. We will discuss some research problems relevant to databases arising from distributed training in the following paragraphs.

3.2.1 Communication and Synchronization

Given that deep learning models have a large set of parameters, the communication overhead between workers and servers is likely to be the bottleneck of a training system, especially when the workers are running on GPUs which decrease the computation time. In addition, for large clusters, the synchronization between workers also accounts. Consequently, it is important to investigate efficient communication protocols for both single-node multiple GPU training and training over a large cluster. Possible research directions include : a) compressing the parameters and gradients for transferring [66]; b) organizing servers in an optimized topology to reduce the communication burden of each single node, e.g., tree structure [25] and AllReduce structure  [86] (all-to-all connection); c) using more efficient networking hardware like RDMA [9].

Recent developments: Gradient compression has shown to be effective in reducing the communication cost[37, 23, 73, 50]. Besides, Jiang et al. [36] propose a decentralized SGD algorithm which has similar convergence rate as mini-batch SGD but eliminates the parameter server. As a result, the traffic bottleneck at the parameter server is resolved. A more popular solution to resolve the bottleneck and improve the communication efficiency is to replace the parameter server architecture with all-reduce communication. Various all-reduce implementations [22, 31, 57] have been applied to train large-scale networks over thousands of GPUs.

SINGA Caffe[32] Mxnet[7] TensorFlow[1] Theano[2] Torch[11]
1. operation scheduling x - - x
2. memory management d+a+p i d+s p p -
3. parallelism d + m d d + m d + m - d + m
4. consistency s+a+h s/a s+a+h s+a+h - s
-: unknown  1. x: not available: ✓: available 2. d: dynamic; a: swap; p:memory pool; i: in-place operation; s: static;
3. d: data parallelism; m: model parallelism; 4. s: synchronous; a: asynchronous; h:hybrid
Table 1: Summary of optimization techniques used in existing systems as of July 18, 2016.

3.2.2 Concurrency and Consistency

Concurrency and consistency are traditional research problems in databases. For distributed training of deep learning models, they also matter. Currently, both declarative programming (e.g., Theano and TenforFlow) and imperative programming (e.g., Caffe and SINGA) have been adopted in existing systems for concurrency implementation. Most deep learning systems use threads and locks directly. Other concurrency implementation methods like actor model (good at failure recovery), co-routine and communicating sequential processes have not been explored.

Sequential consistency (from synchronous training) and eventual consistency (from asynchronous training) are typically used for distributed deep learning. Both approaches have scalability issues [80]. Recently, there are studies for training convex models (deep learning models are non-linear and non-convex) using a value bounded consistency model [85]. Researchers are starting to investigate the influence of consistency models on distributed training [25, 26, 6]. There remains much research to be done on how to provide flexible consistency models for distributed training, and how each consistency model affects the scalability of the system, including communication overhead.

Recent developments: In recent papers and the benchmark testing [10], synchronous training is preferred to asynchronous training because the former one is more stable in terms of convergence. With warm-up, layer-wise adaptive rate scaling for the learning rate [22], label smoothing, etc., synchronous SGD can scale to over 2000 GPUs [88, 31] without sacrificing accuracy. Typically, they increase the batch size gradually from a few thousands to tens of thousands. FlexPS [28] is a system that support such training schemes that involve multiple stages.

3.2.3 Fault Tolerance

Databases systems have good durability via logging (e.g., command log) and checkpointing. Current deep learning systems recover the training from crashes mainly based on checkpointing files [1]. However, frequent checkpointing would incur vast overhead. In contrast with database systems, which enforce strict consistency in transactions, the SGD algorithm used by deep learning training systems can tolerate a certain degree of inconsistency. Therefore, logging is not a must. How to exploit the SGD properties and system architectures to implement fault tolerance efficiently is an interesting problem. Considering that distributed training would replicate the model status, it is thus possible to recover from a replica instead of checkpointing files. Robust frameworks (or concurrency model) like actor model, could be adopted to implement this kind of failure recovery.

3.3 Optimization Techniques in Existing Systems

A summary of existing systems in terms of the above mentioned optimization aspects is listed in Table 1. Many researchers have done ad hoc optimization using Caffe, including memory swapping and communication optimization. However, the official version is not well optimized. Similarly, Torch itself provides limited support for distributed training. Mxnet has optimization for both memory and operations scheduling. Theano is typically used for stand-alone training. TensorFlow is potential for the aforementioned static optimization based on the dataflow graph.

We are optimizing the Apache incubator SINGA system [62] starting from version 1.0. For stand-alone training, cost models are explored for runtime operation scheduling. Memory optimization including dropping, swapping and garbage collection with memory pool will be implemented. OpenCL is supported to run SINGA on a wide range of hardware including GPU, FPGA and ARM. For distributed training, SINGA (V0.3) has done much work on flexible parallelism and consistency, hence the focus would be on optimization of communication and fault-tolerance, which are missing in almost all systems.

4 Deep Learning to Databases

Deep learning applications, such as computer vision and NLP, may appear very different from database applications. However, the core idea of deep learning, known as feature (or representation) learning, is applicable to a wide range of applications. Intuitively, once we have effective representations for entities, e.g., images, words, table rows or columns, we can compute entity similarity, perform clustering, train prediction models, and retrieve data with different modalities [82, 81] etc. We shall highlight a few deep learning models that could be adapted for database applications below.

4.1 Query Interface

Natural language query interfaces have been attempted for decades [47], because of their great desirability, particularly for non-expert database users. However, it is challenging for database systems to interpret (or understand) the semantics of natural language queries. Recently, deep learning models have achieved state-of-the-art performance for NLP tasks [21]. Moreover, RNN has been shown to be able to learn structured output [71, 74]. As one solution, we can apply RNN models for parsing natural language queries to generate SQL queries, and refine it using existing database approaches. The challenge is that a large amount of (labeled) training samples is required to train the model. One possible solution is to train a baseline model with a small dataset, and gradually refining it with users’ feedback. For instance, users could help correct the generated SQL query, and these feedback essentially serve as labeled data for subsequent training.

Recent developments: Multiple annotated datasets that consist of text query and SQL query pairs have been created using templates [94, 3] and user feedback [30]. The solutions [94, 30, 18] generally extend the sequence-to-sequence model to encode the text query and then generate the SQL query via the decoder. Domain knowledge like the SQL grammar is exploited.

4.2 Query Plans

Query plan optimization is a traditional database problem. Most current database systems use complex heuristic and cost models to generate the query plan. According to

[27], each query plan of a parametric SQL query template has an optimality region. As long as the parameters of the SQL query are within this region, the optimal query plan does not change. In other words, query plans are in-sensitive to small variations of the input parameters. Therefore, we can train a query planner which learns from a set of pairs of SQL queries and optimal plans to generate (similar) plans for new (similar) queries. To elaborate more, we can learn a RNN model that accepts the SQL query elements and meta-data (like buffer size and primary key) as input, and generates a tree structure [74] representing the query plan. Reinforcement learning (like AlphaGo [67]

) could also be incorporated to train the model on-line using the execution time and memory footprint as the reward. Note that approaches purely based on deep learning models may not be very effective. First, the query plan is generated based on probability, which is likely to have grammar errors. Second, the training dataset may not be comprehensive to include all query patterns, e.g., some predicates could be missing in the training datasets. To solve these problems, a better approach would be combining database solutions and deep learning, e.g. using some heuristics to check and correct grammar errors.

Recent developments: Recently, there has been an increasing trend in applying deep learning techniques for optimizing database systems, including query optimization by deciding the join order [41, 54], query performance prediction [55], cardinality estimation for join queries [51, 70] and database configuration tuning[92], etc. Deep reinforcement learning is the key model for supporting these optimizations [56]. Kraska et al.[40] propose a learned index that uses neural networks to map the key to the location of the record. SageDB [39] goes further by providing a vision to build a database system that can optimize towards a specific application. It exploits the data and workload distribution of the application to learn models for data access and query plan optimization.

4.3 Crowdsourcing and Knowledge Bases

Many crowdsourcing [87] and knowledge base [16] applications involve entity extraction, disambiguation and fusion problems, where the entity could be a row of a database, a node in a graph, etc. With the advancements of deep learning models in NLP [21], it is opportune to consider deep learning for these problems. In particular, we can learn representations for entities and then do entity relationship reasoning  [69] and similarity calculation using the learned representations.

Recent developments: DeepER [17]

exploit LSTM models to learn tuple embeddings for entity resolution. Deep learning models like CNN and attention modelling have been applied for concept linking 

[19, 14]. Mudgal et al.[60] evaluate four different deep learning models for entity matching problems.

4.4 Spatial and Temporal Data

Spatial and temporal data are common data types in database systems [24], and are commonly used for trend analysis, progression modeling and predictive analytics. Spatial data is typically processed by mapping moving objects into rectangular blocks. If we regard each block as a pixel of one image, then deep learning models, e.g., CNN, could be exploited to extract the spatial locality between nearby blocks. For instance, if we have the real-time location data (e.g., GPS data) of moving objects, we could learn a CNN model to capture the density relationships of nearby areas for predicting the traffic congestion for a future time point. When temporal data is modeled as features over a time matrix, deep learning models, e.g. RNN, can be designed to model time dependency and predict the occurrence in a future time point. A particular example would be disease progression modeling [59] based on historical medical records, where doctors would want to estimate the onset of certain severity of a known disease. In fact, most healthcare data is time-serise data, and thus deep learning can make great contribution in healthcare data analysis [45, 52].

Recent developments: Deep learning models including CNN and RNN have been applied in various spatial-temporal problems, including traffic flow prediction [53, 38], travel time estimation [77, 49, 84], driver behavior analysis [79], geospatial aggregation querying [75], etc.

5 Conclusions

In this paper, we have discussed databases and deep learning. Databases have many techniques for optimizing system performance, while deep learning is good at learning effective representation for data-driven applications. We note that these two “different” areas share some common techniques for improving the system performance, such as memory optimization and parallelism. We have discussed some possible improvements for deep learning systems using database techniques, and research problems applying deep learning techniques in database applications. To make the database systems more autonomic, with the ability to learn and optimize, and with ability to support complex analytics and predictions beyond data aggregation, we foresee a seamless integration of ML/DL and database technologies. With the implementation of 5G mobility network, we foresee the distribution of databases, training and inferencing at the edge devices, which will lead to further integration and adaption of technologies. Let us not miss the opportunity to contribute to the existing challenges ahead!

6 Acknowledgement

We would like to thank Divesh Srivastava for his valuable comments. This work was supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its Competitive Research Programme (CRP Award No. NRF-CRP8-2011-08), and Singapore Ministry of Education Academic Research Fund Tier 3 under MOE’s official grant number MOE2017-T3-1-007. Meihui Zhang was supported by China Thousand Talents Program for Young Professionals (3070011181811).

References