Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation

03/04/2018 ∙ by Saravanan Thirumuruganathan, et al. ∙ 0

Past. Data curation - the process of discovering, integrating, and cleaning data - is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data Curation

Data curation (DC) – the process of discovering, integrating and cleaning datasets for a specific analytics task, as shown in Figure 1 – is critical for any enterprise to extract the real business value from their data; feeding flawed, redundant or incomplete data as input will produce nonsense output or “garbage” (a.k.a. garbage in, garbage out). Due to its importance, there has been many commercial solutions [50, 28] and academic efforts [9, 31, 19, 52, 49, 6, 46] for all aspects of DC, including data discovery [10, 21], data integration [14] and data cleaning [18].

Unfortunately, data curation remains the most expensive task for humans – an oft-cited statistic is that data scientists spend 80% of their time curating their data [10]. The reason for such a high human cost is mainly because most DC solutions are ad-hoc, which are domain-specific and task-specific. Existing DC solutions cannot keep up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity (a.k.a. the four V’s of big data) – practitioners are desperate for revolutionary solutions that democratize DC, a journey that is deemed extraordinary.

Figure 1: A Data Curation Pipeline

Deep Learning

Deep learning (DL) is an emerging paradigm within the area of machine learning (ML) that has achieved incredible success in diverse areas, such as computer vision, natural language processing, genomics, and many more. The trifecta of big data, better algorithms, and faster processing power has resulted in DL achieving super human performance in many areas. Due to these successes, there has been extensive new research seeking to apply DL to other areas both inside and outside of computer science.

Data Curation Deep Learning : Opportunities

Deep learning brings unique opportunities to leapfrog DC, which our community must seize upon. In this vision paper, we lay down a roadmap on how to make this a reality.

➼ Feature (or Representation) Learning. For either ML- or non-ML based DC solutions, domain experts are heavily involved in either feature engineering or data (or representation) understanding so as to define data quality rules or make sophisticated deductive inference.

Opportunity. One of the fundamental ideas in DL is representation learning where appropriate features for a given task are automatically learned instead of being manually crafted. Developing DC specific representation learning algorithms could dramatically alleviate many of the frustrations domain experts face when solving DC problems.

➼ Deep Learning Architectures for Data Curation. So far, no DL architecture exists that is cognizant of the characteristics of DC tasks, such as representations for tuples or columns, integrity constraints, and so on. Consequently, existing DL architectures may work well for some, but definitely not all, DC tasks.


In retrospect, computer vision blossomed because of convolutional neural networks (CNNs), and natural language processing (NLP) and speech recognition flourished because of recurrent neural networks (RNNs). Likely, new DL architectures will emerge for various DC tasks.

➼ Holistic Knowledge: An Oracle. Current DC solutions are often designed for a specific domain or task, which does not utilize other global/enterprise level information. Even when one can obtain human feedback for a task, it is often filtered through the lens of the specific problem being worked on. A key issue is that knowledge in an enterprise is often scattered across various modalities including structured data (e.g., relations), unstructured data (e.g., documents), graphical data (e.g., enterprise networks), and even videos, audios, and images. An ideal DC expert should have knowledge gleaned from a wide variety of data sources and previous tasks, which can be used to solve any DC task.

Opportunities. (i) There has been extensive work on cross modal representation learning in DL that could be used to partially “approximate” the knowledge of a domain expert. (ii) A good domain expert seamlessly “transfers” knowledge gained from one task to another. That is, designing features for one task makes the process of designing features for a related task slightly easier. (iii) In order to avoid providing training examples for each task from scratch, a DL inspired technique of unsupervised pre-training can be used to train a baseline model that encodes the global information and then followed by a supervised re-training for a specific problem.

Figure 2: Deep Learning Architectures

Contributions and Roadmap

We started a multi-year research project, called AutoDC, to exploit the huge potential offered by DL to tackle key DC challenges. Below, we outline the main elements of our vision to unleash a DL-driven revolution in DC.

  • [Deep Learning Fundamentals.] There is a Cambrian explosion of DL research with thousands of papers being published every year that can drown an unguided researcher. To this end, we provide a crash course on DL through a series of carefully selected concepts that are relevant to DC (Section 2).

  • [Deep Learning for Data Curation.] We describe several interesting research directions for extending existing DL techniques, such as learning distributed representations for databases, designing DC-aware DL architectures, learning holistic knowledge, and automatically orchestrating a DC workflow (Section 3).

  • [Deep Learning for Synthesizing Data Curation Programs.] A further discussion is about automatically generating programs (a.k.a. program synthesis) using neural architectures, where the synthesized richer, larger, and more complex programs can solve more complicated DC problems, beyond the scope of current DC solutions (Section 4).

  • [Deep Learning Low Hanging Fruits.] We show simple adaptations of DL that can be used to solve some traditional DC tasks (Section 5).

  • [Deep Learning Tricks for Data Curation.] Because DL is still a relatively new concept, especially for database applications, there are a lot of concerns that we need to be cautious about – DL, at least in its current form, is not a silver bullet. Fortunately, there are anecdotal tricks for DL learned from other domains that can mitigate these concerns and are likely to be useful for DC (Section 6).

  • [Call to Arms.] We close this paper by a conclusion (Section 7).

The AutoDC research agenda represents a promising future of DC and DL research and practice. Our primary success metric is to reduce the time and cost of performing DC tasks, so as to benefit the end-users. Within AutoDC, we target at end-to-end solutions for solving real-world problems. We should be cautious that, despite its stunning success, DL is still at its infancy and the theory of DL is still being developed. To a DC or DB researcher used to purveying a well organized garden of conferences such as VLDB/SIGMOD/ICDE, the DL research landscape might look like the wild west. We argue that database conferences must provide a safe zone in which these DL explorations are conducted in a principled manner.

2 Deep Learning Fundamentals

This section is a crash course of fundamental DL concepts (see [23, 7, 35] for more details) needed in the latter sections. While these concepts are widely discussed in the DL literature, we include them here for completeness – even a general audience can understand our work. Of course, readers with DL background can safely skip this section.

Deep learning is a subfield of ML that seeks to learn meaningful representations from data that could be used to solve the task at hand effectively. The most commonly used DL models are neural networks with many hidden layers. The key insight is that successive layers in this “deep” neural network can be used to learn increasingly useful representations of the data. Intuitively, the input layer often takes in the raw features. As the data is forwarded and transformed by each layer, more and more meaningful information (representation) is extracted. This incremental manner in which increasingly sophisticated representations are learned layer-by-layer is one of the defining characteristic of DL. The number of layers in a model is referred to as its depth while the number of parameters to be learned is the model’s capacity.

In the following, we will first introduce some DL architectures (Section 2.1), which are depicted in Figure 2111The figure’s style is inspired by the Neural Network Zoo by Fjodor van Veen ( We then discuss distributed representations (Section 2.2).

2.1 Deep Learning Architectures

✑ Neural Networks

Neural networks are a biologically inspired ML model that mimic human brains. Figure 2(a) shows an example of a three layer neural network, which contains an input layer (the leftmost layer), a hidden layer (the intermediate layer), and an output

layer (the rightmost layer). In a nutshell, neural networks are simply layers stacked on top of each other, with the neurons in each layer not connected among themselves. Each layer can be thought of performing a geometric transformation of its input such that the resulting output is more meaningful for the ML task. That is, the neural network performs a complex geometric transformation (where each layer performs over the output of the other) of the input data space to the output data space. The transformation is parameterized by the weights of each layer that are learned by an algorithm based on the training data. Initially, they are often set to random values corresponding to a random – and probably useless – transformation. The learning algorithm evaluates how far the output of the network is from the training data through

loss scores. It then systematically updates the parameters such that the loss score decreases. This process is often repeated many times to obtain a neural network whose output is as close as possible to the output from the training data.

✑ Fully-connected Neural Networks

The most common models used in DL are neural networks with many hidden layers (see Figure 2

(b)) – a series of layers where each node in a given layer are connected to every other node in the next layer. Fully-connected neural networks are also called feed-forward neural network, since they feed information from the input to the output. It can learn relationships between any two input features or intermediate representations. Its generality however comes with a cost; one has to learn the weights/parameters which requires lot of training data. For example, two fully connected hidden layers with 100 nodes in each layer will require learning 10,000 parameters.

✑ Convolutional Neural Networks (CNNs)

CNNs are feed-forward neural networks (see Figure 2(c)). They differ from other neural networks in that the input is fed through convolutions layers, where neurons in convolutional layers only connect to close neighbors (instead of all neurons connecting to all neurons). Also, convolutional layers tend to shrink as they become deeper (e.g., by easily divisible factors of the input). CNNs are widely used by the computer vision community. Because every image can essentially be represented as a matrix of pixel values, instead of learning global patterns between arbitrary features, this method focuses on spatially local patterns. If a layer can recognize a certain pattern, it can recognize it irrespective of its location in the image. Furthermore, CNNs also have the ability to learn spatial hierarchies such as nose face human. In other words, CNNs can learn increasingly complex (resp. abstract) representations from simpler (resp. granular) representations.

✑ Recurrent Neural Networks (RNNs)

RNNs are also feed-forward neural networks (see Figure 2

(d)). In contrast to the previous DL architectures that process the input on its entirety, RNN processes them one step at a time (for example, given two words “data curation”, it first handles “data” and then “curation”). Hence, neurons in an RNN are designed by being fed with information not just from the previous layer but also from themselves from the previous pass. Consequently, the order of feeding an input to RNN matters. RNNs are widely used in NLP and speech recognition. In practice, a simple RNN often have troubles with learning long range dependencies. Long Short Term Memory (LSTM) is a popular variant of RNN that can be explicitly used when the input sequence exhibits long range dependencies. LSTM provides a mechanism to “remember” past information across multiple time steps through its memory.

✑ Auto Encoders (AE)

Autoencoders are a popular DL model for unsupervised learning of efficient representations and used commonly in feature extraction (see Figure 2(e)). An autoencoder takes a -dimensional input

, and maps it to a hidden representation

in a -dimensional space using encoders. This hidden representation can then be reconstructed to a -dimensional output using a decoder. Of course, for the autoencoder to learn useful representations, we require that be as close to as possible and that (otherwise, the network can simply “memorize” the input). This forces the autoencoder to learn a compressed representation of the input. Intuitively, an autoencoder seeks to reconstruct its own inputs – given an input , and reconstructs it as accurately as possible by encoding the input to a low dimensional latent space (that can capture the major structures/patterns implicit in the data) and then decoding it back to the original space.

For the purpose of DC, three variants of autoencoders are especially relevant, as introduced below.

➼ Sparse Autoencoders (SAE) (Figure 2(f)) seek to learn a compressed representation that minimizes reconstruction error while also enforcing the sparsity constraint ensuring that the autoencoder learns a sparse representation of the input. This type of networks is typically used to extract many small features from a dataset. For example, -Sparse Autoencoders identify the largest components in the hidden representation and zeroes out the rest.

Denoising Autoencoders (DAE)

(Figure 2(g)) seek to reconstruct the input data from a noisy/corrupted version of that same input. Typically, one first stochastically corrupts the input data and uses it as an input to the autoencoder. However, instead of reconstructing this noisy version, we force it to reconstruct the original version of the data. This autoencoder has the appealing property of learning distributed representations that are often robust to corruptions.

➼ Variational Autoencoders (VAE) (Figure 2(h)) seek to minimize the reconstruction loss while also imposing additional constraints over the latent space, e.g.,

an approximated probability distribution of the input samples. Specifically, VAEs seek a continuous, well structured latent space such that two vectors that are close in the latent space will be reconstructed to similar vectors in the input space. For example, VAEs for images ensure that any two vectors that are close to each other in the latent space will produce highly similar images.

✑ Generative Adversarial Networks (GANs)

GANs [24] are a class of DL models used for unsupervised representation learning. They work as twins: two neural networks working together – a generator and a discriminator (Figure 2(i)), where the former generates content that will be then judged by the latter. Similar to autoencoders, the generator seeks to map an input vector from the latent space to a vector in the data space. For example, let us consider the domain of (synthetic) generation of photo-realistic images. The generator tries to map a vector in the latent space to a (synthetic) image. The discriminator takes an input (either real or synthetic) and predicts whether the image is original or produced by the generator. The goal of the system is to increase the number of mistakes made by the discriminator – which implies that the data generated are so realistic that they can fool the discriminator. [7] gives a forger-dealer analogy where a forger seeks to create fake (say) Picasso paintings while the dealer tries to distinguish between the original paintings and the fakes. While the forger is initially bad, eventually she becomes increasingly competent at imitating Picasso’s style. Of course, the dealer also becomes increasingly competent at recognizing fakes. When these competing systems converge, the results are very realistic Picasso paintings that can fool all but very best Picasso experts.

2.2 Distributed Representations

Figure 3: Local vs. Distributed Representations

Deep learning is based on learning data representations, and the concept of distributed representations (a.k.a. embeddings) is often central to DL. In contrast to local representations where each object is represented by a single representational element (or one neuron in neural networks), distributed representations [29] represent one object by many representational elements (or many neurons). Conversely, each neuron is associated with more than one represented object. That is, the neurons represent features of objects.

Let us better illustrate this concept through an example. Consider Figure 3 with different distributed representations for words such as man, woman, boy, and so on222The example is borrowed from Local representations are one-hot (or “-of-”) encodings, where all except one of the values of the vectors are zeros (see Figure 3(a)). If we try to reduce the dimensionality of the encoding, we can use distributed representations (see Figure 3(b)) where each word is represented by several dimensions with decimal values between 0 and 1.

There are certain advantages of using distributed representations. (1) The representation power is exponential in the total dimensions available, while the representation power of local representations is only linear to the total dimensions. (2) It is easier to capture semantic similarity between two different objects, e.g., girl is closer to princess than man in two dimensions ( and Femininity and and Youth).

✑ Distributed Representations of Words

Distributed representations of words (a.k.a. word embeddings) seek to map individual words to a vector space, which helps learning algorithms to achieve better performance in NLP tasks by grouping similar words. For word embeddings, the dimensionality is often fixed (such as ) and the representation is often dense. Moreover, the representation is distributed where each word is represented by multiple components of the representations (all the non-zero values of the vector) and each component can potentially participate in representing multiple words (i.e., the -th component can take a non-zero value and can be used in the representation of many words).

Word embeddings are often learned from the data in such a way that semantically related words are often close to each other. In other words, the geometric relationship between words often also encodes a semantic relationship between them. Furthermore, it is also possible that certain vector operations have a semantic interpretation. A oft-quoted example shows that by adding the vector corresponding to the concept of female to the distributed representation of king, we (approximately) obtain the distributed representation of queen. Popular word embeddings such as word2vec [22] often encode hundreds if not thousands of such meaningful transformation vectors, such as singular-to-plural, symptom-disease, country-capital, and so on.

✑ Distributed Representations of Graphs

Distributed representations of graphs (a.k.a. network embeddings) are a popular mechanism to effectively learn appropriate features for graphs (see [8] for a survey). Typically, one seeks to represent each node in the graph as a dense fixed length high dimensional vector. The optimization objective is often neighborhood-preserving whereby two nodes that are in the same neighborhood will have similar representations. Depending on the definition of neighborhood, one can design different representations for nodes in a graph.

✲ Compositions of Distributed Representations

Most of the distributed representation learning algorithms are designed for atomic units – words in NLP, or nodes in a graph. One can then use these to design distributional representations for more complex units. In the case of NLP, these could be sentences (i.e., sentence2vec), paragraphs or even documents (i.e., doc2vec) [33]. In the case of graphs, it can be subgraphs or entire graph. Analogously, for databases, the atomic unit is a cell (an attribute value of a tuple). Assuming that we can learn the distributed representations of cells, by composition, we can design representations for tuples, columns, tables, or even an entire database.

Recently, there has been an increasing interest in directly learning the distributed representations based on the task at hand (i.e., bypassing the composition phase). For example, one can directly learn distributed representations of subgraphs [41] and use it for community detection.

3 Deep Learning for Data Curation

The Promised Land: Given an ocean of data and an analytic task, the holy grail is that the entire data curation pipeline (see Figure 1) can be automatically orchestrated, and the discovered datasets can be nicely integrated and cleaned, ready for the analytics task at hand.

As discussed earlier in the introduction, existing DC solutions are mostly domain-specific, task-specific, and human intensive – the very aspects where DL has shined in addressing for many other communities. In this section, we will explore several voyages towards the promised land.

  • A fundamental issue in many DC problems (e.g., entity resolution, data normalization and violations of integrity constraints) is to reason about syntactic and semantic relationships between (multiple) attribute values, tuples, columns, and tables, when the data is dirty (or noisy). We will discuss distributed representations for DC that may ease these pain points (Section 3.1).

  • With history as a guide, many DL architectures have been designed based on the characteristics of the target applications, e.g., CNN for computer vision and RNN for NLP. There is a pressing requirement for new DL architectures tailored for DC (Section 3.2).

  • A common knowledge is that only domain experts are best at performing a DC task. Ideally, we need an augmented expert (or an Oracle) who has knowledge across domains, tasks, and data formats (Section 3.3).

  • Assume that each DC task can be effectively performed. A further question is how to automatically orchestrate multiple DC tasks (Section 3.4).

3.1 Distributed Representations for
Data Curation

The “matching” process is a central concept in most, if not all, DC problems: for data discovery, one needs to find datasets that match a user specification (e.g., a keyword search or an SQL query); for schema mapping, one needs to capture matched columns; for entity resolution, one needs to discover matched

entities (or tuples); for outlier detection, one needs to detect anomalous data that

does not match

a group of values; for data imputation, one needs to guess the missing value that should

match its surrounding evidence. Traditional DC solutions use “local” interpretation for each DC task as mentioned above, which is usually ad-hoc and thus not extensible. Distributed representations, which can interpret one object in many (exponential) different ways, may provide great help for various DC problems.

✑ Distributed Representation of Cells

A cell, which is an attribute value of a tuple, is the smallest data element in a relational database. We first discuss distributed representations for cells.

An Adapted Approach from Word Embeddings. A plausible approach inspired by word2vec [40] treats this as equivalent to the problem of learning word embeddings, where we map words to a dense high dimensional vector such that semantically related words are close to each other (see more details in Section 2.2). A naive adaptation treats each tuple as a document where the values of each attributes corresponds to words. This setting will ensure that if two attribute values occur together often in a similar context, then their distributed representation will be similar. For example, if a relation has attributes and Country and and Capital with many tuples containing (Brazil, Brasilia), then their distributed representations would be similar.

Limitations. There are several obvious limitations when simply applying the method of learning word embeddings for learning cell embeddings.

  1. Databases are typically well normalized to reduce the redundancy, e.g., a databases in third normal form (3NF) or Boyce-Codd normal form (BCNF) [3], which also minimizes the frequency that two semantically related attribute values co-occur in the same tuples.

  2. The window size , which is used to scan the document by considering a word and a window of words around it, may have a dramatic impact on learning cell embeddings. Consider a relation , where is and Country and is and Captial that are clearly relevant to each other. Since tuples are treated as a document and thus some order is assumed, if , a window size of can capture their co-occurrences. However, if ( is a relatively large number, say 10), then even a window size will miss them.

  3. A big difference between databases and documents is that databases have many data dependencies (or integrity constraints), within tables (e.g., functional dependencies, and conditional functional dependencies [19]) or across tables (e.g., foreign keys, and matching dependencies [20]). These data dependencies are important hints between semantically related cells that should be captured by learning distributed representations of cells.

A More Natural (Sophisticated) Model for DC. In order to capture the relationships (e.g., integrity constraints) between cells, a more natural way is to treat each relation as a heterogeneous network. More specifically, each relation is modeled as a graph , where each node is a unique attribute value, and each edge represents multiple relationships, such as co-occur in one tuple, there is functional dependency from the attribute of to the attribute of , and so on. The edges could be either directed or undirected, and may carry different labels and weights. Intuitively, using this enriched model might provide a more meaningful distributed representation that is cognizant of both content and constraints.

A sample table and our proposed graph representation of the table is shown in Figure 4. There are four distinct and Employee ID values (nodes), three distinct and Employee Name values, two distinct and Department ID values, and three and Department Name values. There are two types of edges: undirected edges indicating values appearing in the same tuple, e.g., 0001 and John Doe, and directed edges for functional dependencies, e.g., and Employee ID 0001 implies and Department ID 1.

Figure 4: A Heterogeneous Graph of A Table

✍ Research Opportunities

  • Algorithms. New algorithms are needed for learning cell embeddings that take values, integrity constraints, and other metadata (e.g., a query workload) into consideration.

  • Global Distributed Representations. We need to learn distributed representations for the cells over the entire data ocean, not only on one relation. This requires to “transfer” knowledge gained from other relations to design more meaningful distributed representations.

  • Data Enrichment. There are multiple ways to enrich a relation, e.g., by joining with other tables, which may result in an enriched table that is more suitable for learning representations.

  • Handling Rare Values. Rare values, such as primary keys, should be treated fairly, so as to have meaningful representations.

✑ Compositional Distributed Representations

The idea proposed in the previous section seeks to learn the distributed representations for cells. However, many data curation tasks are often performed at a high level of granularity. The next fundamental question to solve is to design an algorithm to compose the distributed representations of more abstract units from these atomic units. As an example, how can one design an algorithm for tuple embeddings assuming one already has a distributed representation for each of its attribute values? A common approach is to simply average the distributed representation of all its component values. Alternatively, one can use a more sophisticated approach such as LSTM that follows a data driven approach to compose the tuple’s distribution while taking into account long range dependencies inherent to a relation.

✍ Research Opportunities

  • Tuple Embeddings (Tuple2Vec): Is there any other elegant approach to compose representations for tuples?

  • Column Embeddings (Column2Vec): Many tasks such as schema matching require the ability to represent an entire column (i.e., attribute) as a distributed representation. How can one adapt the ideas described above for columns?

  • Table Embeddings (Table2Vec) or Database Embeddings (Database2Vec): Tasks such as copy detection or data discovery (finding similar relations) might require to represent an entire relation or even an entire database as a single vector.

  • Direct Representation Learning: There has been a series of proposals that seek to directly learn representations for composite objects, such as sentences and paragraphs [34], graphs [42], and subgraphs [41]. An interesting direction is to adapt these ideas to directly learn the representations for tuples, columns, tables and databases.

3.2 Deep Learning Architectures for
Data Curation Tasks

Another fundamental question is to design DL architectures that are cognizant of the characteristics of DC tasks, which can then be leveraged for efficient learning in terms of training time and the amount of required training data.

Recall from Section 2 that a fully connected architecture (Figure 2(b)) is the most generic one. It does not make any domain specific assumptions and hence can be used for arbitrary domains. However, this generality comes at a cost; a lot of training data. One can argue that a major part of DL’s success in computer vision and NLP is due to the design of DL architectures – CNN and RNN respectively. For example, CNN leverages the fact that image recognition tasks often exhibit spatial hierarchies of patterns where complex/global patterns are often composed of simpler/local patterns (e.g., curves mouth face). Similarly, RNN assumes that language often requires sequential processing. In other words, it processes an input sequence one step at a time while maintaining an internal state. This is analogous to how humans process a sentence – one word at a time while maintaining an internal state of the sentence’s meaning based on what we have read so far. When we complete the entire sentence, we use this internal state to comprehend its meaning. These assumptions allows one to design effective neural architectures for processing images and text.

✍ Research Opportunities

  • DC Specific DL Architectures. The desiderata for such architectures are to: handle long range dependencies across attributes (similar to LSTM/RNN); learn both local and global patterns; and leverage additional domain knowledge and integrity constraints.

  • Handling Heterogeneity. In both CNN and RNN, the input is homogeneous – images and text. However, a relation can contain a wide variety of data, such as categorical, ordinal, numerical, textual, and image. Moreover, integrity constraints may also be considered.

  • DL Architectures for Distributed Representations. One interesting research topic is to explore the impact of the new DC specific DL architectures on the design of distributed representations.

3.3 Holistic Knowledge for Data Curation

Enterprises often possess data across various modalities, such as structured data (e.g., relations), unstructured data (e.g., documents), graphical data (e.g., enterprise networks), and even videos, audios, and images. Moreover, most of the prior work in DC treats various modalities of data separately, e.g., keys for graphs [17] have nothing to do with keys for tables, even if the information stored in graphs and tables are relevant. Also, users need to be involved for each DC task (e.g., schema mapping and entity resolution), even for the same dataset, causing a lot of redundant work.

Limitations. From the perspective of knowledge learning and sharing, existing solutions have three major limitations.

  1. Modality Specific. Current DC solutions are designed for specific data modalities – the DC efforts paid for one data modality cannot be carried over to another data modality.

  2. Task Specific. Even for the same dataset, the knowledge learned from one DC task cannot be transferred to another task.

  3. Training from Scratch. Typically, there is no prior knowledge that can be effectively used when handling a new dataset on a new DC task.

Ideally, we want to have a holistic DC solution, which is able to convert these diverse data/tasks into knowledge, that can be fed into very many downstream DC applications.

✍ Research Opportunities

  • Cross Modal Representation Learning. One possible way for solving limitation (1) is in the form of cross modal representation learning [30], where two entities that are similar in one or more modalities, e.g., occurring in the same relation, document, image, and so on, will have similar distributed representations.

  • Task Agnostic Representation Learning. An interesting direction to tackle limitation (2) is to use the enormous amount of data available to learn task agnostic distributed representations [53].

  • Pre-trained Deep Learning Models.

    In domains such as image recognition, there is a common tradition of training a DL model on a large dataset and then reusing it (after some tuning) for tasks on other smaller datasets. Often, one trains the model on a large, diverse and generic dataset such as ImageNet 

    [12] (with almost 14M images over 20k categories). The spatial hierarchy learned by this network is often a proxy for modeling the visual world [7]. In other words, many of the features learned by a model trained on ImageNet, such as VGG16, prove useful for many other computer vision problems - even when it is applied on a different dataset and even different categories [7]. These pre-trained models can be used in two ways:

    1. feature extraction

      where these are used to extract generic features that are fed to a separate classifier for the task at hand; and

    2. fine-tuning where one adjusts the abstract representations from the last few hidden layers of a pre-trained model and make it more relevant to a targeted task.

    Hence, a promising research avenue is the design of pre-trained DL models for DC, to ease the pain for limitation (3).

3.4 Data Curation Pipeline Orchestration

Recently, there has been increasing interest in automating and simplifying machine learning tasks. For example, Google AutoML [1] is a suite of ML products that allow developers with limited ML knowledge to train ML models for commonly used tasks such as image recognition, language translation, and classification.

✍ Research Opportunities

  • Data Curation as a Service. An appealing line of inquiry is to envision the desiderata for analogous DC as a service. That is, whether we can orchestrate a DC pipeline, where each component possibly uses some DL model, such that the input data is integrated and cleaned automatically for a user specified task.

4 Data Curation by
Neural Program Synthesis

So far, our discussion has been limited to traditional DC problems with simple outputs. Oftentimes, DC desires more complicated output such as a program (for ETL scripts) or a graph (for data violations). However, DL models for structured output are still in infancy. In what follows, we will discuss how to build DL models for more complex DC problems where the output is a program.

The area of program synthesis aims to automatically construct programs, in a programming or domain specific language (DSL), that are consistent with some specification - as a formal specification but often through few input-output examples. For example, given the input-output specification {(John Smith, J Smith), (Jane Doe, J Doe), ,}, Flashfill [27] can construct a program based on Excel’s macros that can perform a regular expression based string transformation. Program synthesis is often very challenging as it requires searching over the space of all possible programs in the programming language that satisfies the input-output examples. Recently, there has been extensive interest in applying DL for program synthesis. This often takes the form of neural program synthesis [5, 45] where a neural network is trained on input-output examples and generates a program. An alternate approach is neural program induction [32, 43] where the neural network produces outputs for new inputs by using a latent specification of the program without explicitly generating it. One can see that this area can be very promising for DC.

✍ Research Opportunities

  • Syntactic and Semantic Transformations. Flashfill [27] can be used for regular expression driven string transformation from input-output examples. How can one extend such a method to perform more sophisticated string transformations that are not constrained by regular expressions as a DSL? How can one further extend these to perform semantic transformations [2]? For instance, given the example pairs, {(France, Paris), (Germany, Berlin), , } can one automatically learn that the latter is the capital city of the former?

  • Domain Specific Language for Data Curation. Program synthesis often searches for valid programs within the confines of a DSL, such as regular expressions. Is it possible to come up with a DSL that can encode common DC operations? This can dramatically simplify the coding and debugging code for common data cleaning and wrangling operations.

  • Program Synthesis from ETL Scripts. Currently, program synthesis is restricted to simple DSLs such as regular expressions. Often, enterprises have a number of pre-written ETL scripts. How can program synthesis be used to generate such ETL scripts automatically? Consider an example where we wish to integrate two relations. If we give a set of input-output tuples, is it possible to identify a series of ETL operations that can generate this virtual relation automatically?

  • Interactive and Preference-driven Program Synthesis. In a number of cases, the input-output is provided in an interactive manner by the domain expert. How can one learn the intrinsic preferences of the domain expert and use it to identify the appropriate program for the given input-output examples. As an example, consider the entity consolidation problem [11], where a group of tuples that refer to the same underlying entity are consolidated by selecting the appropriate value for each attribute. Given conflicting values “John Smith” and “J Smith” for the attribute Name, the domain expert might prefer to use the former to latter. Can one use program synthesis to identify the preferences of the domain expert so as to automatically take them into account for other conflicting tuples?

5 Deep Learning Low Hanging Fruits

In this section, we will first give some simple adaptions of DL for data discovery (Section 5.1) and entity resolution (Section 5.2). We will then discuss efforts from other domains whose contributions can be easily adapted for DC tasks (Section 5.3).

5.1 Data Discovery

Data discovery is one of the fundamental problems in DC. Large enterprises typically possess hundreds or thousands of databases and relations. Data required for analytic tasks is often scattered across multiple databases depending on who collected the data. This makes the process of finding relevant data for a particular analytic task very challenging. Usually, there is no domain expert who has complete knowledge about the entire data repository. This results in a non-optimal scenario where the data analyst patches together data from her prior knowledge or limited searches - thereby leaving out potentially useful and relevant data.

As an example of applying word embeddings for data discovery, Raul et al. [21] show how to discover semantic links between the different data units, which are materialized in the

enterprise knowledge graph

(EKG)333An EKG is a graph structure whose nodes are data elements such as tables, attributes and reference data such as ontologies and mapping tables and whose edges represent different relationships between nodes.. These links assist in data discovery by linking tables to each other, to facilitate navigating the schemas, and by relating data to external data sources such as ontologies and dictionaries, to help explain the schema meaning. A key component of this work is a semantic matcher based on word embeddings. It introduces the new concept of coherent groups to tackle the issues of multi-word phrases and out-of-vocabulary terms – the key idea is that a a group of words is similar to another group of words if the average similarity in the embeddings between all pairs of words is high. In particular, this approach was able to surface links that were previously unknown to the analysts, e.g., isoform, a type of protein, with Protein and Pcr – polymerase chain reaction, a technique to amplify a segment of DNA – with assay. It also helped discard spurious results obtained from other syntactical and structural matchers, e.g., the link between biopsy site and site_components because the words biopsy and components do not often appear together – they are not semantically related. More examples on how useful our method is for data discovery including its deployment with a pharmaceutical company can be found in the paper.

Alternatively, we are trying to use some of the recent advances in Neural Information retrieval [54] to tackle discovery in the context of DC. At its core, information retrieval involves two key steps: (a) generating good representations for query and documents and (b) finding relevance between query and documents. DL has been applied for both steps. We can represent appropriate data units (tuples, columns or relations) using either general word embeddings [40] or an enterprise specific one that is learned by some of the unsupervised representation learning approaches described previously.

Once we have both of the above steps, we can envision a Google-style search engine where the analyst can enter certain textual description of the data that she is looking for. We convert the query to a distributed representation, and use a DL/non-DL model to find the relevance between query and other result units. Once a set of candidate results (such as relations) are obtained, one can use the EKG to also simultaneously return other datasets that are thematically related. The analyst can then use these to make an informed decision.

5.2 Data Integration

Figure 5: DeepER Framework

Entity resolution, a key problem in data integration, determines if two tuples refer to the same underlying real-world entity[16]. Our recent work [15], DeepER, applies DL techniques for ER. We show the overall architecture of DeepER in Figure 5. DeepER pushes the boundaries of existing ER solutions in terms of accuracy, efficiency, and ease-of-use. For accuracy, we use sophisticated composition methods, uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units to convert each tuple to a distributed representation, which can in turn be used to effectively capture similarities between tuples. For efficiency, we propose a locality sensitive hashing (LSH) based approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only few attributes. For ease-of-use, our approach requires much less human labeled data, and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. When run on multiple benchmark datasets, DeepER achieves competitive results with minimal interaction with experts.

5.3 Data Cleaning

Data Transformation is a fundamental problem in DC where one needs to transform a given column such that all its values are in a canonical form. Examples, include “First Initial. Last Name”, nnn-nnn-nnnn format for phone numbers, etc. Currently, this process is often manual and relies on a domain expert to write appropriate scripts to ensure that all the inputs are in the standard form. Recently, there has been some work [13] that seek to apply DL for this purpose. Given a diverse set of inputs and a standard output, it can synthesize an appropriate DL model to produce a standardized output.

Data Imputation and Repair. Real world data often has a substantial amount of missing or incorrect data. Identifying appropriate values for missing data is a challenging research problem. A number of imputation techniques used in other areas (such as mean/median) are not applicable to DC tasks. Additional DC specific challenges include heterogeneity of data types, missingness patterns, proportions and distributions. Prior models such as missing-at-random or missing-completely-at-random are often too simple for DC scenarios. Recently, there has been a series of promising work [25] on using DL models such as denoising autoencoders for multiple imputation (where more than one cell has missing values). The key idea is to fill in missing values with plausible predicted values depending on local (tuple level) and global (relation level) patterns.

This approach could also be used to handle inconsistent data for knowledge fusion. Information integration in the presence of multiple, possibly conflicting data is very challenging. Reconciling conflicting information to identify true values so that they can be stored in a clean repository is not a solved problem. Researchers have proposed both non-probabilistic (such as minimal FD repair) and probabilistic (data fusion) based solutions. DL research has recently started scratching the surface of knowledge fusion. One could simply treat this as a missing value problem. For example, in the presence of conflicting values, treat them as missing and identify the most plausible predicted values.

Alexa/Siri/Cortana for Data Curation. Recently, there has been an explosion of interest in intelligent personal assistants such as Amazon Alexa. One can marry DL’s success in interactive spoken dialog with querying structured databases. Recent work such as EchoQuery [37] provided a hands-free, dialogue based querying of databases with a personalized vocabulary. For example, it can automatically learn the terms used by domain experts to refer to certain concepts that might be different from schema elements (i.e., table and column names). Extending it for arbitrary data curation tasks is an intriguing research problem.

6 Deep Learning Tips/Tricks for
Data Curation Practitioners

The moral of our vision about DC with DL is clear: DL can simplify and optimize DC tasks. So far in this paper, we have highlighted many of the potential upsides of applying DL for DC tasks. However, we would also like to caution that DL – at least in its current form – is not a silver bullet. Recently, there has been a series of papers, such as  [39], that study some of the challenges and limitations of DL.

In this section, we shift gears and discuss some teething issues of DL as applied to DC tasks. We will first discuss some common DL concerns, relevant to DC (Section 6.1), followed by describing some tips of handling the lack of training data, a major balk for using DL for DC (Section 6.2).

6.1 Common Deep Learning Concerns

In the past five years, DL has been achieving substantial successes in many areas such as computer vision, NLP, speech recognition, and many more. As pointed out in the previous sections, DC has a number of characteristics that are quite different from prior domains where DL succeeded. We anticipate that applying DL to challenging real-world DC tasks can be messy. We now describe some of the concerns that could be raised by a pragmatic DC practitioner or a skeptical researcher.

✑ Deep Learning is Data Hungry

Indeed, this is one of the major issues in adoption of DL for DC. Most classical DL architectures often have many hidden layers with millions of parameters to learn, which requires a large amount of training data. Unfortunately, the amount of training data for DC is often small. The good news is, there exist a wide variety of techniques and tricks in DL’s arsenal that can help address this issue. These include:

  • leveraging the vast amount of unlabeled data to learn key parameters;

  • transferring knowledge gained from a related task/domain;

  • designing DC-aware DL architectures that require less data to train; and

  • obtaining cheap but possibly inaccurate labels instead of expensive but accurate labels.

We will discuss these ideas in more details in Section 6.2.

✑ Deep Learning is Computing Heavy

Another common concern is that training DL models requires exorbitant computing resources. There are legions of tales on how training some DL models took days even on a large GPU cluster. In practice, training time often depends on the model complexity, such as the number of parameters to be learnt, and the size of training data.

We will describe some tricks later (Section 6.2) that can reduce the amount of training time. For example, a task-aware DL architecture often requires substantially less parameters to be learned. Alternatively, one can “transfer” knowledge from a pre-trained model from a related task/domain and the training time will now be proportional to the amount of fine-tuning required to customize the model to the new task. For example, DeepER [15] leveraged word embeddings from GloVe (whose training can be time consuming) and built a light-weight DL model that can be trained in a matter of minutes even on a CPU. Finally, while training could be expensive, this can often be done as a pre-processing task. Prediction using DL is often very fast and often comparable to that of other ML models.

✑ Deep Curation Tasks are Too Messy or Too Unique

DC tasks often require substantial domain knowledge and a large dose of “common sense”. Current DL models are very narrow in the sense that they primarily learn from the correlations present in the training data. However, it is quite likely that this might not be sufficient. Unsupervised representation learning over the entire enterprise data can only partially address this issue. Current DL models are often not amenable to encoding domain knowledge in general as well as those that are specific to DC such as data integrity constraints. As mentioned before, substantial amount of new research on DC-aware DL architectures is needed. However, it is likely that DL, even in its current form, can reduce the work of domain experts.

DC tasks often exhibit a skewed label distribution. For the task of entity resolution (ER), the number of non-duplicate tuple pairs are orders of magnitude larger than the number of duplicate tuple pairs. If one is not careful, DL models can provide inaccurate results. Similarly, other DC tasks often exhibit unbalanced cost model where the cost of misclassification is not symmetric. Prior DL work utilizes a number of techniques to address these issues such as (a) cost sensitive models where the asymmetric misclassification costs are encoded into the objective function; (b) sophisticated sampling approach where we under or over sample certain classes of data and so on. For example, DeepER [15] samples non-duplicate tuple pairs that are abundant at a higher level than duplicate tuple pairs.

✑ Deep Learning Predictions are Inscrutable

Yet another concern from domain experts is that the predictions of DL models are often uninterpretable. Deep learning models are often very complex and the black-box predictions might not be explainable by even DL experts. However, explaining why a DL made a particular data repair is very important for a domain expert. Recently, there has been intense interest in developing algorithms for explaining predictions of DL models or designing interpretable DL models in general. Please see [26] for an extensive survey. Designing algorithms that can explain the prediction of DL models for DC tasks is an intriguing research problem.

✑ Deep Learning Systems can be Easily Fooled

There exist a series of recent works which show that DL models (especially for image recognition) can be easily fooled by perturbing the images in an adversarial manner. The sub-field of adversarial DL [51, 44] studies the problem of constructing synthetic examples by slightly modifying real examples from training data such that the trained DL model (or any ML model) makes an incorrect prediction with high confidence. While this is indeed a long term concern, most DC tasks are often collaborative and limited to an enterprise. Furthermore, there are a series of research efforts that propose DL models that are more resistant to adversarial training such as [38].

✑ Building DL Models for DC is “Just Engineering”

Many DC researchers might look at the process of building DL models for DC and simply dismiss it as a pure engineering effort. And they are indeed correct! Despite its stunning success, DL is still at its infancy and the theory of DL is still being developed. To a DC researcher used to purveying a well organized garden of conferences such as VLDB/SIGMOD/ICDE, the DL research landscape might look like the wild west.

In the early stages, researchers might just apply an existing DL model or algorithm for a DC task. Or they might slightly tweak a previous model to achieve better results. We argue that database conferences must provide a safe zone in which these DL explorations are conducted in a principled manner. One could take inspiration from the computer vision community. They created a large dataset (ImageNet [12]

) that provided a benchmark by which different DL architectures and algorithms can be compared. They also created one or more workshops focused on applying DL for specific tasks in computer vision (such as image recognition and scene understanding). The database community has its own TPC series of synthetic datasets that have been influential in benchmarking database systems. Efforts similar to TPC are essential for the success of DL-driven DC.

6.2 Taming Deep Learning’s Hunger for Data

Deep learning often requires a large amount of training data. There are a number of exciting principles routinely used in DL that can be adapted to solve many pain points of DC. Below, we highlight some of these promising ideas and discuss some potential research questions.

6.2.1 Unsupervised Representation Learning

A key issue is that the amount of supervised training data is limited. However, most enterprises have substantial amount of unsupervised data in various relations and data lakes. One can use these unlabeled information to learn some task agnostic and generic patterns commonly observed in the data. Once these are identified, one can readily train a DL model using these representations using relatively less training data. Please refer to Section 3.3 for more details.

6.2.2 Data Augmentation

Data augmentation is a popular technique in DL to increase the size of labeled training data without increasing the load of domain experts. The key idea is to apply a series of label preserving transformations over the existing training data. Consider an image recognition task for distinguishing cats from dogs with limited training data. A typical trick is to apply a series of transformations such as translation (moving the location of the dog/cat within the image), rotation (changing the orientation), shearing, scaling, changing brightness/color, and so on. Note that each of these operations does not change the label of the image – yet provides many more synthetic training data.

✍ Research Opportunities

  • Label Preserving Transformations for DC. Techniques are needed to perform label preserving transformations, but for DC operations.

  • Domain Knowledge Aware Augmentation. It is interesting to integrate domain knowledge and data integrity constraints for data augmentation.

  • Domain Specific Transformations. There are some recent approaches such as Tanda [48] that seek to learn domain specific transformations. Similar ideas might be extended to DC.

  • GANs for Transformation. A possible direction is to use GANs (see Figure 2(i) in Section 2) for learning transformations.

6.2.3 Synthetic Data Generation for Data Curation

A seminal event in computer vision was the construction of ImageNet dataset [12] with many million images over thousands of categories. This dataset was orders of magnitude larger than prior ones and served as a benchmark for image recognition for many years.

✍ Research Opportunities

  • Benchmark for DC. It is very important to create a similar benchmark to drive research in DC (both DL and non-DL). While there has been some work for data cleaning such as BART [4], it is often limited to specific scenarios. For example, BART can be used to benchmark data repair algorithms where the integrity constraints are specified as denial constraints.

  • Synthetic Datasets. If it is not possible to create an open-source dataset that has realistic data quality issues, a useful fall back is to create synthetic datasets that exhibit representative data quality issues. The family of TPC benchmarks involve a series of synthetic datasets that is somehow realistic and widely used for benchmarking database systems. How can one use a wide variety of DL techniques to generate synthetic data? The most promising approaches are variational auto encoders (VAE) and Generative adversarial networks (GANs). Both have their own pros and cons. While the latent space of VAE is more structured, it is also makes additional distributional assumptions that might not strictly apply to DC. GANs on the other hand are more generic but often have issues with convergence.

6.2.4 Weak Supervision

A key bottleneck in creating training data is that there is often an implicit assumption that it must be accurate. However, it is often infeasible to produce hand-labeled and accurate training data for most DC tasks. This is especially challenging for DL models that require huge amount of training data. However, if one can relaxes the need for veracity of training data, its generation becomes much easier for the expert. The domain expert can specify a high level mechanism to generate training data without endeavoring to make it perfect. For example, she can say that if two tuples have the same country but different capitals, they are in error. This often requires substantially less effort than manually encoding all the exceptions.

✍ Research Opportunities

  • Weakly Supervised DL Models. There has been a series of research (such as Snorkel [47]) that seek to weakly supervise ML models and provide a convenient programming mechanism to specify “mostly correct” training data. In fact, in many cases, these weakly labeled data can even be generated in an automated manner. It will be very useful to weakly supervise DL models for DC tasks.

6.2.5 Domain Adaptation

Another trick to handle limited training data is to “transfer” representations learned in one task to a different yet related task. For example, DL models for computer vision tasks are often trained on ImageNet [12] that is commonly used for image recognition purposes. However, these models could be used for tasks that are not necessarily image recognition and are even used for recognizing images that do not belong to the categories found in ImageNet.

✍ Research Opportunities

  • Transfer learning. Train a DL model for one task and tune the model for the new task by using the limited labeled data instead of starting from scratch.

  • Pre-trained Models. Alternatively, pre-train a DL model on a large, generic dataset and adapt the learned DL model for many other related tasks.

6.2.6 Crowdsourcing

Recently, there has been increasing interest in using crowdsourcing for a variety of tasks including generating appropriate training data for ML models. The output of crowd workers are often noisy and hence requires sophisticated algorithms [36] for inferring true labels from noisy labels, learning the skill of workers, assigning workers to appropriate tasks, and so on.

✍ Research Opportunities

  • Crowdsourced DL. Combining crowdsourcing along with some of the aforementioned ideas (such as data augmentation) is an intriguing research problem to investigate.

7 Call to Arms

This vision paper is motivated by two key observations. On one hand, DC, a long standing problem, which is further compounded by the increase in size, variety, veracity, and velocity of data, is at a crossroad; it needs revival and novel solutions, if it is to keep up with the new data ecosystem. On the other hand, DL is gaining traction across many disciplines, both inside and outside computer science, showing successes, which were unthinkable, just a few years ago. The meeting of these two disciplines will unleash a series of research activities that will certainly lead to actual and usable solutions for many DC tasks.

Key components of our roadmap in the AutoDC project include research opportunities in learning distributed representations for database aware objects such as tuples or columns, designing DC-aware DL architectures, learning holistic knowledge, automatically orchestrating DC workflows, and automatically generating programs using neural architectures. We have also discussed both simple adaptations of DL techniques to solve some traditional DC tasks, as our early efforts for the AutoDC project. Moreover, we have described many anecdotal tricks for DL learned from other domains that are likely to be useful for DC. All in all, besides presenting the vision and early achievement of AutoDC, this is a call to arms for the database community in general, and the DC community in particular, to seize this opportunity to leapfrog the area of DC, while keeping in mind the risks and mitigation that were also highlighted in this paper.