I Introduction
Emerging ideas in automated analytics [1] and metalearning across many datasets [2]
offer great promise for improving both performance and tedium in the data science pipeline. However, a major obstacle remains: such methods generally have no knowledge about what type of realworld entity (i.e. variable) generated the data they operate upon. In contrast, human analysts presented with new data often utilize this knowledge to recall previouslyencountered datasets that contain the same sort of variables. Reviewing past experience with how different algorithms fared on these same variables enables a person to quickly leverage methods that work well for this type of data (e.g. removal of UserID from predictive features, missingnessmodeling for userratings data, etc.).
Endowing a machine with this capability can be achieved via two ingredients: (1) an organized repository full of different datasets annotated with informative metadata regarding the performance of various computational techniques, and (2) the ability to recognize which repository datasets (and hence the algorithms associated with them) are relevant when presented with new data. The former component is being addressed by the emergence of rapidlygrowing public repositories such as OpenML [3] and kaggle.com, in which datasets from diverse domains are curated along with the performance of thousands of data science pipelines (including preprocessing & learning methods). Thus, this work aims to address the latter issue of identifying repository datasets which contain some of the same variables present in newly acquired data.
We focus on structured data tables (where rows = statistically independent observations, columns = variables), which are the most prevalent form of data among current organizations. Given a previouslyunseen data table, we aim to automatically infer what type of variable was measured to generate the observations in each column. To accomplish this task, we compare the new data against a stored repository of data tables and report which variables in the repository are likely the same as variables that generated the new data.
Note on terminology: Throughout, we use the term variable to refer to a realworld entity whose measurements form the values contained within a single column of a data table (also commonly called an attribute or feature). Although a single table of measurements (as might be found in a relational database) is often called a dataset, we carefully refer to this as a data table, reserving the term dataset for the univariate data values located within a single column of the data table (i.e. each dataset in this work is a single data column from some table, whose values stem from a single variable).
A major practical issue is the lack of informative variablenames (i.e. column labels) in most data tables, and a machinereadable description or codebook for the columns is rarely provided. [4, 5, 6, 7]. It also is unlikely for different groups producing different datasets to adopt precisely the same nomenclature, even when they are making measurements of the same type. For example, the age of a person is recorded as Age in some datasets and Years in others. To ensure robustness against idiosyncratically chosen variablenames, the methodology presented here aims to match variables solely on the basis of their data values. In settings where adequate columnannotations are available, our methods can nonetheless be straightforwardly combined with standard information retrieval techniques (e.g. based on wordembedding of the column label) to leverage the information across both the variablenames and the values of their measurements (c.f. [8, 9, 10]).
The main idea of our approach is to use this dataset repository to learn a vector embedding for each set of data values, such that datasets which stem from the same variable are embedded closer together than those which stem from different variables. When trained over a huge and representative data repository, our methodology can crucially adapt these embeddings to the different types of variation that naturally occur amongst realworld datasets. By prestoring the embedding vectors for each dataset in our repository, the variables most likely to match a given new dataset can be efficiently identified via fast approximate nearestneighbor algorithms [11].
Ii Related Work
[12, 13] also recently proposed methods for learning vector representations of datasets, but explore their use in generative modeling rather than our task. [5, 14, 15]
consider the more related problem of classifying data from an unknown variable into a predetermined set of meaningful categories such as semantic labels. Such a multiclass classification setup may however undesirably restrict the number of categories that can be considered and requires careful manual curation. In practice, there exists an extremely longtail of different variable types encountered in natural data (and the number of types is even continuously growing over time), which thus is not easily reduced into a limited set of classes. It is also the case that none of the previouslyrecorded types of variables may apply for some new datasets.
Given just a single pair of datasets, inference of columns that stem from the same variable is also known as instancebased schema matching [8, 16, 4, 7]. Also related are the tasks of semantic labeling [17, 9], dataset search [10], and Bayesian sets [18]
. However, existing methods from these communities rely on manuallyderived features of the datadistribution chosen by authors, unlike the neural network embeddings we employ, which automatically learn features that are naturally important across myriad datasets. Unlike our approach, existing tools entirely rely on the stringent assumption that different datasets from the same variable will always be drawn from identical distributions. As they do not leverage our fixedsize vector representations of datasets, these existing methods also cannot directly exploit fast approximate nearest neighbor methods and instead resort to knowledgebased blocking heuristics to remain computationally viable for a large number of highdimensional datasets
[16, 19].Iii Methods
To formalize our task, suppose we have a repository containing datasets , where each dataset stems from a single variable . Here, we assume each individual data instance value belongs to a common space shared across datasets, and the instancevalues within each are IID (e.g. not timeseries). In this work, we consider the case where is either the set of numeric scalar values or the set of (finitelength) strings. This means that formerly structured data tables have been split in into separate datasets comprised of their columns, which is done to ensure we can accurately recognize each individual column within a newly acquired data table. It is also assumed that some of the datasets stem from the same type of variable, but do not contain the same set of observations ( for some ). Given a previouslyunseen query dataset generated from an unknown variable at testtime, our system seeks to efficiently infer which of likely stems from the same type of variable as . Note that the query dataset may match more than one or none of the repository datasets.
A standard approach to determine whether two datasets stem from the same variable bases this decision on the similarity between their data distributions, or summary statistics thereof [7, 9]. One way to quantify the distance between two distributions is via:
(1) 
where the feature map determines how distributionaldifferences are reflected in the resulting distance between the summary statistics formed by average feature values. For example, a histogram representation of the distribution can be encoded by letting different dimensions of correspond to indicator functions for different bins. When corresponds to the (infinitedimensional) feature map of a universal RKHS, this distance is known as the maximum mean discrepancy (MMD) and can capture any difference in distribution [20]. Even finitedimensional can retain this property if we limit the distributions to certain parametric families.
In this work, we instead implement as a neural network with parameters and compare distributions via the squared Euclidean distance:
(2) 
We can ensure these representations are able to reflect subtle differences in distributions by employing a highcapacity network with highdimensional outputs. If dataset consists of IID samples from distribution , then we can empirically use the following distance between datasets:
(3) 
A fixedsize vector representation of a dataset is thus provided by the averaged output of the neural network , as in deep sets [21]. The computational complexity of
is linear in the datasets’ sample size, unlike the quadratictime MMD statistic, for which lineartime approximations come at the cost of a substantial variance increase
[20, 22]. However, this general approach to infer whether two datasets stem from the same variable solely based on their distributional similarity (e.g. via twosample testing) suffers from three problems:
inability to ignore natural variation between datasets containing measurements of the same type of variable

inability to distinguish different variables whose data distributions happen to be identical

computational inefficiency
To grasp the first problem, consider a variable like temperature, which is recorded in some datasets as Fahrenheit and others as Celsius, or from different geographical regions across datasets. Clearly, the temperature measurements appearing in different datasets may not be distributed similarly at all, an issue described as: batch effects, sampling bias, or covariate shift. By learning to ignore natural factors of variation present between different repository datasets from certain variables, the dataset embeddings proposed in our method can still recognize new datasets from these same variables even when their data distributions are dissimilar.
The second problem with matching variables via distributional similarity arises from the fact that two variables which are different may nonetheless have highly similar data distributions. Consider for example any Boolean attribute which takes values yes or no. Two such attributes that happen to share the same proportion of yes instances will have identical data distributions, even if these attributes share no relation at all. By learning from the data repository which sorts of distributions are naturally common between unrelated variables, our proposed methods remain able to distinguish new variables with these same common distributions. Finally, if each dataset contains samples, identifying the best candidates for a query dataset via distributional similarity would generally require at least computation. This is undesirably slow for a repository containing many large datasets. Our method’s reliance on fixedsize dataset embeddings enables the use of fast approximate nearest neighbor search to circumvent the inefficiency of exhaustive pairwise comparison.
Iiia Modeling Dataset Similarity
Our general strategy is to learn neural networks which infer the likelihood that two datasets stem from a shared variable via scalable stochastic gradient methods. Let
denote the probability that datasets
andstem from the same attribute. We directly estimate this match probability via the parameterized form
where:(4) 
Here, is the distance between dataset embeddings defined in (3), and is a scalar output by another deep sets neural network (with parameters ) to adjust the match probability:
(5) 
To ensure a nonnegative output value, the final activation used in the output layer of this network is the square function . Note that when the adjustment factor , the match probability between two datasets is solely determined by the distance between their embeddings. Datasets that likely stem from the same variable share similar embeddings while those unlikely to match are separated in the embeddingspace. However as discussed previously, the embedding distance in (3) alone will be unable to distinguish between two different variables whose data happen to be identically distributed. Thus, should learn to output large adjustment values for datasets with a distribution that is common amongst different variables in the repository, as this is the only way to produce a low matchprobability between such datasets under our model. Figure 3 shows that empirical performance severely degrades without .
IiiB Training Procedure
Training examples presented to our models are of the basic form where is equal to 1 if and only if and stem from the same variable (). In this form, we can learn neural network parameters by optimizing the standard crossentropy loss for binary classification:
(6) 
Since , the measure defined in (4) satisfies all properties of a metric except for the identity of indiscernibles (recall we explicitly do not want this property in order to match/distinguish attributes with different/identical data distributions). The match probability definition in (4) implies our overall crossentropy loss can be equivalently written as:
(7) 
which is of a similar geometric form to the loss [23] proposed to ensure various metric learning desiderata such as a positive margin between positive and negative pairs. Under this loss, it is more important to ensure that unmatched data are separated rather than matched data sharing nearly the same embedding, a critical property to reduce the false positive rate when searching for variables that might match a new dataset. Networks and
are learned jointly via the Adam stochastic gradient method applied to this loss function
[24].Rather than sampling pairs of datasets from the repository at a time for training the embeddings, we employ triplets in which the first two datasets stem from the same variable () and the latter two do not (). Triplet training which pairs the same anchor sample with both a positive and negative match is a commonly utilized strategy in embedding learning [25]. To calculate stochastic gradients for updating network parameters, we simply draw such datasettriplets at random from our repository . Note that this training strategy balances positive/negative ratios, which may not reflect the desired marginal distribution in a particular application. Nonetheless, the estimated match probabilities can be easily recalibrated by estimating the overall variablematch rate amongst random dataset pairs in [26]. Balancing schemes like ours are known to often substantially improve performance in applications with limited positive examples [27]. We also experimented with alternative metricbased training objectives and more sophisticated sampling strategies pervasive in deep image embedding methods, but found these empirically led to increased training times and less stability without performance gains for our task (we considered the triplet/contrastive/margin losses applied directly to rather than combined with semihard negative mining or distance weighted sampling [28]). [29] report similar instabilities.
IiiC Scaling to Large Datasets via Subsampling
Basing our training procedure on stochastic gradient methods ensures it can scale to large repositories that contain many datasets. However, since each training example is comprised of entire datasets, the time to process these examples via our deep sets network depends linearly on the datasets’ sample size, which may be undesirably inefficient. When is a large dataset, we instead rely on a smaller subset of randomly selected samples from to obtain our dataset representations , which are subsequently used to compute the measures in (4).
By Jensen’s inequality, our estimate in (3) of the population distance between underlying distributions (2) is slightly biased upward. Our definition employs the Vstatistic formulation discussed in [20] in the case of MMD, rather than a Ustatistic. Thus, our stochastic gradient method effectively optimizes an upper bound for loss that would be incurred based on the embeddings of the underlying population distributions, where the tightness of this bound depends on the datasets’ sample sizes. Theorem 1 below nonetheless ensures that our subsampled dataset embedding distances closely approximate their underlying population counterparts.
Theorem 1.
Let . When are subsamples of size from a given pair of datasets :
The quantities above are defined as in (2)(3), and the probability is taken over the sampling of datasets from the underlying distributions as well as the random subsampling procedure. Here, we have implicitly assumed both datasets contain at least instances and our neural network outputs are bounded. The proof of this theorem directly follows from a combination of the triangle and Hoeffding inequalities. A similar concentration bound may be derived for the differences . Together, these bounds ensure datasetsubsampledestimates of and are concentrated around the values they would take if provided with the actual embeddings of the underlying population distributions. In our applications, we train our embedding networks by drawing subsamples of 1K instance values (when samplesize exceeds this) at which point these empirical approximations are likely quite accurate. By batching such subsamples on a GPU, we can compute the vector embedding of a subsampled dataset in a single simultaneous pass through each neural network.
IiiD Scarcity of Positive Matches
In practice, it may be difficult for human annotators to comprehensively examine all pairs of attributes in a large dataset repository. The number of annotated variable matches may thus be small relative to the possible pairings of datasets (i.e. may be missing for many pairs). Since it is quite unlikely that a random pair of datasets would stem from the same variable, we can safely assume (i.e. ) for all except for pairs where a variable match has been indicated. Ideas from positive and unlabeled learning [30] might instead be used to recalibrate our predicted match probabilities (for example, leveraging the proportion of matched pairs among all pairs considered by the annotator).
A limited number of positive matches also limits the quality of our learned embeddings. To obtain additional positive examples for supervising our neural network training, we employ the following training data augmentation strategy: A single dataset in the repository is randomly split into two disjoint subsets, which are treated as separate datasets whose variables are to be matched. These pairs are then added as additional positive examples to our training set. We repeat this procedure for randomly selected datasets and split them into randomly sized partitions, such that we can generate as many augmented samples as desired. Note that for these augmented samples, a measure of distributional similarity does in fact suffice to determine they should be matched. Thus, their inclusion in the training set encourages our embedding distance to behave like a standard twosample statistic for data from unique variables without any observed match.
IiiE Neural Network Models for Different Data Types
Throughout this work, 300dimensional vector embeddings are used as the output of , regardless of the data type. For embedding numeric data where , both and
are three layer fullyconnected networks with 300 units in each hidden layer, ReLU activations in
, and tanh activations in (including for the output layer). When our repository contains a diverse collection of variables, we may need to embed numeric data across an arbitrary input range (rescaling continuous datasets is undesirable as crucial variable information is encoded in the magnitude of their values). Unfortunately, standard neural networks perform poorly across inputs of widelydiffering magnitudes [31]. Thus, rather than taking as input the raw floatingpoint value corresponding to each numeric data instance , and take as input a 32 dimensional vector of binary features encoding the 32bit signed binary representation of instead. This binary input representation enables the network to more stably operate over diverse input ranges in different datasets, and consistently results in improved performance (Figure 3).While supervised learning pipelines typically bin data with string fields into categories encoded as onehot vectors, the original string values and their distribution contain important information about the datagenerating variable. Our methods instead operate directly on these strings, producing a vector representation of each string that is subsequently pooled in our deep sets embedding. Note that when the number of unique strings in a dataset is far less than the sample size, our deep set embeddings can be efficiently computed as a multinomiallyweighted sum of string embeddings. Two types of string data may be encountered in practice: Datasets whose entries come from natural language (e.g. names and other standard text fields), and datasets whose values are general strings (e.g. phonenumbers or domainspecific vocabulary). In order to embed datasets of the former type, we compute the 300dimensional
fastText vector representation^{1}^{1}1Pretrained word vectors from http://fastText.cc/ of each string value [32], which is then fed as input to fullyconnected deep sets networks consisting of the same architecture as described previously. This use of pretrained word vectors ensures that known semantics of the values are taken into consideration when comparing datasets (e.g. that US and America refer to the same entity). For a particular string dataset, we determine whether our fastText embedding method is appropriate or not by ascertaining whether a majority of the strings in each data field entail words present in the fastText vocabulary. When one data field contains multiple words, we simply average their fastText vectors.For embedding other datasets where each field is simply an arbitrary finitelength string not from a known language, both
are taken to be bidirectional characterlevel Long Shortterm Memory (LSTM) recurrent networks
[33], which employ two stacked layers with 128 hidden units in each direction. We concatenate the final hidden states in each direction to form a vector representation of the entire string. This same stringrepresentation is then passed through separate fully connected layers in and ; the former outputs a 300dimensional embedding with activation, the latter a single nonnegative scalar value.IiiF Approximate nearest neighbor search
Given a query dataset , we wish to quickly identify the repository datasets most likely to stem from the same variable, which correspond to those for which is the smallest. Recalling the definition in (4), can be expressed as a squared Euclidean distance between:
where the final dimensions of these vectors are simply two numbers appended to the dataset embedding vector: the scalar factor produced by our probabilityadjustment network and a zero value. For each repository dataset , we compute and prestore it. At testtime, can be efficiently computed via a forward pass through our deep sets networks.
Having reexpressed our similarity function as a simple distance between vectors that each individually depend only on one dataset, we can leverage approximate nearest neighbor algorithms to efficiently find the most promising candidates for a query dataset. Specifically, we employ the approach of [11] which is known to be fast and accurate in practice. Note that it is our particular formulation of the match probability in (4) which allows the learned dataset similarity to be expressed as Euclidean distance between datasetspecific embedding vectors. Approximate nearest neighbor methods are not as easily leveraged for a general neural network architecture that jointly operates on both datasets.
Iv Experiments
Our experiments are based around a publiclyavailable largescale source of tabular data stored in the OpenML repository^{2}^{2}2http://www.openml.org/ [3]. OpenML serves as a nice data repository as it is not only wellorganized (with the performance of many learning algorithms stored for each dataset), but is also continually growing over time as users upload new data to the platform. Each column from a data table is split into a separate dataset, within which we omit entries with missing values for simplicity. We consider all data files pertaining to either regression or classification tasks, which appeared to be the best organized data. We partition the OpenML data into disjoint groups of numeric, language and generalstring datasets which are handled separately. This means we only consider matching numeric datasets with numeric datasets, language data with other language data, and strings with strings, as well as training separate embedding models for each data type.
To programmatically identify matched variables across different columns of OpenML data tables (i.e. what we refer to as a dataset in this work), we leverage the provided column labels and employ standard techniques from schemamatching. Namely, if two column labels (or tokenized versions thereof) are present in the set of pretrained fastText
word vectors, we consider whether the cosine similarity between their vector representations exceeds a highly conservative threshold (0.9). If the column label strings cannot be tokenized into valid words with corresponding pretrained word vectors, then we instead consider whether the JaroWinkler string distance between them exceeds a separate conservative threshold (0.95), as proposed in
[34]. While we rely on the provided column labels to identify true variable matches in the OpenML data, none of these labels are accessible to the different learning methods studied in our evaluations. Note that our strategy to identify variable matches based on provided names ensures columns with identical names are always matched. As this is not always desirable, we manually assemble a list of common but vague column names comprised of generic descriptors (such as: attribute, variable, column, as well as names with under 3 characters) and remove data corresponding to columns with these uninformative labels from our analysis.The data tables of each type are separated into two groups: one whose columns form the datasets contained within a training repository (to represent previouslyobserved datasets used for training our embeddings) and another whose columns form a collection of heldout datasets only used during evaluations (analogous to the “test set” in supervised learning). For a small subset of training datasets in , we partition their observations into two halves, one of which is removed from and added as a separate dataset to another collection also heldout for evaluations. Table I summarizes the OpenML data. We simply train our embedding networks on each repository as described in Section III, until the training loss in (6) no longer consistently improves (we did not note overfitting). Once our embedding models have been trained on the appropriate OpenML repository , we perform all evaluations in this paper with no additional training of these models. While our methods operate at the level of individual columns (what we refer to as a dataset in this work), we ensure that our analysis always compares data from different randomlydrawn OpenML tables to reflect realistic usecases (and avoid overrepresentation of highdimensional tables).
Data type  Num. matches  Partition  Num. tables  Num. datasets  Sample sizes 

616  
Numeric  100  
  800  
308  
Language  50  910  
  200  
185  
String  50  180  
  100 
Iva Alternative Methods
When presenting results, we refer to our methods as EmbedNum when applied to numeric datasets, EmbedTxt when applied to string datasets with text fields represented via fastText vectors, and EmbedStr when applied to general string data. For numeric data, we compare against alternative methods where the used to determine match probabilities between datasets are instead given by:
MeanSD: Empirical mean difference plus the empirical standard deviation difference between two datasets.
KS: 1 minus the value of the KolmogorovSmirnov test for an arbitrary difference between distributions [35].
MMD: Maximum Mean Discrepancy with RBF kernel. For computational efficiency, we use the lineartime estimator [20].
SCF: More powerful lineartime MMD estimator from [22]. We use the Smooth CF variant with test frequencies.
For string data, we compare against the following alternate measures of datasetdistance (the latter methods are suited for natural language data, while Jaccard is more appropriate for arbitrary strings with uniformative fastText vectors):
Jaccard: Jaccard distance between sets of unique stringvalues appearing in instances within each dataset, as considered by many schemamatching systems [36, 37, 16].
mWordVec: Distance between means of the fastText word vectors computed from the entries in each dataset.
pWordVec: 1 minus value of Hotelling statistic for the difference in the above mean fastText vectors, used in [10].
V Results
Using tSNE to visualize our dataset embeddings (Figure 1) reveals that these vectors capture quite a bit of information about the annotated variable names, despite never having previously encountered these data or any variablename annotations at all. Our first quantitative evaluation involves the following binary classification problem: given a random pair of heldout datasets, determine whether or not their variables match. We consider balanced instances of this problem (with same number of matched and unmatched test pairs), where performance is quantified via the area under the ROC curve (AUC). To evaluate across different samplesize regimes, we also apply our methods to downsampled fractions of each test dataset. Two variants of this problem are considered:
Split: the positive class of matched examples consists of pairs formed by partitioning the observations from a single test dataset in into two equalsized datasets.
Diff: the positive class of matched examples consists of pairs that are two different test datasets from whose variable names indicate a match.
In both variants, the negative class of unmatched examples consists of random pairs of test datasets in whose variable names do not match. Note that Split is a simpler problem which can be solved by accurate measures of statistical difference between distributions (without learning to ignore natural variation between different datasets that stem from the same type of variable). Figure 2 depicts the classification results for 1000 test pairs sampled from each class. In the Diff setting, none of the alternative methods (which are based solely on distributionalsimilarity) performs nearly as well as our approach, validating the reasons listed in §I and §III. Even in the Split setting where the alternative methods are appropriate, our approach is able to slightly outperform standard twosample methods having learned to adapt its embeddings to types of numeric data that appear in practice. Figure 4 shows analogous results for the language/string data.
Next, we consider the task of retrieving repository datasets whose variables match that of a given (heldout) query dataset. Repository datasets are ranked based on their match probability with the query (monotonicallyrelated to measure of dataset distance used), and we report the (average) number of correct variable matches present among the top ranked results for each query (more appropriate than precision/recall in our setting where there are so few variable matches). Two variants of this problem are considered:
SplitRetrieve: the query dataset comes from and matches exactly one dataset in (as both datasets originate from the same column which has been partitioned in two).
DiffRetrieve: the query dataset comes from and matches all datasets in with similar variable names. We try 100 query datasets randomly chosen from a subset of the datasets in for which we know at least one such match in exists.
SplitRetrieve is again a simpler problem that can in theory be solved by accurate twosample measures of distributional differences. In Table II, the best performance values for each variant are highlighted in bold, indicating that our approach tends to have higher precision in these dataset retrieval tasks. For example, when our query consists of numeric values from a column named age (part of a table annotated as survival data), the top 3 matches returned by EmbedNum are all columns with the same label age, belonging to tables respectively annotated as audiology, diabetes, and breast tumor data. Our embeddings correctly recognize that these four datasets all correspond to measurements of patient ages in medical studies.
Method  Recall  

MeanSD  0.4  0.51  0.58    
KS  0.46  0.59  0.67    
MMD  0.48  0.62  0.69    
SCF  0.49  0.65  0.72    
EmbedNum  0.48  0.66  0.74    
MeanSD  0.35  0.44  0.53  0.52  
KS  0.36  0.46  0.59  0.6  
MMD  0.33  0.45  0.52  0.6  
SCF  0.33  0.4  0.55  0.62  
EmbedNum  0.42  0.61  0.67  0.71  
Numeric data  Jaccard  0.41  0.56  0.62   
mWordVec  0.49  0.59  0.68    
pWordVec  0.51  0.63  0.7    
EmbedNum  0.54  0.66  0.71    
Jaccard  0.32  0.38  0.52  0.59  
mWordVec  0.4  0.46  0.59  0.67  
pWordVec  0.43  0.49  0.59  0.68  
EmbedNum  0.49  0.59  0.67  0.75  
Language data  Jaccard  0.44  0.67  0.7   
mWordVec  0.31  0.38  0.6    
pWordVec  0.34  0.4  0.62    
EmbedNum  0.44  0.66  0.71    
Jaccard  0.33  0.55  0.58  0.63  
mWordVec  0.28  0.53  0.56  0.59  
pWordVec  0.31  0.52  0.55  0.57  
EmbedNum  0.39  0.56  0.64  0.68  
String data 
In addition to retrieving the correct matches when they exist, useful matchprobability estimates should also indicate when a new dataset is unlikely to match any of the previously observed variables’ data. We consider 100 query datasets randomly chosen from a subset of the datasets in which do not have any annotated match in , and compute the maximum matchprobability for each query when compared against the repository datasets in . Avoiding false matches for such a query requires selecting matchlikelihood threshold , and we evaluate the calibratedrecall of each method as the fraction of annotated matches between and datasets whose estimated match probability exceeds . The final column of Table II lists these calibratedrecall values, demonstrating that the predicted matchprobabilities are better calibrated under our approach than the alternative methods.
Among the natural language data, the repository dataset that receives highest probability adjustment value, i.e. , stems from a column where 58% of the values are true and the rest all have the value false. The OpenML language data contain many other columns also solely comprised of roughly evenlydistributed true/false values, and many of these also appear in the list of the repository datasets that assigns the greatest probability adjustment values. As these boolean columns are labeled with diverse names indicating different unrelated attributes (such as whether an animal species is oviparous in one dataset with variablename eggs), our model learns to output a low match probability between such pairs, which is only achievable by ensuring outputs large adjustment values for unmatched datasets whose distribution occurs commonly in . In contrast, the alternative methods produce small between all pairs of such true/false datasets and thus mistakenly predict a variablematch, demonstrating yet another downside of pure reliance on measures of distributional similarity.
Va Schema Matching
As previous evaluations were based on OpenML column names which might be noisy, we now investigate how well our embedding models trained on the OpenML column names generalize to other datasets where ground truth variablematches are known. First, we apply our OpenMLtrained dataset embedding models to the task of instancebased schema matching. We use the PUMS data considered by [7], in which the task is to match schemas between separate census data tables from New York, California, and Texas. Evaluation of proposed matchings can be done since the underlying schemas are actually shared across tables, but the methods we apply are not privy to this knowledge (we evaluate on the largest tables studied by [7]: schemasize ). Both numeric and language data types are found within these tables and we use the appropriate corresponding OpenMLpretrained neural network to separately embed each column of each table (with no additional training on the schema matching data). Subsequently, we adopt our estimated match probabilities as pairwise columnsimilarity scores that are fed into the schemaalignment optimization used in [7], which utilizes local minimization with twoopt switching to find a onetoone schema alignment. Figure 5A shows our pretrained embeddings are more effective in these schemamatching applications than the two bestperforming columnsimilarityscoring methods proposed by [7]: FOLL and FOEP (which both rely on distributional similarity between matched variables).
VB Table Union Search
Finally, we consider the table union search benchmark of [10], which involves searching a repository of data tables from the Canadian/UK governments’ Open Data portal for tables that can be unioned with a given query table (i.e. joined rowwise by merging their observations). Unionability is defined as the condition that a sufficiently large subset of their columns contain measurements of the same underlying variables. We use the appropriate OpenMLpretrained neural network (for each data type) to separately embed each column of each table (with no additional training on the table union datasets). The topperforming method of [10] relies on a sophisticated Ensemble of multiple manuallycrafted statistical similarity measures between columns. Our method produces an analogous column similaritymeasure (based on our embeddings), from which overall table unionability is evaluated by aggregating similarity across columns via the same maxalignment procedure [10] use to aggregate their Ensemble scores. Figure 5B shows our pretrained embeddings are quite effective for table union search, producing superior precision/recall compared with the topperforming method in [10]. While the schema matching and table union evaluations involved vastly different datasets, our OpenMLtrained models remain able to perform well, highlighting the broad applicability of these embeddings, as well as the validity of the OpenML column names and our previous OpenML evaluations.
References

[1]
C. Steinrucken, E. Smith, D. Janz, J. Lloyd, and Z. Ghahramani, “The automatic
statistician,” in
Automatic Machine Learning: Methods, Systems, Challenges
, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds. Springer, 2018, ch. 9, pp. 175–188.  [2] J. Vanschoren, “Metalearning,” in Automatic Machine Learning: Methods, Systems, Challenges, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds. Springer, 2018, ch. 2, pp. 39–68.
 [3] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “OpenML: Networked science in machine learning,” in SIGKDD Explorations, 2013, pp. 49–60.
 [4] J. Kang and J. F. Naughton, “On schema matching with opaque column names and data values,” in ACM SIGMOD International Conference on Management of Data, 2003, pp. 205–216.
 [5] Z. Chen, H. Jia, J. Heflin, and B. D. Davison, “Generating schema labels through dataset content analysis,” in Companion Proceedings of The Web Conference (WWW ’18), 2018, pp. 1515–1522.
 [6] Y. Yi, Z. Chen, J. Heflin, and B. D. Davison, “Recognizing quantity names for tabular data,” in International Workshop on Data Search (ACM SIGIR ‘18), 2018, pp. 68–73.
 [7] A. Jaiswal, D. J. Miller, and P. Mitra, “Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discretevalued data fields,” ACM Transactions on Database Systems, vol. 38, 2013.
 [8] A. Doan, P. Domingos, and A. Y. Halevy, “Reconciling schemas of disparate data sources: A machinelearning approach,” in ACM SIGMOD International Conference on Management of Data, vol. 30, no. 2, 2001, pp. 509–520.
 [9] M. Pham, S. Alse, C. A. Knoblock, and P. Szekely, “Semantic labeling: a domainindependent approach,” in International Semantic Web Conference, 2016, pp. 446–462.
 [10] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller, “Table union search on open data,” Proceedings of the VLDB Endowment, vol. 11, no. 7, pp. 813–825, 2018.
 [11] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.” VISAPP, vol. 2, pp. 331–340, 2009.
 [12] H. Edwards and A. Storkey, “Towards a neural statistician,” in International Conference on Learning Representations, 2017.

[13]
D. Bouchacourt, R. Tomioka, and S. Nowozin, “Multilevel variational autoencoder: Learning disentangled representations from grouped observations,” in
AAAI Conference on Artificial Intelligence
, 2018.  [14] N. Ruemmele, Y. Tyshetskiy, and A. Collins, “Evaluating approaches for supervised semantic labeling,” in Linked Data on the Web, 2018.
 [15] I. Valera and Z. Ghahramani, “Automatic discovery of the statistical types of variables in a dataset,” in International Conference on Machine Learning, 2017, pp. 3521–3529.
 [16] F. Greg and P. Xu, “Modern approaches to schema matching,” DataMade Blog, 2017. [Online]. Available: http://datamade.us/blog/schemamatching/
 [17] S. K. Ramnandan, A. Mittal, C. A. Knoblock, and P. Szekely, “Assigning semantic labels to data sources,” in European Semantic Web Conference. Springer, 2015, pp. 403–417.
 [18] Z. Ghahramani and K. A. Heller, “Bayesian sets,” in Advances in Neural Information Processing Systems, 2006, pp. 435–442.
 [19] E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schema matching,” The VLDB Journal, vol. 10, no. 4, pp. 334–350, 2001.
 [20] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.
 [21] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola, “Deep sets,” in Advances in Neural Information Processing Systems, 2017.
 [22] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton, “Fast twosample testing with analytic representations of probability measures,” in Advances in Neural Information Processing Systems, 2015, pp. 1981–1989.
 [23] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, vol. 1. IEEE, 2005, pp. 539–546.
 [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference for Learning Representations, 2015.

[25]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in
Computer Vision and Pattern Recognition. IEEE, 2015, pp. 815–823.  [26] P. Latinne, M. Saerens, and C. Decaestecker, “Adjusting the outputs of a classifier to new a priori probabilities may significantly improve classification accuracy: Evidence from a multiclass problem in remote sensing,” in International Conference on Machine Learning, 2001.
 [27] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004.
 [28] C.Y. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl, “Sampling matters in deep embedding learning,” in International Conference on Computer Vision. IEEE, 2017.
 [29] S. Horiguchi, D. Ikami, and K. Aizawa, “Significance of softmaxbased features in comparison to distance metric learningbased features,” arXiv:1712.10151, 2017.
 [30] C. Elkan and K. Noto, “Learning classifiers from only positive and unlabeled data,” in Knowledge Discovery and Data Mining. ACM, 2008.
 [31] A. Trask, F. Hill, S. E. Reed, J. Rae, C. Dyer, and P. Blunsom, “Neural arithmetic logic units,” in Advances in Neural Information Processing Systems, 2018, pp. 8046–8055.
 [32] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
 [33] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [34] W. Cohen, P. Ravikumar, and S. Fienberg, “A comparison of string metrics for matching names and records,” in KDD workshop on data cleaning and object consolidation, 2003.
 [35] W. J. Conover, Practical Nonparametric Statistics. New York: John Wiley & Sons, 1971.
 [36] E. Rahm, “Towards largescale schema and ontology matching,” in Schema matching and mapping. Springer, 2011, pp. 3–27.

[37]
J. Berlin and A. Motro, “Database schema matching using machine learning with feature selection,” in
Advanced Information Systems Engineering, 2002, pp. 452–466.
Comments
There are no comments yet.