The motivation of this work is to develop a method for summarizing the content of tabular datasets. One can imagine the potential utility of automatically assigning a set of tags to each member of a large collection of datasets that would indicate the potential subject being addressed by the dataset. This can allow for semantic querying over the dataset collection to extract all available data pertinent to some specific task subject at scale.
We make the assumption that the dataset contains some text that is semantically descriptive of the dataset subject, whether appearing in columns, headers or some augmenting metadata. As opposed to an extractive approach that would merely select some exact words and phrases from the available text, we propose an abstractive approach that builds an internal semantic representation and produces subject tags that may not be explicitly present in the text augmenting the dataset.
The result of this work is DUKE—Dataset Understanding via Knowledge-base Embeddings—a method that employs a pretrained Knowledge Base (KB) semantic embedding to perform type recommendation within a prespecified ontology. This is achieved by aggregating the recommended types into a small collection of super types predicted to be descriptive of the dataset by exploiting the hierarchical structure of the various types in the ontology. Effectively, the method represents employing an existing KB embedding to extensionally generate a dataset2vec embedding. Using a February 2015 Wikipedia knowledge base and a corresponding DBpedia ontology to specify types, we present experimental results on open data taken from several sources—OpenML, CKAN, and data.world—to illustrate the effectiveness of the approach.
2. Related Work
The distributional semantics (Sahlgren, 2008)
concept has been recently widely employed as a natural language processing (NLP) tool to embed various NLP concepts into vector spaces. This rather intuitive hypothesis states that the meaning of a word is determined by its context. By far the most pervasive application of the hypothesis has been the word2vec model(Mikolov et al., 2013)(Pennington et al., 2014)
which employs neural networks on large corpora to embed words that are contextually similar to be close to each other in a high-dimensional vector space. Arithmetic operations on the elements of the vector space produce semantically meaningful results, e.g.,King-Man+Woman=Queen.
Since the original word2vec model, various incremental incarnations of it have been employed to embed sentences, paragraphs and even knowledge graphs into vector spaces via sent2vec(Pagliardini et al., 2017), paragraph2vec(Le and Mikolov, 2014), and RDF2Vec(Ristoski and Paulheim, 2016) respectively.
A topic domain is typically expressed as a manually curated ontology. A basic element of an ontology is a type, and a type assertion statement links specific entities of the knowledge graph to specific types. These statements can be used to augment a semantic embedding space with type information in order to add high level context of the graph to the embedding space. For instance, it was recently shown that one can extend a pretrained Knowledge Graph Embedding (KGE) to contain types of a specific ontology if those were not already present as entities, given a list of assertion statements(Kejriwal and Szekely, 2017). Thus, it can be assumed that a semantic embedding is typed for our purposes.
We note that the abstractive tabular dataset summarization problem is closely related to the well-studied problem of type recommendation, where the type is a super tag for all text segments in the dataset within a prespecified ontology that needs to be predicted. Systems for type recommendation using both manually curated features(Ma et al., 2013) and automated features (van Erp and Vossen, 2017), e.g., via typed KGEs(Kejriwal and Szekely, 2017), for individual entities, have been previously explored. To the best of our knowledge, this is the first application of typed semantic embeddings to abstractive tabular dataset summarization.
In this subsection, we present a pair of definitions to aid orientation.
Definition (word2vec) Word2vec models utilize a large corpus of documents to build a vector space mapping words to points in a space, where proximity implies semantic similarity(Mikolov et al., 2013). This allows us to calculate distances between words in the dataset and the set of types in our ontology.
Definition (wiki2vec) When discussed in this paper, a wiki2vec model is a form of word2vec model trained on a corpus of Wikipedia KB documents222See https://github.com/idio/wiki2vec. Training on this data ensures that the list of types in the DBpedia ontology are included in the vocabulary of the model, and increases the likelihood that topics are discussed in context with their super-types.
Note that wiki2vec is different from a KGE, which is typically trained on relationship triples between entities in a knowledge graph (such as DBpedia)(Ristoski and Paulheim, 2016).
3.2. Generating Type Recommendations
The method for summarizing a tabular dataset can be broken down into three distinct steps:
Collect a set of types and an ontology to use for abstraction
Extract any text data from the tabular dataset and embed it into a vector space to calculate the distance to all the types in our ontology
Aggregate the distance vectors for every keyword in the dataset into a single vector of distances
3.2.1. Type Ontology
In order to generate an abstract term to describe the dataset, we must first collect an ontology of types to select a descriptive term from. We use an ontology provided by DBpedia333Downloaded from http://downloads.dbpedia.org/2015-10/dbpedia 2015-10.nt which contains approximately 400 defined types, including everything from sound to video game and historic place. DBpedia also contains defined parent-child relationships for the types444Defined parent-child type relationships can be found at http://dbpedia.org/ontology/ that we use to build a complete hierarchy of types e.g. that tree is a sub-type of plant which is a sub-type of eukaryote.
3.2.2. Word Embedding
With the set of topics collected, extract each word from the dataset, embed it in a wiki2vec vector space and calculate the distance between that word and every type in the ontology. If a single cell in a column contains more than one word, take the average of the corresponding embedded vectors. This results in a collection of distance vectors representing all text in the dataset. Collect the vectors according to their source within the dataset, i.e. words in the same column are collected into a matrix of distances for each column. If column headers are provided, treat them as an additional column in the dataset.
3.2.3. Distance Aggregation
The previous steps produce a set of matrices containing distances between every text segment in the dataset and the set of types. The goal of this step is to reduce them to a single vector of distances.
We utilize three successive aggregations in order to compute this final vector. The first aggregation is computed across the rows of each column matrix in order to produce a single vector of distances between the column and all types. Potential functions to use are discussed below. The second aggregation is what we call the tree aggregation, where we take this vector of distances for a column and utilize the hierarchy of types described by DBpedia in order to update the scores for each type. For instance, we need to update the score for means of transportation based on the scores for airplane, train, and automobile. The third aggregation is performed over the set of distance vectors computed for each column, producing a single vector of distances to every defined type. We tested two simple functions for each aggregation step: mean and max, as well as a variety of more complex aggregations for the tree aggregation step. Tree aggregation allows for additional complexity because the updated distance for each type was dependent on the original distance for that type and the vector of scores for all the children. We found that the most successful tree aggregation functions were those that utilized different functions for processing the child scores and the original type score, e.g.,
3.2.4. Aggregation Function Selection
To select the best function for each aggregation, we hand-labelled a collection of datasets with types from our ontology to use as a sort of ‘training set’. Then, for each labelled dataset and each combination of aggregation functions, we computed the percentage of true labels found in the top three labels predicted by DUKE, with results shown in Figure 1. This figure clearly shows that using mean for column aggregation, meanmax tree aggregation described in equation 1, and then mean for the final dataset aggregation step produces the best results.
4. Results and Discussion
The goal of this section is to illustrate the effectiveness of the proposed approach to the tabular dataset summarization problem, in the context of some widely available open data sets for which manually curated summary (i.e., types/tags) are available to facilitate comparison and evaluation. Links to every dataset used is provided to facilitate verification by the reader. For each dataset, we generated one subject tag using the DUKE program, as described in the previous sections, and grade it manually using ‘low’ for low accuracy, ‘medium’ for medium accuracy (where the automatically generated tag is “related to”, but is not exactly one of the manual tags) and ‘high’ for high accuracy (where the automatically generated tag is exact in the sense that it is one of manually generated tags). Also, please note that each prediction took roughly 20 seconds to perform (approximately 17 seconds of which was spent loading the wiki2vec model) on a 16 CPU 64 GB D16s v3 Azure Cloud VM executing serially.
4.1. Example 1 - CKAN Datasets
Four randomly selected CKAN datasets were used: Class Size 2016-2017555Available at https://catalogue.data.gov.bc.ca/dataset/bc-schools-class-size, 2016 Annual Survey Questions666See https://catalogue.data.gov.bc.ca/dataset/bc-public-libraries-statistics-2002-2016, BC Liquor Store Product Price List Oct 2017777Available at https://catalogue.data.gov.bc.ca/dataset/bc-liquor-store-product-price-list-historical-prices, and Coalfile Report888Available at https://catalogue.data.gov.bc.ca/dataset/coalfile-database. Manually curated subject tags were available for each dataset (see Table 1). The match between the predicted tags and the manual tags for each dataset is depicted in Table 1.
For the first two datasets, DUKE predicts an exact tag. For the next two datasets, the accuracy is medium, with wine region being very close to wine and river being a common semantic theme in coal field names (examples include Elk River, Hat Creek and Peace River). Specifically, the top 5 tags returned by DUKE in decreasing order for the fourth example were river, stream, body of water, natural place and natural region, words that are semantically descriptive of the kind of names typically possessed by coal fields. Moreover, we plot the top 5 DUKE-predicted tags and the manual tags for the third example in Figure 2, demonstrating an exact match.
4.2. Example 2 - OpenML Datasets
Four simple OpenML datasets were obtained through the D3M DARPA program: the 185 baseball999Available for download at http://www.openml.org/d/185, 196 autoMpg101010Available for download at http://www.openml.org/d/196, 30 personae111111Available for download at http://www.clips.ua.ac.be/datasets/personae-corpus, and 313 spectrometer121212Available for download at http://www.openml.org/d/313 datasets. The results for these datasets are shown in Table 2.
For the first two datasets, DUKE predicts an exact tag. Note that for the second dataset, we consider engine to be an exact tag, since the manual tags are essentially attributes of engines. For the next two datasets, the accuracy is medium, with person being very close to personality and band being descriptive of red band and blue band manual tags. To verify that bands here referred to the right context, we looked at the top 5 tags returned by DUKE, which in decreasing order are band, brown dwarf, inhabitants per square kilometer, star and celestial body, words that are fairly consistent with the context suggested by the manual tags. Moreover, we plot these 5 tags and the manual tags for this dataset in Figure 2, demonstrating that while an exact match is not attained, nontrivial subsets of both tag types are ‘very close’ to each other in the wiki2vec embedding space.
4.3. Example 3 - data.world Datasets
The names of some randomly-selected data.world datasets are as follows: US terrorist origins131313Available for download at https://data.world/tommyblanchard/u-s-terrorist-origins, Occupational Employment Growth141414Available for download at https://data.world/tommyblanchard/u-s-terrorist-origins, CAFOD activity file for Haiti 151515Available for download at https://data.world/cafod/cafod-activity-file-for-haiti and Queensland Gambling Data 161616Downloaded at https://data.world/queenslandgov/all-gambling-data-queensland. The results for this representative set of four data.world datasets are shown in Table 3.
For the first two datasets, DUKE achieves medium accuracy. To see the justification for this, note that the top 5 tags returned by DUKE for the first dataset, in decreasing order, are person, still image, legal case, supreme court of the USA and military person, words fairly descriptive of the dataset, which is a list of terrorists, availability of a headshot and details of their legal charges. Moreover, we plot these 5 tags and the manual tags for this dataset in Figure 2, demonstrating that while an exact match is not attained, nontrivial subsets of both tag types are ‘very close’ to each other in the wiki2vec embedding space. The second dataset provides a list of occupations, many of which are scientific, and corresponding wages at various locations, which leads us to believe that site of scientific interest is fairly descriptive of the semantics represented in the dataset. For the next two datasets, the accuracy is high, which should be self-explanatory to the reader from Table 3.
5. Conclusion and Future Work
A method for abstractive summarization of tabular datasets, under the assumption that it contains some descriptive text, was presented. Results of numerical experiments on OpenML, CKAN and data.world datasets show good agreement between manual and automatically generated tags by our system, DUKE. These results can be significantly improved by more extensive ontologies included in the model (in place of the 2015 DBpedia ontology). Additionally, retraining wiki2vec on a more complete version of DBpedia (potentially augmented using an Automatic Knowledge Base Completion, or AKBC, Algorithm (Wang et al., 2015) (Groth et al., 2016)) will help improve the accuracy of our system. More sophisticated handling of multi-word phrases also needs to be explored.
Acknowledgements Work was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Number D3M (FA8750-17-C-0094). Views, opinions, and findings contained in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.
- Groth et al. (2016) Paul T. Groth, Sujit Pal, Darin McBeath, Brad Allen, and Ron Daniel. 2016. Applying Universal Schemas for Domain Specific Ontology Expansion. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction, AKBC@NAACL-HLT 2016, San Diego, CA, USA, June 17, 2016. 81–85. http://aclweb.org/anthology/W/W16/W16-1315.pdf
- Kejriwal and Szekely (2017) Mayank Kejriwal and Pedro Szekely. 2017. Supervised Typing of Big Graphs Using Semantic Embeddings. In Proceedings of The International Workshop on Semantic Big Data (SBD ’17). ACM, New York, NY, USA, Article 3, 6 pages. https://doi.org/10.1145/3066911.3066918
- Le and Mikolov (2014) Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 http://arxiv.org/abs/1405.4053
- Ma et al. (2013) Y. Ma, T. Tran, and V. Bicer. 2013. TYPifier: Inferring the type semantics of structured data. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 206–217. https://doi.org/10.1109/ICDE.2013.6544826
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. Curran Associates, Inc., 3111–3119.
- Pagliardini et al. (2017) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. CoRR abs/1703.02507 (2017). arXiv:1703.02507 http://arxiv.org/abs/1703.02507
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. 14. 1532–1543.
- Ristoski and Paulheim (2016) Petar Ristoski and Heiko Paulheim. 2016. RDF2Vec: RDF Graph Embeddings for Data Mining. In The Semantic Web – ISWC 2016, Paul Groth, Elena Simperl, Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, and Yolanda Gil (Eds.). Springer International Publishing, Cham, 498–514.
- Sahlgren (2008) Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguistics 20, 1 (2008), 33–54.
- van Erp and Vossen (2017) Marieke van Erp and Piek Vossen. 2017. Entity Typing Using Distributional Semantics and DBpedia. In Knowledge Graphs and Language Technology, Marieke van Erp, Sebastian Hellmann, John P. McCrae, Christian Chiarcos, Key-Sun Choi, Jorge Gracia, Yoshihiko Hayashi, Seiji Koide, Pablo Mendes, Heiko Paulheim, and Hideaki Takeda (Eds.). Springer International Publishing, Cham, 102–118.
et al. (2015)
Quan Wang, Bin Wang,
and Li Guo. 2015.
Knowledge Base Completion Using Embeddings and
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015. 1859–1866. http://ijcai.org/Abstract/15/264