Paying down metadata debt: learning the representation of concepts using topic models

10/09/2020
by   Jiahao Chen, et al.
0

We introduce a data management problem called metadata debt, to identify the mapping between data concepts and their logical representations. We describe how this mapping can be learned using semisupervised topic models based on low-rank matrix factorizations that account for missing and noisy labels, coupled with sparsity penalties to improve localization and interpretability. We introduce a gauge transformation approach that allows us to construct explicit associations between topics and concept labels, and thus assign meaning to topics. We also show how to use this topic model for semisupervised learning tasks like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing features. We show results from this topic model in predicting subject tags on over 25,000 datasets from Kaggle.com, demonstrating the ability to learn semantically meaningful features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/20/2023

Re-visiting Automated Topic Model Evaluation with Large Language Models

Topic models are used to make sense of large text collections. However, ...
research
01/12/2022

Topic Modeling on Podcast Short-Text Metadata

Podcasts have emerged as a massively consumed online content, notably du...
research
03/24/2021

Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling

The rise of big data has revolutionized data exploitation practices and ...
research
02/10/2021

Memory-Associated Differential Learning

Conventional Supervised Learning approaches focus on the mapping from in...
research
10/12/2019

Prediction Focused Topic Models via Vocab Selection

Supervised topic models are often sought to balance prediction quality a...
research
05/29/2020

Automatic Generation of Topic Labels

Topic modelling is a popular unsupervised method for identifying the und...
research
09/22/2021

Automated Feature-Topic Pairing: Aligning Semantic and Embedding Spaces in Spatial Representation Learning

Automated characterization of spatial data is a kind of critical geograp...

Please sign up or login with your details

Forgot password? Click here to reset