Explain3D: Explaining Disagreements in Disjoint Datasets

by   Xiaolan Wang, et al.

Data plays an important role in applications, analytic processes, and many aspects of human activity. As data grows in size and complexity, we are met with an imperative need for tools that promote understanding and explanations over data-related operations. Data management research on explanations has focused on the assumption that data resides in a single dataset, under one common schema. But the reality of today's data is that it is frequently un-integrated, coming from different sources with different schemas. When different datasets provide different answers to semantically similar questions, understanding the reasons for the discrepancies is challenging and cannot be handled by the existing single-dataset solutions. In this paper, we propose Explain3D, a framework for explaining the disagreements across disjoint datasets (3D). Explain3D focuses on identifying the reasons for the differences in the results of two semantically similar queries operating on two datasets with potentially different schemas. Our framework leverages the queries to perform a semantic mapping across the relevant parts of their provenance; discrepancies in this mapping point to causes of the queries' differences. Exploiting the queries gives Explain3D an edge over traditional schema matching and record linkage techniques, which are query-agnostic. Our work makes the following contributions: (1) We formalize the problem of deriving optimal explanations for the differences of the results of semantically similar queries over disjoint datasets. (2) We design a 3-stage framework for solving the optimal explanation problem. (3) We develop a smart-partitioning optimizer that improves the efficiency of the framework by orders of magnitude. (4) We experiment with real-world and synthetic data to demonstrate that Explain3D can derive precise explanations efficiently.


To not miss the forest for the trees – a holistic approach for explaining missing answers over nested data (extended version)

Query-based explanations for missing answers identify which operators of...

Explaining Inference Queries with Bayesian Optimization

Obtaining an explanation for an SQL query result can enrich the analysis...

Explaining Aggregates for Exploratory Analytics

Analysts wishing to explore multivariate data spaces, typically pose que...

Putting Things into Context: Rich Explanations for Query Answers using Join Graphs (extended version)

In many data analysis applications, there is a need to explain why a sur...

Heterogeneous Replica for Query on Cassandra

Cassandra is a popular structured storage system with high-performance, ...

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Neural networks are among the most accurate supervised learning methods ...

FEDEX: An Explainability Framework for Data Exploration Steps

When exploring a new dataset, Data Scientists often apply analysis queri...