Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need
Deep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real-world problems with similar settings (e.g., same input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need for which the training dataset is created. Although some datasets may share high structural similarities, e.g., question-answer pairs for the question answering (QA) task and image-caption pairs for the image captioning (IC) task, not all datasets are created for the same information need. To support our argument, we conduct a comprehensive analysis on widely used benchmark datasets for both QA and IC tasks. We compare the dataset creation process (e.g., crowdsourced, or collected data from real users or content providers) from the perspective of information need in the context of information retrieval. To show the differences between datasets, we perform both word-level and sentence-level analysis. We show that data collected from real users or content providers tend to have richer, more diverse, and more specific words than data annotated by crowdworkers. At sentence level, data by crowdworkers share similar dependency distributions and higher similarities in sentence structure, compared to data collected from content providers. We believe our findings could partially explain why some datasets are considered more challenging than others, for similar tasks. Our findings may also be helpful in guiding new dataset construction.
READ FULL TEXT