What Makes a Good Dataset for Symbol Description Reading?

04/17/2023
by   Karol Lynch, et al.
0

The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math Formula Question Answering Dataset (MFQuAD) with 7508 annotated identifier occurrences; (ii) describes novel variations of the noun phrase ranking approach for the MIDR task; (iii) reports experimental results for the SOTA noun phrase ranking approach and our novel variations of the approach, providing problem insights and a performance baseline; (iv) provides a position on the features that make an effective dataset for the MIDR task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/07/2020

Textual Description for Mathematical Equations

Reading of mathematical expression or equation in the document images is...
research
04/26/2022

Symlink: A New Dataset for Scientific Symbol-Description Linking

Mathematical symbols and descriptions appear in various forms across doc...
research
04/20/2018

Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension

The current trend of extractive question answering (QA) heavily relies o...
research
07/19/2022

PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search

Since BERT (Devlin et al., 2018), learning contextualized word embedding...
research
03/03/2023

Discovery and Recognition of Formula Concepts using Machine Learning

Citation-based Information Retrieval (IR) methods for scientific documen...
research
12/20/2019

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

SberQuAD—a large scale analog of Stanford SQuAD in the Russian language—...
research
02/25/2021

IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAM with Special Tokens, Re-Ranking, Siamese Encoders and Back Translation

This paper introduces our systems for all three subtasks of SemEval-2021...

Please sign up or login with your details

Forgot password? Click here to reset