Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations

10/08/2021
by   Michal Růžička, et al.
0

In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification and using the standard precision/recall/F1-measure metrics. The results give insight into how different math representations may influence the performance of the classification and similarity search tasks in STEM repositories. Non-surprisingly, machine learning methods are able to grab distributional semantics from textual tokens. A proper selection of weighted tokens representing math may improve the quality of the results slightly. A structured math representation that imitates successful text-processing techniques with math is shown to yield better results than flat TeX tokens.

READ FULL TEXT
research
08/24/2021

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a coll...
research
05/25/2020

AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels

Authors of research papers in the fields of mathematics, and other math-...
research
09/02/2021

Towards Explaining STEM Document Classification using Mathematical Entity Linking

Document subject classification is essential for structuring (digital) l...
research
06/28/2015

Topic2Vec: Learning Distributed Representations of Topics

Latent Dirichlet Allocation (LDA) mining thematic structure of documents...
research
09/21/2017

Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness

Estimation of semantic similarity and relatedness between biomedical con...
research
05/22/2020

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

In this paper, we show how selecting and combining encodings of natural ...
research
10/27/2020

The Search for Equations - Learning to Identify Similarities between Mathematical Expressions

On your search for scientific articles relevant to your research questio...

Please sign up or login with your details

Forgot password? Click here to reset