WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code

by   Dávid Lupták, et al.
Masarykova univerzita

Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We have developed and open-sourced the WebMIaS MIR search engine that has been successfully deployed in the European Digital Mathematics Library (EuDML). However, its deployment is difficult to automate due to the complexity of this task. Moreover, the solutions developed so far to tackle this challenge are imperfect in terms of speed, maintenance, and robustness. In this paper, we will describe the virtualization of WebMIaS using Docker that solves all three problems and allows anyone to deploy containerized WebMIaS in a single line of code. The publicly available Docker image will also help the community push the development of math-aware search engines in the ARQMath workshop series.



There are no comments yet.


page 1

page 2

page 3

page 4


MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML c...

Literature Review of the Pioneering Approaches in Cloud-based Search Engines Powered by LETOR Techniques

Search engines play an essential role in our daily lives. Nonetheless, t...

Math-Aware Search Engines: Physics Applications and Overview

Search engines for equations now exist, which return results matching th...

Mars Image Content Classification: Three Years of NASA Deployment and Recent Advances

The NASA Planetary Data System hosts millions of images acquired from th...

TinySearch -- Semantics based Search Engine using Bert Embeddings

Existing search engines use keyword matching or tf-idf based matching to...

Sound Search by Text Description or Vocal Imitation?

Searching sounds by text labels is often difficult, as text descriptions...

FONTNET: On-Device Font Understanding and Prediction Pipeline

Fonts are one of the most basic and core design concepts. Numerous use c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Searching for math formulae does not appear as a task for search engines at first glance. Text retrieval is dominant among search engines, while math-awareness is a specialized area in the field of information retrieval: Springer’s LaTeX Search, the MathWebSearch of zbMATH Open (formerly known as Zentralblatt MATH), and the Math Indexer and Searcher (MIaS) of the European Digital Mathematics Library (EuDML) are all examples of systems with math-aware search deployed in production. Our MIaS search engine [mir:MIaSNTCIR-11short] runs on the industry-grade, robust, and highly-scalable full-text search engine Apache Lucene with our own preprocessing of mathematical formulae.

Figure 1: The architecture of MIaS with indexing and searching phases overlapping over Lucene index. Besides standard text processing, the math input from indexing (a document) and searching (a query) stage is canonicalized, ordered, tokenized, and unified, afterward returned back to the indexer and searcher module, respectively.

The text is tokenized and stemmed to unify inflected word forms whereas math is expected to be in the MathML format, which is then canonicalized, ordered, tokenized, and unified, see Figure 1.

Figure 2: Searching text and formulae with a single mixed query in WebMIaS.

To provide a web user interface for MIaS, we have developed and open-sourced the WebMIaS [mir:webmias2014short, mir:MIaSNTCIR-11short] search engine. In WebMIaS, users can input their mixed queries in a combination of text and math with a native support for LaTeX and MathML. Matches are conveniently highlighted in the search results. The user interface of WebMIaS is shown in Figure 2.

Although the (Web)MIaS system has been deployed in the European Digital Mathematics Library (EuDML) already, the complicated deployment process might be an obstacle for a more wide-spread deployment to other digital mathematics libraries that avail of or can extend to the MathML markup. To solve this problem, we will describe the virtualization of WebMIaS using Docker [boettiger2015introduction] that allows anyone to deploy WebMIaS in a single line of code. Whether you have an open-access repository such as DSpace, or just a number of mathematical documents, you can benefit from the math-aware search provided by WebMIaS. For testing, we also provide the MREC dataset [dml:liska2011short].

In the rest of our paper, we will describe our deployment process in Section 2, evaluate the speed and quality of WebMIaS in Section 3, and conclude in Section 4.

2 Deployment process description

All modules of the MIaS system are Java projects, so users first need to 1) install the Java environment prerequisites and then 2) build the respective system modules. The next step in the process is to 3) index a dataset of mathematical documents using the command-line interface of MIaS. Finally, the users can 4) run Apache Tomcat with the WebMIaS servlet as a user interface.

Over the years, we have attempted to automate the above steps into running a single Makefile or Jupyter Notebook. However, these solutions were slow, fragile, and hard to maintain. We propose a better solution using lightweight virtualization via Docker with instant deployment, a short but powerful Dockerfile configuration, and a complete workflow that automates all the steps of the deployment process. Moreover, GitHub Actions provide continuous integration and automate the publishing of Docker images to Docker Hub.

Figure 3: The continuous integration of WebMIaS and the build times of the respective packages: MathMLCan canonicalizes different MathML encodings of equivalent formulae. MathMLUnificator generalizes distinct mathematical formulae so that they can be structurally unified. MIaSMath adds math processing capabilities to Lucene or Solr. MIaS indexes text with math in Lucene/Solr-based full-text search engines. Finally, WebMIaS provides a web interface for MIaS.

Both MIaS and WebMIaS are containerized into separate Docker images named miratmu/mias and miratmu/webmias, respectively. This allows users to run both the indexing and the retrieval without a specific configuration of the environment. Resolving the dependencies and building all modules is up to the continuous integration workflow (see Figure 3), and users receive Docker images with everything prebuilt. After downloading a dataset to the working directory, users can index the dataset directory into the index directory using MIaS, see Listing 2. [t] [ linenos, breaklines, framesep=.02mm, ]bash \(wgethttps://mir.fi.muni.cz/MREC/MREC2011.4.439.tar.bz2\) mkdir dataset ; tar xj -f MREC2011.4.439.tar.bz2 -C dataset PWD”/dataset:/dataset:ro -v ” docker run -v ”PWD”/index:/index:ro –rm –name webmias -d -p miratmu/webmias Downloading and indexing the MREC2011.4 dataset for WebMIaS (lines 1–3), and deploying WebMIaS in a single line (n. 4) of code.

Mathematical (sub)formulae Indexing time (min) Documents Input Indexed Real (Wall clock) CPU 10,000 (2.28 %) 3,406,068 64,008,762 35.75 (2.05 %) 35.05 100,000 (22.76 %) 36,328,126 670,335,243 384.44 (22.00 %) 366.54 439,423 (100 %) 158,106,118 2,910,314,146 1,747.16 (100 %) 1,623.22

Table 2: Quality evaluation results on the NTCIR-11 Math-2 dataset. The mean average precision (MAP), and precisions at ten (P@10) and five (P@5) are reported for queries formulated using Presentation (PMath), and Content MathML (CMath), a combination of both (PCMath), and LaTeX. Two different relevance judgement levels of (partially relevant), and (relevant) were used to compute the measures. Number between slashes (//) is our rank among all teams of NTCIR-11 Math-2 Task.
Measure Level PMath CMath PCMath LaTeX
MAP 3 0.3073 0.3630 /1/ 0.3594 0.3357
P@10 3 0.3040 0.3520 /1/ 0.3480 0.3380
P@5 3 0.5120 0.5680 /1/ 0.5560 0.5400
P@10 1 0.5020 0.5440 0.5520 /1/ 0.5400
Table 1: The linear indexing speed on the MREC dataset using 448G of RAM, and eight Intel Xeon™ X7560 2.26 GHz CPUs.

Finally, the users can deploy WebMIaS in a single line of code with the dataset and index directories in a container named webmias running at the TCP port 8888 on the localhost. The WebMIaS system will be running at http://localhost:8888/WebMIaS.

3 Evaluation

We performed a speed evaluation of MIaS on the MREC dataset [dml:liska2011short] (see Table 2), and a quality evaluation on the NTCIR-10 Math [mir:NTCIR-10-Overview, MIR:MIRMUshort], NTCIR-11 Math-2 [NTCIR11Math2overviewshort, mir:MIaSNTCIR-11short] (see Table 2), NTCIR-12 MathIR [ZanibbiEtAl16NTCIR, RuzickaSojkaLiska16Mathshort], and ARQMath 2020 [zanibbi2020overview, novotny2020three] datasets. We also measured the time to deploy WebMIaS without Docker (see Figure 3).

The speed evaluation shows that the indexing time of our system is linear in the number of indexed documents and that the average query time is 469 ms. Additionally, the dockerization of WebMIaS reduces the deployment time from about 10 minutes to a matter of seconds. With respect to quality evaluation, MIaS has notably won the NTCIR-11 Math-2 task.

4 Conclusion

An open-source environment brings reproducibility and the possibility of trying out the projects of one’s interest without limitations. However, the installation instructions are often hard to follow with many prerequisites and possible conflicts with the running operating environment on the go. Automation tools, continuous integration, and package virtualization ease the development process. With this motivation and in the hope of helping the math community, we have dockerized our math-aware web search engine WebMIaS. As a result, anyone can now deploy WebMIaS in a single line of code. The software is accessible and at the fingertips of the math community, see https://github.com/MIR-MU/WebMIaS.