Technical documents typically include meta components such as figures, tables, mathematical formulas, and pseudo-code to effectively communicate complex ideas and results. Let us define the term unnatural language as blocks of lines consist of only such components, as opposed to the body text that are natural language.
There are many great NLP tools available as the field has been advanced. However, these tools are mostly built for input text that are natural language. As many of our tools for NLP can be badly confused by unnatural language, it is necessary to distinguish unnatural language blocks from natural language blocks, or else unnatural language blocks will cause confusion for natural language processing. Once we salvage natural language blocks from the documents, we can exploit NLP tools much better as they are intended for. This phenomenon is emphasized in technical documents that have higher ratio of unnatural language compared to non-technical documents such as essays and novels.
Document layout analysis aiming to identify document format by classifying blocks into text, figures, tables, and such has been a long-studied problem[O’Gorman1993, Simon et al.1997]. Most previous work have focused on image-based documents, PDF and OCR formats, and used geometric analysis on the pages using the visual cues from its layout. This was a clearly important problem in many applications in NLP and IR.
This work was particularly motivated while we attempted to cluster teaching documents (e.g., lecture slides and reading materials from courses) in technical topics. We discovered that unnatural language blocks introduced significant noise for clustering, causing spurious matches between documents. For example, code consists of reserved programming keywords and variable names. Two documents can contain two very different code from one another but their cosine similarity is computed high because they share many same terms by programming convention (Figure 1). [Kohlhase and Sucan2006] similarly recognized this problem by explaining main challenges of semantic search for mathematical formula: (1) Mathematical notation is context-dependent; without human’s capability to understand the formula from the context, formulas are just noise. (2) Identical presentations can stand for multiple distinct mathematical objects.
This paper proposes a new approach for identifying unnatural language blocks in plain text into four types of categories,(1) table (2) code (3) mathematical formula, and (4) miscellaneous (misc.). Text are extracted from technical documents in the PDF, PPT, and HTML formats with little to no explicit visual layout information preserved. We focus on technical documents because they have a significant amount of unnatural language blocks (26.3% and 16% in our two corpora). Specifically, we focus on documents in slide formats, which have been relatively under-explored.
We further study how removal of unnatural language improves NLP tasks, document similarity and document clustering. Our experiments show that clustering on documents with unnatural language removed consistently showed higher accuracy on many of the settings than on original documents, with the maximum improvements up to 15% and 11% in two datasets, while it never significantly hurts the original clustering.
2 Related Work
2.1 Table Extraction
Various efforts have been made for table extraction using semi-supervised learning on the patterns of table layouts within ASCII text documents[Ng et al.1999] web documents [Pinto et al.2003, Lerman et al.2001, Zanibbi et al.2004] PDF and OCR image documents [Clark and Divvala2015, Liu et al.2007]. Existing techniques exploit the graphical features such as primitive geometry shapes, symbols, and lines to detect table borders. [Khusro et al.2015] introduces and compares the state-of-the-art table extraction techniques from PDF articles. However, there does not appear to be any work that has attempted to process plain text extracted from richer formats, where table layouts are unpreserved.
2.2 Formula Extraction
equInpdf categorized existing approaches for mathematical formulas detection by ‘character-based’ and ‘layout-based’ with respect to key features. [Chan and Yeung2000] provides a comprehensive survey of mathematical formula extraction using various layout features available from image-based documents. Since we have no access to layout information, character-based approaches are more relevant to our work. They use features of mathematical symbols, operators, and positions and their character sizes [Suzuki et al.2003, Kacem et al.2001].
2.3 Code Extraction
tuarob proposed 3 pseudo-code extraction methods: a rule based, a machine learning, and a combined method. Their rule based approach finds the presence of pseudo-code captions using keyword matching. The machine learning approach detects a box surrounding a sparse region and classifies whether the box is pseudo-code or not. They extracted four groups of features: font-style based, context based, content based, and structure based.
|Classification Training||T||35 lecture slides (8,514 lines) whose components are annotated|
|T||35 ACL papers (25,686 lines) whose components are annotated|
|T||Combination of T and T|
|Word Embedding Training||T||
|Clustering||C||128 lecture slides from ‘data structure’ and ‘algorithm’ classes|
|C||300 lecture slides from, ‘operating system’ classes|
3 Problem Definition
Input to our task is the plain text extracted from PDF or PPT documents. The goal is to assign a class label to each line in that plain text, identifying it as natural language (regular text) or one of the four types of non-natural language block components, table, code, formula, or miscellaneous text. In this work, we focus on these four specific types because our observations lead us to believe they are the most frequently occurring components in PPT lecture slides and PDF articles. Figures are also a very frequent component but we do not consider them because they are commonly pictures or drawings and cannot be easily extracted to text. In this section, we briefly discuss the characteristics of each component and challenges in their identification from the raw text.
Tables are prevalent in almost every domain of technical documents. Tables are usually conveyed by its two-dimensional layout and its column and/or row headings [Khusro et al.2015]. Figure 2 shows a table in an original PDF document and the same table as it appears the text extracted by Apache Tika111https://tika.apache.org/. Tables frequently have multiple cells merged for layout, which makes them particularly difficult to distinguish as table once they are converted to flat text.
3.2 Mathematical Formula
Mathematical formulas exist in two ways: isolated formulas on their own lines or as formulas embedded within a line of text. In this work, we treat both types as formula component. Because not all math symbols can be matched to Unicode characters and because the extraction software may not convert them correctly, the extracted text tends to contain more oddly formatted or even completely wrong characters. Superscripts and subscripts are no longer distinguishable and the original visual layout (e.g., math symbols over multiple lines such asand ) is lost.
Articles in Computer Science or related fields often contain pseudo-code or actual program code to illustrate their algorithm. Similar to mathematical formulas, they exist both isolated and embedded, though most code components are isolated code blocks. As in formula component, we treat both types of code blocks as code components. We assume that even indents, one of the strong code visual cues, are not preserved in the extracted text although some extraction tool saves them, not to limit ourselves to the detailed performances of text extraction tools.
3.4 Miscellaneous Non-text (Misc.)
In addition to the components mentioned above, there are other types of non-natural language blocks that are left during conversion to text and that may provide spurious sub-topic matches between documents. To allow for those, we denote those components as miscellaneous text. One example of miscellaneous text is the text and caption that are part of the diagrams in slides. Figure 3 shows an example of miscellaneous text that lost its structure and meaning while being converted to text without the original diagram.
4.1 Data Collection
We collected 1,561 lecture slides from various Computer Science and Electrical Engineering courses that are available online, and 5,898 academic papers from several years of ACL/EMNLP archive222https://aclweb.org/anthology
. We divided the dataset for several purposes; training the classification model, training word embedding model for feature extraction, and clustering for extrinsic evaluation. The details of the dataset we used are summarized in Table1. We make the data publicly available for download at http://people.cs.umass.edu/~mhjang/publications.html.
For classification, we constructed three dataset using two different data sources: (1) lecture slides (2) ACL papers, (3) combining both. We chose these two types of data sources because they have different ratios of unnatural language components, hence complementary to each other for the coverage. Table 2 shows the ratio of the four components from each annotated dataset. For example, 1.4% of lines in are annotated as part of table.
4.2 Text Extraction
We extracted plain text from our datasets using an open-source software package, Apache Tika. The package is available for text extraction from various formats including PDF, PPT, and HTML.
To train a statistical model, we need ground-truth data. We created annotation guidelines for the 4 types of non-natural language components and annotated 35 lectures slides (7,943 lines) and 35 ACL proceeding papers (25,686 lines). We developed a simple annotation tool to support the task and also to enforce that annotators follow certain rules333The guidelines and the tool are available at http://people.cs.umass.edu/~mhjang/publication.html. We hired four undergraduate annotators who have knowledge of the Computer Science domain for this task.
We find line-based prediction has more advantage over token-based prediction because it allows us to observe the syntactic structure of the line, how statistically common the grammar structure is, and how layout patterns compare to neighboring lines. We introduce five sets of features used to train our classifier and discuss each feature’s impact on the accuracy.
5.1 N-gram (N)
Unigrams and bigrams of each line are included as features.
5.2 Parsing Features (P)
Unnatural languages are not liklely to form any grammar structure. When we attempt to parse the unnatural language line, the resultant parsing tree would form unusual syntactic structure. To capture this insight, we parsed each line using the dependency parser in ClearNLP [Choi and McCallum2013] and extracted features such as the set of dependency labels, the ratio of each POS tag, and POS tags of each dependent-head pair from each parse tree.
5.3 Table String Layout (T)
Text extracted from tables loses its visual layout as a table but still preserves implicit layout through its string patterns. Tables tend to convey the same type of data along the same column or row. For example, if a column in a table reports numbers, it is more likely to contain numeral tokens in the same location of the lines of the table in parallel. Hence, a block of lines will more likely be a table if they share the same pattern. We encode each line by replacing each token as either S (String) or N (Numeral). We then compute the edit distance among neighboring lines weighted by language modeling probability computed from table corpus (Equation1, 2).
where refers to a i-th line in a document, refers to a j-th token in .
5.4 Word Embedding Feature (E)
We train word embeddings using using word2vec [Mikolov et al.2013]
. The training corpus contained 278,719 words. Since we do a line-based prediction, we need a vector that represents the line, not each word. We consider three ways of computing line embedding vector: (1) by averaging the vector of the words, (2) by computing paragraph vector introduced in[Le and Mikolov2014], (3) using both.
5.5 Sequential Feature (S)
The sequential nature of the lines is also an important feature because the component most likely occurs over a block of contiguous lines. We train two models. The first model uses the annotation for the previous line’s class. We then train another model using the previous line’s predicted label, which is the output of the first model.
|PC-CB [Tuarob et al.2013]*||N/A||75.95||N/A||N/A||N/A||75.95||N/A||N/A|
6 Classification Experiments
We used the Liblinear Support Vector Machine (SVM)[Chang and Lin2011] classifier for training and ran 5-fold cross-validation for evaluation. To improve the robustness of structured prediction, we adopted a learning to search algorithm known as DAgger to SVM [Ross et al.2010]. We first introduce two baselines to compare the accuracy against our statistical model.
Since no existing work is directly applicable to our scenario, we consider two straightforward baselines.
Weighted Random (W-Random)
This assigns the random component class to each line. Instead of uniform random prediction, we made more educated guesses using the ratio of components known from the annotated dataset (Table 2).
Component Language Modeling (CLM)
Among the five language models of five component class (the four non-textual components and text component) generated from the annotations, we predict the component for each line by assigning the component whose language model gives the highest probability to the line.
6.2 Classification Result
We first conducted single-domain classification. Annotations within each dataset, T and T are split for training and testing using 5-fold cross validation scheme. Table 3 reports F1-score for prediction of the four components in the two dataset using our method as well as baselines.
Proposed method dramatically increased the prediction accuracies for all of the components against the baselines. CLM baseline showed the highest accuracy on code among the four categories in both datasets. Because pseudo-codes use more controlled vocabulary (e.g., reserved words, common variable names), the language itself becomes distinctive characteristics. We also include the numbers reported by tuarob for comparison. Since their dataset was 258 PDF scholarly articles, T is more comparable dataset than T, but our training set is much smaller than their dataset. However, their number reported on Table 3 is not directly comparable to other numbers because the numbers are on different datasets.
In T, the classification F1-score for formula is relatively low as 29.09% compared to the other components in the same dataset, and also compared to the formula prediction in T (80.98%). This is due to too small amount of training data (only 0.5% of formula in T), which is overcome in T that contain 5% of formula training data (refer to Table 2).
In the proposed method, classification of code and misc was significantly improved in T (around 90%), while that of table and formula was improved in T (over 80%). This shows the complementary nature between two datasets, which suggests that a combined dataset of two, T, would improve classification performance. Table 4 shows the multi-domain classification result using T, in which all four categories are identified with F1-score higher than 80%.
6.3 Feature Analysis
We conducted feature analysis to understand the impact of single feature and their combination. We started from single features and incrementally combined them to observe the performance (Figure 5). Features are added in a greedy fashion that a feature that gives the higher accuracy when used alone is added first.
We first compare the three ways of computing sentence vector features mentioned in Section 5 (Figure 4). When we experiment with only embedding features, averaging word vectors performed 9-12 times better than paragraph vectors. When both features were used, there are some gains in Code and Misc. components and losses in Table and Formula. However, when we experiment with all the other features in addition to embedding features, losses were covered by the other features such that ultimately combined vectors give overall the highest performances.
N-gram (N) features was the most powerful feature with 68% of F1-score when used alone. The next useful features are parsing feature (P), table layout (T), and embedding features (E) in order for table, while embedding vectors were more effective than parsing feature for code (Figure 5).
7 Removal Effects of Unnatural Language on NLP tools
We observe how removal of unnatural language from document affects the performance of two NLP tools, document similarity and document clustering. For the set of experiments, we prepared a gold standard clustering for each dataset, C and C.
7.1 Document Similarity
If two documents are similar, they must be topically relevant to each other. A good similarity measure should reflect that; two topically relevant documents should have a high similarity score. To test whether the computed similarity reflects the actual topic relevance better once the unnatural language is removed, we conducted regression analysis.
We converted the gold standard clustering to pair-wise binary relevance. If two documents are in the same ground-truth cluster, they are relevant, and otherwise irrelevant. We then fitted a log-linear model in R for predicting binary relevance from the cosine similarity of document pairs.
Regression models fitted in R are evaluated using AIC [Akaike1974]. The AIC is a measure used as a means for model selection, which measures the relative quality of statistical models learned from the given data. When AIC is smaller, the goodness of fit is better, and the smaller the complexity of the model is, having fewer parameters to represent the model. Table 5 shows that AIC was reduced by 53 and 118 respectively on the models trained with documents whose unnatural language blocks are removed, compared to the original documents. Since AIC does not provide a test for a model, AIC does not suggest anything about the quality of the model in an absolute sense, but relative quality. From this result, we can conclude that cosine similarity can fit a better model that predicts documents’ topic relevance with significance after unnatural language blocks have been removed.
7.2 Document Clustering
Comparing general clustering performance on two document sets is tricky because clustering performance varies by many factors, e.g., clustering algorithm, similarity function, document representation, and parameters. To make a safe claim that clustering quality of one set of documents is better than of the other, clustering on one set should consistently outperform the other under many different settings. To validate this, we perform clustering experiments with multiple settings such as varying document vector size and and different initialization schemes.
In this experiment, we consider seeded K-means clustering algorithm[Basu et al.2002] for teaching documents. In our application scenario, users initially submit a topic list (e.g., syllabus) of the course. Then lecture slides are grouped into the given topic cluster. Depending on users’ interaction level, we consider a semi-interactive scenario where users only provide a topic list, and a fully-interactive setting where users not only provide a topic list but also provide an answer document for each topic cluster, more specifying the intended topic.
In a semi-interactive setting, topic keywords are sparse seed as they usually consist of two or three words. Therefore, we expand the topic keywords by finding the top-1 document retrieved from the keywords and use it as seed. For experiments, we simulate the fully-interactive setting; instead of having an actual user to pick an answer document, we use an answer document randomly chosen from a gold cluster. The seeded K-means clustering algorithm with three interactive seeding schemes is described in Algorithm 1.
We can consider a simulated setting more realistic when the selected document is suggested to the user as the top or a near-top choice. In our dataset, 60% of the selected documents were ranked in top 10 in C, and 13% of the selected documents were ranked in top 10 in C, which implies that the simulated setting in C was more realistic than in C. For top-1 document seeding, 64% and 78% of document seeds matched with the gold standard in C and C, respectively.
Figure 6 shows the clustering result of original documents (D) and documents whose unnatural language blocks removed (D), with three different seeding schemes over two lecture slides datasets. In C, D consistently outperformed with all three seeding schemes. The clustering performed the best with D when top-1 document was used as seed. Overall, in C, clustering was improved 94% of the times with the maximum absolute gain of 14.7% and the average absolute gain of 4.6%. The average absolute loss was 0.8% when 6% of the times it hurt. In C, clustering was improved 73% of the times with the maximum absolute gain of 11.4% and the average absolute gain of 3.9%. The average absolute loss was 1.7%. Our results suggest that removal of unnatural language blocks can significantly improve clustering most of the times with bigger gain than occasional losses.
In this paper, we argued that unnatural language should be distinguished from natural language in technical documents for NLP tools to work effectively. We presented an approach to the identification of four types of unnatural language blocks from plain text, which is not dependent a document format. Proposed method extracts five sets of line-based textual features, and showed above 82% F1-score for the four categories of unnatural language. We showed how existing NLP tools can work better on documents if we remove unnatural language from documents. Specifically, we demonstrated removing unnatural language improved document clustering in many settings by up to 15% and 11% at best, while not significantly hurting the original clustering in any setting.
This work was supported in part by the Center for Intelligent Information Retrieval, in part by NSF grant #IIS-0910884, and in part by NSF grant #IIS-1217281. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. The authors thank Kenneth W. Church for providing valuable comments and advice.
- [Akaike1974] Hirotugu Akaike. 1974. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716–723, December.
- [Basu et al.2002] Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. 2002. Semi-supervised clustering by seeding. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, pages 27–34, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- [Chan and Yeung2000] Kam-Fai Chan and Dit-Yan Yeung. 2000. Mathematical expression recognition: A survey.
- [Chang and Lin2011] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27.
- [Choi and McCallum2013] Jinho D. Choi and Andrew McCallum. 2013. Transition-based dependency parsing with selectional branching. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL’13, pages 1052–1062.
- [Clark and Divvala2015] Christopher Clark and Santosh Divvala. 2015. Looking beyond text: Extracting figures, tables and captions from computer science papers. In AAAI Workshops.
- [Kacem et al.2001] Afef Kacem, Abdel Belaïd, and Mohamed Ben Ahmed. 2001. Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. IJDAR, 4(2):97–108.
- [Khusro et al.2015] Shah Khusro, Asima Latif, and Irfan Ullah. 2015. On methods and tools of table detection, extraction and annotation in pdf documents. J. Inf. Sci., 41(1):41–57, February.
- [Kohlhase and Sucan2006] Michael Kohlhase and Ioan Sucan. 2006. A search engine for mathematical formulae. In Jacques Calmet, Tetsuo Ida, and Dongming Wang, editors, AISC, volume 4120 of Lecture Notes in Computer Science, pages 241–253. Springer.
- [Le and Mikolov2014] Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. CoRR, abs/1405.4053.
- [Lerman et al.2001] Kristina Lerman, Craig Knoblock, and Steven Minton. 2001. Automatic data extraction from lists and tables in web sources. In In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park. AAAI Press.
- [Lin et al.2011] Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xiaofan Lin, and Xuan Hu. 2011. Mathematical Formula Identification in PDF Documents. In International Conference on Document Analysis and Recognition, ICDAR, pages 1419–1423.
- [Liu et al.2007] Ying Liu, Kun Bai, Prasenjit Mitra, and C. Lee Giles. 2007. TableSeer: automatic table metadata extraction and searching in digital libraries. In Joint Conference on Digital Library, JCDL, pages 91–100.
- [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
- [Ng et al.1999] Hwee Tou Ng, Chung Yong Lim, and Jessica Li Teng Koo. 1999. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 443–450, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [O’Gorman1993] L. O’Gorman. 1993. The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell., 15(11):1162–1173, November.
- [Pinto et al.2003] David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, pages 235–242, New York, NY, USA. ACM.
- [Ross et al.2010] Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. 2010. No-regret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686.
- [Simon et al.1997] Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson. 1997. A fast algorithm for bottom-up document layout analysis. IEEE Trans. Pattern Anal. Mach. Intell., 19(3):273–277, March.
- [Suzuki et al.2003] Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. 2003. Infty- an integrated ocr system for mathematical documents. In Proceedings of ACM Symposium on Document Engineering 2003, pages 95–104. ACM Press.
- [Tuarob et al.2013] Suppawong Tuarob, Sumit Bhatia, Prasenjit Mitra, and C. Lee Giles. 2013. Automatic detection of pseudocodes in scholarly documents using machine learning. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, ICDAR ’13, pages 738–742, Washington, DC, USA. IEEE Computer Society.
- [Zanibbi et al.2004] Richard Zanibbi, Dorothea Blostein, and R. Cordy. 2004. A survey of table recognition: Models, observations, transformations, and inferences. Int. J. Doc. Anal. Recognit., 7(1):1–16, March.