Web2Text: Deep Structured Boilerplate Removal

01/08/2018 ∙ by Thijs Vogels, et al. ∙ Association for Computing Machinery ETH Zurich Apple Inc 0

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content. Our method uses a hidden Markov model on top of potentials derived from DOM tree features using convolutional neural networks. The proposed method sets a new state-of-the-art performance for boilerplate removal on the CleanEval benchmark. As a component of information retrieval pipelines, it improves retrieval performance on the ClueWeb12 collection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern methods in natural language processing and information retrieval are heavily dependent on large collections of text. The World Wide Web is an inexhaustible source of content for such applications. However, a common problem is that Web pages include not only main content, but also ads, hyperlink lists, navigation, previews of other articles, banners, etc. This boilerplate/template content has often been shown to have negative effects on the performance of derived applications [15, 24].

The task of separating main text in a Web page from the remaining content is known in the literature as “boilerplate removal”, “Web page segmentation” or “content extraction”. Established popular methods for this problem use rule-based or machine learning algorithms. The most successful approaches first perform a splitting of an input Web page into text blocks, followed by a binary labeling of each block as either main content or boilerplate.

In this paper, we propose a hidden Markov model on top of neural potentials for the task of boilerplate removal. We leverage the representational power of convolutional neural networks (CNNs) to learn unary and pairwise potentials over blocks in a page-based on complex non-linear combinations of DOM-based traditional features. At prediction time, we find the most likely block labeling by maximizing the joint probability of a label sequence using the Viterbi algorithm 

[23]. The effectiveness of our method is demonstrated on standard benchmarking datasets.

The remainder of this document is structured as follows. Section 2 gives an overview of related work. Section 3 formally defines the main-content extraction problem, introduces the block segmentation procedure and details our model. Section 4 empirically demonstrates the merit of our method on several benchmark datasets for content extraction and document retrieval.

2 Related Work

Early approaches to HTML boilerplate removal use a range of heuristics and rule-based methods. Finn

et al. [7] design an effective system called Body Text Extractor (BTE). It relies on the observation that the main content contains longer paragraphs of uninterrupted text, where HTML tags occur less frequently compared to the rest of the Web page. Looking at the cumulative distribution of tags as a function of the position in the document, Finn et al. identify a flat region in the middle of this distribution graph to be the main content of the page. While simple, their algorithm has two drawbacks: (1) it only makes use of the location of HTML tags and not of their structure, thus losing potentially valuable information, and (2) it can only identify one continuous stretch of main content which is unrealistic for a considerable percentage of modern Web pages.

To address these issues, several other algorithms have been designed to operate on DOM trees, thus leveraging the semantics of the HTML structure [11, 19, 6]. The problem with these early methods is that they make intensive use of the fact that pages used to be partitioned into sections by <table> tags, which is no longer a valid assumption.

In the next line of work, the DOM structure is used to jointly process multiple pages from the same domain, relying on their structural similarities. This approach was pioneered by Yi et al. [24] and was improved by various others [22]. These methods are very suitable for detecting template content that is present in all pages of a website, but have poor performance on websites that consist of a single Web page only. In this paper we focus on single-page content extraction without exploiting the context of other pages from the same site.

Gottron et al. [10] propose Document Slope Curves and Content Code Blurring

methods that are able to identify multiple disconnected content regions. The latter method parses the HTML source code as a vector of 1’s, representing pieces of text, and 0’s, representing tags. This vector is then smoothed iteratively, such that eventually it finds active regions where text dominates (content) and inactive regions where tags dominate (boilerplate). This idea of smoothing was extended to also deal with the DOM structure 

[4, 21]. Chakrabarti et al. [3] assign a likelihood of being content to each leaf of the DOM tree while using isotonic smoothing to combine the likelihoods of neighbors with the same parents. In a similar direction, Sun et al. [21] use both the tag/text ratio and DOM tree information to propagate DensitySums through the tree.

Machine learning methods offer a convenient way to combine various indicators of “contentness”, automatically weighting hand-crafted features according to their relative importance. The FIASCO system by Bauer et al. [2]

uses Support Vector Machines (SVM) to classify an HTML page as a sequence of blocks that are generated through a DOM-based segmentation of the page and are represented by linguistic, structural and visual features. Similar works of Kohlschütter

et al. [17] also employ SVMs to independently classify blocks. Spousta et al. [20] extend this approach by reformulating the classification problem as a case of sequence labeling where all blocks are jointly tagged. They use conditional random fields to take advantage of correlations between the labels of neighboring content blocks. This method was the most successful in the CleanEval competition [1].

In this paper, we propose an effective set of block features that capture information from adjacent neighbors in the DOM tree. Additionally, we employ a deep learning framework to automatically learn non-linear features combinations, giving the model an advantage over traditional linear approaches. Finally, we jointly optimize the labels for the whole Web page according to local potentials predicted by the neural networks.

3 Web2Text

Boilerplate removal is the problem of labeling sections of the text of a Web page as main content or boilerplate (anything else) [1]. In the following, we discuss the various steps of our method. The complete pipeline is also illustrated in Figure 1.

Figure 1: The Web2Text pipeline. The leaves of the Collapsed DOM tree of a Web page form an ordered sequence of blocks to be labeled. For each block, we extract a number of DOM tree-based features. Two separate convolutional networks operating on this sequence of features yield two respective sets of potentials: unary potentials for each block and pairwise potentials for each pair of neighboring blocks. These define a hidden Markov model. Using the Viterbi algorithm, we find an optimal labeling that maximizes the total sequence probability as predicted by the neural networks.

3.1 Preprocessing

We expect raw Web page input to be written in (X)HTML markup. Each document is parsed as a Document Object Model tree (DOM tree) using Jsoup [12]. We preprocess this DOM tree by i) removing empty nodes or nodes containing only whitespace, ii) removing nodes that do not have any content we can extract: e.g. <br>, <checkbox>, <head>, <hr>, <iframe>, <img>, <input>.

Figure 2: Collapsed DOM procedure example. Left: HTML source code, middle: the corresponding DOM tree, right: the corresponding Collapsed DOM.

We make use of the parent and grandparent DOM tree relations. In a raw DOM tree, however, these relationships are not always meaningful. Figure 2 shows a typical fragment of a DOM tree where two neighboring nodes share the same semantic parent (<ul>) but not the same DOM parent. To improve the expressiveness of tree based features (such as “the number of children of a node’s parent”), we recursively merge single child parent nodes with their respective child. We call the resulting tree-structure the Collapsed DOM (CDOM).

3.2 Block Segmentation

Our content extraction algorithm is based on sequence labeling. A Web page is treated as a sequence of blocks that are labeled main content or boilerplate. There are multiple ways to split a Web page into blocks, the most popular currently used being i) Lines in the HTML file, ii) DOM leaves, iii) Block-level DOM leaves. We opt for using the most flexible DOM leaves strategy, described as follows. Sections on a page that require different labels are usually separated by at least one HTML tag. Therefore, it is safe to consider DOM leaves (#text nodes) as the blocks of our sequence. A potential disadvantage of this approach is that a hyperlink in a text paragraph can receive a different label than its neighboring text. Under this scheme, an empirical evaluation of Web2Text shows no cases where parts of a textual paragraph are wrongly labeled as boilerplate, while the rest are marked as main content.

3.3 Feature Extraction

Features are properties of a node that may be indicative of it being content or boilerplate. Such features can be based on the node’s text, CDOM structure or a combination thereof. We distinguish between block features and edge features.

Block features capture information on each block of text on a page. They are statistics collected based on the block’s CDOM node, parent node, grandparent node and the root of the CDOM tree. In total, we collect 128 features for each text block, e.g. “the node is a <p> element”, “average word length”, “relative position in the source code”, “the parent node’s text contains an email address”, “ratio of stopwords in the whole page”, etc.

 We clip and standardize all non-binary features to be approximately Gaussian with zero mean and unit variance across the training set. For a full overview of all 128 features, please refer to Appendix 

0.A.

Edge features capture information on each pair of neighboring text blocks. We collect 25 features for each such pair. Define the tree distance of two nodes as the sum of the number of hops from both nodes to their first common ancesor. The first edge features we use are binary features corresponding to a tree distance of 2, 3, 4 and . Another feature signifies if there is a line break between the nodes in an unstyled HTML page. Finally, we collect features b70–b89 from Appendix 0.A for the common ancestor CDOM node of the two text blocks.

3.4 CNN Unary and Pairwise Potentials

We assign unary potentials to each text block to be labeled and pairwise potentials to each pair of neighboring text blocks. In our case, potentials are probabilities as explained below. The unary potentials , are the probabilities that the label of a text block is content or boilerplate, respectively. The two potentials sum to one. The pairwise potentials , , and are the transition probabilities of the labels of a pair of neighboring text blocks. These pairwise potentials also sum to one for each text block pair.

The two sets of potentials are modeled using CNNs with 5 layers, ReLU non-linearity between layers, filter sizes of

for the unary network and of

for the pairwise network. All filters have a stride of 1 and kernel sizes

respectively. The unary CNN receives a sequence of block features corresponding to the sequence of text blocks to be labeled and outputs unary potentials for each block. The pairwise CNN receives a sequence of edge features corresponding to the sequence of edges to be labeled and outputs the pairwise potentials for each block. We use zero padding to make sure that each layer produces a sequence of the same size as its input sequence. The outputs for the unary network are sequences of 2 values per block that are normalized using softmax. The outputs for the pairwise network are sequences of 4 values per block-pair that are normalized in the same way. Thus, the output for the block

depends indirectly on a range of blocks around it. We employ dropout regularization with rate 0.2 and weight decay with rate .

For the unary potentials, we minimize the cross-entropy

(1)

where is the true label of block , are the parameters of the unary network and is the index of the last text block in the sequence.

For the pairwise network, we minimize the cross-entropy

(2)

where are the parameters of the pairwise network.

3.5 Inference

The joint prediction of the most likely sequence of labels given an input Web page works as follows. We denote the sequence of text blocks on the page as and write the probability of a corresponding labeling being the correct one as

(3)

where

is an interpolation factor between the unary and pairwise terms. We use

in our experiments. This expression describes a hidden Markov model and it is maximized using the Viterbi algorithm [23] to find the optimal labeling given the predicted CNN potentials.

4 Experiments

Our experiments are grouped in two stages. We begin by assessing Web2Text’s performance at boilerplate removal on a high-quality manually annotated corpus of Web pages. In a second step, we turn towards a much larger collection and investigate how improved content extraction results in superior information retrieval quality. Both experiments highlight the benefits of Web2Text over state-of-the-art alternatives.

4.1 Training Data

CleanEval 2007 [1] is the largest publicly available dataset for this task. It contains 188 text blocks per Web page on average. It consists of an original split of development (60 pages) and test (676 pages) sets. We divide the development set into a training set (55 pages) and a test set (5 pages). Since our model has more than 10,000 parameters, it is likely that the original training set is too small for our method. Thus, we did a second split of the CleanEval as follows: training (531 pages), validation (58 pages) and test (148 pages).

4.1.1 Automatic Block Labeling.

To our knowledge, the existing corpora (including CleanEval) for boilerplate detection pose an additional difficulty. These datasets consist only of pairs of Web pages and corresponding cleaned text (manually extracted). As a consequence, the alignment between the source text and cleaned text, as well as block labeling, have to be recovered. Some methods (e.g. [20]) rely on expensive manual block annotations. One of our contributions is the following automatic recovery procedure of the aligned (block, label) pairs from the original (Web page, clean text) pairs. This allows us to leverage more training data compared to previous methods.

We first linearly scan the cleaned text of a Web page using windows of 10 consecutive characters. Each such snippet is checked for uniqueness in the original Web page (after spaces trimming). If such a unique match is found, then it can be used to divide both the cleaned text and the original Web page in two parts on which the same matching method can be applied recursively in a divide-et-impera fashion. After all unique snippets are processed, we use dynamic programming to align the remaining splitted parts of the clean text with the corresponding splitted parts of the original Web page blocks. In the end, in the rare case that the content of a block is only partially matched with the cleaned text, we mark it as content iff at least 23 of its text is aligned.

4.2 Training Details

The unary and pairwise potential-predicting networks are trained separately with the Adam optimizer [14] and a learning rate of for 5000 iterations. Each iteration processes a mini-batch of 9-text-block long Web page excerpts. We perform early stopping, observing no improvements after this number of steps. We then pick the model corresponding to the lowest error on the validation set.

4.3 Baselines

We compare Web2Text to a range of methods described in the literature or deployed in popular libraries. BTE [7] and Unfluff [8] are heuristic methods. [17, 16] is a popular machine learning system that offers various content extraction settings111We were not able to find code for re-training this system. which we used in our experiments (see Table 1). CRF [20] achieves one of the best results on CleanEval. This machine learning model trains a Conditional Random Field on top of block features in order to perform block classification. However, as explained in Section 4.1.1, CRF relies on a different Web page block splitting and on expensive manual block annotations. As a consequence, we were not able to re-train it and thus only used their out-of-the-box model pre-trained on the original CleanEval split. For a fair comparison, we also train on the original CleanEval split, but note below that our neural network has many more parameters and will suffer from using so few training instances.

4.3.1 Model Sizes.

The CRF model [20] contains 9,705 parameters. In comparison, our unary CNN network contains 17,960 parameters, while the pairwise CNN contains 12,870 parameters, the total number of parameters for the joint structured model being 30,830. This explains why the original train set is too small for our model.

Original test (676 pages) Our test (148 pages) Method Acc. Precision Recall Acc. Precision Recall CRF [20] original train 55p + 5p 0.82 0.87 0.81 0.84 0.82 0.88 0.81 0.84 BTE [7] 0.79 0.79 0.89 0.83 0.75 0.76 0.84 0.80 default-ext [16] 0.80 0.89 0.75 0.81 0.79 0.89 0.74 0.81 article-ext [16] 0.72 0.91 0.59 0.71 0.67 0.89 0.50 0.64 largest-ext [16] 0.60 0.93 0.36 0.52 0.59 0.93 0.33 0.48 Unfluff [8] 0.71 0.90 0.57 0.70 0.68 0.90 0.51 0.65 Web2Text original train 55p, val 5p 0.84 0.88 0.85 0.86 Web2Text our train 531p, val 58p 0.86 0.87 0.90 0.88

Table 1: Boilerplate removal results on the CleanEval dataset. We use two different splits of this dataset, the original split (55p, 5p, 676p) and our split (531p, 58p, 148p). It is confirmed that our method benefits from bigger training sets.

4.4 Content Extraction Results

Table 1 shows the results of this experiment. All the metrics are block based, where all blocks are weighted equally. We note that Web2Text obtains state-of-the-art accuracy, recall and F1 scores compared to popular baselines including previous CleanEval winners. Note that these numbers are obtained by evaluating each method using the same block segmentation procedure, namely the DOM leaves strategy described in Section 3.2. We additionally note that, compared to using Web2Text only with the unary CNN, the gains of the hidden Markov model are marginal in this experiment.

4.4.1 Running times.

Web2Text takes 54ms per Web page on average; 35ms for DOM parsing and feature extraction, and 19ms for the neural network forward pass and Viterbi algorithm. These measurements were done on a Macbook with a 2.8 GHz Intel Core i5 processor.

Collection Ret. Model Method P@10 R@10 @10 MAP nDCG
CW12-A QL raw content 0.316 0.056 0.095 0.137 0.459
CW12-A QL CRF [20] 0.342* 0.068* 0.113* 0.147* 0.543*
CW12-A QL BTE [7] 0.301 0.048 0.083 0.128 0.435
CW12-A QL default-ext [16] 0.318 0.055 0.094 0.138 0.462
CW12-A QL article-ext [16] 0.298 0.049 0.084 0.126 0.433
CW12-A QL largest-ext [16] 0.279 0.044 0.076 0.112 0.417
CW12-A QL Unfluff [8] 0.304 0.051 0.087 0.128 0.428
CW12-A QL Web2Text 0.361* 0.079* 0.130* 0.154* 0.578*
CW12-A RM raw content 0.278 0.048 0.082 0.121 0.439
CW12-A RM CRF [20] 0.301* 0.057* 0.096* 0.138* 0.487*
CW12-A RM BTE [7] 0.262 0.041 0.071 0.110 0.409
CW12-A RM default-ext [16] 0.277 0.048 0.082 0.123 0.442
CW12-A RM article-ext [16] 0.260 0.039 0.068 0.109 0.411
CW12-A RM largest-ext [16] 0.248 0.032 0.057 0.097 0.401
CW12-A RM Unfluff [8] 0.264 0.041 0.071 0.111 0.407
CW12-A RM Web2Text 0.325* 0.069* 0.114* 0.145* 0.525*
CW12-B QL raw content 0.210 0.025 0.045 0.037 0.134
CW12-B QL CRF [20] 0.241* 0.031* 0.055* 0.048* 0.165*
CW12-B QL BTE [7] 0.193 0.019 0.035 0.030 0.121
CW12-B QL default-ext [16] 0.212 0.026 0.046 0.038 0.132
CW12-B QL article-ext [16] 0.199 0.017 0.031 0.031 0.120
CW12-B QL largest-ext [16] 0.178 0.015 0.028 0.024 0.107
CW12-B QL Unfluff [8] 0.195 0.020 0.036 0.029 0.121
CW12-B QL Web2Text 0.266* 0.038* 0.067* 0.055* 0.181*
CW12-B RM raw content 0.172 0.021 0.037 0.030 0.122
CW12-B RM CRF [20] 0.198* 0.028* 0.049* 0.041* 0.143*
CW12-B RM BTE [7] 0.158 0.015 0.027 0.022 0.111
CW12-B RM default-ext [16] 0.170 0.020 0.036 0.029 0.124
CW12-B RM article-ext [16] 0.156 0.015 0.027 0.019 0.109
CW12-B RM largest-ext [16] 0.145 0.013 0.024 0.015 0.095
CW12-B RM Unfluff [8] 0.159 0.016 0.029 0.021 0.112
CW12-B RM Web2Text 0.213* 0.032* 0.056* 0.046* 0.165*
Table 2: The effect of boilerplate removal on ad hoc retrieval performance. An asterisk (*) indicates a significance performance difference between raw and cleaned HTML concent. A dagger () indicates that a model significantly outperforms all other text extraction methods.

4.5 Impact on Retrieval Performance

Besides the previously presented intrinsic evaluation of text extraction accuracy, we are interested in the performance gains that other derived tasks experience when operating on the output of boilerplate removal systems of varying quality. To this end, our extrinsic evaluation studies the task of ad hoc document retrieval. Search engines that index high-quality output of text extraction systems should be better able to answer a given user-formulated query than systems indexing raw HTML or naïvely cleaned content. Our experiments are based on the well-known ClueWeb12 collection of Web pages.222http://lemurproject.org/clueweb12/ It is organized in two well-defined document sets, the full CW12-A corpus of 733M organic Web documents (27.3 TB of uncompressed text) as well as the smaller, randomly sampled subset CW12-B of 52M documents (1.95 TB of uncompressed text). The collection is indexed using the Indri search engine and retrieval runs are conducted using two state-of-the-art probabilistic retrieval models, the query likelihood model [13] (QL) as well as a relevance-based language model [18] (RM). Our 50 test queries alongside their relevance judgments originate from the 2013 edition of the TREC Web Track [5].

Table 2 highlights the quality of each combination of retrieval model and collection when indexing either raw or cleaned Web content. Within each combination, statistical significance of performance differences between raw and cleaned HTML content is denoted by an asterisk. Models that significantly outperform all other text extraction methods are indicated by . We can note that, in general, retrieval systems indexing CW12-A deliver stronger results than those operating only on the CW12-B subset. Due to the random sampling process, many potentially relevant documents are missing from this smaller collection. Similarly, across all comparable settings, the query likelihood model (QL) performs significantly better than the relevance model (RM). As hypothesized earlier, text extraction can influence the quality of subsequent document retrieval. We note that low-recall methods (BTE, article-ext, largest-ext, Unfluff) cause losses in retrieval performance, as relevant pieces of content are incorrectly removed as boilerplate. At the same time, the most accurate models (CRF, Web2Text) were able to introduce improvements across all metrics. Web2Text, in particular, outperformed all baselines at significance level . We note that, for this experiment, Web2Text was trained on our CleanEval split as explained in Section 4.1.

5 Conclusion

This paper presents Web2Text 333Our source code is publicly available: https://github.com/dalab/web2text, a novel algorithm for main content extraction from Web pages. The method combines the virtues of popular sequence labeling approaches such as CRFs [9] with deep learning methods that leverage the DOM structure as a source of information. Our experimental evaluation on CleanEval benchmarking data shows significant performance gains over all state-of-the-art methods. In a second set of experiments, we demonstrate how highly accurate boilerplate removal can significantly increase the performance of derived tasks such as ad hoc retrieval.

Acknowledgments

This research is funded by the Swiss National Science Foundation (SNSF) under grant agreement numbers 167176 and 174025.

References

  • [1] Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. CleanEval: a competition for cleaning web pages. In LREC, 2008.
  • [2] Daniel Bauer, Judith Degen, Xiaoye Deng, Priska Herger, Jan Gasthaus, Eugenie Giesbrecht, Lina Jansen, Christin Kalina, Thorben Kräger, Robert Märtin, Martin Schmidt, Simon Scholler, Johannes Steger, Egon Stemle, and Stefan Evert. FIASCO: Filtering the internet by automatic subtree classification, osnabruck. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, incorporating CleanEval, volume 4, pages 111–121, 2007.
  • [3] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. Page-level template detection via isotonic smoothing. In Proceedings of the 16th international conference on World Wide Web, pages 61–70. ACM, 2007.
  • [4] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th international conference on World Wide Web, pages 377–386. ACM, 2008.
  • [5] Kevyn Collins-Thompson, Paul Bennett, Fernando Diaz, Charlie Clarke, and Ellen Voorhees. Overview of the TREC 2013 web track. In Proceedings of the 22nd Text Retrieval Conference (TREC’13), 2013.
  • [6] Sandip Debnath, Prasenjit Mitra, Nirmal Pal, and C Lee Giles. Automatic identification of informative sections of web pages. IEEE transactions on knowledge and data engineering, 17(9):1233–1246, 2005.
  • [7] Aidan Finn, Nicholas Kushmerick, and Barry Smyth. Fact or fiction: Content classification for digital libraries. Unrefereed, 2001.
  • [8] Adam Geitgey. Unfluff – an automatic web page content extractor for node.js!, 2014.
  • [9] John Gibson, Ben Wellner, and Susan Lubar. Adaptive web-page content identification. In Proceedings of the 9th annual ACM international workshop on Web information and data management, pages 105–112. ACM, 2007.
  • [10] Thomas Gottron. Content code blurring: A new approach to content extraction. In Database and Expert Systems Application, 2008. DEXA’08. 19th International Workshop on, pages 29–33. IEEE, 2008.
  • [11] Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. DOM-based content extraction of HTML documents. In Proceedings of the 12th international conference on World Wide Web, pages 207–214. ACM, 2003.
  • [12] Jonathan Hedley. Jsoup HTML parser, 2009.
  • [13] Rong Jin, Alex G Hauptmann, and ChengXiang Zhai. Language model for information retrieval. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 42–48. ACM, 2002.
  • [14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [15] Christian Kohlschütter. A densitometric analysis of web template content. In Proceedings of the 18th international conference on World wide web, pages 1165–1166. ACM, 2009.
  • [16] Christian Kohlschütter et al. Boilerpipe – boilerplate removal and fulltext extraction from HTML pages. Google Code, 2010.
  • [17] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441–450. ACM, 2010.
  • [18] Victor Lavrenko and W Bruce Croft. Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 120–127. ACM, 2001.
  • [19] Shian-Hua Lin and Jan-Ming Ho. Discovering informative content blocks from web documents. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588–593. ACM, 2002.
  • [20] Miroslav Spousta, Michal Marek, and Pavel Pecina. Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pages 12–17, 2008.
  • [21] Fei Sun, Dandan Song, and Lejian Liao. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 245–254. ACM, 2011.
  • [22] Karane Vieira, Altigran S Da Silva, Nick Pinto, Edleno S De Moura, Joao Cavalcanti, and Juliana Freire. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 258–267. ACM, 2006.
  • [23] Andrew J Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. In The Foundations Of The Digital Wireless World: Selected Works of AJ Viterbi, pages 41–50. World Scientific, 2010.
  • [24] Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296–305. ACM, 2003.

Appendix 0.A List of block features

ID Name Description b1 has duplicate 1/0: is there another node with the same text? b2 has 10 duplicates 1/0: are there at least 10 other nodes with the same text? b3 r same class path ratio of nodes on the page with the same class path (e.g. body>div>a.link>b) b4 has word 1/0: there is at least one word in the text block b5 log(n words) log(number of words) (clipped between 0 and 3.5) b6 avg word length average word length (clipped between 3 and 15) b7 has stopword 1/0: block contains a stopword b8 stopword ratio ratio of words the are in our stopword list b9 log(n characters) log(number of characters) (clipped between 2.5 and 5.5) b10 log(punctuation ratio) log(ratio of of characters to the total) (clipped between -4 and -2.5) b11 has numeric 1/0: the node contains numeric characters b12 numeric ratio ratio of numeric characters to the total character count b13 log(avg sentence length) log(average sentence length) (clipped between 2 and 5) b14 ends with punctuation 1/0: the node ends with a character b15 ends with question mark 1/0: the node ends with a question mark b16 contains copyright 1/0: the node contains a copyright symbol b17 contains email 1/0: the node contains an email address b18 contains url 1/0: the node contains a URL b19 contains year 1/0: the node contains a word consisting of 4 digits b20 ratio words with capital ratio of words starting with a capital letter b21 ratio words with capital b25 squared b22 ratio words with capital b25 to the power 3 b23 contains punctuation node contains a character b24 n punctuation number of characters b25 has multiple sentences 1/0: there are more than 1 sentences in the text b26 relative position relative position of the start of this block in the source code b27 relative position 17 squared b28 has parent 1/0: the CDOM leaf has a parent node b29 p body percentage ratio of the source code characters that is within the parent CDOM node b30 p link density ratio of characters within <a> elements to total character count b31–b47 parent features b6–b22, but for the parent CDOM node b48 p contains form element 1/0: the parent CDOM node contains a form element b49–b69 aprent tag features encoding of the parent CDOM node’s HTML tags as 1’s and 0’s (td, div, p, tr, table, body, ul, span, li, blockquote, b, small, a, ol, ul, i, form, dl, strong, pre) b69 has grandparent 1/0: the node has a grandparent CDOM node b70–b89 grandparent features b29–b48, but for the grandparent CDOM node b90–b109 root features b29–b48, but for the root CDOM node (body) b110–b128 tag features encoding of the CDOM node’s HTML tags as 1’s and 0’s (a, p, td, b, li, span, I, tr, div, strong, em, h3, h2, table, h4, small, sup, h1, blockquote)

Table 3: List of all block features used. 1/0 indicated a binary feature: 1 if true, 0 if false. Non-binary features are normalized to have zero mean and unit variance.