A Large Visual, Qualitative and Quantitative Dataset of Web Pages

05/15/2021
by   Christian Mejia-Escobar, et al.
7

The World Wide Web is not only one of the most important platforms of communication and information at present, but also an area of growing interest for scientific research. This motivates a lot of work and projects that require large amounts of data. However, there is no dataset that integrates the parameters and visual appearance of Web pages, because its collection is a costly task in terms of time and effort. With the support of various computer tools and programming scripts, we have created a large dataset of 49,438 Web pages. It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics such as art, entertainment, economy, business, education, government, news, media, science, and environment, covering different cultural characteristics and varied design preferences. In this paper, we describe the process of collecting, debugging and publishing the final product, which is freely available. To demonstrate the usefulness of our dataset, we expose a binary classification model for detecting error Web pages, and a multi-class Web subject-based categorization, both problems using convolutional neural networks.

READ FULL TEXT

page 8

page 14

page 18

page 20

research
09/25/2010

Web Page Categorization Using Artificial Neural Networks

Web page categorization is one of the challenging tasks in the world of ...
research
05/09/2019

Collecting 16K archived web pages from 17 public web archives

We document the creation of a data set of 16,627 archived web pages, or ...
research
01/30/2023

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Modeling user interfaces (UIs) from visual information allows systems to...
research
11/29/2022

ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provide...
research
04/22/2020

Boilerplate Removal using a Neural Sequence Labeling Model

The extraction of main content from web pages is an important task for n...
research
12/02/2018

Improved and Robust Controversy Detection in General Web Pages Using Semantic Approaches under Large Scale Conditions

Detecting controversy in general web pages is a daunting task, but incre...
research
08/02/2023

A Large-Scale Study of Phishing PDF Documents

Phishing PDFs are malicious PDF documents that do not embed malware but ...

Please sign up or login with your details

Forgot password? Click here to reset