Language-Agnostic Website Embedding and Classification

01/10/2022
by   Sylvain Lugeon, et al.
0

Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie, the largest multilingual crowdsourced Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and can generate embeddings representation. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources. We make publicly available the curated Curlie dataset aligned across languages, the pre-trained Homepage2Vec model, and libraries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2021

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Pre-trained transformers are now the de facto models in Natural Language...
research
11/01/2018

GlobalTrait: Personality Alignment of Multilingual Word Embeddings

We propose a multilingual model to recognize Big Five Personality traits...
research
09/27/2021

MFAQ: a Multilingual FAQ Dataset

In this paper, we present the first multilingual FAQ dataset publicly av...
research
09/07/2021

IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages

In this paper we present IndicBART, a multilingual, sequence-to-sequence...
research
05/15/2023

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

While natural language processing tools have been developed extensively ...
research
11/13/2017

Targeted Advertising Based on Browsing History

Audience interest, demography, purchase behavior and other possible clas...
research
06/14/2022

Random Access Concatenated Libraries and dd enable a short-latency high-content website on an inexpensive shared server

Content-rich websites typically house their images as individual files o...

Please sign up or login with your details

Forgot password? Click here to reset