Corpora Compared: The Case of the Swedish Gigaword Wikipedia Corpora

11/06/2020
by   Tosin P. Adewumi, et al.
0

In this work, we show that the difference in performance of embeddings from differently sourced data for a given language can be due to other factors besides data size. Natural language processing (NLP) tasks usually perform better with embeddings from bigger corpora. However, broadness of covered domain and noise can play important roles. We evaluate embeddings based on two Swedish corpora: The Gigaword and Wikipedia, in analogy (intrinsic) tests and discover that the embeddings from the Wikipedia corpus generally outperform those from the Gigaword corpus, which is a bigger corpus. Downstream tests will be required to have a definite evaluation.

READ FULL TEXT
research
07/23/2020

Exploring Swedish English fastText Embeddings with the Transformer

In this paper, our main contributions are that embeddings from relativel...
research
11/15/2020

The Challenge of Diacritics in Yoruba Embeddings

The major contributions of this work include the empirical establishment...
research
12/18/2021

The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus

In order to address the increasing demands of real-world applications, t...
research
12/15/2018

Wikipedia2Vec: An Optimized Implementation for Learning Embeddings from Wikipedia

We present Wikipedia2Vec, an open source tool for learning embeddings of...
research
12/15/2018

Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia

We present Wikipedia2Vec, an open source tool for learning embeddings of...
research
03/13/2020

Know thy corpus! Robust methods for digital curation of Web corpora

This paper proposes a novel framework for digital curation of Web corpor...
research
03/23/2020

Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks

Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks...

Please sign up or login with your details

Forgot password? Click here to reset