LScDC-new large scientific dictionary

12/14/2019
by   Neslihan Suzen, et al.
0

In this paper, we present a scientific corpus of abstracts of academic papers in English – Leicester Scientific Corpus (LSC). The LSC contains 1,673,824 abstracts of research articles and proceeding papers indexed by Web of Science (WoS) in which publication year is 2014. Each abstract is assigned to at least one of 252 subject categories. Paper metadata include these categories and the number of citations. We then develop scientific dictionaries named Leicester Scientific Dictionary (LScD) and Leicester Scientific Dictionary-Core (LScDC), where words are extracted from the LSC. The LScD is a list of 974,238 unique words (lemmas). The LScDC is a core list (sub-list) of the LScD with 104,223 lemmas. It was created by removing LScD words appearing in not greater than 10 texts in the LSC. LScD and LScDC are available online. Both the corpus and dictionaries are developed to be later used for quantification of meaning in academic texts. Finally, the core list LScDC was analysed by comparing its words and word frequencies with a classic academic word list 'New Academic Word List (NAWL)' containing 963 word families, which is also sampled from an academic corpus. The major sources of the corpus where NAWL is extracted are Cambridge English Corpus (CEC), oral sources and textbooks. We investigate whether two dictionaries are similar in terms of common words and ranking of words. Our comparison leads us to main conclusion: most of words of NAWL (99.6 present in the LScDC but two lists differ in word ranking. This difference is measured.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2023

Automatic Extraction of the Romanian Academic Word List: Data and Methods

This paper presents the methodology and data used for the automatic extr...
research
04/13/2018

Neologisms on Facebook

In this paper, we present a study of neologisms and loan words frequentl...
research
03/05/2020

Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System

We present the first approach to automatically building resources for ac...
research
02/01/2020

Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List

This paper is an effort to complement the contributions made by research...
research
09/10/2020

The Grievance Dictionary: Understanding Threatening Language Use

This paper introduces the Grievance Dictionary, a psycholinguistic dicti...
research
04/19/2019

Recognizing the vocabulary of Brazilian popular newspapers with a free-access computational dictionary

We report an experiment to check the identification of a set of words in...
research
06/13/2022

Automatic generation of a large dictionary with concreteness/abstractness ratings based on a small human dictionary

Concrete/abstract words are used in a growing number of psychological an...

Please sign up or login with your details

Forgot password? Click here to reset