Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure

02/18/2016
by   Stefan Gerdjikov, et al.
0

When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely corpus-based and language-independent way without using any kind of prior linguistic knowledge. Unsupervised methods for identifying "phrases", mining subphrase structure and finding words in a fully automated way are described. This can be considered as a step towards automatically computing a "general dictionary and grammar of the corpus". We hope that in the long run variants of our approach turn out to be useful for other kind of sequence data as well, such as, e.g., speech, genom sequences, or music annotation. Even if we are not primarily interested in immediate applications, results obtained for a variety of languages show that our methods are interesting for many practical tasks in text mining, terminology extraction and lexicography, search engine technology, and related fields.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2017

Automated Phrase Mining from Massive Text Corpora

As one of the fundamental tasks in text analysis, phrase mining aims at ...
research
11/22/2018

Creating a contemporary corpus of similes in Serbian by using natural language processing

Simile is a figure of speech that compares two things through the use of...
research
10/04/2015

A Novel Approach to Document Classification using WordNet

Content based Document Classification is one of the biggest challenges i...
research
05/15/2023

Using LLM-assisted Annotation for Corpus Linguistics: A Case Study of Local Grammar Analysis

Chatbots based on Large Language Models (LLMs) have shown strong capabil...
research
09/26/2022

ImmunoLingo: Linguistics-based formalization of the antibody language

Apparent parallels between natural language and biological sequence have...
research
05/06/2020

Evaluating text coherence based on the graph of the consistency of phrases to identify symptoms of schizophrenia

Different state-of-the-art methods of the detection of schizophrenia sym...
research
10/26/2017

Text Mining Descriptions Of Dreams: aesthetic and clinical efforts

Dreams are highly valued in both Freudian psychoanalysis and less conser...

Please sign up or login with your details

Forgot password? Click here to reset