Phrase Mining

06/28/2022
by   Ellie Small, et al.
0

Extracting frequent words from a collection of texts is performed on a great scale in many subjects. Extracting phrases, on the other hand, is not commonly done due to inherent complications when extracting phrases, the most significant complication being that of double-counting, where words or phrases are counted when they appear inside longer phrases that themselves are also counted. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting process, or they require human interaction to identify those quality phrases during the process. We present a method that eliminates double-counting without the need to identify lists of quality phrases. In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words "not" and "no", does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user. Our method can identify such principal phrases independently without human input, and enables their extraction from any texts. An R package called phm has been developed that implements this method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2017

Human Associations Help to Detect Conventionalized Multiword Expressions

In this paper we show that if we want to obtain human evidence about con...
research
02/15/2017

Automated Phrase Mining from Massive Text Corpora

As one of the fundamental tasks in text analysis, phrase mining aims at ...
research
06/03/2017

Task-specific Word Identification from Short Texts Using a Convolutional Neural Network

Task-specific word identification aims to choose the task-related words ...
research
02/07/2020

How do Quantifiers Affect the Quality of Requirements?

Context: Requirements quality can have a substantial impact on the effec...
research
01/26/2021

pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents

Biomedical research is intensive in processing information in the previo...
research
04/18/2023

Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition

This paper presents an extension to train end-to-end Context-Aware Trans...
research
04/02/2016

Discriminative Phrase Embedding for Paraphrase Identification

This work, concerning paraphrase identification task, on one hand contri...

Please sign up or login with your details

Forgot password? Click here to reset