Computing Word Classes Using Spectral Clustering

08/16/2018
by   Effi Levi, et al.
0

Clustering a lexicon of words is a well-studied problem in natural language processing (NLP). Word clusters are used to deal with sparse data in statistical language processing, as well as features for solving various NLP tasks (text categorization, question answering, named entity recognition and others). Spectral clustering is a widely used technique in the field of image processing and speech recognition. However, it has scarcely been explored in the context of NLP; specifically, the method used in this (Meila and Shi, 2001) has never been used to cluster a general word lexicon. We apply spectral clustering to a lexicon of words, evaluating the resulting clusters by using them as features for solving two classical NLP tasks: semantic role labeling and dependency parsing. We compare performance with Brown clustering, a widely-used technique for word clustering, as well as with other clustering methods. We show that spectral clusters produce similar results to Brown clusters, and outperform other clustering methods. In addition, we quantify the overlap between spectral and Brown clusters, showing that each model captures some information which is uncaptured by the other.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2022

An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese

Part-of-speech (POS) tagging plays an important role in Natural Language...
research
11/10/2019

TENER: Adapting Transformer Encoder for Named Entity Recognition

The Bidirectional long short-term memory networks (BiLSTM) have been wid...
research
08/03/2016

Improving Quality of Hierarchical Clustering for Large Data Series

Brown clustering is a hard, hierarchical, bottom-up clustering of words ...
research
11/10/2019

TENER: Adapting Transformer Encoder for Name Entity Recognition

The Bidirectional long short-term memory networks (BiLSTM) have been wid...
research
01/27/2017

Bangla Word Clustering Based on Tri-gram, 4-gram and 5-gram Language Model

In this paper, we describe a research method that generates Bangla word ...
research
07/15/2018

Concept-Based Embeddings for Natural Language Processing

In this work, we focus on effectively leveraging and integrating informa...
research
04/15/2016

Parallelizing Word2Vec in Shared and Distributed Memory

Word2Vec is a widely used algorithm for extracting low-dimensional vecto...

Please sign up or login with your details

Forgot password? Click here to reset