CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering

08/28/2023
by   Sami Diaf, et al.
0

Document scaling has been a key component in text-as-data applications for social scientists and a major field of interest for political researchers, who aim at uncovering differences between speakers or parties with the help of different probabilistic and non-probabilistic approaches. Yet, most of these techniques are either built upon the agnostically bag-of-word hypothesis or use prior information borrowed from external sources that might embed the results with a significant bias. If the corpus has long been considered as a collection of documents, it can also be seen as a dense network of connected words whose structure could be clustered to differentiate independent groups of words, based on their co-occurrences in documents, known as communities. This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus and use them as an input to Wordfish method, instead of considering the word space. This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model. Aside from yielding communities, assumed to be subtopic proxies, the application of this technique outperforms the classic Wordfish model by highlighting historical developments in the U.S. State of the Union addresses and was found to replicate the prevailing political stance in Germany when using the corpus of parties' legislative manifestos.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2021

Topic Scaling: A Joint Document Scaling – Topic Model Approach To Learn Time-Specific Topics

This paper proposes a new methodology to study sequential corpora by imp...
research
10/09/2018

textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

We address two challenges of probabilistic topic modelling in order to b...
research
10/09/2018

textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

We address two challenges of probabilistic topic modelling in order to b...
research
02/11/2023

Dialectograms: Machine Learning Differences between Discursive Communities

Word embeddings provide an unsupervised way to understand differences in...
research
04/12/2019

Political Text Scaling Meets Computational Semantics

During the last fifteen years, text scaling approaches have become a cen...
research
08/06/2015

Privacy-Preserving Multi-Document Summarization

State-of-the-art extractive multi-document summarization systems are usu...

Please sign up or login with your details

Forgot password? Click here to reset