Topic Modeling on User Stories using Word Mover's Distance

07/10/2020
by   Kim Julian Gülle, et al.
0

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

READ FULL TEXT

page 1

page 3

page 7

research
03/15/2016

Topic Modeling Using Distributed Word Embeddings

We propose a new algorithm for topic modeling, Vec2Topic, that identifie...
research
04/30/2020

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

Topic models are a useful analysis tool to uncover the underlying themes...
research
07/22/2020

Better Early than Late: Fusing Topics with Word Embeddings for Neural Question Paraphrase Identification

Question paraphrase identification is a key task in Community Question A...
research
03/03/2018

Understanding and Improving Multi-Sense Word Embeddings via Extended Robust Principal Component Analysis

Unsupervised learned representations of polysemous words generate a larg...
research
01/11/2021

Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa – A Large Romanian Sentiment Data Set

Romanian is one of the understudied languages in computational linguisti...
research
01/13/2022

Compressing Word Embeddings Using Syllables

This work examines the possibility of using syllable embeddings, instead...
research
05/18/2020

Exploring Software Reusability Metrics with Q A Forum Data

Question and answer (Q A) forums contain valuable information regardin...

Please sign up or login with your details

Forgot password? Click here to reset