Topic Modeling on User Stories using Word Mover's Distance

07/10/2020 ∙ by Kim Julian Gülle, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.



There are no comments yet.


page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In traditional Requirements Engineering (RE), techniques like surveys, workshops, observations, and interviews are used to gather stakeholder input and elicit software requirements [45]. Usually, these techniques are limited and can only be applied to end-users within organizational reach [33]. With the emergence of new data sources, this changes: Researchers have shown different approaches to extract requirements from feedback channels such as tweets or app store reviews [33, 40]. Another approach to gather a broad range of feedback is crowd-sourcing [12]. In 2016, Murukannaiah et al. elicited 2,966 requirements for smart home applications from crowd workers [30]. Such forms of user feedback can be used to identify, prioritize, and manage requirements [22] for software products and to increase user satisfaction [34]. However, automatic techniques are necessary to derive useful insights the from large amounts of raw data the crowd can produce [22, 31, 12]. This becomes even more apparent as the decision-making process in requirements engineering shifts towards a more data-driven approach [23]. The automatic analysis of crowd-based requirements comes with some challenges, though, as Murukannaiah et al. declared [31]. With our paper, we work on how to summarize crowd-acquired requirements automatically. In our evaluation, these requirements are represented in the form of user stories, which seem to be an appropriate form for crowd-sourced requirements elicitation [25, 15]. Our contribution is a method to cluster requirements through the combined use of topic modeling techniques and similarity metrics based on word embeddings. The goal is to provide the basis for an automatic solution that identifies groups of requirements or features in crowd-sourced data. Existing work for automatic requirements clustering is mostly based on Latent Dirichlet Allocation (LDA), a statistical model that characterizes a requirement by a distribution over certain latent topics. In contrast, we cluster user stories based on word embeddings and distance measures. Although an objective and quantified evaluation is not possible in our study setup due to a missing ground truth, we conclude that a clustering approach based on pretrained word embeddings and Word Mover’s Distance (WMD) as distance measure produced the most promising and interesting results in our setting.

Ii Background

Ii-a The CrowdRE Dataset

With the intention to “facilitate large scale user participation in RE”, Murukannaiah et al. [31, 30] conducted an empirical study on the Amazon Mechanical Turk111 platform, resulting in the CrowdRE dataset consisting of 2,966 crowd-generated requirements for smart home applications.

The dataset was generated in two phases: First, 300 crowd workers were asked to formulate requirements for smart home applications in the form of user stories (As a [role], I want [feature] so that [benefit]). The authors had to assign one of five domains to the requirement (Energy, Entertainment, Health, Safety, or Other). Additionally, an arbitrary number of free-text tags could be added.

In the second phase, 309 additional crowd workers rated the requirements of the first phase with regard to clarity, usefulness, and novelty. For our work, the results of the first phase are the primary data source since we want to extract topics based on the textual data. Below, an exemplary requirement and its annotated domain and tags is given:

Requirement: “As a pet owner, I want my smart home to let me know when the dog uses the doggy door, so that I can keep track of the pets whereabouts.”
Domain: Safety
Tags: Pets, Cats, Dogs

Ii-B Latent Dirichlet Allocation (LDA)

LDA proposed by Blei et al. [5] is a generative probabilistic model used to observe hidden groups of similar data called topics

within a dataset. The authors define a word as an item from a vocabulary, a document as a sequence of words, and a corpus as a collection of documents. The approach aims to find a limited number of topics that are latent in the documents of the corpus. To do so, it is assumed that each document is a mixture of a limited number of latent topics with each topic being modeled as the probability distribution over all words in the vocabulary. Based on this generative model for a collection of documents, the

LDA approach uses backtracking to find a set of topics that likely have generated the corpus. Therefore, for a new document, it is possible to infer the involved latent topics and assign a topic label [32]. However, LDA suffers from order effects [1] i.e. if the input data is shuffled, different topics can be retrieved. This leads to different results each time the algorithm computes the topics and therefore introduces new challenges for subsequent text mining algorithms. Additionally, being a probabilistic model, LDA models describe the relationship between words as a statistical relationship of occurrences without considering the semantic information embedded in words [32]. Therefore, the similarity between words based on their meaning cannot be discovered [27] which, in turn, can result in too broad topics [32].

Ii-C Word Vectors and Word Embeddings

To overcome the introduced shortcomings, continuous space neural network language models can be trained to capture both the syntactic and the semantic regularities of language. A common defining feature of such models is that each word is converted into a high-dimensional real-valued vector (

word vector) via learned lookup-tables. A property of these models is that similar words are likely to have similar vectors [29].

Ii-C1 Word2Vec

Although several architectures for the computation of word vectors exist [27, 29], according to Mikolov et al., none of these “architectures has been successfully trained on more than a few hundred of millions of words” [27], as they become computationally very expensive with larger data sets. This also applies to the previously mentioned LDA. Addressing this shortcoming, Mikolov et al. propose two optimized neural network architectures for calculating word vectors at a significantly reduced learning time: the Continuous Bag-of-Words (CBOW) model and the continuous skip-gram model [27].

The idea behind the CBOW architecture is to predict the current word based on the context, whereas the Skip-gram model predicts surrounding words given the current word [27]. Both are shallow neural network architectures consisting of an input layer, a projection layer, and an output layer [27, 35]. Once the language model is trained on any of these architectures, the projection layer holds a dense representation of the word vectors, also called word embedding222 These embeddings preserve the syntactic and semantic information of the words. Therefore, when displayed in vector space, it is possible, to express these syntactic and semantic similarities by vector offsets, where all pairs of words sharing a particular relation are related by the same constant offset [29].

Ii-C2 Word Mover’s Distance

While word2vec is a sophisticated approach when it comes to generating quality word embeddings, the word vectors alone are not sufficient regarding the task of topic modeling. Consider the two documents: “My smart home should turn on my favorite music when I come to my home.” and “My smart home shall play my most favored songs when I arrive at my place.” The sentences basically convey the same information. Plotting these sentences with word embeddings, some of their vectors will even be close, especially if word-wise similarity is given (e.g. the pairs <music, songs> and <come, arrive> are close. The closeness of the whole sentences, on the other hand, cannot be represented in the word2vec model alone. To overcome this shortage, Kusner et al. introduced Word Mover’s Distance (WMD) as a word-based distance measure for whole sentences [17]. Based on previously created word embeddings (as for example those from word2vec), the distance between two text documents A and B is described as the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B. Using this method, WMD reaches a high retrieval accuracy while being completely free of hyper-parameters and therefore straightforward to use.

Iii Related Work

When it comes to topic modeling, the LDA algorithm is the most widely-used technique in recent approaches throughout software engineering [1]. In the following, we will introduce some LDA-based approaches and approaches based on word embeddings.

Guzman and Maalej [13]

applied NLP and sentiment analysis to extract software features from user reviews together with a summary of the user opinions about each feature. To identify high-level features, they used

LDA on a set of features they extracted from the reviews.

Galvis Carreño and Winbladh [11] extracted word-based topics from reviews and assigned sentiments to them through a combination of LDA and sentiment analysis. Similarly, Chen et al. [6] proposed AR-miner, a review analytics framework for summarizing informative app reviews. The tool first filters noisy and irrelevant reviews, such as ratings. Then, it summarizes and ranks the informative reviews using topic modeling (LDA

and Aspect and Sentiment Unification Model (ASUM)) and heuristics from the review metadata.

Another approach is presented by Asuncion et al. [2] who propose a method that automatically records traceability links and then performs topic modeling. The topic model is learned over the artifacts and allows a semantic categorization and topical visualization of the system. The presented tools aid users to analyze the semantic nature of artifacts and the software architecture itself.

Another LDA-based approach is presented by Barua et al. [3]. They use LDA to automatically identify the main topics in the textual content of Stack Overflow discussions. They additionally quantify how these topics change over time to retrieve emerging trends and gain more detailed insights into the needs of developers. A similar approach is presented by Zhou et al. [41]. They evaluated over 200,000 Wikipedia articles and as a second analysis applied their LDA-based approach to a set of twitter messages from 10,000 users. They were able to retrieve articles as well as twitter users that cover similar content. However, due to the large amount of data, the LDA-approach turned out to be computationally expensive.

While the above-mentioned approaches mainly focus on user-generated content (user reviews, Stack Overflow posts), Hindle et al. have applied LDA to extract topics from documented requirements at Microsoft [14] and found that many topics were relevant to features and development effort. Stakeholders who were familiar with the requirements documents tended to be comfortable labeling the topics and identifying behavior, but those who were not showed some resistance to the task of topic labeling.

In contrast to the above-introduced approaches, methods based on recent neural probabilistic language models [4] have shown that they are able to address the shortcomings introduced by the LDA-based approaches. In particular, the already introduced Word2Vec approach proposed by Mikolov et. al [27] is used by multiple other approaches to build upon. One of these approaches is presented by Qiang et al. [35] who propose an embedding-based Topic Model (ETM) that uses semantic knowledge from word embeddings to alleviate the problem of very limited word co-occurrence information in short texts. They claim that their method outperforms the state-of-the-art methods, including LDA, on two real-world datasets.

Another approach, built upon word embeddings, that aims to solve the problem of sparseness in terms of word co-occurrences, is presented by Li et. al. [20]. They propose a method that is particularly suited to perform topic modeling on short texts by incorporating knowledge about word semantics, learned from a large number of external documents.

Wu and Li [44] present an approach called Topic Mover’s Distance (TMD), being a topic-based distance metric for documents, inspired by WMD. In their approach, each document is considered to be composed of predefined topics and each topic is denoted by a word cluster. These word clusters are then expanded to a vector space in which TMD measures how far topics need to travel from one document to another.

Iv Proposed Approaches

In this section, we present the three topic modeling approaches we use. Figure 1 shows an overview of the different process steps for each approach, with A1 - 3 signifying the three approaches as detailed later in the respective chapters. We published the code of the three approaches to enable replication and reuse.333 Before detailing the separate approaches, some common processing steps are covered.

Fig. 1: Overview of the analyzed approaches (A1, A2, and A3)

Iv-a Dataset Preparation

For our topic modeling approaches, we only use the text of the requirements without any ratings or user characterization added to the data. We therefore extract the sentences of the Crowd RE dataset444 to construct our dataset.

Fig. 2: Distribution of domains within the user stories.

Additionally, we need a measure to compare the proposed approaches. For this purpose, we used the domains assigned to the requirements in the corpus as labels. The domains are separated into five groups: Health, Energy, Entertainment, Safety and Other. The Other category contains additional user-defined, specific domains. For our study, we focus on the five top-level domains. Figure 2 shows the distribution of domains within the user stories. The Safety domain shows the most associated requirements while the least represented domain, Other, exhibits roughly half the number of requirements. Still, the latter category contains about 400 requirements. Therefore, there is no under-representation of any of the categories. This labeling approach results in every requirement receiving a label and the number of categories remaining rather small, as to not over-complicate the supposed topics to be identified.

Iv-B Natural Language Processing Pipeline

In order to condition the data for further processing, we perform several Natural Language Processing (NLP) operations. Our NLP pipeline is shown in

Figure 3 with an exemplary application to a requirement from the dataset.

Fig. 3: Processing an exemplary requirement sentence through our NLP preprocessing pipeline.

As some of the requirements sentences contain special characters, an initial data cleaning is necessary. We remove all but alphabetic characters as they do not provide any semantic value.

We apply tokenization to separate the requirements into sequences of tokens [39]. In our application, the tokens are simply the single words separated by whitespace. Therefore, all whitespaces and punctuation is removed from the requirements. The tokenization yields a list of tokens per requirement. After this step, the data exhibits 4,968 unique tokens.

Stopword-removal is applied to remove words from the data that do not provide any semantic value [26]. In addition to the stopwords, we remove all template words555Predetermined template words in the Crowd RE dataset: as, smart, home, owner, i, want, be, able. from the requirements, as these are common to all requirements and therefore do not provide information to distinguish different topics. This reduces the size of the vocabulary, i.e. the number of unique tokens, to 4,851.

(a) Approach 1
(b) Approach 2
(c) Approach 3
Fig. 4: Final representation of requirements for each approach before visualization.

Iv-C Approach 1: LDA

The presented LDA approach is mainly introduced to serve as a reference to the results of the two other neural network approaches. In Figure 1, this approach is denoted as A1.

After initial preprocessing, Bag-of-Words (BoW) [27] is applied to transfer the requirements to a numerical representation. Bag-of-Words is one of the basic techniques used to simplify sentences or documents in numerical space. In this application, a BoW vector is constructed for each requirement. Each vector has the size of the vocabulary of the dataset, i.e. the number of unique tokens in the dataset. For each requirement, there is a count per token in the vector representing how often each word in the vocabulary appears in the requirement.

Subsequently, a weighting scheme is applied to these BoW vectors, precisely we chose the Term Frequency – Inverse Document Frequency (TF-IDF) [38]. The term frequency is the rating how often a specific term occurs in the text, as already specified by the BoW vectors. The inverse document frequency is a measure of how relevant a term is in relation to all samples within the dataset [18]. For example, if a term occurs in every sample of the dataset, it is assumed to not be very informative towards differentiating samples. Very rarely occurring words are assumed to exhibit more explanatory power.

With every requirement represented as a weighted vector, an LDA is applied to identify the latent topics within the data. As our labeling approach from Section IV-A introduces five different labels to the requirements, we set the number of topics to be detected by the LDA to the same value. Therefore, the LDA produces vectors for the requirements containing the five probabilities that a requirement belongs to one of the identified topics. The resulting matrix is shown in Figure 4(a).

Lastly, we apply t-SNE [24], a dimensionality reduction technique, to this matrix to get a 2-dimensional representation for each requirement, which can be plotted.

Iv-D Approach 2: Word Embeddings and PCA

In order for our approach to include semantic aspects of the requirements, we apply word embeddings as introduced in Section II-C. This approach is denoted in Figure 1 as A2.

For the second approach we compare two different implementations, one with self-trained word embeddings and one with pretrained ones. Our self-trained vectors are produced via the skip-gram method proposed by Mikolov et al. [28]. We construct 50-dimensional word vectors. This size is chosen due to the limited size of the vocabulary in the dataset. Most of the 4,851 words in the vocabulary only occur in the data very seldom. We empirically determine the best results to be obtained with a minimum word occurrence of five. i.e. all words that appear less than five times in the data, are dropped from the vocabulary and are not represented in the embedding. This results in about 24 % (1,159) of words being incorporated in the self-trained embedding.

With our dataset being relatively small and the created embedding not capturing all the semantic regularities due to the dropped words, we chose to also incorporate pretrained vectors. We use the word embedding from the Google News dataset666 trained on a set of about 100 billion words. Due to the large embedding and the extensive training data, the individual vectors exhibit 300 dimensions. We thus expect the quality of these word vectors to be much higher and, therefore, positively affect our topic modeling results, although we may loose the domain-specificity of self-trained embeddings. Of the 4,851 unique tokens in our data, 93 % (4517 tokens) can be represented by the pretrained embedding. Tokens that are not included in the provided embedding are dropped. However, this loss in vocabulary only affects 13 % of the requirements, with the majority only missing one word.

To subsequently process the data, we first create a matrix for every requirement in the corpus by replacing each word with its vector representation. Due to the different lengths of the requirements, the resulting matrices have different dimensions. We apply a PCA to reduce the different dimensions to the length of the shortest requirement in the dataset, therefore producing requirements matrices of equal dimensions. For the approach with self-trained embeddings, the shortest requirement exhibits only one token. Therefore, after this dimensionality reduction, each requirement is represented as one 50-dimensional vector already. For the approach applying pretrained embeddings, the minimal requirement length is three tokens. Therefore, each requirement is represented by three 300-dimensional vectors in this case. We subsequently combine all these matrices to a single matrix .

For the pretrained representation, the result is a 3-dimensional matrix , where is the total number of requirements, is the dimension of the word vectors and is the length of the shortest sample in the dataset. To be able to later plot the results, we concatenate all word vectors per requirement to receive one vector representing each requirement. The resulting matrix has dimensions .

Figure 4(b) shows the form of the resulting matrix for the approaches with self- and pretrained embeddings. Finally, the matrices for each approach are processed via t-SNE to a reduce the dimensions per requirement for plotting.

Iv-E Approach 3: Word Mover’s Distance

As mentioned in Section II-C2, the document- or sentence-wise similarity cannot be captured by solely using word vectors. Therefore, the third and final approach employs word embeddings again but the subsequent processing is done with the Word Mover’s Distance (WMD). This approach is referred to in Figure 1 as A3.

As in the previous approach, we use and compare both the self-trained embedding as well as the pretrained one. We then apply the WMD to calculate the distances between the requirements. The result is a distance matrix , with being the total number of requirements (see Figure 4(c)). This matrix is subsequently reduced with t-SNE for plotting. The assumption is that requirements that are similar, show similar distances to all other requirements and are therefore plotted closely as well.

V Results

We expect to find four different topics in the dataset, one for each of the predefined application domains: Energy, Entertainment, Health, and Safety. We also assume sentences categorized as Other to be visible as noise in the results, as these sentences may overlap topic-wise with the four concrete domains. To visualize the outcome of each approach, we transform the word embedded user stories into 2-dimensional space using t-SNE. The marker colors and shapes follow the application domain the user stories are associated with: Health (purple ), Entertainment (beige ), Energy (teal ), Safety (cherry ), and Other (orange ).

V-a Lda

Figure 5 shows that the LDA approach results in separable clusters. Also, some similarities between the requirements plotted next to each other can be found. However, the clusters do not show a strong overlap with the original domain labels.

Fig. 5: Results of approach A1: LDA with TF-IDF (plotted with t-SNE)

V-B Word Embeddings and PCA

As shown in Figure 6, we can identify two clusters in both plots resulting from the combination of word2vec and PCA for dimensionality reduction independent of the choice of self-trained (Figure 6(a)) or pretrained (Figure 6(b)

) word embeddings. As with most machine learning techniques, it is difficult to say why exactly our approach resulted in these two clusters 

[37]. Different settings for the perplexity and learning rate of the t-SNE do not change the number of clusters, at least. To better understand our results, we thus look into the plotted sentences and find the following:

  1. Sentences with multiple words in common are plotted close to each other.

  2. Sentences with fewer words in common are plotted further away from each other.

As a consequence of (1), we achieve good results for requirements that overlap in vocabulary (e.g. “As a home owner I want Room thermostat sensor so that The room is optimal temperature for an occupant” at (-58.033, 11.973) and “As a home occupant I want Room thermostats so that Protect the room temperature” at (-57.890, 12.276)). However, because of (2), sentences that express related requirements in different words are not clustered reliably. E.g. “As a home occupant I want music to be played when I get home so that it will help me relax” at (-24.767, 4.210) and “As a home owner I want music to play whenever I am in the kitchen so that I can be entertained while cooking or cleaning” at (53.929, 4.752). We anticipated the latter findings due to the shortcomings of word2vec to identify similar sentences as mentioned in Section II-C2. Therefore, we cannot model the topics as desired. Nevertheless, this approach delivers deeper insight into the dataset and needs relatively little computation time, as the results are available within a few minutes.

(a) Self-trained word vectors
(b) Google News word vectors
Fig. 6: Results of approach A2: Word embeddings and PCA (plotted with t-SNE)

V-C Word Embeddings and Word Mover’s Distance

We achieve the best results in our third approach, using word2vec and WMD. Figure 7 shows the plotted distance matrices we created, as described in Section II-C2. Here, we can successfully distinguish clusters both spatially and content-wise. Using the self-trained word vectors, in Figure 7(a) we can see that the domains Entertainment (gray sentences around (0,0)) and Energy (stretching from (0,-45) to (10,57)) can be distinguished clearly. Also, a cluster predominantly consisting of Health requirements is apparent in the region from (0,20) to (70,45).

(a) Self-trained word vectors
(b) Google News word vectors
Fig. 7: Results of approach A3: Word embeddings and Word Mover’s Distance (plotted with t-SNE)

Judged by the domain categories only, the clustering with our self-trained word vectors seems to yield better results. But as manual inspection shows, the clustering based on the Google News vectors also brings new insights into the dataset: In Figure 7(b), we can see a much clearer demarcation between the clusters. Also, two new domains become apparent. Although the sentences seem unrelated at first (told by the different label colors), the leftmost cluster, namely the area between (-65,-5) and (-40,40), mostly contains sentences related to parenting and children. Furthermore, the topmost cluster between (-30,55) and (0,70) contains requirement sentences about animals. These results show that the dataset may be clustered into different clusters than the 4 domain-based clusters we initially anticipated.

Vi Discussion

Vi-a Limitations and Threats to Validity

The Crowd REdataset contains requirements in the form of user stories. We assume that the structured form of user stories may facilitate any form of automated analysis (see [8, 21]). Although the formulations within the free-text parts of the user stories are quite different in terms of length and used words, we cannot say how the approaches would work when applied to unrestricted natural language requirements.

Even though the Crowd RE dataset is too large to process the requirements manually, it is relatively small for the application of automatic topic modeling techniques: LDA is a technique proven to work well on large documents. Short texts instead, contain very limited word co-occurrence information. This hinders the LDA to work well on short texts [36], as we have also seen in our results.

The word2vec approach is impacted by the text length as well, since the document similarity cannot be accurately measured under BoW representations due to the extreme sparseness of short texts. [19]. Also, when working with word embeddings, in general more data (as opposed to simply relevant data) creates better embeddings [17]. As already mentioned, to benchmark word2vec, Mikolov et al. trained their tool on the Google News dataset with 100 billion words, so a dataset 2000 times the size of our dataset. This suggests, better results may be possible using the same techniques on a larger data set. However, the meaning of requirements usually depends on the considered application domain [10]. Therefore, domain-specific word embeddings may lead to better results [9]. In our case, however, we achieved the best results using a pretrained general-purpose model. This may indicate that the advantages of domain-specific word embeddings are overruled by the disadvantages of the small dataset. For the future, it may be interesting to use domain-specific word embeddings trained on larger data sets (e.g. Wikipedia or news paper articles on home automation).

All of our approaches associate a user story with a point in a high-dimensional vector space. We applied dimensionality reduction techniques (PCA and t-SNE) to be able to compare the results of the approaches visually. Dimensionality reduction techniques provide an approximation of the original data, which may result in information being lost in the process [42]. A more precise analysis of clusters may be possible by clustering the points directly in the high-dimensional space (e.g. by applying k-means

). We did that for some of our results and found that the clusters generated by k-means are similar to the visually distinguishable clusters in the 2-dimensional plots.

For a proper and quantified evaluation of our results, manual work would be needed. To rate our findings, the dataset has to be labeled properly. Consider the following example:

  1. “As a home occupant I want music to be played when I get home so that it will help me relax” (Health)

  2. “As a home owner I want music to play whenever I am in the kitchen so that I can be entertained while cooking or cleaning” (Energy)

With our word2vec & WMD approach, these sentences are plotted nearby, both located inside the central Entertainment cluster in Figure 7(a). We cannot say that RE1 is surely assigned to the wrong domain, we consider a relationship to the Entertainment cluster to be equally valid, though. Also, with RE2 the Energy

domain may have been selected accidentally, as the domains are next to each other in the select box of the form the crowd workers used when they created the requirements. When manually reviewing the dataset to fix the labels, our results could also be improved through generally cleaning the dataset. Cleaned datasets have a much higher impact on the training results of ML models than the optimization of hyperparameters 

[7, 16].

Vi-B Future Work

Besides cleaning the data, future work can be done for cross-validation and performance improvements: Li et al. also created a classifier using WMD 

[19]. Using their approach one could create clusters on the Crowd RE dataset to compare the findings with our results. Regarding performance improvements, the calculation of the WMD matrix is relatively time-consuming. Wu et al. propose a different distance measure for document clustering, which, compared to the WMD, “can achieve much lower time complexity with the same accuracy” [44]. In a new approach called Word Mover’s Embeddings, Wu et al. also use pretrained word embeddings and were able to improve the accuracy and the calculation effort when they tested the approach on several benchmark text classification datasets [43].

Finally, our work may be used in future attempts to crowd source user requirements for input validation in a web service. When continuously learning and storing the word vectors for new requirements, it would be possible to already suggest similar sentences to the ones a crowd worker is about to enter, based on WMD. If the crowd worker obtains that his submission overlaps with an existing sentence, they could up-vote the existing sentence instead of submitting their sentence. This would not only avoid duplication, but also help in data-driven RE to identify frequently requested requirements without the need for additional data processing.

Vii Conclusion

Acquiring requirements and requirements-related information from crowd workers facilitates a user-centered RE process and enables engineers to consider requirements form a broad and heterogeneous set of potential users [12]. However, crowd-sourced information or information from other user feedback platforms are raw and unstructured. Automatic techniques are essential to preprocess, filter, and analyze the large amount of gathered information. In this paper, we have proposed and compared three approaches for clustering crowd-sourced requirements given in the form of user stories. A “classical” approach based on Latent Dirichlet Allocation and two approaches based on similarity measures in vector space models generated from different word embeddings. To the best of our knowledge, a combination of word embeddings with Word Mover’s Distance as distance measure has not been used for requirements clustering.

Our main reference for evaluation was a mapping of user stories to one of five domains, which was defined by the authors of the user stories. In our evaluation, a combination of a vector space model based on a pretrained word embedding (word2vec) and WMD as distance measure resulted in the most interesting results. Most interesting means that the approach resulted in a reasonable number of clusters with good overlap to the original domains. In some sample cases, the clustering pointed to potential misclassifications by the authors.


  • [1] A. Agrawal, W. Fu, and T. Menzies (2018) What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology 98, pp. 74 – 88. External Links: Document Cited by: §II-B, §III.
  • [2] H. U. Asuncion, A. U. Asuncion, and R. N. Taylor (2010) Software traceability with topic modeling. In ACM/IEEE International Conference on Software Engineering (ICSE), New York, NY, USA, pp. 95–104. External Links: Document Cited by: §III.
  • [3] A. Barua, S. W. Thomas, and A. E. Hassan (2014) What are developers talking about? An analysis of topics and trends in Stack Overflow. Empirical Software Engineering 19 (3), pp. 619–654. External Links: Document Cited by: §III.
  • [4] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin (2003) A neural probabilistic language model. J. Mach. Learn. Res. 3, pp. 1137–1155. Cited by: §III.
  • [5] D. M. Blei (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3, pp. 30 (en). Cited by: §II-B.
  • [6] N. Chen, J. Lin, S. C. H. Hoi, X. Xiao, and B. Zhang (2014) AR-miner: Mining informative reviews for developers from mobile app marketplace. In International Conference on Software Engineering (ICSE), New York, NY, USA, pp. 767–778. External Links: Document Cited by: §III.
  • [7] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang (2016) Data cleaning: overview and emerging challenges. In International Conference on Management of Data (SIGMOD), pp. 2201–2206. External Links: Document Cited by: §VI-A.
  • [8] F. Dalpiaz, I. V. D. Schalk, S. Brinkkemper, F. B. Aydemir, and G. Lucassen (2019) Detecting terminological ambiguity in user stories: Tool and experimentation. Information and Software Technololgy (IST) 110, pp. 3–16. Cited by: §VI-A.
  • [9] A. Ferrari and A. Esuli (2019) An NLP approach for cross-domain ambiguity detection in requirements engineering. Automated Software Engineering 26 (3), pp. 559–598. External Links: Document Cited by: §VI-A.
  • [10] A. Ferrari (2018) Natural language requirements processing: from research to practice. In Proceedings of the 40th International Conference on Software Engineering Companion Proceeedings, pp. 536–537. External Links: ISBN 978-1-4503-5663-3, Document Cited by: §VI-A.
  • [11] L. V. Galvis Carreño and K. Winbladh (2013) Analysis of user comments: An approach for software requirements evolution. In International Conference on Software Engineering (ICSE), pp. 582–591. Cited by: §III.
  • [12] E. C. Groen, N. Seyff, R. Ali, F. Dalpiaz, J. Doerr, E. Guzman, M. Hosseini, J. Marco, M. Oriol, A. Perini, and M. Stade (2017) The crowd in requirements engineering: the landscape and challenges. IEEE Software 34 (2), pp. 44–52. Cited by: §I, §VII.
  • [13] E. Guzman and W. Maalej (2014) How do users like this feature? a fine grained sentiment analysis of app reviews. In IEEE International Requirements Engineering Conference (RE), pp. 153–162. Cited by: §III.
  • [14] A. Hindle, C. Bird, T. Zimmermann, and N. Nagappan (2012) Relating requirements to implementation via topic analysis: do topics extracted from requirements make sense to managers and developers?. In 28th IEEE International Conference on Software Maintenance (ICSM), pp. 243–252. Cited by: §III.
  • [15] M. Z. Kolpondinos and M. Glinz (2019) GARUSO: a gamification approach for involving stakeholders outside organizational reach in requirements engineering. Requirements Engineering. Cited by: §I.
  • [16] S. Krishnan and J. Wang (2016-06) Data cleaning: A statistical perspective - overview and challenges part 2. Note: ACM SIGMOD/PODS Conference Cited by: §VI-A.
  • [17] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger (2015) From word embeddings to document distances. In International Conference on International Conference on Machine Learning (ICML), pp. 957–966. Cited by: §II-C2, §VI-A.
  • [18] J. Leskovec, A. Rajaraman, and J. D. Ullman (2014) Data Mining. In Mining of Massive Datasets, Cited by: §IV-C.
  • [19] C. Li, J. Ouyang, and X. Li (2019) Classifying extremely short texts by exploiting semantic centroids in word mover’s distance space. In The World Wide Web Conference (WWW), pp. 939–949. External Links: Document Cited by: §VI-A, §VI-B.
  • [20] C. Li, H. Wang, Z. Zhang, A. Sun, and Z. Ma (2016) Topic modeling for short texts with auxiliary word embeddings. In International ACM Conference on Research and Development in Information Retrieval (SIGIR), New York, NY, USA, pp. 165–174. External Links: Document Cited by: §III.
  • [21] G. Lucassen, M. Robeer, F. Dalpiaz, J. M. E. M. van der Werf, and S. Brinkkemper (2017) Extracting conceptual models from user stories with visual narrator. Requirements Engineering 22 (3), pp. 339–358. Cited by: §VI-A.
  • [22] W. Maalej, M. Nayebi, T. Johann, and G. Ruhe (2016) Toward data-driven requirements engineering. IEEE Software 33 (1), pp. 48–54. Cited by: §I.
  • [23] W. Maalej, M. Nayebi, and G. Ruhe (2019) Data-driven requirements engineering - An update. In International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 289–290. External Links: Document Cited by: §I.
  • [24] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: §IV-C.
  • [25] A. Menkveld, S. Brinkkemper, and F. Dalpiaz (2019) User story writing in crowd requirements engineering: the case of a web application for sports tournament planning. In IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 174–179. Cited by: §I.
  • [26] M. Mhatre, D. Phondekar, P. Kadam, A. Chawathe, and K. Ghag (2017) Dimensionality Reduction for Sentiment Analysis using Pre-processing Techniques. In Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC), pp. 16–21. External Links: ISBN 978-1-5090-4890-8 Cited by: §IV-B.
  • [27] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013-09)

    Efficient Estimation of Word Representations in Vector Space

    Proceedings of the International Conference on Learning Representations. Note: arXiv: 1301.3781 Cited by: §II-B, §II-C1, §II-C1, §III, §IV-C.
  • [28] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26. Cited by: §IV-D.
  • [29] T. Mikolov, W. Yih, and G. Zweig (2013) Linguistic regularities in continuous space word representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 746–751. Cited by: §II-C1, §II-C1, §II-C.
  • [30] P. K. Murukannaiah, N. Ajmeri, and M. P. Singh (2016) Acquiring creative requirements from the crowd: understanding the influences of personality and creative potential in crowd RE. In IEEE International Requirements Engineering Conference (RE), pp. 176–185. External Links: Document Cited by: §I, §II-A.
  • [31] P. K. Murukannaiah, N. Ajmeri, and M. P. Singh (2017) Toward automating crowd RE. In IEEE International Requirements Engineering Conference (RE), pp. 512–515. External Links: Document Cited by: §I, §II-A.
  • [32] L. Niu and X. Dai (2015-06) Topic2Vec: Learning Distributed Representations of Topics. arXiv:1506.08422 [cs] (en). Note: arXiv: 1506.08422 Cited by: §II-B.
  • [33] M. Oriol, M. Stade, F. Fotrousi, S. Nadal, J. Varga, N. Seyff, A. Abello, X. Franch, J. Marco, and O. Schmidt (2018) FAME: Supporting continuous requirements elicitation by combining user feedback and monitoring. In IEEE International Requirements Engineering Conference (RE), pp. 217–227. External Links: Document Cited by: §I.
  • [34] F. Palomba, M. Linares-Vásquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia (2015) User reviews matter! Tracking crowdsourced reviews to support evolution of successful apps. In IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 291–300. External Links: Document Cited by: §I.
  • [35] J. Qiang, P. Chen, T. Wang, and X. Wu (2016-09) Topic Modeling over Short Texts by Incorporating Word Embeddings. arXiv:1609.08496 [cs] (en). Note: arXiv: 1609.08496 Cited by: §II-C1, §III.
  • [36] X. Quan, C. Kit, Y. Ge, and S. J. Pan (2015) Short and sparse text topic modeling via self-aggregation. In

    International Conference on Artificial Intelligence (IJCAI)

    pp. 2270–2276. Cited by: §VI-A.
  • [37] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ‘Why should I trust you?’: explaining the predictions of any classifier. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), New York, NY, USA, pp. 1135–1144. External Links: Document Cited by: §V-B.
  • [38] G. Salton and C. Buckley (1988) Term-weighting approaches in automatic text retrieval. Information processing & management 24 (5), pp. 513–523. Cited by: §IV-C.
  • [39] Y. A. Solangi, Z. A. Solangi, S. Aarain, A. Abro, G. A. Mallah, and A. Shah (2018-11) Review on natural language processing (NLP) and its toolkits for opinion mining and sentiment analysis. In 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS), pp. 1–4. External Links: Document Cited by: §IV-B.
  • [40] C. Stanik, M. Haering, and W. Maalej (2019)

    Classifying multilingual user feedback using traditional machine learning and deep learning

    In IEEE International Requirements Engineering Conference Workshops (REW), pp. 220–226. External Links: Document Cited by: §I.
  • [41] Z. Tong and H. Zhang (2016) A text mining research based on LDA topic modelling. In International Conference on Computer Science, Engineering and Information Technology (CCSEIT), pp. 21–22. Cited by: §III.
  • [42] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2 (1), pp. 37–52. External Links: Document Cited by: §VI-A.
  • [43] L. Wu, I. E. Yen, K. Xu, F. Xu, A. Balakrishnan, P. Chen, P. Ravikumar, and M. J. Witbrock (2018) Word mover’s embedding: from Word2Vec to document embedding. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4524–4534. External Links: Document Cited by: §VI-B.
  • [44] X. Wu and H. Li (2017) Topic mover’s distance based document classification. In IEEE International Conference on Communication Technology (ICCT), pp. 1998–2002. External Links: Document Cited by: §III, §VI-B.
  • [45] D. Zowghi and C. Coulin (2005) Requirements elicitation: a survey of techniques, approaches, and tools. In Engineering and Managing Software Requirements, A. Aurum and C. Wohlin (Eds.), pp. 19–46. External Links: Document Cited by: §I.