Estimating software effort is a challenging and important activity in the software development process. This activity depends on the success of other crucial aspects of a project, mainly related to the achievement of time and budget constraints, directly impacting on the quality of the software product developed. The success of any particular software project depends heavily on how accurate its effort estimates are (Idri et al., 2016b). An accurate estimate assists in contract negotiations, scheduling, and synchronization of project activities and efficient allocation of resources.
The importance of the accuracy of estimates has explored by studies in the field of software engineering (SE) published in recent years, such as: (Kazemifard, 2011), (Azzeh, 2011), (Bardsiri et al., 2013), (Hamdy, 2014), (Amazal et al., 2014), (Moharreri et al., 2016) and (Ionescu, 2017). These studies continually seek to explore computational techniques individually or in combination, always seeking to achieve better levels of precision for effort estimation techniques.
Among the existing classifications for software effort estimation techniques, we highlight in this article non-algorithmic models - which do not use predefined metrics, such as points per function and points per use cases. These models make use of Machine Learning techniques (e.g. linear regression, neural networks) and are also called models by analogy(Shepperd et al., 1996). According to the authors, in some ways, this method is a systematic way of judging specialists, who are already experts in seeking similar situations to inform their opinions. Manikavelan et al. (Manikavelan and Ponnusamy, 2014) says that these models are built from historical data from projects to generate custom models.
According to Idri and Abran (Idri et al., 2016a)
, the accuracy of the estimate is improved when the analogy is combined with Artificial Intelligence (AI) techniques to generate estimates. In this way, fuzzy systems, decision trees, neural networks, and collaborative filtering are are some examples of techniques that improve Analogy-based Software Effort Estimation (ABSEE). Idri et al.(Idri et al., 2015) reinforces that software effort estimation models by analogy are reproducible and closely resemble human reasoning because they based on experience gained from past projects.
ABSEE can fit either agile or traditional models as long as the estimation approach is based on data and previous team experiences to estimate software projects. One challenge that has been presented for using these techniques, especially in agile models, is the lack of project data and their requirements in the early stages of the development process. The basic specification of software requirements used in these models is the user story, which is user needs, usually written informally (Cohn, 2005).
Assigning effort estimates to software requirements, especially in the early stages of development, becomes quite critical as it depends on the empirical expertise of the experts involved (e.g. project managers and systems analysts), as there are not always complete records of historical data on projects and requirements. But this is still a limitation for most companies, as in most cases there are only textual requirements that briefly describe user needs. This fact makes this task very complex and sometimes even unfeasible.
The difficulty with textual requirements (e.g. use cases, user stories) is related to intrinsic informality in many software development processes, which makes it difficult to use as a basis for predicting software costs. This limitation occurs because these texts include a diversity of domain-specific information such as natural language text containing source code, numeric data (e.g. IP addresses, artifact identification codes), among others. A very common aspect is the occurrence of different words, but they are used in the same context; in this case, they should be considered similars (polysemy) because their context is similar. Or, equal words (ambiguous), but applied in different contexts, therefore, should not be represented in the same way.
On the other hand, in the AI world, especially in the Natural Language Process (NLP) field, word embedding methods mainly aim to capture the semantics of a given word in a specific context. This method allows words to be represented densely and with low dimensionality, facilitating machine learning tasks that use textual characteristics. Breakthroughs in training word embedding models for a variety of purposes began in years recent with the emergence of Word2Vec(Mikolov et al., 2013) and GloVe (Pennington et al., 2014), enabling models to learn from a very large corpus. Thus, contextual representations through embeddings models have been very useful in identifying context-sensitive similarity (Huang et al., 2012), disambiguation of the meaning of the word (Bordes et al., 2012), (Chen et al., 2014), the induction of the meaning of the word (Kågebäck et al., 2015), lexical substitution by the creation of a generic embeddings model (Melamud et al., 2016b), sentence complementation (Liu et al., 2015), among others.
Some studies have been conducted specifically in the field of SE, such as recommending domain-specific topics from Stack Overflow question tags (Chen et al., 2016), recommending similar bugs (Yang et al., 2016)
, sentiment analysis in SE(Calefato et al., 2018), embedding model using Word2Vec for the SE domain (Efstathiou et al., 2018), ambiguity detection in requirements engineering (Ferrari and Esuli, 2019), among others.
More specifically applied in the generation of software effort estimates, the use of embeddings is highlighted in the studies by (Ionescu, 2017) and (Choetkiertikul et al., 2018). In the first case, Ionescu (2017) explores the use of word embeddings generated by a context-less approach (Word2Vec), aggregated with design attributes and textual metrics (e.g. TF-IDF), from which positive results were obtained. Choetkiertikul et al. (Choetkiertikul et al., 2018) it seeks to infer estimates from the text of user stories, which were given as input to a deep learning architecture, with an embedding layer as input. However, these initiatives face two main limitations, which make it difficult to solve in the specific field of SE. Are they:
Domain-specific terms present their meanings changed according to the context in which they are used: in this article, domain-specific terms are a set of words common to a specific area (e.g. software engineering, medicine), among which there are strong semantic relationships. Some studies were carried out, seeking to develop resources to facilitate textual representation in software engineering (Tian et al., 2014; Efstathiou et al., 2018), but no complete solution for the presentation of contexts. Still, regarding the context representation, the textual requirements are usually short, bringing an inherent difficulty in identifying the context to which they refer. This reality makes it difficult to extract significant characteristics from the analyzed texts, making the inference process difficult.
Lack of domain-specific SE data to train smart models: this aspect makes it very difficult to train deep neural networks, which tend to overfiting themselves in these small training data sets, not reaching generalization. This reality is no different for textual software requirements and becomes more critical when we need these texts to be accompanied by their labels, which should represent the effort required to implement them.
To solve both problems, this work explores the use of contextualized pre-training embedding models (e.g. BERT (Devlin et al., 2019)
) to infer an estimate of software effort from textual requirements. Pre-trained models present the concept of transfer learning(Howard and Ruder, 2018), or allow to solve these problems, because we can train or model on a generic dataset (solve a lack of data problems) and adjust it (solver o problem of the specific meaning of words in different contexts).
Although there has been a lot of research on the application of word embedding in various areas, so far, only (Ionescu, 2017) and (Choetkiertikul et al., 2018) have sought to apply embeddings to estimate or run software. No research has explored the application of contextualized pre-trained embedding models following the task of ABSEE.
In this way, this article aims to present a model for the inference of effort estimates by analogy, both for existing and new software requirements, using contextualized pre-trained embeddings models, having as input the exclusive use of textual requirements (e.g. user stories), generated in the initial stage of development. This approach was named Software Effort Estimation Embedding Model (SM). With this model, greater precision is sought for this activity, since it has currently been carried out based on human empiric experience. This characteristic makes these estimates quite subjective.
As noted by Howard and Ruder (Howard and Ruder, 2018), even though deep learning models have achieved state-of-the-art in many NLP tasks, these models are trained from scratch, requiring large datasets, and days to converge. Thus, pre-trained embeddings models make it easy to perform NLP tasks related to SE, without the need for training from scratch and with a low computational cost. According to the authors, this is possible through the fine-tuning technique, eliminating the need for a representative corpus.The fine-tuning approach consists of changing minimal task-specific parameters and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters (Devlin et al., 2019)
. Most language representation models (e.g. Word2Vec, ELMo, OpenAI GPT) are classified as context-less, i.e. each token considers only words on the left or right as part of its context(Vaswani et al., 2017). This makes fine-tuning approaches difficult, where it is relevant to incorporate the context bidirectionally, the reason for using BERT models. BERT is a contextual representation model that solves the one-way constraint mentioned earlier, which will be further explained in section 2.3.
Therefore, the approach also aims to infer software effort estimation using pre-trained embeddings models, with and without fine-tuning on a SE specific corpus. Thus, with the results of this article, we intend to answer the following research questions (RQ):
RQ1. Does a generically pre-trained word embedding model show similar results with a software engineering pre-trained model?
RQ2. Would embedding models generated by context-less methods (i.e. Word2Vec) be effective as models generated by contextualized methods (i.e. BERT)?
RQ3. Are pre-trained embeddings models useful to a text-based software effort estimation?
RQ4. Are pre-trained embeddings models useful to a text-based software effort estimation, both on new and existing projects?
RQ5. Are the results found generalizable, aiming to generate estimates of effort between projects or companies?
It is worth mentioning that, unlike the more similar approaches ((Ionescu, 2017), (Choetkiertikul et al., 2018)), the proposed approach proposes to be generalizable, that is, it should allow the generation of estimates between projects and/or between companies, both for new requirements, as for existing ones (e.g. maintenance). In addition, similar approaches that perform the estimation process through text representation, apply textual representation methods by embeddings without context, which disregard the actual context of each word.
The structure of this article to organize as follows. Session 2 presents the background of software effort estimation and word embeddings (context-less and contextualized), in sequence to present the related works. Session 4 presents the theoretical aspects necessary for the proposed approach. Then, in session 5 the results of each step of the proposed method are presented, ending with session 6, where the initial research questions are to answer before concluding and presenting the future works.
In this section, we first introduce aspects related to software effort estimation (Section 2.1). The following are the concepts of word embedding and the context-less and contextualized paradigms (Sections 2.2 and 2.3).
2.1. Software Effort Estimation
Various classifications for software effort estimation models have been applied in the last decades, with small differences, according to each author’s point of view. According to Shivhare (Shivhare and Rath, 2014), software effort estimation models can subdivide into algorithmic/parametric and non-algorithmic. The first ones are those that use algorithmic models, applied to project attributes and/or requirements to calculate their estimate, presenting themselves as reproducible methods in substitution to non-algorithmic human expert methods (Kocaguneli et al., 2010). Examples of algorithmic models are COCOMO II (Boehm et al., 2000) and Function Point Analysis (Choi, 2012). Non-algorithmic ones are those based on Machine Learning techniques (e.g. linear regression, neural networks) and are also called models by analogy, which rely on historical data to generate custom models that can learn from this data (Manikavelan and Ponnusamy, 2014).
Chiu and Huang (Chiu and Huang, 2007) point out that the ABSEE estimate deals with the process of identifying one or more historical projects similar to the target project, and from them infer the estimate. In other words, but using the same line of reasoning, Shepperd’s (Shepperd et al., 1996), says that the basis for the ABSEE technique is to describe (in terms of several variables) the project that must be estimated, and then, to use this description to find other similar projects that have been completed. In this way, the known effort values for these completed projects can be used to create an estimate for the new project.
Therefore, the ABSEE is classified as non-algorithmic. Idri and Abran (Idri et al., 2016b) also classify a technique by analogy as a machine learning technique. These authors further point out that machine learning models have also gained significant attention for effort estimation purposes, as they can model the complex relationship between effort and software attributes (cost factors), especially when this relationship is not linear, and it does not appear to have any predetermined form. Analog-based reasoning approaches have proven to be promising in the field of software effort estimation, and their use has increased among software researchers (Idri et al., 2016a).
Idri and Abran (Idri et al., 2016b) conducted a systematic review of the literature on ABSEE and found that these techniques outperform other prediction techniques. This conclusion to support by most of the works selected in their mapping. Among the main advantages of ABSEE is the similarity with human reasoning by analogy and, therefore, they are easier to understand.
Thus, the estimation of software effort by analogy is perceived as a very appropriate technique when the input resources are requirements specifications in the unstructured text format, as is the case of this research. Since it is possible to submit textual characteristics, drawn from these specifications, as input to machine learning models (e.g. neural networks).
2.2. Word Embeddings
Unlike lexical dictionaries (e.g. WordNet), which basically consist of a thesaurus, grouping words based on their meanings (Fellbaum, 2012)
and which are usually built with human support, a word embedding model is made up of word representation vectors. Through these vectors, it is possible to identify a semantic relationship between words in a given domain, based on the properties observed in a training body and their automatically created generation(Mikolov et al., 2013).
Word embeddings are currently being a strong trend in NLP. Word embedding models use neural networks arquitectures to represent each word as a dense vector with low dimensionality and focusing on the relationship between words (Mikolov et al., 2013). Such vectors are used independently to calculate similarities between terms and as a basis of representation for NLP tasks (e.g. text classification, entity recognition, sentiment analysis). The word embedding models emerged to solve some limitations imposed by the bag-of-words (BOW) model, which usually present sparse and high dimensionality matrices.
Word2Vec, one of the most popular methods for generating word embeddings from a text corpus, is an unsupervised learning algorithm that automatically attempts to learn the relationship between words by grouping words that have similar meanings into similar clusters(Raschka, 2015). In the Word2Vec model (Mikolov et al., 2013), a neural network is trained to represent each word in the vocabulary as an n-dimensional vector. The general idea is that the distance between the vectors representing the word embedding is smaller if the corresponding words are semantically more similar (or related). Pennington et al. (Pennington et al., 2014) adds that for the generation of these vectors, the method captures the distributional semantics and co-occurrence statistics for each word in the presented training corpus.
The Word2Vec model internally implements two neural network-based approaches: Common Bag of words (CBOW) and skip-gram. Both are models for word embedding widely used in information retrieval tasks. The goal of the CBOW model is to predict the current word based on its context, while the skip-gram model is intended to predict the current surrounding words for the given word. For both cases, the model consists only of a single weight matrix (in addition to the word analyzed), which results in training capable of capturing semantic information (Mikolov et al., 2013). After training, each word is to map to a low dimension vector. Results presented by their authors (Mikolov et al., 2013) show that words with similar meanings are much closer to those with different meanings. Generally speaking, the key concept of Word2Vec is to find words that share common contexts in the training corpus, close to the vector space compared to others.
Word2Vec, like other models (e.g. Glove, FastText) is considered a context-less (static) method for generating pre-trained textual representations. This feature means that these models have constraints regarding the representation of the context of words in a text, making sentence-level or even fine-tuning token tasks difficult. Also, according to their authors (Mikolov et al., 2013), these models are to consider too shallow, as they represent each word by only one layer, and there is a limit to how much information they can capture. And finally, these models do not consider word polysemy, that is, the same word to use in different contexts can have different meanings, which is not dealt with by these models,
The BERT is an innovative method, considered the state of the art in pre-trained language representation (Devlin et al., 2019). BERT models are considered contextualized or dynamic models, and have shown much-improved results in several NLP tasks (Dai and Le, 2015), (Peters et al., 2018), (Radford et al., 2018), (Howard and Ruder, 2018) as sentiment classification, calculation of semantic tasks of textual similarity and recognition of tasks of textual linking.
This model originated from various ideas and initiatives aimed at textual representation that have emerged in the area of NLP in recent years, such as: coVe (McCann et al., 2017), ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), CVT (Clark et al., 2018), context2Vec (Melamud et al., 2016a)
, the OpenAI transformer (GPT and GPT-2)(Radford et al., 2018) and the Transformer (Vaswani et al., 2017).
BERT is characterized as a dynamic method, mainly because it has an attention mechanism, also called Transformer (Devlin et al., 2019), which allows analyzing the context of each word in a text individually, including checking if each word has been previously used in a text with the same context. This allows the method to learn contextual relationships between words (or subwords) in a text.
BERT consists of several Transformer models (Vaswani et al., 2017) whose parameters are pre-trained on an unlabeled corpus like Wikipedia and BooksCorpus (Zhu et al., 2015). It can say that for a given input sentence, BERT “looks left and right several times” and outputs a dense vector representation for each word. For this reason, BERT is classified as a profoundly two-way model because it learns two representations of each word, one on the right and one on the left, and this learning to repeat n times. These representations are concatenated to obtain a final representation to use in future tasks.
The preprocessing model adopted by BERT accomplishes two main tasks: masked language modeling (MLM) and next sentence prediction (NSP). In the MLM task, the authors argue (Devlin et al., 2019)
that it is possible to predict a particular masked word from the context. For example, let’s say we have a phrase: ”I love reading data science articles.” We want to train a contextualized language model. In this case, you need to replace ”data” with ”[MASK]”. It is a token to indicate that it is missing. We will then train the model so that it can predict ”date” as the missing token: ”I love reading articles from [MASK] science”.
This technique aims to make the model learn the relationship between words, improving the level of learning, avoiding a possible “vicious cycle”, in which the prediction of a word to base on the word itself. Devlin et al. (Devlin et al., 2019) used 15-20% of words as masked words.
The task of NSP is to learn the relationship between sentences. As with MLM, given two sentences (A and B), we want to know if B is the next sentence after A in the corpus or if it would be any sentence.
With this, BERT combines the pre-training tasks of both tasks (MLM and NSP), making it a task-independent model. For this, their authors provided pre-trained models in a generic corpus but allowing fine-tuning. It means that instead of taking days to pre-workout, it only takes a few hours. According to the authors of BERT (Devlin et al., 2019), a new state of the art has been achieved in all NLP tasks they have attempted (e.g. Question Answering (QA) and Natural Language Inference (NLI)).
3. Related Works
Performing a systematic mapping focusing on ABSEE, we found few studies, as ((Abrahamsson et al., 2011), (Hussain et al., 2013), (Moharreri et al., 2016), (Choetkiertikul et al., 2018), (Zhang et al., 2016), (Ionescu, 2017), (Ochodek, 2016), (Ayyildiz and Koçyigit, 2016)), that obtain the effort estimate from text using texts.
It was observed in most of the studies presented the bag of words approaches are applied, considering word-level features (e.g. tf, tf-idf, part-of-speech tag), which to treated individually, that, is based on quantitative and qualitative data, not employing specific knowledge about the text structure of the requirements, ignoring aspects of context. Only two studies ((Ionescu, 2017), (Choetkiertikul et al., 2018)) differ from these attributes, as they apply word embedding models, but none of them use pre-trained embeddings models.
Ionescu (Ionescu, 2017) proposed a machine learning-based method for estimating effort for software development, using as input the text of project management requirements and metrics. The authors applied an original statistical preprocessing method to try out better results. First, a custom vocabulary was made. It is done using the standard deviation of the effort of those requirements where each word appears in the training set. For each requirement, a percentage of your words is maintained based on this statistic. The resulting requirements are concatenated with available project metrics. A modified TF-IDF calculation was also used, and numerical data were produced to form a bag-of-words, which is used as input to a linear regression algorithm.
Choetkiertikul et al. (Choetkiertikul et al., 2018)
proposed the use of deep learning. Two neural network models were to combine into the proposed deep learning architecture: The Long Short Term Memory (LSTM), which are long term memories and the recurrent highway network. The model is trainable from start to finish with raw input data that has only gone through a preprocessing step. The model learns from the story point estimated by previous projects to predict the effort of new stories. This proposal(Choetkiertikul et al., 2018) uses context-less word embeddings as input to the LSTM layer. As input data, the title and description of the requirements report were combined into a single sentence, where the description follows the title.
The embeddings vectors generated in this first layer serve as input to the LSTM layer, which then generates a representation vector for the full sentence. It should be to note that this process of training the embedding layer and then the LSTM layer, to then generate the embedding vectors for each sentence, becomes computationally expensive. For this reason, the authors pre-train these layers, and only then make these models available for use. This sentence vector is then to feed into the recurrent highway network, which transforms the document vector several times before producing a final vector that represents each sentence. Finally, the sentence vectors undergo simple linear regression, predicting their effort.
The possible bottleneck of this approach is the difficulty to feeding the model with new data. This feedback would fine-tune models, making them increasingly accurate. This difficulty occurs because with each new insertion into the dataset, the pre-training process, and consequently, its cost needs to be to repeat. Besides, Choetkirtkul’s (Choetkiertikul et al., 2018) method realizes inter-project prediction, which is not repository independent.
Another aspect to point is that, because requirement texts do not usually have a structured form, more words may not represent more complexity (Zhang et al., 2016) and (Hussain et al., 2013) and therefore greater effort. Actually, it is the context that influences the effort most.
As a differential, our visa method for making effort estimation in a generic way, that is, using a single requirements repository, independent of projects or repositories. A method of incorporating contextualized words (i.e. BERT) allows you to deeply learn the context of each selected word, solving problems of polysemy and ambiguity. When feeding the model with new training data, the adjustment can be made a few hours, that is, cheaper computationally than the generation of a pre-trained model from zero.
4. Construction Approach
The main objective of this research is to evaluate the efficacy of pre-trained embedding models contextualized according to effort estimation and based on requirement texts.
A requirement in this context can be a case of use, a user’s story, or any software requirement, provided that the data is in text format, and aligned with the target effort. For this paper, user’s stories serve as the input to the proposed model, and are composed of their description and their effort, provided in points per story. This data’s textual format requires some basic preprocessing (e.g. removal of special characters and stop words) before their use in the models.
It is important to note the nature of the software’s requirement texts, which are usually presented informally, that is, they do not have a standard format. In addition, these texts have a series of very specific elements (e.g. links, numeric addresses, method names). Considering these attributes, we propose that software automatically learn the characteristics of the original text.
For the purpose of a comparison, pre-trained models generated from two approaches for modeling language will be applied: context-less and contextualized models. Figure 2 presents the steps that make up the overall architecture of the proposed model, which will be described below.
Data collection and pre-processing: in this step, the data collection and preparation procedures are performed for later use during the feature learning step, which will generate a context vector (i.e. numerical representation) for a given requirement text. For the proposed model, two corpus of texts containing software requirements will be required: corp_SE e corpPret_SE (as shown in Table 2). One of them will be the fine-tuning corpus, in which the texts are not labeled. The other corpus will be used during the training and testing stages of the inference model, in which each text will be labeled with their respective efforts. The texts for both corpus go through basic pre-processing procedures (e. g. removal of special characters and stopwords).
Textual representation model: this step consists in applying methods of deep learning to the textual characteristics (Bengio et al., 2012), aiming to generate the vector representation for each of the texts. Therefore, these methods do not use manual activity for the generation of characteristics. To achieve this goal, methods to generate context-less embeddings (e.g. Word2Vec) and contextualized embeddings (e.g. BERT) are used. Therefore, this step comprises the fine-tuning procedure of the generic pre-trained models for both approaches, which are given as input to the inference model. For fine-tuning, an unlabeled corpus (corpPret_SE - according to Table 2) is applied together with generic pre-trained models for each incorporation approach (word2vec_base and BERT_base - according to Table 3). As an output, two pre-trained adjusted models are generated: word2vec_SE and BERT_SE (according to Table6. Then, the pre-trained and adjusted models are used to extract the textual representations for each requirement that makes up the training and testing corpus, using appropriate pooling techniques (e.g. mean, sum) applied to embeddings for each word. This textual representation is given by a matrix containing the number of samples from the training and test corpuses in relation to the number of dimensions of the respective embeddings (context-less and contextualized). The vector representation of each requirement is given as an input to the ABSEE inference model, aiming to learn and infer new estimates.
Inference model for ABSEE: the textual representation for the set of training and testing requirements is submitted as input to the inference model. This model is composed of a deep learning architecture, which is considered to be quite simplified, when compared to VGG-type models (Simonyan and Zisserman, 2014), Resnet (He et al., 2016)
, among others. It is important to note that the concept of deep learning is not related to the number of layers of neural networks that make up the architecture, but rather to the fact that this architecture executes a deep learning of the text’s characteristics, through the process of learning features, which begins by representing the texts using an embedding model. Therefore, the characteristics learned during this process are applied through the embedding layer of the deep learning architecture used. Since the network entry is a sequence of words (100 words are considered for each text), an LSTM layer has the function of processing that sequence, generating a representation for the sentence (that is, for each text). Subsequently, two dense layers with nonlinear activation functions are used (the dimensions of the hidden layers are 50 and 10 respectively), ending with a linear regression layer. This is an even smaller architecture, when compared to the one used in the work of(Choetkiertikul et al., 2018), excluding the role of recurring networks (Graves, 2012), (Sak et al., 2014) and applying a single sequence feedforward after the representation layer. This is possible due to the feature learning methods applied in the previous step. A simplified architecture also aims to not mask the results generated from the expected inputs to the network.
After performing an inference of the estimates for each textual requirement for the training and testing sets, metrics for performance evaluation were applied, identified in the learning model used (according to the section 4.1), in order to analyze the feasibility of the application of the model developed. This evaluation process was carried out using applicable statistical and graphic techniques, always related to the real development environment.
4.1. Evaluation Metrics
The evaluation metrics used to evaluate model performance, which refers to the distance between the test set values and the predicted values. For this, some metrics were selected, which have been recommended for the evaluation of regression-based software effort prediction models(Sarro et al., 2016),(Choetkiertikul et al., 2018), (Raschka, 2015), They are: Mean Absolute Error (MAE), Median Absolute Error (MdAE) and Mean Square Error (MSE).
Where N is the number of textual requirements (e.g. user stories) that make up the test suite used to evaluate model performance, actual_eff is the current effort measure, and estimated_effi
is the estimated effort measure for a given textual requirement i. We also used the Median Absolute Error (MdAE), suggested as a more robust metric for large outliers(Choetkiertikul et al., 2018). MdAE is defined as:
The Mean Squared Error (MSE) metric, represented by Equation x, was also applied:
4.2. Data Collection and Pre-Processing
In order to obtain and prepare the data that makes up the training and testing corpus, and the fine-tuning corpus, the data collection and pre-processing step was performed using API’s for the NLP, based on models previously established by the literature. These steps precede the process of representing textual characteristics, that is, the generation of the context vector for each requirement.
Thus, in order to create the training and testing data set, a corpus specific to the software engineering context (corp_SE) was used, composed of textual requirements, more specifically user’s stories (Choetkiertikul et al., 2018), labeled according to their respective development efforts. It is important to highlight that, despite being referred to as user’s stories, the text with requirements does not have a standard structure. Regarding the effort attributed to each requirement, it is worth mentioning that no single measurement scale was adopted (ex. Fibonacci). The corp_SE (Table 2) is considered by the authors to be the first data set where the focus is on the level of requirements (e.g. user’s stories) and not just on the project level, as in most data sets available for SE research.
The requirements texts, as well as the effort given to each of them, were obtained from large open sources from project management systems (e.g. Jira), totaling 23.313 requirements (Table1), which were initially made available by (Porru et al., 2016). Subsequently, (Choetkiertikul et al., 2018) used the same database to carry out his research, aiming to estimate software effort by analogy. Despite the difficulty in obtaining the real effort to implement a software requirement, the authors claim to have been able to obtain the implementation time based on the situation (status
) of the requirement. Thus, the effort was obtained beginning from the moment when the situation was defined as ”in progress” until the moment when it was changed to ”resolved”. Thus,(Choetkiertikul et al., 2018) applied two statistical tests (Spearman’s and Pearson correlation) (Zwillinger and Kokoska, 1999), which suggested a correlation between the points throughout the history and their real effort. Therefore, this same database was applied to the research proposed for this paper. It is known that these story points were estimated by human teams and, therefore, may contain biases and, in some cases, may not be accurate, which may cause some level of inaccuracy in the models.
|14||Talend Data Quality||1381|
Typically, user’s stories are measured on a scale based on a series of Fibonacci (Scott and Marketos, 2014), called Planning Poker (e.g. 1, 2, 3, 5, 8, 13, 21, 40, 100) (Cohn, 2005). As there is no standardized use of this scale among the projects used to create corp_SE, there was no way of approximating the points by the history provided. Therefore, 100 possible predictions were considered, distributed over the corp_SE, of which some are nonexistent, as can be seen in the histogram of Figure 4. In this way, the effort estimate was treated as a regression problem.
|corp_SE||Contains 23.313 user stories and consists of 16 large open source projects in 9 repositories (Apache, Appcelerator, DuraSpace, Atlassian, Moodle, Lsstcorp, Mulesoft, Spring e Talendforge).||Train and test||YES|
|corpPret_SE||It consists of more than 290 thousand texts of software requirements of different projects (ex. Apache, Moodle, Mesos).||Fine-tuning||NO|
For the fine-tuning process, which makes up the proposed model, a corpus of specific texts from software engineering, the corpPret_SE (shown in the Table (2), is used. It is not labeled, therefore the training carried out is unsupervised, and is composed of texts with specifications of requirements, obtained from open source repositories, according to the procedure described by (Choetkiertikul et al., 2018).
While exploring the data available in corp_SE, some relevant aspects that interfere with the inference model’s settings were observed. The histogram of Figure 3 allows for one to evaluate the maximum number of words to be considered per text. The average number of words per text, accompanied by its standard deviation, is .
Figure 4 shows the frequency of distribution of the effort size in the training and testing database (corp_SE). It can be seen that most of the samples have smaller efforts (between 1 and 8 points per story).
The cross validation k-fold method was applied in order to partition the data set used to carry out the experiments (corp_SE). Thus, a number of equal subsets (nsplit) was defined, represented by k with a value of 10. Thus, the data from the corp_SE were divided into a set of training and validation (90% of texts), and a set of tests (10% of texts). For each subset of the data, the mean and standard deviation for the metrics applied in the performance evaluation were obtained, as described in the section 4.1.
4.3. Textual Representation Model
The purpose of the procedures described in this section is to generate models to represent the texts that make up the training and test corpus. These representation models will be obtained from the generic and adjusted pre-trained embeddings models, that must consider the diversity of existing contexts. Thus, the representation models (according to the Figure 2) serve as input to the proposed sequential architecture.
To perform the experiments, two pre-trained generic word embeddings models were applied, one using the context-less approach (Word2Vec) and the other the contextualized approach (BERT).
Thus, for the context-less approach, a pre-trained model called word2vec_base was used, the specifications of which are shown in Table 3. As a contextualized model, a generic BERT model (BERT_base uncased) had previously been pre-trained and made available by its authors (Devlin et al., 2019) for free use in PLN tasks, as specified in Table3.
trained on a corpus from Wikipedia using the Word2Vec (Mikolov et al., 2013)
algorithm. For this, the following hyperparameters were used: number of dimensions of the hidden layer = 100; method applied to the learning task =cbow.
|BERT_base||Bert_base uncased: 12 layers for each token, 768 hidden layers, 12 heads of attention, 110 million parameters. The uncased specification means that the text was converted to lower case before tokenization based on WordPiece, in addition, removes any accent marks. This model was trained with english texts (Wikipedia) with lowercase letters.|
The BERT_base model, as well as the other pre-trained BERT models, offers 3 components (Devlin et al., 2019):
Then, these two generic models go through a fine-tuning process, as presented in the section 4.4.
It is worth noting that the fine-tuning process consists of the use of a pre-trained embedding model (trained on a generic dataset) in an unsupervised way, which is adjusted, that is, retrained on a known data set that is specific to the area of interest. In this case, the fine-tuning was performed on the generic models word2vec_base and BERT_base, using corpus corpPret_SE (shown in Table 2).
A fine-tuning of the generic model word2vec_base (Figure 5) was performed using specific methods for this purpose provided by the Gensim library in the Python language. This process generated the word2vec_SE model. This adjusted model was used to generate the average representation (see section 4.5) for each requirement text of the training and testing corpus (corp_SE).
The fine-tuning process of the pre-trained BERT model consists of two main steps (Devlin et al., 2019):
Preparation of data for pre-training: initially the input data is generated for pre-training. This is done by converting the input sentences into the format expected by the BERT model (using create_pretraining_data algorithm). As BERT can receive one or two sentences as input, the model expects an input format in which special tokens mark the beginning and end of each sentence, as shown in Table 4. In addition, the tokenization process needs to be performed. BERT provides its own tokenizer, which generates output as shown in Table 5.
Entry of two sentences Entry of a sentence [CLS] The man went to the store. [SEP] He bought a gallon of milk. [SEP] [CLS] The man went to the store. [SEP] Table 4. Example of formatting input texts for pre-training with BERT. Input sentence ”Here is the sentence I want embeddings for.” Text after tokenizer [’[CLS]’, ’here’, ’is’, ’the’, ’sentence’, ’i’, ’want’, ’em’, ’##bed’, ’##ding’, ’##s’, ’for’, ’.’, ’[SEP]’] Table 5. Example application of tokenizer provided by BERT.
Application of the pre-training method: the method used for pre-training by BERT (run_pretraining) was made available by its authors. The necessary hyperparameters were informed, the most important being:
input_file: directory containing pre-formatted pre-training data (as per step 1).
output_dir: output file directory.
max_seq_length: defining the maximum size of the input texts (set at 100).
batch-size: maximum lot size (set at 32, per use guidance of the pre-trained model BERT_base.
bert_config_file: BERT model configuration file, supplied with the pre-trained model (bert_config.json).
init_checkpoint: files of the pre-trained model used containing the weights (bert_model.ckpt).
For the generic BERT model (Table 3, we opted for its version Uncased_L-12_base, here called BERT_base (Table 3). The fine-tuning process for the BERT model also used the corpPret_SE (Figure 6) and was performed as shown above.
The entire process, from data preparation to fine-tuning the BERT model, used the algorithms produced in the repository (Devlin et al., 2018), in which (Devlin et al., 2019) provides the full framework developed in the Python language.
After performing the fine-tuning, two new pre-trained models are available, as shown in Table 6, which will also compose the experiments.
It is noteworthy that the proposed model requires pre-training only for the embedding layer. This allows, for example, for this pre-trained model to be made available for other software engineering tasks, or even for different effort estimation tasks. Thus, this pre-trained model may undergo successive adjustments, according to the need of the task to which it will be applied.
|word2vec_SE||consists of the word2vec_base model after fine-tuning with the corpus corpPret_SE.|
|BERT_SE||consists of the BERT_base model after fine-tuning with the corpus corpPret_SE.|
4.5. Obtaining Characteristics
After the fine-tuning was completed, processing was performed to obtain textual representations from the four models of embeddings (word2vec_base, BERT_base, word2vec_SE and BERT_SE). For the context-less embeddings model, represented by the word2vec_base and word2vec_SE models (as shown in Figure 5), the embeddings vectors of the words contained in each text were averaged (Wieting and Gimpel, 2017; Palangi et al., 2016).
As for the contextualized embeddings model, the textual representations generated by the BERT model (according to Figure 6) present a different structure from the context-less embeddings models (e.g. Word2Vec). This is primarily due to the fact that the number of dimensions of the embeddings vectors is not defined by the user, but by the model itself. Therefore, the number of dimensions of word embeddings for the model BERT_base is defined in 768. In addition, each word in this model is represented by 12 layers (standard for BERT_base), with the need to pool (Lev et al., 2015) the embeddings of some of the layers for each word. In order to define the pooling strategies to be applied, the article by (Devlin et al., 2019) was taken as a basis.
Thus, one of the proposed strategies was used, considering that there were no significant differences between the results obtained with the other tested strategies. Thus, for the models BERT_base and BERT_SE, the strategy chosen was to use the penultimate layer of each word to generate a vector of average embeddings for each sentence. More details on pooling strategies can be found in (Lev et al., 2015; Alammar, 2018).
4.6. Exploratory data analysis
Before presenting the results of the effort estimation, it is first important to highlight some observable aspects regarding the vector of textual representation obtained after the embeddings layer. These sentence embeddings were generated from the models specified in Tables 3 and 6, that is, generic and fine-tuned models for both approaches (without context and contextualized).
In order to show the characteristic of the generated embeddings, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm was applied. This algorithm has been used to represent complex data graphically and in smaller dimensions, while preserving the relationships between neighboring words (Maaten and Hinton, 2008), which greatly facilitates their understanding.
Thus, Figure 7 shows sentence embeddings generated from the BERT_SE for each textual requirement. Only 100 instances of requirements were included for each of the 16 projects analyzed (Table 1) (represented by “idProj”). The t_SNE algorithm reduced the 768-dimension model (BERT standard) to just two dimensions, which allowed for a visual analysis of the representations obtained and, based on these representations, some conclusions were reached.
|Texto 18994||for the sqlserver and postgresql connection can not show the structure correctly.1. create apostgresql sqlserver connection set or not set the catalog parameter(doesn’t set the textbfschema). 2. check the structure, only one schema show under each catalog. in fact i have several. please check it same issue as informix db.|
|Texto 17552||sqoop - unable to create job using merge command as a user, i need to use xd sqoop module to support the merge command. currently, the sqoop runner createfinalarguments method forces the requirement for connect, username and password options which are not valid for the merge option. a check of the module type to not force these options being assigned to sqoop arg list would be preferred.|
|Texto 15490||need to create a persistent-job-registry in order to hook up the to get access to all the jobs available the job registry has to be shared. currently the only implementation is the mapjobregistry. testability. the admin will need to be see all jobs created by its containers.|
In this example (Table 7), the context of the requirements presented is related to connection and database operations. Among the highlighted words (in bold), one can find, for example ”sqoop”. This word is part of to an application that transfers data between relational databases and Hadoop 111Hadoop is an open source software platform for the storage and distributed processing of large databases. The services Hadoop provides includes storage, processing, access, governance, security and data operations, making use of powerful hardware architectures, which are usually on loan.. Therefore, the identification of similar contexts is more clearly perceived, even with very different words, as is the case of ”sqoop”, ”sqlserver” and ”persistent”. These are different words, but part of the same context. Thus, although the groupings did not demonstrate clear groups, either by project or by effort, the groupings demonstrated, at a certain level, a representation of requirements from the same context, even if from different projects and/or efforts.
4.7. Effort Estimation Model Settings
In this section, the stages of the SM model will be presented, in which representation vectors for each textual requirement are obtained according to the procedures presented in section 4.5, are given as a parameters to the layers dense of the architecture deep learning, as shown in Figures 8 and 9.
The embedding() layers (Figures 8 and 9) are represented by the vector of pre-trained weights for each sentence. These vectors are processed by two dense nonlinear layers, followed by a linear regression layer. The output is the estimate of the predicted effort (e.g. points per story).
Each instance of textual requirement submitted to the Embedding () layer is represented by an average vector representing each sentence. A sentence consists of a maximum of 100 words, each of which is represented by 100 text embeddings for the Word2Vec models and 768 dimension embeddings according to BERT models, as justified below.
The maximum number of words per text was defined based on the representation of the histogram shown in Figure 3, which shows that this number would include most of the sentences used in the training and test database, with reduced data loss.
As presented in the architecture specification SM (section 4), it is a very simple architecture, post layer of embedding, consisting only of two dense non-linear layers and a linear regression layer. The fact that the deep learning architecture is very simplified was purposeful, considering that the objective of this thesis is to present how to infer effort estimates in software projects by analogy using pre-trained embeddings models, verifying whether these models are promising for text-based software effort estimation. In this way, a more robust architecture could mask the results generated at each of the different network entrances.
A textual requirement, also referred to in this study as a sentence, is represented by an average vector of the word embeddings that compose this sentence. This average vector is generated from each of the generated models (Tables 3 and 6, considering the particularities of each applied approach (without context and contextualized), as presented in section 4.5. Therefore, each generated embedding model is represented as a matrix, where each line represents an average embedding vector for a given textual requirement. Thus, each embedding model has the same number of samples, and what varies is the dimension applied.
Previous tests were performed using the grid search method. For the embeddings models without a context, tests were performed with dimensions 50, 100, and 200 with the best results presented by dimensions 50 and 100. As there was no significant variation for the MAE between both dimensions using the same neural network architecture, the number of dimensions of embeddings was fixed at 100. For the transformers BERT the standard dimension defined for the BERT_base model was used, that is, equal to 768.
For the training of the learning model, it was necessary to configure some hyperparameters. Therefore, the Adam optimizer (Kingma and Ba, 2014), learning rate 0.002 was used and the size of the batch_size
was 128. Twenty epochs and an early stopping mechanism were defined, in which, if the MAE value remains stable for 5 epochs, the best results are averaged.
4.8. Experiment Settings
The results obtained in this study will be presented below in order to make a comparative analysis with the most similar study (Choetkiertikul et al., 2018) we identified, in addition to answering the following research questions defined initially.
|Experiments||Pre-trained model||Sequential network architecture|
|E1||word2vec_base||Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)|
|E2||word2vec_SE||Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)|
|E3||BERT_base||Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)|
|E4||BERT_SE||Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)|
|E5||BERT_SE||Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (softmax)|
5. Results and Discussions
The results will be presented below, to make a comparative analysis with the most similar study (Choetkiertikul et al., 2018), in addition to answering the following research questions defined initially.
RQ1. Does a generically pre-trained word embedding model show similar results to a software engineering pre-trained model?
To answer this question, experiments (E1, E2, E3 and E4) were performed, with pre-trained models with and without fine-tuning, for context-less and contextualized approaches, a task that aimed to determine effort estimation. The approach consists of using a pre-trained embedding model as the only source of input in the deep learning architecture used.
Table 9 presents the average values for the MAE, MdAE and MSE measurements obtained after the application of cross-validation 10-fold during the execution of the proposed sequential architecture, in which each model was used as input for pre-trained embedding available for each of the experiments (according to Table 8).
|Context-less||word2vec_base||4.66 0.14||100.26 7.04||2.9|
|word2vec_SE||4.36 0.31||89.9 14.37||2.5|
|Contextualized||BERT_base||4.52 0.094||100.95 7.3||2.7|
|BERT_SE||4.25 0.17||86.15 1.66||2.3|
As can be seen in Table 9, the models that underwent fine-tuning (with SE suffix), regardless of the approach used, gave better results for MAE, MSE and MdAE. Thus, it is clear that a pre-trained embedding model, in which fine-tuning is performed with a specific corpus of the task domain, presents a better performance than a pre-trained model with a generic corpus, as is the case with word2vec_base and BERT_base.
This can be proven by comparing the generic context-less embedding model (word2vec_base) to the same fine-tuned model (word2vec_SE), where a 5% improvement is seen in relation to the MAE value for the second model. Likewise, when applying the contextualized embedding model with fine-tuning (BERT_SE), an improvement of 7.8% for the MAE was observed in relation to the generic model BERT_base. Thus, it is noted that, in general, the results achieved improved after adjusting the models with a specific corpus of the domain .
If the MAE value obtained for the best model in Table 9 (BERT_SE), which was 4.03, is applied in practice, this will indicate that, for a given user story, in if the real effort is 5 points per story, the estimated effort value could be between 1 and 9 points per story. This is one of the reasons why human participation in the estimation process is indicated, in order to calibrate these effort values to new requirements.
In terms of MSE, the improvement was of 3.5% for the word2vec_SE model, when compared to word2vec_base, and 13.5% for MdAE. When evaluating MSE and MdAE for the contextualized approach, the fine-tuned model outperforms the generic model by 14.6% and 14.8%, respectively.
Thus, we conclude that the representation of the training and test model (corp_SE), when generated from pre-trained and adjusted embeddings models, improves the performance of activities such as the ABSEE. To this end, it is estimated that the greater the volume and diversity of samples in the corpus used in fine-tuning (corpPret_SE, the better the results can be. Therefore, this fact must be considered when giving continuity to this task in future studies.
RQ2. Would embedding models generated by context-less methods (i.e. Word2Vec) be effective as models generated by contextualized methods (i.e. BERT)?
As stated by Ruder (Howard and Ruder, 2018), “it only seems to be a question of time until pre-trained word embeddings (i.e. word2vec and similar) will be dethroned and replaced by pre-trained language models (i.e. BERT) in the toolbox of every NLP practitioner.” Thus, the objective of this study was to analyze if context-less embeddings models are effective in effort estimates, as compared to contextualized embeddings models in a specific corpus.
As can be seen in Table 9 the results obtained by the contextualized models (BERT_base and BERT_ES), surpass the results of the models without context.
When comparing MAE values, the BERT_base model shows a 3% improvement over word2vec_base. Likewise, for the values of MdAE and MSE there was an improvement of 0.7% and 6.9%, respectively.
When comparing how these metrics between the models with fine tuning (Word2vec_ES and BERT_ES), as improvements of the contextualized model in relation to no context were 2.5%, 4.2% and 8% for the values of MAE, MSE and MdAE, respectively.
It is believed that this result is due to the fact that methods context-less (ex. Word2Vec) allow to represent a single context for a given word in a set of texts. This aspect causes a lot of information to be lost. In an effort estimate, based on the requirements texts (e.g. user stories), a contextualized strategy certainly produces better results. Unlike the Word2vec approach, BERT methods offer this dynamic context, allowing the actual contexts of each word to be represented in each text.
This aspect is important, as textual software requirements (i.e. user stories, use cases) generally have short texts and little vocabulary, which means that many words are common to the field of software engineering and are repeated in many texts. This is the importance of identifying different contexts of use for each word, in order to differentiate them. Contextualized methods like BERT guarantee this dynamic treatment of each word, addressing problems of polysemy and ambiguity in an intrinsic way.
This is possible due to the fact that the contextualized models use a model of deep representation of each word in the text (that is, 12 or 24 layers), unlike the models of embeddings context-less that present a superficial representation (of a single layer) for a word. In addition, contextualized models use an attention model that allows verifying whether the same word occurred previously in the same context or not (example in Table10), or whether different words can present the same context (e.g. create, implement, generate) - example in Table 11.
|Text 22810||Add that contact as a favorite notice that the images for contacts (driven by remote url) change unexpectedly. Under the covers all that is happening, is that the data of the list view is refreshed.|
|Text 23227||Add qparam to skip retrieving metadata and graph edges if the qparam is not there, use current default behavior.|
|Text 18||create new titanium studio splash screen there is a placeholder image…|
|Text 22810||build the corporate directory app for ios…|
RQ3. Are pre-trained embeddings models useful to a text-based software effort estimation?
To answer this question regarding the perspectives of contextualized pre-trained models applied to ABSEE, it is necessary to observe whether MAE, MSE and MdAE obtained good results. As shown in Table 9, the best MAE value was 4.25, which means that a software effort of 6 will be predicted between 2 and 10.
When questioning whether this result is good or bad, can be observed that, considering the small number of samples in the training set and tests, the high degree of imbalance between classes, and the high variability of the text, this result is quite adequate. It is precisely due to the perception that, at least currently, there is little data to estimate the effort involved in software development that motivated these researchers to investigate pre-trained embedding models, so as to solve the proposed problem.
|Method||SM (multi-repositories)||Deep-SE (cross-repository)|
|MAE||4.25 0.17||3.82 1.56|
When comparing the best MAE results, obtained though the BERT_SE model, with the MAE results given by the Deep-SE (Choetkiertikul et al., 2018) model, the latter of which was the study found to be most similar to the present research, some aspects stand out, as shown below.
As can be seen in Table 12, the MAE obtained by BERT_SE was slightly higher than Deep-SE. However, one should note that, to obtain this result, (Choetkiertikul et al., 2018) inferred the effort estimates between projects (e.g. Moodle/Titanium, Mesos/Mule). Therefore, there is a large chance that two projects share a similar context, which would make predictability easier for projects in different contexts. This statement is reflected directly in the standard deviation of the MAE values (e.g. 5.37, 6.36, 5.55, 2.67, 4.24) for the Deep-SE between projects, which is 1.56. One can see that some values of MAE are relatively low (e.g. 2.67), while others are higher (ex. 6.36). This means that there may be a higher variation for the estimates inferred by Deep-SE, which, in the worst case, can cause a requirement whose effort is 7 points per story, to return 1 or 13 points per story. Thus, it is suggested that if the Deep-SE model is applied using a single repository approach, as well as the SM, the MAE values may be even higher.
SM, on the other hand, uses a single repository approach, that is, all requirements are independent of project or repository, aiming to generalize the model. Thus, although the SM MAE is close to the Deep-SE value, the standard deviation obtained is smaller (0.17), or almost nonexistent (ex. 3.87, 4.21, 4.15, 4.12, 3.97, 4.25, 4.01), which can be proven by observing the pattern of MAE values obtained by the model. In this sense, it can be said that the proposed method is more generic and applicable to different problems, demonstrating a greater degree of reliability than Deep-SE. This is mainly due to the ability of the BERT Transformer mechanism (attention mechanism) to resolve long-term dependency and “vanishing gradient” (Hochreiter, 1991) and (Bengio et al., 1994), that presents itself as limitations in recurrent network architectures, as is the case with RHN and LSTM, used by (Choetkiertikul et al., 2018). The “vanishing gradient” is the loss of relevant context information, used to identify the semantics of a given word in a text. Therefore, the attention mechanism allows for one to ignore irrelevant information and focus on what is relevant, making it possible to connect two related words, even if they are not located one after the other.
In order to reinforce the positive trend of applying pre-trained embeddings models in the process of inferring effort estimation, the E5 experiment was performed. In this case, the same set of training and testing data was used, applying a classification layer (with softmax activation) instead of linear regression. This required some modifications to the dataset. First, there was a need to make the dataset more homogeneous and bulky as compared to the existing labels. Then the closest estimates were grouped, considering the series of Fibonacci proposed for the Planning Poker (Cohn, 2005) estimation method. As a result of this process, the data set only had 9 possible labels: 1, 2, 3, 5, 8, 13, 20, 40, 100. Therefore, compared to the regression problem, each label had its data volume increased and balanced in relation to the existing classes, mainly for the smaller labels (between 1 and 8), as shown in Table 13.
|Number of labels||1||2||3||5||8||13||20||40||100||Total|
|Number of textual requirements/labels||4225||3406||4809||4725||3588||1238||706||451||165||23.313|
The Figure 10
shows the confusion matrix generated for the experimentE5. One can see that the greatest confusion occurs between the lowest efforts (1, 2, 3, 5 and 8). Considering the MAE value of the regression experiment (4.25), it is possible to understand this bias in the confusion matrix; after all, as explained previously, an effort of 4 points per story could be 1 or 8.
One can also observe that, similarly, the larger classes (13, 20, 40 and 100) became more confused with each other and, in very low percentages there was confusion among the smaller classes. This aspect leads us to suggest that human intervention at the end of the process is needed, with the aim of approving/modifying the estimate generated, according to the user’s working reality. Thus, considering its application for the end user, the proposed method would be classified as semi-automated.
Another aspect to be observed in the results of Figure 10 is an indication that the larger the data set representing each of the labels, the better the results. This study argues that effective techniques for increasing data in texts can improve this result. Another alternative would be to collect a more significant number of real samples, complementing the pre-training and fine-tuning data set corpPret_SE, as well as sets for training and testing (corp_SE). Thus, the generated embeddings will be more representative, that is, they can better represent each situation found in the requirements.
RQ4. Are pre-trained embeddings models useful to a text-based software effort estimation, both on new and existing projects?
Another aspect that can be observed is the indication of the SM model to estimates new and non-existing requirements. For this, an additional experiment was carried out containing the same configuration as E4, and only changing the type of data partitioning. In this experiment cross validation was applied by projects in order to evaluate the results of inferring the estimates for each project, that is, for completely new projects. Thus, during each of the iterations one of the projects was considered a target for the estimates (test set), while the others were considered a source (training set). Table 14 shows the results obtained for the MAE in each project.
|ID project||Description||Num. requirement||Med||Std||MAE|
|TD||Talend Data Quality||1381||5.92||5.19||4.04|
When considering the cross-project estimation approach, if target projects are considered (UG, ME, AP, TI, AS, TI, MS, MU), as listed in Table 9 of the article by (Choetkiertikul et al., 2018), the authors were able to an average MAE value of 3.82 for the Deep-SE model, while the SM model presented a MAE of 3.4 (Table 15). It is observed that this comparison was made only with the results obtained in the mode between repositories of (Choetkiertikul et al., 2018), and for the results obtained with SM, all 15 projects were considered as source projects remaining, while the target projects are the same chosen by the authors.
It is still observed that the authors of Deep-SE used a source project for training and another target project for tests, that is, the diversity of characteristics that the learning method needs to deal with, is limited only by the domain of a single project. In other words, the model needs to deal with less variability of data, which supposedly can facilitate learning, since some important relationships between these data can be discovered more easily by the learning method. However, the proposal presented here uses all projects (except one) for training, that is, the variety of relationships that the model must deal with is much greater, when compared to the Deep-SE approach. From a certain point of view this is good, as the learning method should learn a better generalization, for any type of problem presented. On the other hand, however, learning is hampered due to curse of dimensionality, in which many important relationships can be more difficult to be inferred automatically, mainly due to the very small corpus. The explanation for this is that in a smaller data set, basic relationships are easily learned, whereas more specific relationships are often not sufficiently representative.
In the case of software projects, each project used as training and testing data is in fact an another domain (when compared to another project), which can be composed of several sub-domains, such as: application areas of the project, programming languages (e.g. Java, Python, C), databases (e.g. SQL Server, Oracle, MySQL, MongoDB), application modalities (e.g. web, mobile, desktop), development teams with different characteristics (e.g. beginner, full), among others. Therefore, each sub-domain can be composed of different relationships, since different aspects can change the context of one project in relation to the other, which leads, for example, to the use of different terms for the same purpose. Thus, there is a need for a balanced volume of representative samples from each sub-domain, so that the model can learn properly.
|Source projects||Target projects||MAE (Deep-SE)||MAE (SM)|
|15 remaining projects||TI||6.36||3.49|
Although the results of this study are not very different from that obtained in the article by (Choetkiertikul et al., 2018), the proposed approach presents a much simpler and potentially more robust network architecture. This is possible because part of the textitfeature learning process previously performed, which extracts the contextualized textual representation for new project requirements, is performed only once, during the process of generating the pre-trained model (BERT_SE). This model is then employed as a parameter in the embedding layer of the sequential architecture used. This aspect is important, considering that there is the need to feed the training database with new cases of requirements, as well as with the cases’ respective efforts, which makes it possible to increase precision in effort estimates for new projects.
RQ5. Are the results found generalizable, aiming to generate estimates of effort between companies?
In our approach, the results show that even a new project can have its effort predicted without any pre-existing data (as shown in RQ4). Although the MAE still does not deliver a perfect result, a good estimate of the effort can be achieved and used, in a semi-automated way, by companies, as explained in the RQ3 response.
Additionally, when considering the application of the model for multiple companies, it is necessary to consider the possibility that existing requirements contain different metrics (e.g. function points, story points), since the data-set would be fed by requirements from different companies. To address this question, the means of converting these metrics into a standard form, which, if obtained from the estimator, could be converted into the format to be used by the user, is suggested. This conversion of software effort metrics is proposed by (Pressman and Maxim, 2016).
Thus, generalizing the proposed method so that is can be used by several companies in a web application, for example, would be perfectly possible. This claim is supported by the fact that the method is based on a single repository (i. e. grouping multiple repositories), in which different projects’ requirements are met, regardless of the format for registering the requirements texts.
Furthermore, considering its practical application, the proposed model can be adjusted by the user. Therefore, real estimates generated and approved by specialists can be fed back, making the system more and more adjusted to a specific company, for example.
6. Threats to validity
Thus as several research in the area of software engineering that apply machine learning techniques for the use with the texts, this paper propose a kind of software estimate or effort by analogy from requirements texts, difficult approaches regarding the availability of real data. These data should reflect the reality of software projects in different areas, levels of complexity, uses of technologies, among other attributes. Typically, these difficulties are related to the volume of data, the language in which the data is given, the quality of the data (for example, text formats, completed, nonexistent labels, among others). This way, after accomplished a search for textual requirements databases, we decided to use database used by (Choetkiertikul et al., 2018), which provided the first database at the level of requirements for the realization of area research. Since the model proposed in this article (SM) aims to compare the results obtained for estimating software effort with the Deep-SE model, proposed by (Choetkiertikul et al., 2018), it was defined by using same database.
Therefore, actions to containing or reducing threats to validity carried out by (Choetkiertikul et al., 2018), were adopted and maintained for this work. Are they:
It is used the Actual project requirements data, which were obtained from large and different open source projects.
The story of points per user, that accompany each textual requirement, were first estimated by human teams, and therefore, may not be accurate in some situations. (Choetkiertikul et al., 2018) performed two tests to mitigate this threat: one with the original story points, the other with normalized and adjusted story points. With that it was verified that the proportions of points for history attributed to each requirement were adequate.
It was observed that project managers and analysts determine the estimate for a new requirement, based on their comparison with requirements already implemented in the past, and thus carry out the estimate consistently. In this way, the problem is indicated for a machine apprentice, since the training and testing database presents the description of the requirements, accompanied by their respective efforts in points by history. Thus, new requirements can be estimated, as long as they have these two attributes.
In order to perform the experiments, appropriate metrics were applied to evaluate regression models, which are commonly used to evaluate models of software effort estimates (Choetkiertikul et al., 2018) which attest to the validity of the model. In addition, different forms of data partitioning (cross-validation) were used, in order to validate the results obtained.
In order to compare the results obtained in this work with those obtained by (Choetkiertikul et al., 2018), considering that our implementation may not present all the details that Deep-SE presents, our model was tested using the same data set as the authors. Thus, it was possible to state that our results are consistent.
In order to validate the possibility of generalization, it should be noted that the training and testing data set is composed of 23.313 requirements from sixteen open source projects, which differ significantly in size, complexity, developer team and community (Choetkiertikul et al., 2018). It is observed that open source projects do not present the same aspects as commercial projects in general, especially in relation to the human resources involved, which requires more research. It would be prudent to test the S M model with a database of commercial projects only (with data available containing the same attributes used in this research) and then with all types of projects (commercial and open source) in order to check for significant differences.
7. Conclusions and Future Work
The main objective of this research is to evaluate if pre-trained embeddings models are promising for the inference of text-based software effort estimation, evaluating two approaches to embeddings: context-less and contextualized. The study obtained positive results for the pre-trained models for the ABSEE task, particularly the contextualized models, such as BERT. As predicted in the literature, the contextualized methods demonstrate the best performance numbers. In addition, we show that fine-tuning the embedding layers can help improve the results. All of these results can be improved, especially if trained with more data and/or using some effective data augmentation.
The researchers observed that the database was a limitation of this research, particularly because it is not very bulky, which prevents the results from being even better, especially when using cutting-edge PLN techniques (e.g. pre-trained models, fine-tuning and deep learning). When observing the volume of data used in fine-tuning tasks for specific domains, such as (Beltagy et al., 2019) and (Lee et al., 2020), it is clear that even though the domain’s base is considered to be light, they contain billions of words. On the other hand, the domain-specific database applied in the fine-tuning of the BERT_SE model, has around 800 thousand words. Thus, it was observed that the results obtained with the use of BERT could be improved if there was a more voluminous and diverse set of data, in which it was possible to better adjust the model for different problems, or even train a model of its own (from scratch), which would present an even greater level of adjustment according to the domain area.
Thus, this study argues that the SM model analyzed in this article can best adapt to different project contexts (for example, agile development). Estimation of story points or another similar estimation metric (e.g. use case points or function points) have a fine granularity (i.e. they are assigned to each user requirement). But this same inference method can be applied to estimate a coarser granularity element. An example would be a sprint in agile models (Paasivaara et al., 2009), which is estimated by the sum of the smaller tasks that compose it.
Compared with the results obtained by Choetkiertikul et al. (Choetkiertikul et al., 2018), the most similar work considering the objectives, note that SM used a pre-trained contextualized incorporation layer, which went through a fine tuning process, without the need to add any noise to the texts. In addition, the proposed architecture is more simplified, with just one recurring layer. As a result of the feature learning process, which applies a contextualized incorporation approach, we have a more generalized and multi-project method, which can be applied even in new projects. Thus, the method has more flexibility regarding the format of the input and multi-project texts, allowing for interference to be used in any new requirement, even during the initial stage of development.
Using the pre-trained BERT model (even the generic one), there is no need for prior training of a specific corpus, or one that requires a large volume of data. This has the advantage of involving no pre-training cost (which takes days (Howard and Ruder, 2018)). Rather, there is only the need for fine-tuning, which takes a few hours ((Devlin et al., 2019)). That is, there is no need for training from scratch, as performed by (Choetkiertikul et al., 2018).
Thus, we provide the pre-trained BERT_SE model that can be used in various software engineering tasks, in addition to allowing for further adjustments, if necessary. In addition, the SM model can be applied in a generic way and, in addition to being reliable, it is a cheap and provides for a computationally fast solution, due to the fine tuning process.
The results demonstrated that this is a promising research field with many available resources and room for innovation. Therefore, several research possibilities are presented as future work:
Collect more textual requirement data to balance the dataset against existing labels, and thereby increase the number of contexts; or apply data augmentation techniques to improve results.
update BERT_base vocabulary, including specific vocabulary, and then fine-tune.
perform fine tuning, such as with model BERT_large, and compare the results.
study and apply effective data augmentation techniques in order to balance the number of samples in each existing effort class, and thus obtain possible improvements in the results.
study and apply different combinations of the layers in the BERT model, in order to evaluate the performance of the model regarding fine-tuning and its effects on EESA.
study and apply pre-trained ”light” (e.g. ALBERT (Lan et al., 2019)) to the SM model, and evaluate its performance aiming to achieve EESA.
- Predicting development effort from user stories. In 2011 International Symposium on Empirical Software Engineering and Measurement, pp. 400–403. Cited by: §3.
- The illustrated bert, elmo, and co.(how nlp cracked transfer learning). December 3, pp. 2018. Cited by: §4.5.
- Improving fuzzy analogy based software development effort estimation. In Software Engineering Conference (APSEC), 2014 21st Asia-Pacific, Vol. 1, pp. 247–254. Cited by: §1.
- A case study on the utilization of problem and solution domain measures for software size estimation. In 2016 42th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 108–111. Cited by: §3.
- Adjusted case-based software effort estimation using bees optimization algorithm. Knowlege-Based and Intelligent Information and Engineering Systems, pp. 315–324. Cited by: §1.
- LMES: a localized multi-estimator model to estimate software development effort. Engineering Applications of Artificial Intelligence 26 (10), pp. 2624–2640. Cited by: §1.
- SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3606–3611. Cited by: §7.
- Unsupervised feature learning and deep learning: a review and new perspectives. CoRR, abs/1206.5538 1, pp. 2012. Cited by: item 2.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §5.
- Software development cost estimation approaches—a survey. Annals of software engineering 10 (1-4), pp. 177–205. Cited by: §2.1.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §2.2.
- Joint learning of words and meaning representations for open-text semantic parsing. In Artificial Intelligence and Statistics, pp. 127–135. Cited by: §1.
- [Journal first] sentiment polarity detection for software development. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 128–128. Cited by: §1.
- Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), Vol. 1, pp. 338–348. Cited by: §1.
- A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1025–1035. Cited by: §1.
- The adjusted analogy-based software effort estimation based on similarity distances. Journal of Systems and Software 80 (4), pp. 628–640. Cited by: §2.1.
- A deep learning model for estimating story points. IEEE Transactions on Software Engineering. Cited by: §1, §1, §1, §3, §3, §3, §3, item 3, §4.1, §4.1, §4.2, §4.2, §4.2, §4.8, Table 1, Table 12, §5, §5, §5, §5, §5, §5, 2nd item, §6, §6, §6, §6, §6, §7, §7.
- A rule-based approach for estimating software development cost using function point and goal and scenario based requirements. Expert Systems with Applications 39 (1), pp. 406–418. Cited by: §2.1.
- Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370. Cited by: §2.3.
- Agile estimating and planning. Pearson Education. Cited by: §1, §4.2, §5.
- Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087. Cited by: §2.3.
- Bidirectional encoder representations from transformers. External Links: Cited by: §4.4.
- Bert: pre-training of deep bidirectional transformers for language understanding. North American Association for Computational Linguistics (NAACL). Cited by: §1, §1, §2.3, §2.3, §2.3, §2.3, §2.3, §4.3, §4.3, §4.4, §4.4, §4.5, §7.
- Word embeddings for the software engineering domain. In Proceedings of the 15th International Conference on Mining Software Repositories, pp. 38–41. Cited by: item 1, §1.
- WordNet. The Encyclopedia of Applied Linguistics. Cited by: §2.2.
- An nlp approach for cross-domain ambiguity detection in requirements engineering. Automated Software Engineering 26 (3), pp. 559–598. Cited by: §1.
Supervised sequence labelling.
Supervised sequence labelling with recurrent neural networks, pp. 5–13. Cited by: item 3.
- Genetic fuzzy system for enhancing software estimation models. International Journal of Modeling and Optimization 4 (3), pp. 227–232. Cited by: §1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: item 3.
- Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München 91 (1). Cited by: §5.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §1, §1, §2.3, §2.3, §5, §7.
- Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882. Cited by: §1.
Approximation of cosmic functional size to support early effort estimation in agile.
Data & Knowledge Engineering85, pp. 2–14. Cited by: §3, §3.
- Missing data techniques in analogy-based software development effort estimation. Journal of Systems and Software 117, pp. 595–611. Cited by: §1, §2.1.
- Analogy-based software development effort estimation: a systematic mapping and review. Information and Software Technology 58, pp. 206–230. Cited by: §1.
- Systematic literature review of ensemble effort estimation. Journal of Systems and Software 118, pp. 151–175. Cited by: §1, §2.1, §2.1.
- An approach to software development effort estimation using machine learning. In Intelligent Computer Communication and Processing (ICCP), 2017 13th IEEE International Conference on, pp. 197–203. Cited by: §1, §1, §1, §1, §3, §3, §3.
- Neural context embeddings for automatic discovery of word senses. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 25–32. Cited by: §1.
- Fuzzy emotional cocomo ii software cost estimation (fecsce) using multi-agent systems. Applied Software Computing Journal 11 (12), pp. 2260–2270. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.7.
- When to use data from other projects for effort estimation. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pp. 321–324. Cited by: §2.1.
Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: 6th item.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §7.
- In defense of word embedding for generic text representation. In International Conference on Applications of Natural Language to Information Systems, pp. 35–50. Cited by: §4.5, §4.5.
- Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1501–1511. Cited by: §1.
- Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.6.
Software cost estimation by analogy using feed forward neural network. In Information Communication and Embedded Systems (ICICES), 2014 International Conference on, pp. 1–5. Cited by: §1, §2.1.
- Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §2.3.
- Context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of the 20th SIGNLL conference on computational natural language learning, pp. 51–61. Cited by: §2.3.
- The role of context types and dimensionality in learning word embeddings. arXiv preprint arXiv:1601.00893. Cited by: §1.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §2.2, §2.2, §2.2, §2.2, §2.2, Table 3.
- Cost-effective supervised learning models for software effort estimation in agile environments. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 135–140. Cited by: §1, §3.
- Functional size approximation based on use-case names. Information and Software Technology 80, pp. 73–88. Cited by: §3.
- Using scrum in distributed agile development: a multiple case study. In 2009 Fourth IEEE International Conference on Global Software Engineering, pp. 195–204. Cited by: §7.
- Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24 (4), pp. 694–707. Cited by: §4.5.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §2.2, §2.2.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.3, §2.3.
- Estimating story points from issue reports. In Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 1–10. Cited by: §4.2.
- Engenharia de software-8ª edição. McGraw Hill Brasil. Cited by: §5.
- Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §2.3, §2.3.
- Python machine learning. Packt Publishing Ltd. Cited by: §2.2, §4.1.
- Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, pp. 338–342. Cited by: item 3.
- Multi-objective software effort estimation. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 619–630. Cited by: §4.1.
- On the origin of the fibonacci sequence. MacTutor History of Mathematics. Cited by: §4.2.
- Effort estimation using analogy. In Proceedings of IEEE 18th International Conference on Software Engineering, pp. 170–178. Cited by: §1, §2.1.
- Software effort estimation using machine learning techniques. In Proceedings of the 7th India Software Engineering Conference, pp. 19. Cited by: §2.1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: item 3.
- SEWordSim: software-specific word similarity database. In Companion Proceedings of the 36th International Conference on Software Engineering, pp. 568–571. Cited by: item 1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.3, §2.3.
- Revisiting recurrent networks for paraphrastic sentence embeddings. arXiv preprint arXiv:1705.00364. Cited by: §4.5.
- Combining word embedding with information retrieval to recommend similar bug reports. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 127–137. Cited by: §1.
Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: 2nd item.
- ESSE: an early software size estimation method based on auto-extracted requirements features. In Proceedings of the 8th Asia-Pacific Symposium on Internetware, pp. 112–115. Cited by: §3, §3.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27. Cited by: §2.3.
CRC standard probability and statistics tables and formulae. Crc Press. Cited by: §4.2.