Software engineers’ personality has been a subject of interest for researchers since the 1970s when it was first hypothesized that personality traits might influence how developers interact (weinberg1971psychologyse; shneiderman1980psychologyse). cruz2015slr (cruz2015slr) identified 90 studies published between 1970 and 2010, most of which published after 2002; barroso2017slr (barroso2017slr) found 21 studies, published between 2003 and 2016, studying the effect of personality on professional developers. The main reasons for this growing interest in personality-focused research lie in the many practical implications existing at both individual and team level. For example, previous studies on the personality of developers involved in agile development have revealed a positive association of conscientiousness (i.e., being organized, dependable) and openness to experience with their pair programming performance (salleh2010icsemotivation; salleh2010esemmotivation); in addition, evidence suggests that testers are significantly higher on conscientiousness than other software development practitioners (kanij2015motivation) and that managers are more extroverted (smith2016personalityse). Regarding performance, teams whose members are more extroverted have been found to release higher-quality software products (acuna2009motivation; acuna2015motivation). licorish2009supporting (licorish2009supporting) have built a prototype to assist project managers in agile team formation by providing lightweight support for personality assessment. However, these studies barely scratched the surface of the potential scenarios of applying personality to software engineering, given its profound socio-technical nature (mauerer2021stc).
Most of the previous research on personality has been conducted using self-assessment questionnaires. Albeit reliable, detecting the personality through psychometric questionnaires has some drawbacks, such as low return rates—especially in the software engineering domain (smith2013sesurveys)—and the limited number of occasions (typically one) to perform data collection (yarkoni2010liwc). The drawbacks of self-assessment questionnaires can be overcome by computational personality detection, which is the task of automatically inferring personality traits from conversation transcripts and written text (argamon2005first; vinciarelli2014apr; farnadi2016cpd).
With the proliferation of collaborative development environments such as GitHub, social media like Twitter, and communication platforms like Slack, developers’ discussions have become easily accessible. Thanks to the availability of such a wealth of communication traces, software engineering researchers have begun developing solutions for automatically detecting developers’ personalities from written text, thus making it possible to study on a large scale whether and how specific aspects might influence the outcome of development activities. For example, previous research has relied on automatic tools to investigate how personality varies with the level of developers’ contribution and prominence in the Apache (rigby2007liwc-asf), Stack Overflow (bazelli2013liwc-so), and GitHub (rastogi2016liwc-gh) ecosystems; other studies have focused on clustering similar personality profiles among the developers of the Eclipse (paruma2016ibmpi; licorish2015personalitygsd) and Apache (calefato2019apache) projects.
The research on computational personality detection has resulted in the release of several general-purpose prediction tools and models (e.g., (mairesse2007pr; pennebaker2015liwc; liu2017c2w2s4pt; majumder2017deeplearning; carducci2018twitpers)). However, while there has been prior work that assessed the psychometric validity of questionnaires showing good correlations among the instruments (e.g., (schmitt2007bfi)), when it comes to studies on automatic personality detection we found no previous assessment of general-purpose tool performance used as off-the-shelf components to analyze the communication traces left in a technical domain such as software engineering. Therefore, to fill this gap, in this paper we investigate the problem of using computational personality detection tools trained on non-technical domains for software engineering research. In particular, we first ask:
RQ1 – How do off-the-shelf personality detection tools perform in the software engineering domain?
The findings of this study show that when general-purpose personality detection tools are used off-the-shelf for software engineering research, their performance is far from acceptable as (i) neither they agree with self-reported ratings nor with each other; (ii) their prediction accuracy is lower compared to the results from prior work in non-technical domains.
The low agreement and accuracy rates of personality predictions suggest that the conclusions based on the application of these tools in the software engineering domain might be affected by the choice of one specific tool over the others. Therefore, to understand whether using different personality detection tools affects the validity of the results when employed in the same software engineering study, we ask:
RQ2 – How does the choice of a personality detection tool affect the validity of previous results in software engineering research?
We conduct the replications of two former studies and—after changing the tool used therein— find that the choice of a specific personality detection tool does affect the validity of the previously published results in software engineering research as we generally fail to reproduce them.
From a research perspective, our study makes a first attempt at benchmarking the performance of the tools available for personality detection from technical text. The study advances the state of the art on computational personality detection for software engineering research by suggesting that, to cope with a domain-specific lexicon, we need to develop software engineering-specific tools given that the existing solutions cannot be fine-tuned because the prediction models are generally not retrainable.
From a practical perspective, this study furthers our understanding of the current limitations when reusing computational personality detection tools as off-the-shelf components. We show that tools underperform in presence of technical text and warn that the choice of specific personality detection tools may lead to contradictory results. Another practical contribution is the replication package that we share with the research community, including both the scripts for automatically re-executing the entire experimental workflow and an anonymized gold standard consisting of an email corpus matched with developers’ self-reported personality scores.
The remainder of this paper is organized as follows. In Sect. 2, we illustrate the study background. Sect. 3 presents the two-phase research framework followed to carry out our work. In Sect. 4, we study the performance of the selected personality prediction tools. In Sect. 5, we perform the replication of two previously published, software engineering studies. The results are discussed in Sect. 6. Finally, we present the related work in Sect. 7 and conclude in Sect. 8.
In Sect. 2.1, we first provide an overview of the fundamental concepts and theories related to personality. Then, in Sect. 2.2 we review some of the instruments used for personality measurement. Finally, in Sect. 2.3, we review prior work that focused on developers’ personalities in the software engineering field.
2.1. Personality Theories
Personality is defined by psychologists as the set of all the behavioral, temperamental, emotional, and mental attributes that characterize a unique individual (ryckman2012theories). Personality has been conceptualized from a variety of theoretical perspectives and at various levels of abstractions. One level that has been often studied is personality traits (john1999big5traits). Given the complexity of its nature, psychologists have developed several taxonomies of traits, that is, descriptive models of personality useful for organizing, distinguishing, and summarizing the major individual differences among the numerous existing in human beings.
Many theories of personality traits have been proposed since the 1930s. Such theories often disagreed regarding the number of traits and their nature (goldberg1993traits). However, after decades of research and the compelling amount of empirical evidence collected, since the 1970s a general consensus was achieved on the validity of a taxonomy of five orthogonal personality traits, called the Big Five. A brief description of the five factors is reported in Table 1. The name was first proposed by Goldberg (goldberg1981bigfive) to emphasize that five dimensions are sufficient to capture the main dispositional characteristics and high-level differences of individuals. These five personality traits have been obtained through repeated studies by applying factor analyses to various lists of trait adjectives used in self-descriptions and self-rating questionnaires for personality assessment. These studies are based on the lexical hypothesis (allport1936lexical-hypothesis), a psycholinguistic conjecture according to which the most important individual characteristics and differences in personality have been encoded over time as words in the natural language, and the more important a characteristic or difference, the more likely it is for individuals to express it using associated words (john1999big5traits).
Several independent studies (see digman1990ffm (digman1990ffm) for a compendium) performed repeated factor analyses on questionnaires based on different personality taxonomies only to find consistent evidence of the existence of a latent personality structure consisting of five main factors. In other words, the extracted models showed minor differences at the higher level—i.e., the traits, albeit labeled differently, could be easily mapped onto each other (goldberg1993traits). Hence, trait psychologists combined these findings on the general ubiquity of five factors across various instruments with the results from the studies on the lexical hypothesis to argue that any personality traits model had to encompass at some level the same Big Five dimensions (goldberg1981bigfive).
Albeit the Big Five is considered a synonym with the Five-Factor Model (FFM) (costa1985neo-pi; mccrae1987ffm-validation), the two models are slightly different. Big Five is a term used to refer in general to personality frameworks consisting of five high-level dimensions and the Big Five model only describes the five traits at a broad level. Instead, the FFM is a personality framework that further refines the five high-level traits into multiple lower-level facets. For the sake of simplicity, from now on we will use Big Five and FFM interchangeably.
|Is conservative and close-minded||Is open to novel experiences|
|Is impulsive and careless||Is responsible and dependable|
|Is reserved and solitary||Is outgoing and decisive|
|Is unfriendly and cold||Is warm and good-natured|
|Is stable and calm||Is nervous and emotionally unstable|
2.2. Instruments for Personality Detection
Psychometrics is the development of measurement instruments and the assessment of whether these instruments are reliable and valid forms of measurement (ginty2013psychometrics).
Personality traits are usually assessed using self-assessment questionnaires, which present a variable number of items (up to hundreds) that describe common situations and behaviors. Subjects take these self-reporting tests by rating on a Likert scale the extent to which an item applies to them, where each item is either positively or negatively associated with a specific trait. Finally, a numeric score is computed for each trait by aggregating (e.g., summing) all the values assigned to its related answers.
There are several instruments for the self-assessment of personality, such as the Myers-Briggs Type Indicator (MBTI) (myers1998mbti) and the Keirsey Temperament Sorter (KTS) (keirsey1998kts), based on Jung’s theories. However, their psychometric validity has been questioned over the years (boyle1995reliability; abramson2010reliability)
The most popular instruments for measuring the Big Five traits are the NEO-PI (costa1985neo-pi) and the NEO-PI-R (costa2008neo-pi-r). McCrae (mccrae2001culture; mccrae2002culture) has found NEO-PI-R to be reliable even after translating and administrating it across 36 countries, also showing that it is possible to use trait mean values to capture systematic differences. Another questionnaire intended to measure the five dimensions of personality is the Big Five Inventory (BFI) used by schmitt2007bfi (schmitt2007bfi) in a large study on 56 nations. The results showed a robust five-factor structure across geographical and cultural regions as well as a high cross-instrument correlation with the NEO-PI-R scales.
Because all the instruments above are proprietary, psychologists have developed and validated the International Personality Item Pool (IPIP) and its follow-up IPIP-NEO (International Personality Item Pool Representation of the NEO PI-R), two alternative and open Big Five inventories freely available to researchers (goldberg2006ipip).
In summary, given the evidence of the validity gained by the Big Five inventories in general and the lack of psychometric reliability of the other instruments, in this work, we focus only on the FFM and the related instruments.
Computational Personality Detection from Text
Self-report inventories are the most popular psychometric instruments to assess personality because of their validity and ease of use. However, in addition to its semantic content, written text is also capable of conveying information about the writer, such as cues to individual personality. Psychologists have been able to identify correlations between specific linguistic markers and personality traits (pennebaker1999linguistic). Computational personality detection (farnadi2016cpd), also referred to as automatic personality recognition (vinciarelli2014apr)
is the task of inferring people’s personality from their digital footprints, such as social media content (e.g., videos, pictures, likes), and written text (e.g., conversation transcripts, blog posts, emails). A machine-learning algorithm uses features extracted from the analyzed content as cues to predict personality. The features and how they are associated with the traits vary depending on the type of corpora analyzed; for instance, audio recordings allow the use of acoustic cues, such as voice inflection, which are lacking in textual corpora where instead lexical features, such as the count of positive vs. negative words, part-of-speech tags, andn-grams, can be leveraged as personality markers. Further details on how computational personality detection works are provided later in Section 4.1, where the selected tools are introduced.
To date, there is a limited yet steadily growing amount of work on the automatic detection of personality (kaushal2018emerging). In particular, thanks to the recent advances in machine learning, recently developed tools (e.g., (liu2017c2w2s4pt; carducci2018twitpers)
) are leveraging deep-learning techniques for processing several cues extracted from large text corpora. Tools can be grouped into top-down and bottom-up solutions(celli2013workshop). Top-down (or closed-vocabulary) solutions rely on external resources (e.g., psycholinguistic databases) and text is processed through such pre-determined dictionaries that specify meaningful word categories associated a priori with personality traits. Instead, bottom-up (or open-vocabulary) (schwartz2013openvocabulary) do not specify in advance the relationship between features and allow linguistic cues (i.e., meaningful words and phrases) associated with personality traits to emerge from data.
In Sect. 4.1, we review in detail some of the tools available as off-the-shelf solutions to automatically recognize the Big Five personality traits from text.
2.2.1. Personality Datasets
Because automatic personality recognition approaches are inherently data-driven, the availability of experimental datasets plays a crucial role. According to novikov2021survey (novikov2021survey), more than 40% of studies on personality involve the collection of new datasets that remain private. The remaining studies rely on a few shared and reusable datasets. Here we briefly review those based on textual documents.
The essay dataset (pennebaker1999linguistic) consists of nearly 2,500 essays (i.e., unedited pieces of text) written in a controlled setting by students who had also taken the BFI personality test. Originally used to validate the LIWC tool (see Sect. 4.1), it is one of the first corpora utilized in personality prediction research (argamon2005first; mairesse2007pr) and is still very much in use (mehta2020essay; salminen2020essay).
Given the growing amount of digital traces left by users in social networks, it is not surprising that most of the personality datasets collect documents from social media (e.g., (liu2016socialmedia; hall2017socialmedia; ramos2018socialmedia)). The largest dataset used in personality prediction research is myPersonality (stillwell2004mypersonality), which contains data of Facebook users who filled in a personality questionnaire. The anonymized dataset has been freely shared with researchers for non-commercial academic purposes until it was retired in 2018. Another example of personality-annotated datasets of posts from social networks is the PAN-AP-2015 corpus (rangel2015pantask), consisting of Twitter posts in English, Spanish, Italian, and Dutch from users who also took the BFI test.
To the best of our knowledge, there is no personality-annotated dataset of text documents (e.g., emails, issue reports, commit messages) collected from technical domains such as software engineering.
2.3. Personality Detection in Software Engineering
Early in the development of the software engineering field, it was recognized that, in addition to technological factors, researchers also had to consider the humans involved in the development process (weinberg1971psychologyse; shneiderman1980psychologyse). lenberg2015bse (lenberg2015bse) have proposed the term Behavioral Software Engineering to refer to the interdisciplinary study of cognitive, behavioral, and social aspects of software engineering as performed by individuals and groups.
In the following, we review the most relevant previous studies on personality in software engineering. We restrict our review to studies published since 2016—for earlier studies, refer to the SLRs reported in (cruz2015slr; barroso2017slr). Also, given the compelling amount of evidence on its validity, we focus our review only on studies that leveraged the Big Five model.
kosti2016personalityse (kosti2016personalityse) conducted a study using a clustering technique, which identified four archetypal personality profiles characterized by the levels of extraversion and conscientiousness. mellblom2019personalityse (mellblom2019personalityse)
analyzed the response from 47 participants in a survey aimed at revealing the relationship between specific personality traits and burnout in professional software developers. Through regression analysis, they uncovered a strong link betweenneuroticism and burnout. mendes2021decisionmaking (mendes2021decisionmaking) surveyed 63 Brazilian developers and found that the agreeableness trait is significantly associated with the variation in the decision-making style. smith2016personalityse (smith2016personalityse) analyzed the characteristics of professional developers’ personalities based on their roles. They found managers to be more conscientious and extroverted, and agile developers to be more neurotic and extroverted. akarsu2019personalityse (akarsu2019personalityse) studied the personality traits by administering the BFI to 18 agile teams. They found that high levels of agreeableness and conscientiousness were common in most teams. Moreover, they observed a low level of extraversion in isolated teams that had fewer contacts with customers. In (vishnubhotla2020personalityse), vishnubhotla2020personalityse investigated the association between the Big Five traits and the factors related to team climate within eight agile teams. Through regression analysis, they found that openness has a statistically significant positive correlation with support for innovation; also agreeableness is positively correlated with the overall team climate. Finally, an interesting attempt was made by yilmaz2017personalityse (yilmaz2017personalityse) who developed a psychometric questionnaire, based on the BFI and specifically adapted it to the software engineering domain, to explore how practitioners’ personality traits are associated with effective software teams. The results indicated that effective teams are characterized by low neuroticism and high levels of agreeableness, extraversion, and conscientiousness.
All the studies reviewed above rely on questionnaires for psychometric assessment. Yet, several studies also rely on the automatic recognition of developers’ personality traits from text. In the Related Work (Sect. 7), we provide an in-depth review of such studies.
3. Research Framework
In this work, we follow a two-phase research framework, as depicted in Fig. 1.
In Phase 1, we assess the performance of personality detection tools for software engineering. Accordingly, to answer RQ1, we first select four tools built upon the Big Five model. Then, we set the ground truth by collecting the responses to a self-assessment personality questionnaire from 50 Apache Software Foundation (ASF) developers. Contextually, we build a dataset of over 1,500 emails written by the ASF developers. We apply the selected tools to the email dataset to extract the personality scores and compare their predictions against the self-ratings. Finally, we complement the previous analysis by assessing the agreement between the selected tools. As such, we make pairwise comparisons between the personality scores obtained by running the tools on the email corpus.
In Phase 2, we study whether the choice of one personality detection tool over the others affects the validity of results from software engineering studies. Accordingly, to answer RQ2, we select two recent, large-scale studies, respectively by iyer2019github (iyer2019github) and calefato2019apache (calefato2019apache) We choose to replicate these two studies because they both used the same tool (i.e., IBM Personality Insights) to extract developers’ personality profiles from different datasets of technical content—respectively, pull-request discussions obtained from GitHub and emails retrieved from the ASF public archives. Finally, both studies provide a replication package, which we adapt to replicate them using another personality detection tool; the availability of complete replication packages allows us to minimize the risk of errors in executing the replications.
The two-phase framework is inspired by the work of jongeling2017negative (jongeling2017negative)
, who conducted a study on the use of sentiment analysis tools for software engineering research and arranged their research questions sequentially so that, after observing disagreement among tool predictions, they could explore the effects on conclusion validity in prior work replications after switching tools.
We provide a complete replication package for the work presented here, which allows other researchers to re-execute all the steps in the research workflow. To reinforce the replicability, we provide a script that fully automates the whole experiment pipeline. Further details and instructions are available in Appendix A.
4. Phase 1 – Assessment of Prediction Performance
This section is structured as follows. In Sect. 4.1, we describe the process followed to select the sample of tools for computational personality detection from text; we also provide details about their features, configuration, and output. In Sect. 4.2, we describe how we built a gold standard by administering a self-assessment personality questionnaire to software developers. In Sect. 4.3, we illustrate the process followed to build the dataset of emails written by the same subjects who answered the personality questionnaire. Finally, In Sections 4.4 and 4.5
, respectively, we describe the evaluation metrics and report the results to answer RQ1.
|LIWC, MRC||2,479 essays||LIWC dataset||7|
|TwitPersonality||Apache 2.0||BU||SVM||Word embedding||
4.1. Tools Selection
In this section, we review in detail the personality detection tools selected for the analysis of prediction accuracy and agreement.
To build a list of candidate tools, we started from those listed in recent studies (calefato2019apache; mehta2019recenttrends) containing reviews of solutions for automatic personality recognition from text, including both commercial tools and research prototypes. Then, from the resource identified, we sought more candidates using a snowball method. At the end of the search process, we identified 19 candidates.111The complete list of the tools identified along with the exclusion criteria is available at https://doi.org/10.6084/m9.figshare.15086391. From these, we filtered out those for which the tool has not been shared or made available as an off-the-shelf solution for testing the model on other datasets. Tools vary largely in terms of the type of prediction task. According to Schwartz et al. (schwartz2013openvocabulary), prediction on a continuous numeric scale (i.e., traits are measured on a given numeric range) is a more appropriate task for studies on automatic personality recognition. Therefore, we also filtered out those tools that intend personality trait prediction as a binary (e.g., yes/no) or multi-class (e.g., low/medium/high) task instead of measuring the outcome on continuous numeric scales. Eventually, we obtained four candidate tools.
To enrich the list of potential candidates, we complemented the previous search by also looking for tools in GitHub. We used four search strings obtained by combining “personality” with “prediction” (258 entries), “detection” (87), “assessment” (47), and “recognition” (38). Then, other than the filters applied earlier, we filtered out the repositories that (i) did not have an associated README.md file with installation and execution instructions and (ii) reported using personality models other than the FFM; (iii) were supplementing material to references research papers (to exclude student projects).
We identified only one candidate repository, which however was already included in the previous list.
Next, we review the four selected tools. An overview is available in Table 2, where we also report the dataset, techniques, and features used to develop the prediction model, as well as the instrument adopted to establish the ground truth.
Finally, while we cannot claim completeness—the systematic review of this research field is outside the scope of this study—this section provides nonetheless a valuable, up-to-date overview of the state of the art in the field of computational personality detection on continuous numeric scales.
The Linguistic Inquiry and Word Count (pronounced luke) (pennebaker2015liwc) is a commercial, text-analysis program that counts words in psychologically meaningful, predetermined categories. It adopts a top-down approach, therefore analyzing the text looking for the occurrence of predetermined linguistic cues associated with personality traits. LIWC is arguably the most well-known resource in this category, often used as an external psycholinguistic database by other tools. pennebaker1999linguistic (pennebaker1999linguistic) used LIWC to count the word categories of 2,479 essays (i.e., unedited pieces of text) written by volunteers who had also taken the BFI test as ground truth. In line with the lexical hypothesis, they found significant associations between the linguistic features of LIWC and the Big Five traits, thus providing evidence of existing connections between language use and personality (tausczik2010psychological).
Output interpretation. When analyzing a piece of text, LIWC returns word-category frequencies. We apply the formulae proposed by yarkoni2010liwc (yarkoni2010liwc), one for each of the big five traits, which leverage the quantified connections between personality and word use, and transform such frequencies into numerical trait scores in the range .
Tool setup. We used the standalone version of LIWC, which is also available from the web using the Receptivity222https://receptiviti.com API. Because the tool now supports both the 2007 and 2015 versions of the vocabulary, we opted for the first one, because it was the version used in (yarkoni2010liwc) to derive the formulae. Nonetheless, despite a few differences among the categories defined, in our internal tests, the scores generated with the two dictionaries have very strong Pearson correlations (between .92 and .97).
4.1.2. IBM Personality Insights
It is a commercial tool that uses an unspecified machine-learning model with a bottom-up, open-vocabulary approach. Earlier versions of the service (i.e., before December 2016) instead used the top-down, closed-vocabulary approach and relied on the LIWC dictionary. Models based on the open-vocabulary approach have been found to work well also in presence of small amounts of text such as tweets (arnoux2017baseline). Also, as per IBM release note,333 https://cloud.ibm.com/docs/personality-insights/science.html#researchPrecise this version of IBM PI reportedly outperformed the previous LIWC-based model. In November 2020, IBM announced444https://cloud.ibm.com/docs/personality-insights?topic=personality-insights-release-notes that the service had been deprecated and that it would be retired at the end of 2021.
Output interpretation. After analyzing a piece of text, IBM PI returns a JSON response. We parsed the JSON and retrieve five raw scores—one for each of the big five traits—defined in , which we rescaled in the range .
Tool setup. Software Development Kits in multiple programming languages are available for the IBM PI service. In particular, we chose the Python API.
4.1.3. Personality Recognizer
Top-down solutions make heavy use of external resources and test the correlations between those resources and personality traits. The seminal work for top-down solutions is Personality Recognizer,555http://s3.amazonaws.com/mairesse/research/personality/recognizer.html a tool developed by mairesse2007pr (mairesse2007pr)
who conducted a series of experiments where multiple statistical models were benchmarked. With a supervised learning approach, they developed multiple prediction models using the same annotated dataset of essays employed for the development ofLIWC. However, other than using LIWC features, they augmented the models with other dimensions from the Medical Research Council (MRC) psycholinguistic database (coltheart1981mrc).
Output interpretation. Personality Recognizer issues trait predictions on a continuous, seven-point scale. Therefore, to allow for comparison with other tools, we rescaled the output in the range .
Tool setup. Personality Recognizer requires the MRC database and the LIWC 2001 dictionary. In our experiment, for consistency, we used the 2007 edition of LIWC. In addition, Personality Recognizer
supports different models for computing scores; we opted for the default option, i.e., Support Vector Machine (SVM) with Linear kernel (SMOreg). Finally, the tool can be optimized for the analysis of spoken or written language. Given the nature of our dataset, we chose the second option.
Another solution relying on the bottom-up approach is TwitPersonality.666https://github.com/D2KLab/twitpersonality To develop the tool, carducci2018twitpers (carducci2018twitpers)
used a supervised learning approach. They first built a word-vector representation of Facebook posts (using the myPersonality dataset(stillwell2004mypersonality)) and then used it to train five SVM models, one for each trait. They also tested the models using a smaller corpus of tweets collected from 24 Twitter users, who also took the BFI test. Albeit with some tinkering, TwitPersonality is the only tool among those benchmarked in this study, whose model can be retrained on new data.
Output interpretation. TwitPersonality issues out-of-the-box trait predictions on a continuous, five-point scale. Therefore, no further transformations were necessary.
Tool setup. We used the default settings present in the source code. TwitPersonality can be used in two modes: user-wise, i.e., the written ‘documents’ of one author are aggregated and analyzed to extract the trait scores; post-wise, i.e., the trait scores are inferred from the documents individually, and then the average scores are computed. We opted for the former mode, for the sake of consistency with the other tools, albeit our internal tests showed that the differences between the two modes, when present, are negligible.
4.2. Gold Standard
In this section, we describe the process followed to build the gold standard and set the ground truth with a self-assessment questionnaire. We retrieved the publicly available mailing lists archives of the projects belonging to the ASF,777https://mail-archives.apache.org/mod_mbox as of Jan. 2018. Our decision to investigate the ASF was motivated by the observation that all the projects within the ecosystem—albeit varying in size, scope, and technology stack—share the same code of conduct,888www.apache.org/foundation/policies/conduct.html which enforces a shared set of guidelines that also regulate written interaction. Accordingly, we focused on analyzing the dev mailing lists because they are intended to host developers’ discussions.
We retrieved the list of all email addresses present in the archives and randomly selected 1,000 among those who had contributed at least 10,000 words in their emails, to ensure they had contributed enough text for the analysis. We manually vetted the list to exclude the presence of emails automatically generated by bots such as the project’s version control system or the mailing-list software. Finally, we sent invitations by email to the selected developers to take a personality test.
To collect the responses, we developed an electronic version of the 20-item Mini-IPIP (donnellan2006miniipip) questionnaire—the shortest, valid personality instruments available—in an attempt to increase the notoriously low response rate of surveys in the software engineering domain (smith2013sesurveys). The form collected the responses and, following the specifications provided in (donnellan2006miniipip), transformed them into trait scores in the range . In addition to the test questions, we inserted a couple of attention items and also measured the time taken to complete the tests. No monetary incentives were given to the test participants.
In the 2010s, personal data belonging to millions of Facebook users was collected without their consent by the consulting firm Cambridge Analytica and used during the 2016 US presidential campaigns for psychological targeting, i.e., the extraction of psychological profiles from social-media digital footprints to influence the attitudes, emotions, and behaviors of large groups of people. This scandal increased the public attention to the privacy risks of personal data misuse, creating still persistent social stigma attached to personality-related research and concerns deriving from participating in related studies (matz2020privacy). Therefore, given the sensitive nature of the data collected in the study, both the invitation emails and the website contained a detailed description of the goal of the study and its academic-only interest. We ensured the developers taking the test that the results would only be presented in aggregate and that no resource would be shared, which could allow third parties to match the test results to their identities. For further details on anonymity and data protection measures adopted during data collection, please refer to the replication package in Appendix A.
We received 61 responses (6% response rate), of which 50 were deemed valid. Seven responses were excluded because the respondents failed the attention checks and four because of the short time spent in taking the test (less than two minutes over an average of nearly eight). The survey respondents belong to 34 different ASF projects,999The complete list of projects and respondents is available at https://doi.org/10.6084/m9.figshare.15066564. including lucene (4 developers), maven (4), couchdb (3), log4net (3), kafka (2), cassandra (2), and openmeetings (2). From Fig. 2, we observe that the participants tend to be open (mean 4.33, SD .63), exhibit average levels of conscientiousness (3.70, .75) and agreeableness (3.73, .80), and are neither very extroverted (2.73, .84) or neurotic (2.77, .92).
4.3. Experimental Dataset
We matched the self-assessed personality profiles with a corpus of all the emails written by the developers who took the test. As a result, we were able to run the selected tools to infer the personality scores from the email corpus and compare the predictions against the self-ratings from the questionnaire (gold standard).
To build the corpus, we aggregated all the email bodies sent by each subject and applied a series of filters to clean the data. In particular, we first used the email-reply-parser101010https://pypi.org/project/email-reply-parser library to ensure that only the text written by the email author was retrained while discarding the signature and any text coming from replying and forwarding. Then, we removed the lines of code using the R package NLoN, developed by mantyla2018nlon (mantyla2018nlon). NLoN was trained and tested on various email corpora—including one from Apache developers—with good results. In addition, we used polyglot111111https://pypi.org/project/polyglot to remove any non-English words. Finally, we lower-cased the text and removed all the stopwords using NLTK.121212www.nltk.org
Overall, the 50 subjects contributed 1,543 emails, with an average of 30.86 per developer (min 15, max 55, median 33, SD 8.45). After collating the email bodies and performing the data cleaning, we found that each developer has contributed on average 1,111.04 words (min 747, max 1764, median 1098, SD 180.14). We notice that these dimensions are in line with those of other datasets used in training and testing the selected tools. For example, IBM PI recommends providing at least between 600 and 1,200 words to enable the analysis,131313https://cloud.ibm.com/docs/personality-insights?topic=personality-insights-input albeit it is not disclosed what kind of preprocessing is applied. The dataset used by LIWC and Personality Recognizer contains 2,479 essays with an average length of 652 words. TwitPersonality employed a dataset of tweets, most of which are typically a hundred characters long.141414https://blog.twitter.com/official/en_us/topics/product/2017/tweetingmadeeasier.html
4.4. Evaluation Metrics
Research on the Big Five model has consistently considered personality detection as a set of five separate trait-prediction tasks. However, the formulation of the prediction tasks can differ drastically (oberlander2006tasks): they can be approached as binary classification tasks, using the mean or median as thresholds to discretize the numerical scores; alternatively, after removing the observations in the middle, they approach the binary classification as limited to the upper and lower groups, albeit this is less than ideal with bell-shaped distributions.
The choice of the metrics for evaluating performance also varies. In the case of personality predictions on a continuous scale, following (schwartz2013openvocabulary), in our analysis, we included the following performance metrics: Pearson
product-moment correlation () and Spearman rank correlation (), measured between the predicted trait scores (by each tool) and the actual scores (from the gold standard, self-assessment questionnaire); Mean Absolute Error (MAE),151515 where is the trait, and and are the ground truth and predicted scores for subject . the average of the absolute value of the difference between the actual and predicted scores; Root Mean Squared Error (RMSE),161616 where is the trait, and and are the ground truth and predicted scores for subject .
the standard deviation of the residuals, i.e., the prediction errors.
|Study (best results)||Approach||Technique||Subjects||Dataset||Validation (ground truth)||Pearson|
|Study (best results)||Approach||Technique||Subjects||Dataset||Validation (ground truth)||Spearman|
|TD stands for Top-Down (closed vocabulary), BU for Bottom-Up (open vocabulary).|
Pearson correlation evaluates the linear relationship between two continuous variables. Coefficients can range from -1 (perfect negative) to +1 (perfect positive), with values close to 0 indicating the lack of correlation. Assessing the predictive performance in automatic personality detection means estimating the convergent validity, i.e., the degree to which the measures of the same construct correlate with each other. One problem with using correlations is how to interpret the results. A common interpretation of the observed correlation magnitude is:negligible, weak, moderate, strong, very strong (schober2018corr-cutoffs). However, as observed by the authors, these cutoff points are arbitrary and should be used judiciously. In particular, values in the middle are disputable and their interpretation as weak, moderate, or strong varies with the applied rule of thumb. Achieving correlations of in psychology studies is challenging—even the simple axiom according to which people’s past behavior is predictive of future actions has been found to produce a correlation coefficient of (meyer2001corr). Rather than relying on the conventional cut-off points used for interpreting correlation coefficients in other fields, meyer2001corr (meyer2001corr) and roberts2007corr (roberts2007corr) have argued that research investigating psychological constructs should use baselines in the order of magnitude of correlations independently measured in related work. In other words, they have called for adjusting the norms that researchers hold for what the strength of relationships is in psychology and related fields. For example, IBM PI reportedly achieved for English an average Pearson correlation coefficient in an internal assessment study. Some studies on psychological and behavioral constructs (e.g., (roberts2007corr)) have reported Pearson correlations with a small to medium magnitude in the range . In a survey of over 200 papers on personality published since 2017, novikov2021survey (novikov2021survey) found that the reported Pearson correlation coefficients between predicted and self-reported personality traits are upper limited by values near . One exception is represented by the work of lynn2020baseline (lynn2020baseline), which reports a score exceeding 0.60 in the case of openness. Table 3 lists some of the Pearson correlation coefficients reported in recent prior work. We will use these values as baselines to assess the performance of the personality prediction tools involved in our study. We notice that most of the studies that reported Person correlation metrics adopted a top-down approach, using regression analyses for analyzing data from the myPersonality dataset.
Spearman correlation coefficient also varies between -1 and +1. Unlike Pearson , however, Spearman is a non-parametric measure that does not make any assumption regarding the normality, linearity, and homoscedasticity of distributions. The Spearman coefficient is based on the ranked values for each variable rather than the raw data. fowler1987spearman (fowler1987spearman) found Spearman rank correlations to be more robust and outperforming Pearson
in cases of non-normal distributions. Only a few studies (e.g.,(hall2017socialmedia)) have assessed prediction performance using Spearman , which appears to be used more frequently with multimedia datasets. A complete analysis of correlation coefficients reported in prior work is out of the scope of this work. For more, please refer to the meta-analyses performed by azucar2018meta (azucar2018meta) and marengo2020meta (marengo2020meta).
The MAE and RMSE measures are also very popular. novikov2021survey (novikov2021survey) found that 65 out of the 218 studies analyzed report personality prediction performance using either measure. Nonetheless, these metrics are not without problems. Their interpretation is largely dependent on the scale of the data and their estimates tend to be over-optimistic since self-ratings in gold standards tend to be normally distributed, with most observations close to the mean (sumner2012rmse-mae). Given their popularity, we include these metrics for the sake of comparison with prior work. Table 4 provides an overview of the best MAE and RMSE results reported in recent and relevant studies, thus providing us with a performance baseline. We notice that the top-down and bottom-up approaches are almost equally distributed and that these studies mostly relied on regression analyses. However, while the studies reporting RMSE typically used myPersonality as the data source and ground truth, the others that relied on MAE built private datasets, using the IPIP or BFI questionnaires for validation.
|Study (best results)||Approach||Technique||Subjects||Dataset||Validation (ground truth)||MAE|
|Study (best results)||Approach||Technique||Subjects||Dataset||Validation (ground truth)||RMSE|
|TD stands for Top-Down (closed vocabulary), BU for Bottom-Up (open vocabulary).|
, we notice that the tool predictions are far less spread out. This can be also observed from the small standard deviations reported along with other descriptive statistics in AppendixB (see Table 23). In particular, we observe that TwitPersonality issues predictions for each score that are clustered around the mean, and even constant in the case of neuroticism. The Q-Q plots in Figures 5-9 (see Appendix B) show that while linear relationships exist in the distributions of the self-ratings as well as the tool predictions, they are not normal, thus violating one of the assumptions to apply Pearson correlation.
We assessed the level of agreement by comparing the personality scores predicted by the tools, respectively, against the self-reported ratings and between each other. The Pearson and Spearman correlation coefficients are reported in Table 5. We point out that the correlation coefficients could not be calculated for neuroticism in the case of TwitPersonality because the tool issues a constant value for all the subjects.
Overall, we notice that no pair does consistently better than any other in terms of either correlation coefficient. Regarding Pearson correlation, the highest coefficient is computed between LIWC and TwitPersonality for the extraversion trait (). However, most coefficients are approximately equal to .10 or smaller, and several are negative. Consistently, we observe that the largest coefficients between all pairs and across all traits in Table 5 () are smaller than those found in prior work and reported earlier in Table 3 (). In terms of Spearman correlation, the same observations still hold, with coefficients in the range , therefore smaller than those reported in prior work ().
Concerning the correlations between the gold standard and the tools, we cannot identify any pattern. Instead, regarding the correlations between tools, we notice that most of the largest and coefficients are obtained for pairs including LIWC. In the case of LIWC and Personality Recognizer, this result is not surprising since the latter uses the LIWC dictionary to build the prediction model and, therefore, the two tools share some linguistic features.
|GS – LIWC||-0.063||-0.057||-0.325||-0.151||-0.090||0.030||-0.056||-0.151||0.134||-0.048|
|GS – IBM PI||-0.050||0.016||-0.156||-0.001||-0.073||-0.048||0.140||-0.200||0.009||-0.107|
|GS – Pers.Rec.||-0.036||-0.028||0.063||-0.120||0.142||0.033||-0.125||0.034||0.041||0.083|
|GS – TwitPers.||0.016||-0.148||-0.114||0.150||-||-0.011||-0.148||-0.137||0.131||-|
|LIWC – IBM PI||-0.169||0.046||0.112||-0.144||0.173||-0.134||0.003||0.213||-0.148||0.132|
|LIWC – Pers.Rec.||0.212||-0.185||-0.025||0.310||-0.175||0.102||-0.073||0.035||0.100||-0.035|
|LIWC – TwitPers.||0.055||-0.006||0.343||-0.023||-||0.052||-0.037||0.217||-0.100||-|
|IBM PI – Pers.Rec.||-0.037||-0.072||0.191||-0.064||-0.003||-0.037||-0.099||0.159||-0.092||-0.015|
|IBM PI – TwitPers.||-0.195||-0.224||-0.050||-0.104||-||-0.183||-0.232||0.000||-0.198||-|
|Pers.Rec. – TwitPers.||-0.073||-0.043||-0.175||0.090||-||-0.077||-0.027||-0.082||0.099||-|
To complete the performance assessment, we complemented the agreement analysis by making sense of the prediction accuracy. Table 6 reports the MAE and RMSE metrics for each tool. First, we observe that there is no best tool in absolute, which outperforms all the others. LIWC does better for the extraversion and agreeableness traits; TwitPersonality performs better at estimating conscientiousness and neuroticism. However, a known limitation of MAE and RMSE is failing to capture the performance accurately when distributions containing observations that are mostly clustered around the mean, as in the case of TwitPersonality. Nevertheless, the best results in this study (MAE , RMSE ) are considerably larger (worse) than the values found in prior work and reported earlier in Table 4 (MAE , RMSE ).
4.6. Threats to Validity
The results of the analyses should be interpreted in light of the following limitations.
First, while using individual self-ratings as gold standards to set ground truth is the norm, psychology research considers the definition of a ‘true’ personality profile out of reach (wright2014ph1limitations). Indeed, personality is an elusive concept whose assessment makes it a complex activity for any rater, whether self, external observer, or computer. Any form of personality rating is a proxy measure and, as such, it comes with its limitations. Despite their validity, self-assessment questionnaires are subjective and biased towards social desirability (boyle2009ph1limitations); external judgments are limited by raters’ idiosyncrasies and machine learning algorithms, while free of human prejudice, reflect biases present in the data (tay2020validity). Also, albeit highly correlated, there are differences between personality constructs based on self-ratings and external observers’ judgments (mount1994ph1limitations).
In line with recent meta-reviews of personality prediction model performance (azucar2018meta; marengo2020meta), we used correlation coefficients (both Pearson and Spearman) as reference metrics. We compared studies that were trained and tested on different datasets (e.g., essays, social media posts). The heterogeneity of the data sources might have influenced the comparability of the results. However, this was intentional to assess how personality detection tools can be used across domains as off-the-shelf solutions, and specifically for software engineering research.
Another potential issue is related to the use of English as lingua franca in emails, i.e., some developers did not communicate using their native language. A limited vocabulary may have arguably prevented some lexical cues from emerging from the text, as argued in the lexical hypothesis.
Finally, we acknowledge the relatively low number of subjects (50 developers) and documents (631 emails) in our experimental sample. However, it is not uncommon to find previous studies on personality detection using samples of a similar size. For instance, carducci2018twitpers (carducci2018twitpers) tested their personality detection tool using a corpus of Twitter posts from 24 volunteers. Similarly, arnoux2017baseline (arnoux2017baseline) report the Big Five traits extracted from the social media posts of 55 volunteers.
[standard jigsaw, title=Phase 1 – Summary, opacityback=0]When off-the-shelf personality detection tools are applied out of domain, their performance is far from acceptable. Indeed, the tools neither agree with self-reported ratings nor with each other when used for software engineering research. In addition, their prediction accuracy is considerably worse as compared to the results from prior work on automatic personality detection from text.
5. Phase 2 – Implications on Earlier Studies
The results of Phase 1 suggest that, due to the limited level of agreement and accuracy of tool predictions, the choice of a personality detection tool might affect the validity of previous results. Indeed, these disagreements do not necessarily imply that conclusions based on the application of these tools in the software engineering domain are affected by the choice of one specific tool over the others.
Accordingly, to answer RQ2, in this section we investigate whether the choice of a specific personality tool introduces threats to conclusion validity by replicating two previous studies in software engineering, which relied on computational personality detection. Because our goal is to assess whether the effects reported in previous studies still hold when a different personality detection tool is used, here we perform two exact dependent replications (shull2008replications) in which we keep the same experimental setup of the original studies while changing only the tool.
5.1. Replicated Studies
We choose to replicate two previous studies, respectively by iyer2019github (iyer2019github) and calefato2019apache (calefato2019apache)
, which have analyzed the personalities of open-source software developers. Both studies provide a replication package. Therefore, we can apply the chosen methodology and perform an exact replication of the studies with the only deviation from the original work being the different personality detection tool used. Since both original studies usedIBM PI here we choose to replace it with LIWC because it is the most used psychometric tool, adopted also in previous studies on personality in software engineering (e.g., (rigby2007liwc-asf; bazelli2013liwc-so; rastogi2016liwc-gh)).
5.1.1. iyer2019github ((iyer2019github))
The first study that we replicate is by iyer2019github (iyer2019github). The authors applied IBM PI to examine the influence of personality traits of developers on the pull request evaluation process in GitHub. They first extracted the Big Five personality traits of 16,935 developers from trace data on GitHub, such as commit messages, issue and pull request comments. Then, they assessed their relative importance in the pull request evaluation process as compared to other non-personality factors from past research. Overall, they evaluated 501,327 pull requests from 1,860 projects and found that the effect of personality traits is significant and comparable to technical factors (e.g., number of files changed, presence of tests), albeit social factors (e.g., prior interaction, following each other) are more influential on the likelihood of pull request acceptance. In particular, they found that pull requests authored by developers (requesters) who are more open and conscientious, but less extroverted, have a higher chance of acceptance. Furthermore, pull requests that are closed by developers (closers) who are more conscientious, extroverted, and neurotic, have a higher likelihood of acceptance. Additionally, the larger the difference in personality traits between the requester and the closer, the more positive effect it has on pull request approval.
For the re-execution, we adapted the original scripts provided in the replication package. However, since the package did not include the collection of traced data used for the analysis, we followed the description of the data collection process in the paper to recreate the dataset ourselves. This lead to minor differences in the number of projects found (1,853) as compared to those reported in the original study (1,860). Finally, we applied LIWC to the reconstructed dataset to infer the developers’ personality scores.
5.1.2. calefato2019apache ((calefato2019apache))
The second study that we replicate is by calefato2019apache (calefato2019apache). The authors applied IBM PI to perform an analysis at the ecosystem-level of code commits and email messages contributed by 211 developers working on ASF projects. They found that there are three common types of personality profiles among Apache developers, characterized in particular by their level of agreeableness and neuroticism. They also found that developers with higher levels of openness are more likely to become contributors to ASF projects. In addition, they confirmed that developers’ personality is stable over time and that the five traits do not vary significantly with their role, membership, and extent of contribution to the projects.
For the replication, we adapt the same scripts and use the same experimental dataset from the original study on which we apply LIWC.
5.2. Replication Results
To distinguish from those of this work, the research questions of the original studies are formatted in italic and lower-cased (e.g., rq1, rq2, …).
5.2.1. Replication of iyer2019github ((iyer2019github))
In this section, we answer the four research questions rq0-3 of the original study by iyer2019github (iyer2019github) and recreate the same tables for comparison.
rq0—Replication of base model with the reconstructed dataset.
Given the slight differences between the original and reconstructed datasets, we replicate iyer2019github’s findings of the baseline model—including only technical and social factors—on a more recently mined dataset to determine if the results still hold. Creating a baseline model is useful in comparing other personality models with social, technical, and personality factors. In addition, GitHub has tremendous yearly growth rates, and over 60M new repositories have been created since the original study was carried out.171717https://octoverse.github.com, accessed in March 2021. As such, the replication of the baseline model provides insights on the generalizability of iyer2019github’s results to a dataset extracted at a different point in time.
We use the same modeling technique as in the original study—a mixed-effects logistic regression—on the reconstructed dataset to assess the effects of social and technical factors on pull request acceptance. Table7 provides a comparison between our results and the original ones by iyer2019github
in terms of odds ratios. All factors have similar overall influences and even the model fit, both marginal () and conditional (), is similar. The only exception is the project age factor, for which we found a significant effect. We speculate that this is due to the extraction of projects that have been active for a longer time in the reconstructed dataset.
Overall, although there are small fluctuations in the odds ratio of all the features, the original results still hold. This increases the confidence in the goodness of the reconstructed dataset and that any statistically significant, different results in the replication of the other research questions are due to the different personality detection tool used rather than differences in the data.
|test_file||1.08 ***||1.07 ***|
|total_churn||0.90 ***||0.90 ***|
|social_distance||2.35 ***||2.37 ***|
|num_comments||0.68 ***||0.68 ***|
|prior_interaction||1.53 ***||1.45 ***|
|main_team_member||1.16 ***||1.19 ***|
|stars_current||0.53 ***||0.53 ***|
|test_file x num_comments||1.12 ***||1.13 ***|
|total_churn x num_comments||1.06 ***||1.07 ***|
|social_distance x num_comments||0.92 ***||0.93 ***|
|num_comments x prior_inter||1.05 ***||1.05 ***|
rq1—Does the personality of a requester affect the likelihood of the pull request being accepted?
The author of a pull request (a requester hereinafter) can be either a member of a projects’ core development team or an outside contributor. iyer2019github examined the personality of requesters to understand whether specific traits lead to a higher likelihood of pull request acceptance. As in the original study, we replicate rq1 using a mixed-effects logistic regression to model the personality traits of the requester from the new dataset, along with the features already used in rq0.
Table 8 shows the comparison of the odd ratios from our replication against those from the original study. Unlike iyer2019github’s results, in our replication agreeableness and neuroticism have a significant effect, respectively positive (1.39) and negative (0.81). Regarding openness, we find that the trait has a positive and significant influence in both studies, albeit the effect is smaller in iyer2019github (1.34 vs. 1.07). Instead, we find a significant yet opposite effect for both conscientiousness (0.77 vs. 1.05) and extraversion (1.31 vs. 0.94).
|Single run||Bootstrap.||Single run||Bootstrap.|
|Odds Ratio||95% CI||Odds Ratio||95% CI|
|(Intercept)||2.87 ***||-||2.72 ***||-|
|test_file||1.08 ***||[1.05, 1.11]||1.06 ***||[1.02, 1.10]|
|total_churn||0.90 ***||[0.89, 0.90]||0.90 ***||[0.88, 0.91]|
|social_distance||2.35 ***||[2.59, 2.80]||2.38 ***||[2.63, 2.99]|
|num_comments||0.68 ***||[0.65, 0.69]||0.69 ***||[0.67, 0.69]|
|prior_interaction||1.52 ***||[1.51, 1.57]||1.45 ***||[1.43, 1.50]|
|followers_current||0.99||[0.95, 0.98]||0.98||[0.94, 0.99]|
|main_team_member||1.15 ***||[1.12, 1.20]||1.19 ***||[1.15, 1.22]|
|age_current||0.91||[0.83, 0.93]||1.01||[0.98, 1.05]|
|team_size||0.99||[0.85, 0.98]||0.92||[0.84, 0.97]|
|stars_current||0.54 ***||[0.45, 0.49]||0.54 ***||[0.46, 0.51]|
|openness||1.07 ***||[1.05, 1.08]||1.34 ***||[1.31, 1.57]|
|conscientiousness||1.05 ***||[1.03, 1.07]||0.77 ***||[0.68, 0.79]|
|extraversion||0.94 ***||[0.93, 0.95]||1.31 ***||[1.29, 1.46]|
|agreeableness||1.01||[1.00, 1.02]||1.39 ***||[1.45, 1.60]|
|neuroticism||0.97||[0.94, 0.98]||0.81 ***||[0.71, 0.84]|
|test_file x num_comments||1.13 ***||[1.11, 1.15]||1.12 ***||[1.10, 1.17]|
|total_churn x num_comments||1.06 ***||[1.06, 1.08]||1.07 ***||[1.06, 1.09]|
|social_connection x num_comments||0.92 ***||[0.90, 0.96]||0.93 ***||[0.90, 0.97]|
|num_comments x prior_interaction||1.05 ***||[1.06, 1.08]||1.05 ***||[1.05, 1.08]|
rq2—Does the personality of a closer affect the likelihood of the pull request being accepted?
Pull requests that are closed by developers (closers) who are always part of the core team. By analyzing the closers’ personalities, iyer2019github aimed to understand whether specific traits affect the likelihood of the pull request getting accepted. As in the original study, we replicate rq2 by using the requesters’ personality traits instead of closers’ and modeled them along with the factors used in rq0.
The results are reported in Table 9. Openness has a significant and positive effect on the pull request acceptance in both the replication (1.18) and the original study (1.05). We also observe a significant result for conscientiousness in both studies, albeit with an opposite direction (0.92 vs. 1.12). Instead, we find completely contrasting findings regarding extraversion, agreeableness, and neuroticism.
|Single run||Bootstrap.||Single run||Bootstrap.|
|Odds ratio||95% CI||Odds ratio||95% CI|
|(Intercept)||2.87 ***||-||2.45 ***||-|
|test_file||1.08 ***||[1.05, 1.11]||1.06 ***||[1.03, 1.10]|
|total_churn||0.90 ***||[0.89, 0.91]||0.90 ***||[0.88, 0.90]|
|social_distance||2.35 ***||[2.35, 2.83]||2.41 ***||[2.66, 3.03]|
|num_comments||0.68 ***||[0.65, 0.68]||0.69 ***||[0.67, 0.69]|
|prior_interaction||1.49 ***||[1.52, 1.56]||1.45 ***||[1.43, 1.50]|
|followers_current||0.98||[0.94, 0.99]||0.98||[0.94, 0.99]|
|main_team_member||1.16 ***||[1.12, 1.21]||1.19 ***||[1.14, 1.22]|
|age_current||0.92 ***||[0.86, 0.95]||1.01||[0.98, 1.05]|
|team_size||0.97||[0.88, 1.00]||0.97||[0.88, 1.00]|
|stars_current||0.54 ***||[0.44, 0.51]||0.54 ***||[0.46, 0.50]|
|openness||1.05 *||[1.02, 1.10]||1.18 ***||[1.15, 1.27]|
|conscientiousness||1.12 ***||[1.11, 1.18]||0.92 ***||[0.86, 0.95]|
|extraversion||1.06 ***||[1.06, 1.13]||1.03||[0.99, 1.05]|
|agreeableness||1.01||[0.99, 1.04]||1.13 ***||[1.13, 1.21]|
|neuroticism||1.08 ***||[1.06, 1.14]||1.00||[0.94, 1.06]|
|test_file x num_comments||1.12 ***||[1.08, 1.16]||1.13 ***||[1.10, 1.17]|
|total_churn x num_comments||1.06 ***||[1.05, 1.08]||1.07 ***||[1.06, 1.09]|
|social_connection x num_comments||0.92 ***||[0.88, 0.95]||0.93 ***||[0.90, 0.97]|
|num_comments x prior_interaction||1.05 ***||[1.05, 1.08]||1.05 ***||[1.05, 1.07]|
rq3—Does the difference in personality between the requester and the closer affect the likelihood of the pull request being accepted?
Finally, iyer2019github analyzed the differences in the personality traits between the requester and the closer to understand whether they hinder or facilitate pull request acceptance. As in the original study, we replicate rq3 by considering the effects of personality differences in the model by adding the absolute differences between the personality traits of requesters and closers along with the other socio-technical features used in rq0.
The results are reported in Table 10. Regarding openness, the difference between the requester and the closer is positive and significant (1.07) only in the replication. We observe consistent results regarding the difference in the levels of conscientiousness and extraversion, which have a positive effect in both studies. Instead, we observe contrasting results for the difference in the levels of agreeableness and neuroticism—negative in the replication (respectively, 0.94 and 0.93) and positive in the original study (1.02 and 1.22).
|Single run||Bootstrap.||Single run||Bootstrap.|
|Odds ratio||95% CI||Odds ratio||95% CI|
|(Intercept)||3.34 ***||-||2.68 ***||-|
|test_file||1.09 ***||[1.05, 1.12]||1.09 ***||[1.05, 1.12]|
|total_churn||0.92 ***||[0.90, 0.93]||0.90 ***||[0.89, 0.91]|
|social_distance||1.81 ***||[1.86, 2.03]||2.34 ***||[2.55, 2.91]|
|num_comments||0.66 ***||[0.65, 0.67]||0.68 ***||[0.67, 0.68]|
|prior_interaction||1.66 ***||[1.63, 1.69]||1.48 ***||[1.46, 1.53]|
|followers_current||1.07 ***||[1.06, 1.11]||1.07 ***||[1.06, 1.11]|
|main_team_member||1.27 ***||[1.23, 1.31]||1.22 ***||[1.17, 1.25]|
|age_current||0.92 ***||[0.87, 0.93]||1.01||[0.95, 1.05]|
|team_size||0.96||[0.89, 1.00]||0.94||[0.87, 0.99]|
|stars_current||0.55 ***||[0.44, 0.50]||0.53 ***||[0.46, 0.49]|
|diff_openness_abs||1.01||[1.01, 1.04]||1.07 ***||[1.06, 1.14]|
|diff_conscientiousness_abs||1.29 ***||[1.29, 1.35]||1.25 ***||[1.28, 1.34]|
|diff_extraversion_abs||1.12 ***||[1.11, 1.16]||1.31 ***||[1.32, 1.44]|
|diff_agreeableness_abs||1.02 **||[1.00, 1.04]||0.94 ***||[0.89, 0.95]|
|diff_neuroticism_abs||1.22 ***||[1.21, 1.27]||0.93 ***||[0.88, 0.94]|
|test_file x num_comments||1.11 ***||[1.09, 1.15]||1.13 ***||[1.10, 1.17]|
|total_churn x num_comments||1.06 ***||[1.05, 1.07]||1.07 ***||[1.06, 1.09]|
|social_connection x num_comments||0.93 ***||[0.89, 0.97]||0.93 ***||[0.90, 0.98]|
|num_comments x prior_interaction||1.05 ***||[1.05, 1.08]||1.05 ***||[1.05, 1.07]|
5.2.2. Replication of calefato2019apache ((calefato2019apache))
In this section, we answer the six research questions rq1-6 presented in the original study by calefato2019apache (calefato2019apache). In the replication, we recreate the related figures and tables for comparison.
calefato2019apache conducted a preliminary analysis to rule out changes in personality over time. For each of the
developers in the dataset, they computed monthly-based personality scores, then split the set by date into two subsets of approximately the same size. For each trait, they averaged the scores in each subset, thus obtaining two observations for each developer (i.e., early vs. later). Finally, for each trait, they performed a Wilcoxon Signed-Rank test to verify the null hypothesis that the median difference between pairs of observations (i.e., for each developer) was not significantly different from zero. Table11 reports the results from the five paired tests in both the original study and the replication. We replicate the same tests after replacing the original dataset, containing personality scores obtained from IBM PI, with the new dataset, containing the scores obtained using LIWC. The results show no significant differences between the distributions (all adjusted p-values ¿ 0.05 after Bonferroni correction for multiple tests), thus confirming the stability of personality traits over time with both personality tools.
|V||p-value||95% CI||V||p-value||95% CI|
|Openness||6,109||0.589||[-0.002, -0.003]||9,330||1,000||[-0.006, 0.029]|
|Conscientiousness||5,575||0.661||[-0.004, -0.003]||8,320||1,000||[-0.014, 0.014]|
|Extraversion||5,839||0.964||[-0.003, -0.003]||7,751||1,000||[-0.022, 0.008]|
|Agreeableness||5,871||0.917||[-0.003, -0.003]||7,199||0.448||[-0.028, 0.002]|
|Neuroticism||5,915||0.853||[-0.003, -0.004]||9,075||1,000||[-0.008, 0.023]|
rq1—Are there groupings of similar developers according to their personality profile?
To answer the first research question, calefato2019apache applied several techniques to reveal the presence of natural groupings of personalities within the dataset of developers. We replicate the same analyses presented in calefato2019apache (calefato2019apache) on the new dataset.
First, to ensure that original data was suitable for structure detection, calefato2019apache computed the Kaiser-Meyer-Olkin measure (0.5, the minimum value recommended in literature (field2012kmo)) and Barlett’s test of sphericity ( = 4088.32, p ¡ 0.001). We obtain similar results with the new dataset (KMO = 0.5; = 900, p ¡ 0.001). Accordingly, we proceed with the analyses to uncover latent factors.
The first analysis performed was the Principal Component Analysis (PCA), a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, i.e., the principal components.
The scree plots in Fig. 4
show the percentage of variance in the data for each of the five components extracted from the data. In the original study, the first three components accounted for most of the variance in the data (86%, see Fig.(a)a), whereas in the replication the first two account for nearly 88% (Fig. (b)b
). The analysis of the eigenvalues in Table12 shows that only the first two components in each replication have a value over Kaiser’s criterion of 1, the cut-off point typically used to retain principal components. Eigenvalues correspond to the amount of the variation explained by each principal component; therefore, a latent component has an eigenvalue ¿ 1 when it accounts for more variance than its accounted for by the original variables in a dataset. Next, we check how the traits load on the two extracted components. The loadings from the original study and the replication are reported in Table 13, from which we observe inconsistent results. In the original study, openness and neuroticism load on the first component, whereas conscientiousness, extraversion, and agreeableness on the second. Conversely, in the replication openness, extraversion, and agreeableness load on the first component, whereas conscientiousness and neuroticism load on the second; we also observe that openness and conscientiousness load negatively on their respective component.
|Eigenvalue||% of variance||Eigenvalue||% of variance|
|Component 1||Component 2||Component 1||Component 2|
After applying PCA, calefato2019apache applied the k-means clustering algorithm to extract clusters of developers’ personalities. We replicate the analysis on the new dataset and use the ‘elbow’ method to identify the optimal number of clusters. The elbow point corresponds to the smallest k value after which is not observed a large decrease in the within-group heterogeneity—measured using the sum of squares—with the increase of the number of clusters. The scree plots from both studies are reported in Appendix C, Fig. 10. In the original study, calefato2019apache selected k = 3, whereas we choose k=2. Table 14 shows the distribution of the developers across the personality clusters extracted in the two studies. The developers are fairly evenly distributed across the clusters in the original study. In the replication, the first cluster is twice the size of the second, though we observed an even larger imbalance with larger k
values. The table also reports the coordinates of the centroids—the average position of the elements assigned to a cluster. All the values are z-score standardized, with positive (negative) values above (below) the overall means. Then,calefato2019apache
performed five non-parametric Kruskal-Wallis tests (one per trait), to understand whether the trait distributions in the clusters are significantly different, followed by a Tukey and Kramer (Nemenyi)post hoc test for multiple pairwise comparisons, to understand which pairs are indeed different. Table 15 shows the results of the Kruskal-Wallis tests, after applying Bonferroni correction to the p-values for repeated tests. In the original study, the Kruskal-Wallis test for each of the five traits was statistically significant (p ¡ 0.001) with a relatively strong () or strong effect size () (rea2014epsilonsquared); however the post hoc tests showed that only clusters 1 & 3 and clusters 2 & 3 were significantly different from each other. In the replicated study, the Kruskal-Wallis tests are also all significant (p ¡ 0.001), with the effect size ranging between moderate () and strong. Because there are only two clusters, there is no need to run a post hoc test to affirm that there are significant differences among the trait distributions.
|Cluster 1 (76)||-0.74||-0.69||-0.06||0.37||-0.84|
|Cluster 2 (55)||0.90||0.86||0.99||0.45||0.81|
|Cluster 3 (80)||0.08||0.07||-0.62||-0.67||0.25|
|Cluster 1 (156)||0.50||0.16||-0.36||-0.32||-0.22|
|Cluster 2 (76)||-1.03||-0.33||0.77||0.65||0.45|
|p-value||95% CI||p-value||95% CI|
|Openness||87.836||<0.001||0.418||[0.297, 0.532]||136||<0.001||0.590||[0.505, 0.658]|
|Conscienti.||78.777||<0.001||0.375||[0.257, 0.495]||17||<0.001||0.073||[0.021, 0.149]|
|Extraversion||94.554||<0.001||0.450||[0.354, 0.547]||84||<0.001||0.362||[0.264, 0.463]|
|Agreeablen.||61.248||<0.001||0.292||[0.197, 0.401]||62||<0.001||0.270||[0.170, 0.377]|
|Neuroticism||107.560||<0.001||0.512||[0.410, 0.613]||24||<0.001||0.104||[0.041, 0.185]|
Finally, calefato2019apache applied Archetypal Analysis. In Appendix C, Fig. 11 we report the scree plots used to identify the optimal number of archetypes to extract with the elbow criterion. The two plots show the fraction of total variance in the data explained by the number of extracted archetypes. In the original study, calefato2019apache extracted three archetypes; in the replication, the function also plateaus after three archetypes. Therefore, also for the ease of comparison with the original study, we opt for extracting three archetypes. Table 16 shows the trait coordinates, standardized for ease of comparison, for the archetypes extracted in the two studies. Looking at the trait coordinates, the archetypal analyses in the two studies do not extract similar phenotypes of developers’ personalities.
rq2—Do developers’ personality traits vary with the type of contributors (i.e., core vs. peripheral?)
In the second research question, calefato2019apache investigated whether the members of projects’ core development teams exhibit different personality traits. Accordingly, they first filtered the personality scores retaining only the commit authors and then split this set into two subgroups, namely peripheral developers (i.e., external contributors without commit access to the repositories, ) and core developers (i.e., project members with write access to the source code repository, ). To replicate the analysis, we perform for each trait a Wilcoxon Rank Sum test for unpaired group comparison on the dataset with the new personality scores.
The results from both the original study and the replication are reported in Table 17. Consistently, in both studies, we observe no significant differences (i.e., all adjusted p-values , after Bonferroni correction) across all five traits. As such, the two studies consistently find that, on average, the personalities of core developers are not significantly different from those of peripheral developers.
|W||p-value||95% CI||W||p-value||95% CI|
|Openness||1,583||1.000||[-0.009, 0.008]||2,233||0.500||[-0.009, 0.087]|
|Conscientiousness||1,625||1.000||[-0.010, 0.011]||1,989||1.000||[-0.022, 0.034]|
|Extraversion||1,575||1.000||[-0.010, 0.008]||1,902||1.000||[-0.041, 0.041]|
|Agreeableness||1,273||0.271||[-0.017, 0.000]||1,685||1.000||[-0.064, 0.018]|
|Neuroticism||2,051||0.063||[0.004, 0.027]||1,751||1.000||[-0.049, 0.021]|
rq3—Do developers’ personality traits change after becoming a core member of a project’s development team?
In the third research question, calefato2019apache investigated whether developers exhibit different personality traits after becoming members of a project’s core development team. Accordingly, for each of the core developers with write access to source code repositories, they first retrieved the date of the first commit accepted and integrated by them, as an approximation of the moment when they have become a member of a project’s core development team. Then, for any of the projects they gained membership for, they used that date to split the personality trait scores of the developers into two paired groups, i.e., before vs. after becoming a project’s core team member. We replicate the same analyses on the new dataset.
In Appendix C, Fig. 12 we report the boxplots of the five personality scores across the two groups in both the original study and the replication. Also, Table 18 reports the results of the five Wilcoxon Signed- Rank tests executed (one per trait). No significant differences are returned by the tests in both studies (all adjusted p-values after Bonferroni correction). As such, the two studies consistently find that developers’ personality does not change significantly after becoming a core developer.
|V||p-value||95% CI||V||p-value||95% CI|
|Openness||39||1.000||[-0.011, 0.034]||62||1.000||[-0.074, 0.141]|
|Conscientiousness||40||1.000||[-0.008, 0.031]||76||0.765||[-0.014, 0.112]|
|Extraversion||17||1.000||[-0.019, 0.019]||46||1.000||[-0.126, 0.071]|
|Agreeableness||15||1.000||[-0.038, 0.011]||49||1.000||[-0.121, 0.081]|
|Neuroticism||43||0.654||[-0.005, 0.048]||35||1.000||[-0.118, 0.018]|
rq4—Do developers’ personality traits vary with the degree of development activity?
calefato2019apache investigated whether more productive developers are characterized by specific personality trait levels. Using the same groups of core () and peripheral () developers created earlier for rq2, they further split the two sets according to the level of development activity. Specifically, they found the mean number of commits authored by each developer in the peripheral group and split it into two subsets, i.e., authored-commits-high () and authored-commits-low (). Similarly, they created the integrated-commits-high () and integrated-commits-low () subgroups considering the mean number of commits integrated (i.e., accepted) by the developers in the core group. We replicate this research question on the new dataset by applying a series of Wilcoxon Rank Sum tests to make unpaired comparisons of the median personality scores between high- and low-activity developers.
The results from the original study and the replications are shown in Table 19. The test results reveal no cases of statistically significant differences between the pairs of trait distributions in both studies (i.e., adjusted p-values after Bonferroni correction). As such, the two studies consistently find that developers’ personality does not vary significantly with the level of development activity.
|High vs. low||Conscientiousness||449||1.000||[-0.008, 0.024]|
|commit authors||Extraversion||383||1.000||[-0.017, 0.017]|
|(peripheral devs)||Agreeableness||341||1.000||[-0.018, 0.009]|
|High vs. low||Conscientiousness||163||1.000||[-0.029, 0.019]|
|commit integrators||Extraversion||129||1.000||[-0.028, 0.006]|
|(core devs)||Agreeableness||204||1.000||[-0.013, 0.025]|
|High vs. low||Conscientiousness||557||1.000||[-0.021, 0.078]|
|commit authors||Extraversion||520||1.000||[-0.115, 0.041]|
|(peripheral devs)||Agreeableness||489||1.000||[-0.078, 0.100]|
|High vs. low||Conscientiousness||253||1.000||[-0.035, 0.121]|
|commit integrators||Extraversion||242||1.000||[-0.050, 0.084]|
|(core devs)||Agreeableness||242||1.000||[-0.052, 0.102]|
rq5—What personality traits are associated with the likelihood of becoming a project contributor?
To answer the fifth research question, calefato2019apache
built a contribution likelihood model, that is, they fit a logistic regression model to study the associations between the personality traits of developers and the likelihood of having a contribution accepted. As the response variable, they used a dichotomous yes/no variable indicating whether a developer has authored at least one commit successfully integrated into a project repository. They also included a couple of control variables, namely word_count, a proxy for the extent of communication and social activity of a developer in the community through email messages from which personality traits are extracted, and project_age, measured as number of days. In the replication, we follow the same process described in the original. The dataset ofdevelopers is fairly balanced with respect to the response variable, as 118 developers have at least one commit and 93 have no commits. Before fitting the model to the new dataset, we first check for the presence of high Pearson correlations between the updated personality predictors; We drop neuroticism and extraversion because they show a high correlation () with conscientiousness and agreeableness, respectively. Also, the Variance Inflation Factor (VIF) computed on the resulting model reveals no collinearity issues for the retained predictors (all values ¡ 3). Then, we evaluate the model fit using McFadden’s pseudo- measure, which describes the proportion of variance in the response variable explained by the model, and the Area Under the ROC curve (AUC), to assess the classification ability of the contribution model as compared to random guessing.
Table 20 includes the results of the two logistic regression models. We observe that the control variable project_age is statistically significant () and has a negative effect in both the original study (-0.420) and the replication (-0.394). In the original study, the only statistically significant predictor was openness (54.09, ), whereas in the replication no personality factor has a significant effect. Also, in terms of goodness of fit, the model in calefato2019apache fit the data slightly better () than the model in the replication (). Finally, we replicate the assessment of the model prediction performance computing the AUC. As in the original study, we use a stratified sampling technique to split the dataset into training (70%) and test (30%) sets. The AUC performance of the logistic model in the original study is also better than the replication (0.89 vs 0.69). Overall, while the results from the original study indicated that higher openness scores are associated with better chances for developers to become project contributors, the replication suggests instead that personality traits have no effect.
|Coef. Est.||Std. Error||z-value|
|project_age (days)||-0.420 ***||0.113||-3.71|
|N=211, McFadden Pseudo-R=0.397, AUC=0.89|
|Coef. Est.||Std. Error||z-value|
|project_age (days)||-0.366 ***||0.080||-4.55|
|N=211, McFadden Pseudo-R=0.270, AUC=0.69|
rq6—What personality traits are associated with a higher number of contributions successfully accepted?
To answer the last research question, calefato2019apache performed a regression analysis to evaluate the association between the personality traits of developers and the number of contributions (i.e., commits) that they got accepted (i.e., merged) into the project repository. As the independent variables, they used the same personality predictors used in the previous logistic regression analysis. Regarding the control variables, in addition to the count of words in emails and the age of projects, they added two more (is_integrator and track_record) to control for, respectively, core members and long-time contributors. The dependent variable used is the number of merged commits, i.e., the count of commits authored by a developer that have been successfully merged. Because the dependent variable takes non-negative integer values only, rather than fitting a linear model, calefato2019apache
performed a count-data regression analysis, which handles non-negative observations. Here we follow the same process described in the original work. Different count data models can be used for estimations, depending on the characteristics of the data. Poisson distributions have a strong assumption on equidispersion, that is, the equality of mean and variance of the count-dependent variable. Alternatively, it is possible to use a negative binomial distribution, a generalization of the Poisson distribution with an additional parameter to accommodate the overdispersion. We perform the Likelihood Ratio Test (LRT) of overdispersion and find out that, as in the original study, the negative binomial model (LogLik = -1023,, ) provides a better fit to the data than the Poisson model (LogLik =1340).
Table 21 shows the results of the count-data regression analysis with the negative binomial models from the two studies. We observe that, except for word_count, all the control variables have a statistically significant effect in both studies. Instead, while in the original study none of the five predictors related to personality has a significant effect, in the replication we find that conscientiousness is significantly and positively associated with a higher number of integrated commits (0.123, ). Finally, both the model in the original study () and the replication () fit the data marginally. Overall, while the results from the original study indicated that personality traits do not affect commit productivity, the replication suggests instead that higher conscientiousness scores are associated with a higher number of accepted commits.
|Coef. Estimate||Std. Error||z value|
|project_age (days)||-0.068 *||0.044||-1.56|
|dev_track_record (days)||0.544 ***||0.033||16.21|
|N=471, LogLik=-917, LRT =514
|Coef. Estimate||Std. Error||z value|
|project_age (days)||-0.087 *||0.042||-2.06|
|dev_track_record (days)||0.566 ***||0.033||17.11|
|N=471, LogLik=-1022.984, LRT =635
5.3. Threats to Validity
Because in the replications we followed the same methodologies presented in the original studies, we have also inherited some of the threats to the validity of those papers, e.g., that the datasets used in iyer2019github (iyer2019github) and calefato2019apache (calefato2019apache) are respectively not representative of GitHub and the Apache ecosystem as a whole. Also, albeit one may argue that some of the statistics applied in Sect. 5.2 may not be the preferred approach, we applied them to support the comparative aspects of the replication.
[standard jigsaw, title=Phase 2 – Summary, opacityback=0]
The choice of a personality detection tool does affect the validity of previously published results. The replication of the first study led to contrasting findings in all the three original research questions aimed to assess the effects of personality on pull-request acceptance. When replicating the second study, we were able to obtain consistent findings for only three original research questions out of six.
The challenges and potential of computational personality detection in software engineering research
There has been considerable interest in applying natural language processing (NLP) and computational linguistics to recent software engineering research. In particular, prior work onsentiment analysis (also referred to as opinion mining) has focused on analyzing corpora of technical text, such as emails, commit comments, code-review discussions, and app reviews, to detect the polarity (calefato2018sentiment; novielli2021offtheshelf), emotions (novielli2018gold; calefato2019emtk), opinions (lin2019opinionmining; uddin2019opinionmining), and intentions (disorbo2015intentionmining; huang2020intentionmining) in software developers’ interactions.
One of the reasons for such widespread interest is that sentiment analysis is a highly restricted NLP problem because, to solve it, tools do not need to fully understand the semantics of each sentence or document but only some aspects of it, i.e., positive or negative sentiments and their target entities or topics (liu2012saombook)
. Computational personality detection is also an NLP problem as it touches every aspect of the research field, e.g., co-reference resolution, negation handling, named-entity recognition, and word-sense disambiguation(cambria2013newavenues). However, when the NLP aspects are addressed, the related constructs are then used to support the further analyses needed to extract personality profiles, which represents a separate problem. The broad scope of the problem arguably explains the poor agreement among the currently available personality detection tools and the negative results of the replications, discussed next in this section. Also, analogously to sentiment analysis, we expect that computational personality detection in software engineering will benefit from future breakthroughs and advancements in NLP (sawant2021nlpbreakthru).
Furthermore, while sentiment analysis and personality detection are under the same umbrella of affective computing research and, therefore, adopt similar technological solutions, the ‘affective phenomena’ they study vary in duration, ranging from short-lived feelings, emotions, and opinions to long-lived, slowly changing personality characteristics (picard2000affectivecomputing). As such, the two research fields can complement each other also when applied to software engineering research, with sentiment analysis concentrating on transient feelings related to entities (e.g., others, themselves, objects, and events) and computational personality detection focusing on intrinsic, long-lasting dispositions (of developers). For example, previous work on sentiment analysis in software engineering has also looked into anger (gachechiladze2017anger) and toxicity (raman2020toxicity) detection, and identified cases of developers who lashed out at others during technical discussions. Computational personality detection could complement this research and tell us if those episodes were extemporaneous or rather the effect of a personal inclination of developers who, despite their technical skills and knowledge, might not be recommended, for example, for tasks such as mentoring newcomers. Prior research on onboarding has developed recommender systems that look at developers’ social and technical aspects to help newcomers identify mentors in OSS projects (canfora2012yoda; steinmacher2012recommending); given that personality mismatch between mentors and mentees has been identified as one of the social barriers to onboarding (balali2018newcomers), a potential follow-up study could take into account the most relevant traits for the task according to personality theories (e.g., being more agreeable and open to collaboration) to identify candidate developers as suitable for mentoring. Another potential scenario of usage concerns code review. As prior work has shown the impact of human factors in performing such activity (ruangwan2019codereview), one might envision the development of a recommender system that assigns developers to code-review tasks also based on their personality profiles, e.g., by preferring those who exhibit a high level of conscientiousness, a trait associated with carrying out tasks precisely and thoroughly.
Phase 1 – Performance Assessment
Because model-based predictions aim to provide an assessment of personality, it is important to establish their convergent validity with self-report scores. Accordingly, to answer RQ1 (How do off-the-shelf personality detection tools perform in the software engineering domain), in Phase 1 of the study we have built a dataset of emails written by 50 Apache Software Foundation developers and compared the predictions of four personality detection tools against the self-ratings collected through a questionnaire. In addition, we have made further pairwise comparisons between the tool predictions.
As can be observed from Table 5, the results of the correlation analysis indicate that the tools analyzed neither agree with self-reported personality ratings nor with each other when used in the software engineering domain. The coefficients are worse than those reported in the literature (see Table 3). Consistently, as can be observed from Table 6, the performance accuracy in terms of prediction errors is considerably worse as compared to the results from prior work on automatic personality detection from text (see Table 4). These results should warn the research community about the current limitations when using general-purpose tools for personality prediction in software engineering research.
The disagreements among tools arguable explain why it is hard to synthesize the results from prior work on computational personality detection in software engineering (see the related work presented next in Sect. 7). We currently ignore the reasons for these disagreements, though. On the one hand, a manual error analysis is impracticable in our case. Even if for simplicity we resorted to using trait predictions based on discrete labels (e.g., low vs. high openness), the number of words that the tools need to reliably infer traits is in the hundreds or thousands, too large for us to analyze and reason about the root causes of misclassification. On the contrary, sentiment analysis tasks such as polarity detection, which infers the positive vs. negative polarity conveyed through text, rely on machine learning and natural language processing techniques similar to those used for building personality prediction models (nanli2012sareview) but typically analyze input text at the sentence level; therefore, through manual analysis, researchers on sentiment analysis in software engineering have identified domain-specific errors limiting model accuracy due to the use of technical words such as patch and jargon like kill a process, which do not express any valence (novielli2015challenges). On the other hand, the assessed personality detection tools do not explain their outcomes—they are applied in a black-box manner and no information is provided about what steers their model predictions. The complexity and performance of machine learning models have increased over the years at the expense of interpretability. Model interpretability is not a monolithic concept and has many facets (lipton2018interpretability). It can be applied at the model level, to refer to algorithmic transparency—a property that is usually very low in deep learning methods whose behavior is notoriously hard to ‘mentally simulate’ by humans; alternatively, it can be applied at the local level, when models offer post hoc explanations—like images, text, or examples—for the output generated in response to a single input instance. Although the need for model interpretability in computational personality detection may not be mandatory because there are no ethical concerns and accountability as in other domains like healthcare and finance, this lack of transparency is nonetheless a drawback that hinders progress in the field. We argue that the research field on computational personality detection might advance and benefit from the development of transparent prediction models that allow for error analysis.
Phase 2 – Replications
Replications play a key role in empirical software engineering so that the research community can build knowledge about which results or observations hold under which conditions (shull2008replications). To answer RQ2 (How does the choice of a personality detection tool affect the validity of previous results in software engineering research?), in Phase 2 we performed two exact dependent replications in which we kept the same experimental set-up of the original studies while only replacing IBM PI with LIWC to infer the developers’ personality scores from written documents.
When replicating the first study by iyer2019github (iyer2019github), we have been unable to confirm any of the findings concerning the effect of specific personality traits on the likelihood of merging pull requests in GitHub. In replicating the second study by calefato2019apache (calefato2019apache), we have found consistent results only for three out of six research questions. In particular, we have been able to replicate those research questions that failed to find differences in the distributions of median trait scores between subgroups of developers (e.g., core vs. peripheral, high- vs. low-activity). Instead, we have failed to replicate the other research questions similar to those reported in iyer2019github (iyer2019github), where a couple of regression models were built to uncover the effects of specific personality traits on the likelihood of becoming a contributor and the productivity level.
As noted by ferguson2012negres (ferguson2012negres), replicability in science cannot be meaningful without the potential acknowledgment of failed replications. Negative results may happen when experimental results fail to meet expectations due to a lack of effect rather than misaligned expectations or a lack of methodological rigor in poorly designed experiments. Negative results are uncommon in the literature, even rare in software engineering where only recently there have been specific conference tracks or journals’ special issues organized to present such results (paige2017negres-si). However, negative results are fundamental in software engineering to embrace the nature of experimentation. In fact, negative results are just as useful as positive ones because, by pointing out what has not worked, they eliminate useless hypotheses and directions, thus redirecting future experimental efforts towards alternative approaches that might pay off (tichy2000negres).
Therefore, though we have failed to replicate the experiments from the two studies, we argue that these negative results can be beneficial to enhance the state of the art of computational personality detection in software engineering. Arguably, the main implication of our findings is that the validity of previous studies, including but not limited to the ones by iyer2019github (iyer2019github) and calefato2019apache (calefato2019apache), should be questioned and possibly reassessed. However, this reassessment, as well as future studies on personality detection in software engineering, are possible only after developing and testing a reliable, SE-specific personality detection tool. In fact, the tools used in both the original studies and our replications introduced a threat to validity since all the instruments available for computational personality detection have been trained on non-software engineering-specific text documents, such as essays and social media posts. Hence, the re-assessment should ideally happen using personality detection tools specifically trained on software engineering-specific text corpora. Previous research on sentiment analysis in software engineering has highlighted the benefits (e.g., the reduced misclassification rate of neutral and positive content as emotionally negative) deriving from the use of tools specifically trained on technical documents retrieved from sources like Stack Overflow and Jira (calefato2018sentiment).
Furthermore, in this work, we have used personality detection tools as off-the-shelf components, i.e., without any tuning or training. Recent work on sentiment analysis has shown that the fine-tuning of tools to the software engineering domain might not be enough to improve accuracy, and that retraining has the potential to adjust the model performance to the shifts in lexical semantics due to different jargon and conventions used in data sources (lin2018howfar; novielli2021offtheshelf). Instead, we observe a trend in recent work on computational personality detection focused on using state-of-the-art, Deep Learning techniques focused more on outperforming baselines (e.g., (mehta2020essay; jiang2020bert; kazameini2020bert; majumder2017deeplearning)) than on releasing reusable, possibly retrainable models that can be transferred to other domains.
7. Related Work
In this section, we focus on reviewing previous studies that investigated the Big Five personality model in the software engineering domain by using tools for automatically extracting personality profiles from communication traces, such as emails, Q&A posts, and code-review comments.
rigby2007liwc-asf (rigby2007liwc-asf) were the first to automatically analyze the personality traits of developers. They studied the personality traits of the four top developers of the Apache httpd project against a baseline built by applying LIWC on the entire mailing-list corpus. They found that two of the developers responsible for the major releases have similar personalities, which are also different from the baseline of all other project members.
licorish2015personalitygsd (licorish2015personalitygsd) combined social network analysis with computational personality detection. They used LIWC to analyze the communication traces of 146 practitioners from the IBM Rational Jazz projects involved in global software development activities and found that those who occupy critical roles in knowledge diffusion demonstrate more openness to experience.
rastogi2016liwc-gh (rastogi2016liwc-gh) used LIWC to analyze the personality profiles of nearly 400 GitHub developers. They found that those with different levels of contributions have different personality profiles, i.e., those with high or low levels of contributions are more neurotic. Also, the personality profiles of most active contributors were found to change across two consecutive years, evolving as more conscientious, more extrovert, and less agreeable.
paruma2016ibmpi (paruma2016ibmpi) used IBM PI to extract the personality traits from e-mails sent by the committers to six Eclipse projects. They found three personality clusters: the first personality groups the committers with the highest scores in extraversion and neuroticism; the second cluster groups the committers with moderate levels of neuroticism; the third cluster groups the committers with low values in neuroticism. The three personality clusters are different from those identified by calefato2019apache (calefato2019apache) after analyzing with the same tool the emails written by the Apache Software Foundation developers.
calefato2017trust (calefato2017trust) investigated the relationship between project success and the propensity to trust, one of the agreeableness facets in the Five-Factor Model. They approximated the overall performance of two Apache Software Foundation projects with the history of successfully merged pull requests in GitHub. Using the LIWC-based version of IBM PI, they analyzed the word usage in pull request comments to extract the developers’ agreeableness scores. The results suggested that the propensity to trust of code reviewers (integrators) is an antecedent of successful pull request integration.
To the best of our knowledge, this study is the first attempt at replicating results from previous work on computational personality detection in software engineering. However, one partial exception is the work of bazelli2013liwc-so (bazelli2013liwc-so) who performed a quasi-replication of the study by rigby2007liwc-asf (rigby2007liwc-asf). Specifically, they used LIWC to infer the personality of Stack Overflow users from Q&A posts. They found that the top reputed authors on Stack Overflow are more extroverted, as compared to medium- and low-reputed users. They argued that such a personality profile is consistent with the one observed by rigby2007liwc-asf regarding the two top Apache httpd developers.
Overall, the findings from these studies show the existence of different clusters of personalities among developers and that their traits vary with their degree of contribution and reputation, while also changing over short periods. Yet, the negative results of our replications suggest that these results should be also reassessed.
In this paper, we have studied the impact of the choice of a personality detection tool when conducting software engineering studies. We have observed a decrease in performance when general-purpose tools are used out of domain as neither they agree with each other nor with the self-reported personality scores. Also, we have observed that the disagreement among tool predictions can lead to diverging conclusions, making it impossible to replicate previously published results when different personality detection tools are used. Our results suggest a need for personality detection tools specially targeted for the software engineering domain. We hope that sharing the complete replication package—the technical corpus annotated with self-reported personality scores and the experimental workflow scripts—can accelerate the advancement in the field.
Acknowledgements.We are grateful to Filippo Lorè, Esq. for his feedback on GDPR compliance. We also thank our CS students Saverio Telera and Marco Iannotta for their help with the replications. Part of the computational work has been executed on the IT resources of the ReCaS-Bari data center.
Appendix A Appendix: Replication Package
Data Collection and Protection
The complete replication package is available on Zenodo at https://zenodo.org/record/4679303.
We are aware of the sensitive nature of the data collected and the privacy risks that come from misusing them. As such, here we clarify all the measures taken to ensure data privacy and protection during our research work. Albeit the GDPR took place in May 2018 (i.e., after the data collection conducted in January 2018), we nonetheless made efforts to comply with the directives already approved by the EU in May 2016.
To ensure that we had control over the data and that they were stored in servers located in the EU, rather than administering the survey through platforms such as Google Forms, we opted for developing in-house an electronic version of the Mini-IPIP personality survey. As such, the application was hosted on our University cloud infrastructure in Italy. The application also handled the sending of invitations to participants via email and the automatic removal of such messages after one month. Both the invitation emails and the landing page of the application identified the research team, clarified the research purpose, and contained a link to the privacy statements briefly summarized next.
In particular, we clarified that: we had retrieved their email address from the public archives of the Apache Software Foundation; there was not going to be any other follow-up email; there was no monetary compensation and study participation was voluntary; the survey responses were needed for research purposes and were going to be stored anonymously; the data and analysis results would be only shared in scientific venues, such as conferences and journals, and presented in an aggregate form, thus making it impossible to identify who participated. We also clarified that the link to the survey in the invitation email contained a randomly generated id that would match their survey responses to a corpus consisting of some of their emails (i.e., the gold standard) publicly archived by the Apache Software Foundation. However, we highlight that (i) at the end of the data collection, we erased all the emails from those who did not answer the survey; (ii) regarding the respondents, by submitting the survey, the system replaced their plain-text email address with the hashed survey id, thus preventing us and anyone else to match an author in the email corpus with their survey responses.
Furthermore, because the Apache Software Foundation email archives are publicly available, to further ensure anonymity and prevent third parties to guess the identity of the survey respondents from the content of their emails, we point out that the gold standard shared in the replication package is built after scrubbing from the email bodies any potentially-sensitive information, such as email and mail addresses, URLs, names, pieces of code, numbers, and stop words (we used three Python libraries, clean-text, scrubadub, and NLTK).
In conclusion, we are confident that all the measures taken are effective in protecting the privacy and anonymity of the developers who agreed to participate in the study.
Download and Setup
First, clone the repository and its submodules from GitHub:
git clone recursive https://github.com/collab-uniba/tosem2021-personality-rep-package.git
Then, before the first execution, run the setup.sh script. Also, make sure that the requirements are satisfied, in particular Python 3.8.3+, R 4.0.4+, and Java 1.8+.
It is possible to automatically execute the full experimental workflow by launching the repro.sh script as follows:
bash repro.sh stage all dataset full
For test purposes, instead of supplying the argument full, it is possible to use the argument test to work with a small, random subsample of the experimental dataset to shorten the execution time (see the next subsection for more):
bash repro.sh stage all dataset test
It is also possible to execute the two workflow stages independently, using either the full dataset or the test one:
bash repro.sh stage phase1 dataset full
bash repro.sh stage phase2 dataset test
The execution of the entire pipeline is quite time-consuming as, depending on the machine specifications, it takes hours—if not days—when working on the full dataset. Additional time is also necessary in case one wants to retrain the TwitPersonality models instead of using the pre-trained ones.
Table 22 compares the execution times between two machines with different hardware specifications, running the same OS (Ubuntu 20.04). We observe that the most recent machine’s performance when re-executing the full pipeline on the smaller test dataset is somewhat close to the older machine’s (5m17s and 7m35s, respectively). The newer machine is also faster when it comes to model retraining (6m27s vs. 10m18s). The largest difference, however, is observed when using the full dataset, which increases the execution time to 2 days for the more recent configuration and 3.9 days for the older machine.
Appendix B Appendix: Phase 1 – Additional Material