A Study on the Prevalence of Human Values in Software Engineering Publications, 2015-2018

07/18/2019
by   Harsha Perera, et al.
Monash University
0

Failure to account for human values in software (e.g., equality and fairness) can result in user dissatisfaction and negative socio-economic impact. Engineering these values in software, however, requires technical and methodological support throughout the development life cycle. This paper investigates to what extent software engineering (SE) research has considered human values. We investigate the prevalence of human values in recent (2015 - 2018) publications at some of the top-tier SE conferences and journals. We classify SE publications, based on their relevance to different values, against a widely used value structure adopted from social sciences. Our results show that: (a) only a small proportion of the publications directly consider values, classified as relevant publications; (b) for the majority of the values, very few or no relevant publications were found; and (c) the prevalence of the relevant publications was higher in SE conferences compared to SE journals. This paper shares these and other insights that motivate research on human values in software engineering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 7

page 8

page 9

page 10

01/11/2021

Valuing Evaluation: Methodologies to Bridge Research and Practice

The potential disconnect between research and practice in software engin...
04/07/2022

Impact of Software Engineering Research in Practice

Existing work on the practical impact of software engineering (SE) resea...
11/27/2019

Benefitting from the Grey Literature in Software Engineering Research

Researchers generally place the most trust in peer-reviewed, published i...
06/04/2021

Towards offensive language detection and reduction in four Software Engineering communities

Software Engineering (SE) communities such as Stack Overflow have become...
07/18/2018

Moving Beyond the Mean: Analyzing Variance in Software Engineering Experiments

Software Engineering (SE) experiments are traditionally analyzed with st...
03/14/2021

The entrepreneurial logic of startup software development: A study of 40 software startups

Context: Software startups are an essential source of innovation and sof...
11/05/2020

Comparing the Results of Replications in Software Engineering

Context: It has been argued that software engineering replications are u...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Ignoring human values while engineering software may result in violating those values (Mougouei et al., 2018; Ferrario et al., 2016) and subsequent dissatisfaction of users. This may lead to negative socio-economic impacts such as financial loss and reputational damage. A recent example, which made news headlines, is the price gouging on airline tickets during Hurricane Irma (Sablich, 2017). After a mandatory evacuation order, the cost of airline tickets rose six fold, due to supply and demand pricing systems, thus disadvantaging evacuees. Arguably, this occurred because of insufficient consideration of valuing compassion for those suffering in a natural disaster. A second example is software used by Amazon to determine free shipping by zip code, which turned out to discriminate against minority neighbourhoods (Gralla, 2016). Racial bias in automatic prediction of re-offenders at parole boards in the US Justice system (Angwin et al., 2016) is another example where software violates human values. Indeed, the negative impacts of ignoring values can go as far as risking human life: the tragic suicide of the British teenager Molly Russell (Baker, 2019) has been partially attributed to Instagram’s personalisation algorithms, which flooded Molly’s feed with self harm images; following public outrage, Instagram has now banned such images.

As awareness about human aspects of software grows, the public is increasingly demanding software that accounts for their values. See, for example, those accusing Facebook of taking advantage of users’ data to influence the US elections (Smith, foot). Public demand has also motivated software vendors to take preemptive measures to avoid violating human values. Google, for instance, has pledged not to use its AI tools for surveillance conflicting with human rights (Dave, 2018).

Though such initiatives are promising, we claim that software engineering research and practice currently pays insufficient attention to the majority of human values. This may be due to the lack of adequate methodological and technical support for engineering values in software (Mougouei et al., 2018). To provide evidence for this claim, as part of our broader approach to studying human values, we have investigated software engineering (SE) research papers to measure how much attention the SE field has given to values. In particular, we have classified software engineering publications in some of the top-tier SE venues (ICSE, FSE, TSE, and TOSEM), from 2015 to 2018, based on their relevance to different values. A paper was classified as directly relevant to a particular value if its main research contribution addressed how to define, refine, measure, deliver or validate this value in software. A widely adopted value structure (Figure 1), based on Schwartz’s theory of human values (Schwartz, 2012, 1992), was used as our classification scheme. Using this classification approach, we investigated the prevalence of human values in SE research, with three key research questions:

  • To what extent are SE publications relevant to values?

  • Which values are commonly considered in SE publications?

  • How are the relevant publications distributed across venues?

The results of our study showed that: (a) only of publications were directly relevant to human values, referred to, henceforth, as relevant publications; (b) for of human values, there were no relevant publications; (c) on average, relevant papers were found per value, while for of values, the number of relevant publications was ; and (d) of relevant papers were published in SE conferences rather than journals.

2. Background

Cheng and Fleischmann summarize seven different definitions of human values as “guiding principles of what people consider important in life” (Cheng and Fleischmann, 2010). Human values with an ethical and moral import such as Equality, Privacy and Fairness have been studied in technology design and human-computer interaction for more than two decades (Friedman, 1996; Flanagan et al., 2005; Friedman and Kahn Jr, 2007)

. Meanwhile, the rapid popularization of artificial intelligence (AI) and its potential negative impact on society have raised the awareness of human values in AI research 

(Riedl and Harrison, 2016; Etzioni and Etzioni, 2017; Cath et al., 2018). Consequently, human values are getting renewed research focus.

There has been some recent (but isolated) research in software engineering such as values-based requirements engineering (Thew and Sutcliffe, 2018), values-first SE (Ferrario et al., 2016) and values-sensitive software development (Aldewereld et al., 2015). However, there has been no previous work that measures to what extent human values have been considered in SE research. Motivated by this research gap, we follow a classification approach, similar to that used in previous SE research to map topic trends (Shaw, 2003; Systä et al., 2012; Montesi and Lago, 2008), but with a different purpose, to measure values relevance. There are no current classification schemes for human values in SE. Therefore, we take inspiration from the social sciences.

Social scientists have been searching for the most useful way to conceptualize basic human values since the 1950s (Schwartz, 2007). In 1973, Rockeach captured 36 human values and organized them into 2 categories (Rokeach, 1973). In 1992, Schwartz introduced his theory of basic human values (henceforth referred to as Schwartz’s Values Structure (SVS)) which recognized 58 human values categorized into 10 value categories (Schwartz, 1992, 2005). While these two value structures remain the most well recognized ways of representing values, there are at least ten other value classifications (Cheng and Fleischmann, 2010). In this paper, we use SVS, which is the most cited and most widely applied classification not only in the social sciences but also in other disciplines (Thew and Sutcliffe, 2018; Ferrario et al., 2014).

In SVS, Schwartz introduced 10 motivationally-distinct value categories recognized across more than 30 cultures (Schwartz, 1992). Each value category has underlying distinct motivational goals (see Table 1) which relate to three fundamental needs of human existence (Schwartz, 1992). Schwartz subdivided each value cateogory into a set of closely related values (Schwartz, 1994, 1992). These 10 value categories and 58 values are arranged in a circular motivational structure as shown in Figure 1. Value categories located close to each other are complementary whereas values further apart tend to be in tension with each other. Section 3 discusses how we applied SVS in our classification study.

Value Category Description (motivational goals)
Self-direction Independent thought and action–choosing, creating, exploring
Stimulation Excitement, novelty, and challenge in life
Hedonism Pleasure or sensuous gratification for oneself
Achievement Personal success through demonstrating competence according to social standards
Power Social status and prestige, control or dominance over people and resources
Security Safety, harmony, and stability of society, of relationships, and of self
Conformity Restraint of actions, inclinations, and impulses likely to upset or harm others and violate social expectations or norms
Tradition Respect, commitment, and acceptance of the customs and ideas that one’s culture or religion provides
Benevolence Preserving and enhancing the welfare of those with whom one is in frequent personal contact
Universalism Understanding, appreciation, tolerance, and protection for the welfare of all people and for nature
Table 1. Value categories and descriptions (Schwartz, 2012)

Figure 1. Schwartz Values Structure (Schwartz, 2006; Schwartz and Boehnke, 2004) (adopted from (Holmes et al., 2011)). Words in black boxes are values categories, each subdivided into values.

3. Methodology

To investigate the prevalence of human values in SE research, we manually classified publications from top-tier SE conferences and journals based on their relevance to different values. We followed a methodology similar to that of prior classification work in SE (Shaw, 2003; Systä et al., 2012; Montesi and Lago, 2008), which mapped trends of SE research over time in terms of topic and type of study. As with prior studies, ours was based on manual classification of paper abstracts by multiple raters. Classification based on abstracts, rather than reading the full paper, is sub-optimal but strikes a balance between accuracy and time needed for the study. All papers had multiple raters and inter-rater agreement was measured using Fleiss’ Kappa (Landis and Koch, 1977). We chose to classify papers from the last four years of conferences and journals generally considered to be the top SE venues, namely, the International Conference on Software Engineering (ICSE), the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), the IEEE Transactions on Software Engineering (TSE), and the ACM Transactions on Software Engineering and Methodology (TOSEM).

When conducting such a study, there are a number of key experimental design decisions that need to be taken, including: (i) how to define relevance to human values, given the imperfect and high-level nature of values definitions in the literature; (ii) how many raters to assign to each paper, and (iii) how to resolve disagreements between raters. To make choices about these design decisions, we first carried out a pilot study before carrying out the main study. Both the pilot and main study assumed SVS as the classification scheme. In total, we employed 7 raters (5 Male, 2 Female) with varying levels of experience in SE research, ranging from PhD students to Professors, and including one rater from outside the software engineering field. Note that this is a relatively high number of raters compared to similar studies (Bertolino et al., 2018; Vessey et al., 2002).

3.1. Pilot Study

The pilot study had three steps: (i) Paper selection and allocation of papers to raters, (ii) Paper classification, and (iii) Calibration of classification decisions made by different raters. The aim of the pilot study was not to measure relevance of papers to values; rather, we had the following objectives:

  • To test the appropriateness of SVS as the classification scheme for SE publications

  • To develop a common understanding regarding the meaning of human values in SE contexts

  • To collect insights from raters to feed into the experimental design of the main study

(i) Paper selection and allocation of papers to raters. We randomly selected 49 papers from ICSE 2018 as our pilot study dataset. These were equally allocated among the seven raters, with three raters per paper. Common practice is to assign two raters per paper (Bertolino et al., 2018; Vessey et al., 2002); three were assigned in the pilot to get a better understanding of how to map papers to values. ICSE was chosen as it has the broadest coverage of SE research (Bertolino et al., 2018). We chose the most recent ICSE proceedings – 2018 at time of writing.

(ii) Paper classification. Raters classified papers, independently, based on their title, abstract and keywords which is an approach used in similar classification studies in SE (Shaw, 2003; Glass et al., 2002; Bertolino et al., 2018). Raters were instructed to decide if a paper was “relevant” or “not relevant” to human values: relevance was deliberately left ill-defined as one of the objectives of the pilot was to influence the definition of this term in the main study. For relevant papers, raters were asked to classify the papers into one value category, and then into one value within the category. Raters were not mandated to follow the hierarchical structure of SVS: that is, they could classify a paper into value X and value category Y even if X did not belong to category Y. This was to give us a way to assess, from a software engineering perspective, the appropriateness of the hierarchy in SVS.

(iii) Calibration. After classification, all seven raters met to discuss the classification decisions. The main objective was to calibrate decisions and use this to refine the definition of values relevance. The intention was not to decide which rater picked the correct classification.

Following the pilot study, we made a number of observations which were fed into experimental design of the main study.

  • [leftmargin=0.3cm]

  • Observation 1: Raters found that almost every paper could be classified into a small number of values such as Helpfulness, Wisdom or Influence because, in general, every piece of research tries to advance knowledge. Thus, an indirect argument could almost always be made why a paper is relevant to helpfulness (e.g., a paper on testing is helpful to testers), wisdom (any paper advances knowledge, thus leading to greater wisdom), or influence (e.g., a paper on an improved software process influences how software is developed). This observation illustrated the difficulty of working with vaguely defined concepts such as values, but also the importance of a better definition of relevance.

  • Decision 1: It is beyond the scope of this paper to fully and formally define all the values; hence, it was decided in the main study to use inter-rater agreement as evidence that a value was sufficiently understood in the context of a particular paper to provide confidence in the results. The definition of relevance was, however, refined for the main study. Raters were instructed not to make indirect arguments why a paper might be relevant to a value. Instead, in the main study, classification was based on “direct relevance” – a paper is defined as directly relevant to a value if its main research contribution is to define, refine, measure, deliver or validate a particular value in software development. All other papers are classified as not relevant. Thus, a paper should only be classified as directly relevant to helpfulness if the research provides software tools or techniques to encourage people to be helpful towards each other.

  • Observation 2: Raters observed that some papers addressed values as a general concept rather than considering any specific value. An example would be a paper that presents a methodology for refining values into a software architecture. These papers should not be classified into any particular value category or value.

  • Decision 2: To facilitate classification of such papers, we introduced a new value category in the main study, named Holistic view. A paper classified under Holistic View relates to values generally without focusing on any specific value (Table 3).

  • Observation 3: Raters found that some papers should be classified under more than one value.

  • Decision 3: To accommodate such papers in the main study, raters were allowed to select up to three values. This decision is different from similar studies in SE where raters were obliged to pick just one category (Bertolino et al., 2018).

  • Observation 4: Not surprisingly, as SVS was not developed specifically for SE, there were cases where SVS was not a perfect fit. We will return to this point in Section 8 but a key point for the main study is that some raters chose a value X and value category Y even if X does not belong to Y according to SVS. A common example was the value Privacy, which from a SE perspective is clearly aligned with the category Security, and yet appears in Self-direction according to SVS (see Figure 1).

  • Decision 4: The main study maintained the decision to allow selection of values and value categories independent of the Schwartz hierarchical structure. In Section 4, we present data to show the effect this had on the results.

  • Observation 5: The pilot study gave us an opportunity to measure how long it took raters to rate papers. We found that, on average, each rater spent four minutes per abstract. Given the number of papers in the main study (1350 – see Table 2), assigning three raters per paper would be infeasible.

  • Decision 5: Out of necessity, we reduced the number of raters in the main study to two. This is consistent with the number of raters in similar studies (Bertolino et al., 2018; Vessey et al., 2002; Glass et al., 2002).

3.2. Main Study

Similar to the pilot study, the main study also had three phases: (i) Paper selection and allocation of papers to raters, (ii) Paper classification and (iii) Disagreement resolution. The final stage was different to the pilot study because rather than calibrating ratings to inform experimental design, some raters met to try and reach a consensus.

(i) Paper selection and allocation of papers to raters. For the main study, we selected papers from ICSE, FSE, TSE and TOSEM over the last four years. These are the same venues used in similar paper classification studies (Bertolino et al., 2018; Glass et al., 2002). We selected all papers in TSE and TOSEM. For FSE, we used all papers from the main track, and for ICSE, we used all papers from the main track, from the Software Engineering in Practice (SEIP) track, and from the Software Engineering in Society (SEIS) track. The motivation for selecting tracks was to choose tracks which publish full research papers, not shorter papers. In total, there were 1350 papers published in the chosen venues over the years 2015–2018, at time of writing. This is a high sample size compared to similar studies (e.g., 976 in Bertolino et al. (Bertolino et al., 2018) and 369 in Glass (Glass et al., 2002)). Table 2 shows the distribution of selected papers by venue, track and year.

Venue & Track 2015 2016 2017 2018 Total
ICSE–Main Track 83 101 68 153 405
ICSE–SEIP 25 28 30 35 118
ICSE–SEIS 9 7 9 11 36
ESEC/FSE–Main Track 123 143 124 122 512
TSE 62 61 61 31 215
TOSEM 22 16 12 14 64
Total 324 356 304 366 1350
Table 2. Classified publications by venue/track and year

The papers were randomly allocated among the seven raters, two raters per paper. Each rater received around 400 papers to classify. We manually extracted links for each of the 1350 papers from digital databases, and provided a spreadsheet with these links and values and value categories for raters to select from.

(ii) Paper classification. Similar to the pilot study, raters were asked to classify papers on the basis of their title, abstract and keywords. However, the main study used a different definition of relevance, as suggested by the pilot study. Raters were asked to classify papers as directly relevant or not directly relevant, where the definition of direct relevance is as given in Section 3.1. Papers found directly relevant to values were further classified into a category and then to a specific value(s). Throughout the process, raters complied with the decisions made during the calibration step in the pilot study.

(iii) Disagreement resolution. Given the subjective nature of the classification, raters sometimes disagreed. This could arise at three levels: (a) relevance level, where raters disagreed on whether a paper was directly relevant or not; (b) value category level, where raters disagreed on the choice of value category; and (c) value level, where raters disagreed on the choice of value.

To attempt to resolve these disagreements, raters met to discuss their views about why the paper in question was classified in a certain way. If the raters could not come to an agreement, a third rater was introduced as an arbiter. The arbiter facilitated a second round of discussion, sharing his or her own views, to facilitate a consensus. However, if the disagreement persisted, the arbiter did not force a decision.

Aligned with previous studies (Bertolino et al., 2018), we calculated inter-rater agreement using Fleiss’ Kappa, once attempts at resolving disagreements had taken place. The results of the Kappa measure are interpreted according to the agreement strengths introduced by Landis and Koch (Landis and Koch, 1977). We achieved almost perfect agreements on relevance level and category level with Kappa values equal to 0.92 and 0.87, respectively. The agreement of value level was found as substantial with Kappa value equal to 0.79. The results from the main study are further discussed in Section 4.

Classification Example
Not Relevant CafeOBJ is a language for writing formal specifications for a wide variety of software and hardware systems and for verifying their properties … we have extended CafeInMaude, a CafeOBJ interpreter implemented in Maude, with the CafeInMaude Proof Assistant (CiMPA) and the CafeInMaude Proof Generator (CiMPG) … (Riesco and Ogata, 2018)
Privacy Network traffic data contains a wealth of information for use in security analysis and application development. Unfortunately, it also usually contains confidential or otherwise sensitive information, … We present Privacy-Enhanced Filtering (PEF), a model-driven prototype framework that relies on declarative descriptions of protocols and a set of filter rules … (Dijk et al., 2017)
Helpful … However, newcomers face many barriers when making their first contribution to an OSS project, leading in many cases to dropouts. Therefore, a major challenge for OSS projects is to provide ways to support newcomers during their first contribution. In this paper, we propose and evaluate FLOSScoach, a portal created to support newcomers to OSS projects. … (Steinmacher et al., 2016)
Protecting the Environment … The battery power limitation of mobile devices has pushed developers and researchers to search for methods to improve the energy efficiency of mobile apps. We propose a multiobjective refactoring approach to automatically improve the architecture of mobile apps, while controlling for energy efficiency… (Morales et al., 2018)
Holistic View … The aim of this paper is to give more visibility to the interrelationship between values and SE choices. To this end, we first introduce the concept of Values-First SE and reflect on its implications for software development. Our contribution to SE is embedding the principles of values research in the SE decision making process and extracting lessons learned from practice. … (Ferrario et al., 2016)
Table 3. Examples of paper classification at different levels (relevance, value category, and value).

4. Results

This section presents the results of the main study described in Section 3.2. As a reminder, we investigate the following research questions:

  • To what extent are SE publications relevant to values?

  • Which values are commonly considered in SE publications?

  • How are the relevant publications distributed across venues?

4.1. The Prevalence of Values in SE Publications

To answer (RQ1) and (RQ2), in this section, we present the results achieved from Section 3.2 and discuss our findings on the prevalence of human values in SE Publications.

4.1.1. Answering (Rq1)

Figure 2 demonstrates the prevalence of human values in classified publications. We observed (Figure 2) that the majority of the publications (82%) were classified as Not Relevant to values, which constitutes 1105 out of 1350 papers. For those publications that did not directly relate to values, an example is given in Table 3. On the other hand, 16% of the publications (216 papers) were found to be directly relevant to values. The remaining 2% of publications (29 papers) were classified as undecided, because the two raters could not agree on a classification. To investigate if there were any trends in the prevalence of values in SE venues over time, we compared the percentages of the relevant publications from 2015 to 2018 (Figure 3): no significant trends were observed.

It is worth mentioning that even though the raters agreed that 216 papers (16% of the classified papers) were relevant to values, disagreements still remained at the value category level and value level (Section 3.2): out of 216 papers, agreements were reached for 195 papers at value category level and at the value level, agreements were reached for 115 papers.

Figure 2. Relevance of SE publications to human values
Figure 3. Relevant publications per year.

4.1.2. Answering (Rq2)

Which values are commonly considered?

Our results showed that for each of the 58 values in Figure 1 – on average – relevant publications were found. As shown in Figure 5, however, the frequency of the relevant publications varied significantly for different values. Figure 4 shows the level of attention given to the 58 human values in SVS.

It can be seen that for the majority of the values (), the number of the relevant publications was while for 60% (35 out of 58) of the values, no relevant publications were found (Figure 4). Also, for some values, e.g., Enjoying life and Honoring of parents and elders, only one relevant publication was found across all of the studied venues from 2015 – 2018 (Figure 5). It can also be seen in Figure 4 that only for 21% (12 out of 58) of the values, e.g. Helpful and Privacy, the number of the relevant publications were above average ().

While being cautious with generalizing, these findings are highly suggestive of negligible or limited attention paid by the SE research community to the majority of human values. Although finding the exact cause requires broader studies, it may not be difficult to attribute ignoring some of the values in SE publications to the lack of practical definitions for those values (Mougouei et al., 2018); this is particularly clear for values such as Forgiving and Mature love, that need to be further clarified in a SE context before they can be used by SE researchers and practitioners.

Figure 4. The level of attention given to 58 values in the Schwartz Value Structure. Publications were classified as relevant if their main research contribution directly considered values.

In the attempt to understand which values are most commonly considered in SE research, we found (Figure 5) that the number of publications relevant to Helpful, Privacy, and Protecting the environment, were the highest among all 58 values in SVS (Figure 1). Examples of such publications are given in Table 3. With 38 relevant papers, the value Helpful was the most frequently considered value. Publications that contributed software tools or techniques to encourage people to be helpful towards each other were classified by the raters as relevant to Helpful.

Moreover, the second highest number of relevant publications was observed for Privacy (Figure 5). This category contained papers that directly considered user privacy. Also, Protecting the environment, the third most commonly found value, appeared in publications that directly considered Sustainability and Energy efficiency in software.

Figure 5. The number of relevant publications per value
Figure 6. Considering value categories in SE publications

Which value categories are commonly considered? As explained in section 3.2, the raters were given the freedom to classify the publications under different value categories regardless of SVS hierarchical structure. As a result, the raters were allowed to pick, for a publication, values and value categories that did not necessarily match in SVS. Figure 6 shows the prevalence of the publications under different value categories specified by the raters.

Figure 6 also shows how those papers would have been classified if the raters strictly followed SVS (Figure 1): a significant difference was observed for value categories Security and Self direction, where the raters classified 80 papers as relevant to Security; had the raters followed SVS for classification, only 55 papers would have been classified under Security. On the other hand, raters classified only 6 papers as relevant to Self direction. If it was based on SVS, 21 papers would have fallen under the category of Self direction (Figure 1).

Scrutinizing the publications classified under Security and Self direction revealed an interesting finding: the raters chose Security as the category of 12 papers classified as relevant to Privacy, but based on Schwartz Values Structure (SVS), Privacy is under Self direction. As such, those 12 papers (annotated on the graph of Figure 6) could have been classified under Self direction if SVS (Figure 1). Though relatively small, similar differences were also observed for other value categories such as Power, Achievement, Conformity, and Hedonism. Considering the SE background of most of the raters (Section 3), this raised a major question: “do software engineers perceive values differently from social scientists?” To reflect the view of the raters in our discussions, we use, consistently, the categories specified by them in the rest of the paper.

Figure 7. Relevant publications per value category

It can be observed from Figure 7 that 80 papers (41% of the relevant publications) were classified as relevant to Security, which made Security the most prevalent value category. This was not hard to predict as Security is a well-recognized quality aspect of software, for which there is a great demand from stakeholders. The second and third most highly prevalent value categories were found to be Benevolence and Universalism, which constituted 20% and 16% of the relevant publications, respectively. On the other hand, no publications were found to be relevant to the categories Tradition, Stimulation, and Hedonism. Moreover, 8% of the relevant papers were classified under the category Holistic view, which does not exist in SVS – this category was introduced based on the raters’ feedback from the pilot study (Section 3.1) to account for publications that considered values in general.

4.2. Relevant Publications per Venue

To answer (RQ3), this section reports our findings on the distribution of values relevant to SE publications across SE venues. Figure 8 demonstrates, for each venue/track, the proportion of the relevant publications in 2015 – 2018.

The proportion of relevant publications in each venue/track. We observed (Figure 8) that the proportion of relevant publications in the SE journals, namely TOSEM (about 5%) and TSE (about 11%), is lower than the proportion of relevant publications in the main tracks of ICSE (about 18%) and FSE (about 13%), and significantly lower than the proportion of relevant papers in the SEIP (21%) and SEIS (about 81%) tracks of ICSE. In particular, the proportion of values relevant papers was significantly higher in SEIS. This is not surprising given the focus of the track.

Figure 8. Proportion of values relevant publications in SE venues/tracks. The labels on the bars denote the number of papers in each category.
Figure 9. Relevant publications per venue/track

The distribution of relevant publications by venue/track. Figure 9 demonstrates the distribution of relevant publications across the studied venues/tracks. From all 216 publications that directly considered values (relevant publications), 58% were published in different tracks of ICSE: main track (33%), SEIS (14%), and SEIP (11%). The highest prevalence of relevant publications was seen in the main tracks of ICSE (33%) and FSE (30%). As such, it was concluded that about of the publications that directly considered values were published in SE conferences: ICSE (58%) and FSE (30%). On the other hand, SE journals, TSE (11%) and TOSEM (1%), constituted only 12% of the relevant publications (Figure 9).

Figure 10. The distribution of publications relevant to different values by venue/track; relevant publications were found only for 23 out of 58 values in Schwartz Value Structure (Figure 1).

The distribution of relevant publications by values and venues. Figure 10 shows how the publications relevant to different values are distributed across different venues/tracks. We observed that only 23 out of 58 values in SVS (Figure 1) were considered. For some values, relevant publications were found across most venues/tracks; publications relevant to Helpful were found in 5 out of 6 venues/tracks. But for the majority of the considered values in Figure 10 (15 out of 23), the number of the venues/tracks that published papers relevant to those values did not exceed 2. For instance, publications relevant to Social justice and National security were found only in the main tracks of FSE and ICSE. Also, publications relevant to Enjoying life, Honoring of the parents and elders, and A world at peace appeared only in the main track of ICSE. Also, publications relevant to certain values, e.g. Equality, Social justice, and Healthy, were only present in conference papers but not in journals. We further observed that for the majority of values (19 of 23 values in Figure 10), relevant publications were found in the main track of ICSE while publications in TOSEM only considered Privacy.

Figure 11. Publications relevant to different value categories across SE venues/tracks.

The distribution of relevant publications by value categories and venues. Publications relevant to 7 out of 10 value categories in SVS (Figure 1) were found across different venues/tracks (in Figure 11). We further found publications relevant to category Holistic view, which was introduced based on pilot study, as discussed in Section 3.1. Publications relevant to all these 8 value categories were found in the main tracks of FSE and ICSE (Figure 11). Also, publications relevant to Security were found in all SE venues. Moreover, publications that directly considered Benevolence and Universalism were found across most venues/tracks. Publications relevant to Universalism were more prevalent in the SEIS track of ICSE. However, publications in TOSEM only considered Security but not other value categories. It was also interesting to see that, compared to other venues/tracks, the SEIS track of ICSE contained the highest proportion of publications relevant to Conformity.

4.3. Data Availability

The dataset that supports the findings of this study is available at https://figshare.com/s/7a8c55799584d8783cd6.

5. Discussion

We carried out this research to verify our hypothesis that SE research does not sufficiently consider human values. Our research findings confirm this intuition. The extent to which SE research ignores human values is significant: 1105 out the selected 1350 papers (82%) were found not to be relevant to human values. Furthermore, out of 195 papers that do address values, 80 papers relate to Security. This is unsurprising, but also illustrates that the lack of consideration of human values in SE is even more stark. Indeed, a majority of other human values (approximately 79%) are not adequately addressed in SE research.

The value of Helpful that relates to “preserving and enhancing the welfare of those with whom one is in frequent personal contact” (Table 1) was the highest classified among all (58) values. This suggests that SE research is often aimed at being helpful to the SE community – for example, by means of improving processes thus reducing development effort or removing development obstacles, or developing new tools and techniques to facilitate or improve certain practices or tasks. Our results indicate that only a small proportion of publications relate to individualistic-value categories (Hedonism, Achievement, Stimulation and Power) compared to group-value categories (Universalism, Conformity and Tradition). This verifies the tensions discussed in the Schwartz Values Structure about the competing and contradicting nature of these bipolar value categories (Schwartz, 1994).

It is important to note that SVS served as an appropriate yet not ideal scheme for classifying human values in SE. We discovered that SVS does not include some values commonly discussed in SE. For example, sustainability is a value that has received significant recent attention in SE, yet is not listed among the 58 SVS values. Since SVS originates in the social sciences, raters sometimes found it difficult to map certain SE values to SVS-prescribed value categories. This is likely due to the difference in meaning of values in different contexts (i.e., social sciences versus software engineering). Future work will look at how to adapt SVS to an SE context.

Without attempting to generalize, certain findings are worth mentioning here. For example, among the selected venues, ICSE has the most diverse range of values covered compared to others. In addition, there are certain values such as Wealth, Unity with nature, Social recognition, Honoring of parents and elders, Enjoying life, and A world at peace found in ICSE publications but not addressed in any other venue. It is difficult to attribute this to a trend in ICSE submissions or to the broad nature of ICSE. Similarly, for other venues, a broader and more comprehensive study is needed to discuss any trends.

6. Threats to Validity

In this section we discuss limitations of this research categorized as Internal, External and Construct validity threats.

Construct Validity: Choosing a classification scheme suited for the software engineering domain was one of the main challenges for this research. In the absence of an SE-specific scheme to classify human values, we selected the Schwartz Values Structure (SVS). SVS is a well established theory for understanding human values in the social sciences. It has been successfully applied in Human and Computer Interaction (HCI) and Information and Computer Technologies (ICT) to study and explain human values (Thew and Sutcliffe, 2018). Using SVS as an independent classification scheme, instead of developing our own, mitigated the risk of introducing researcher bias.

Similar to Glass et al. (Glass et al., 2002), lack of mutual exclusion was a challenge for our classification scheme. It was often possible to classify a paper as relating to more than one individual value. This we believe was more to do with the ill-defined nature of human values than a limitation of the chosen classification scheme. Still, the potential threat was mitigated by using an iterative process and conducting rater training to understand and clarify relationships between values and their categories.

In some cases, the raters found that certain papers related to human values in general rather than any particular value. Forcing such papers into a single value category would have influenced results. To mitigate this, we added a new Holistic view category. Some papers relating to Privacy were categorized under Security rather than Self direction. Raters, based on their understanding of Privacy in an SE context, considered it to be more relevant to Security than Self direction. This may have influenced the results. To mitigate this, we provided results for both rater preferred and SVS prescribed categories in Figure 6. Some common SE values were not found in SVS, such as Sustainability.

SVS may not be the ideal classification scheme for SE, and we expect useful further research to adapt SVS to SE context.

Internal Validity threats for this study arise from the complexities of categorizing papers into the selected classification scheme. It is possible that the raters’ own expertise in understanding the scheme categories and definitions of values may have influenced paper classifications. This risk however, was mitigated as the classification process forced random assignment of each paper to two raters and in case of a disagreement an independent arbiter was introduced to facilitate agreement. Some disagreements (2%, see Figure 2) remained even after the arbiter’s intervention. In such cases we did not force consensus.

External validity threats may arise from potential limitations of our choice of publication venues and the block of time period under study (i.e., 2015-2018). The chosen venues are widely acknowledged as the top-tier venues of SE research; however, we accept that the results may be different if other more specialist conferences/journals had been considered.

Generalizability of results based on a subset of papers is often a concern for empirical studies. In our research, this risk was mitigated by using 1350 papers published in the last 4 years which can be considered a good representation of trends in SE research as suggested in (Bertolino et al., 2018). The findings of this study, however, may be biased towards ICSE and FSE as they published more papers in the selected period compared to journals (ICSE 559, FSE 512 vs. TSE 215 and TOSEM 64).

While a detailed review of the entire papers (rather than just the abstract, title and keywords) could have provided more accurate results, we adopted a procedure similar to those used in previous studies (Shaw, 2003; Bertolino et al., 2018). The time required for reliable classification means that reading all 1350 papers is infeasible.

7. Related Work

Classification of papers has been widely adopted in the SE literature (Shaw, 2003; Systä et al., 2012; Montesi and Lago, 2008; Vessey et al., 2002) as a way of providing insights on trends and directions in SE research. Such findings, though not conclusive, can indicate the general attitude of SE researchers as well as the priorities in SE research. Paper classification helps to highlight the gaps and the needs for further research in specific SE domains. Mary Shaw (Shaw, 2003), for instance, analyzed the abstracts of research papers submitted and accepted to ICSE 2002 to identify different research types as well as the trends in research question types, contribution types and validation approaches. The author also studied the program committee discussions regarding the acceptance or rejection of the papers. Another example is the work by Vessey et al. (Vessey et al., 2002): to report their findings, the authors categorized samples of SE papers published from 1995 to 1999 in six journals based on topic, method, and approach.

However, paper classification methods rely on classification schemes, that can be general or specific depending on the purpose of the classification. To classify different SE papers, Montesi and Lago (Montesi and Lago, 2008) presented a paper classification approach based on the call for papers of top-tier SE conferences and journals included in the Journal Citation Reports and the instructions to authors of relevant journals and published works. Also, Ioannidis et al. (Ioannidis et al., 2015) categorized the meta-research discipline into five main thematic fields corresponding to how to conduct, report, verify, correct and reward science. There have also been efforts to develop specific classification schemes. For instance, Wieringa et al. (Wieringa et al., 2006) developed a classification scheme to identify papers that belong to Requirements Engineering as a subdomain in SE. Sjoberg et al. (Sjøberg et al., 2005) surveyed SE papers in nine journals and three conferences from 1993 to 2002 with the aim to characterize controlled experiments in SE by characterizing the topics of the experiments and their subjects, tasks, and environments.

Moreover, some paper classifications have identified gaps in SE practice. An example is the work by Stol and Fitzgerald (Stol and Fitzgerald, 2015), where the authors observed the lack of a holistic view in SE research. The work contributed a framework for positioning a holistic set of research strategies and showed its strengths and weaknesses in relation to various research components. Also, Zelkowitz and Wallace (Zelkowitz and Wallace, 1997) classified, according to a 12-model classification scheme, around 600 SE papers published over a period of three years to provide insights on the use of experimentation within SE. They identified a gap in SE research with respect to validation and experimentation. Another example is an empirical study of SE papers performed by Zannier et al. (Zannier et al., 2006) to investigate the improvement of the quantity and quality of empirical evaluations conducted within ICSE papers over time. The authors compared a random sample of papers in two periods, 1975 – 1990 and 1991 – 2005, and found that the quantity of empirical evaluation has grown, but the soundness of evaluation has not grown at the same pace.

Last but not least, some paper classifications have provided insights on SE venues in relation to the papers published in those venues. An example is the work by Systa et al. (Systä et al., 2012) that investigated the turnover of PC compositions and paper publication in six SE conferences. The work was later extended by Vasilescu et al.  (Vasilescu et al., 2014) by proposing a wider collection of metrics to assess the health of 11 SE conferences over a period of more than 10 years.

8. Conclusions and Future Work

Repeated incidents of software security and privacy violations continue to attract researchers’ attention. In this paper, however, we investigated the prevalence of a broader range of human values including Trust, Equality and Social justice in software engineering research. Using Schwartz Values Structure as our classification scheme, we classified 1350 recently published (2015–2018) papers in top-tier SE conferences and journals. We conclude that only a small proportion of SE research considers human values. While Security, as a value category, and Privacy, as a specific value, stand out as the main focus in SE research, few other human values such as Helpful, Protecting the environment and Social justice are considered. A broad range of human values remain inadequately addressed in SE research. Finally, we found SE conferences publish more values relevant research compared to SE journals.

In future, we would like to extend this study using a machine learning approach. Manually labelled data from this study could be used for training machine learning algorithms to classify larger sets of publications with the aim to better visualize how SE research addresses human values. We also plan to utilise our manually labelled data captured from various SE contexts to develop definitions of human values that are relatively easy for practitioners to understand and implement. Finally, we plan to carry out case studies in software organizations to investigate whether SE research related to human values has actually made an impact on SE practice.

References

  • (1)
  • Aldewereld et al. (2015) Huib Aldewereld, Virginia Dignum, and Yao-hua Tan. 2015. Design for values in software development. Handbook of Ethics, Values, and Technological Design: Sources, Theory, Values and Application Domains (2015), 831–845.
  • Angwin et al. (2016) Julia Angwin, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. Machine Bias. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
  • Baker (2019) Nick Baker. 2019. Molly Russell: Instagram bans graphic self-harm images after suicide of UK teen. https://www.sbs.com.au/news/molly-russell-instagram-bans-graphic-self-harm-images-after-suicide-of-uk-teen
  • Bertolino et al. (2018) Antonia Bertolino, Antonello Calabrò, Francesca Lonetti, Eda Marchetti, and Breno Miranda. 2018. A categorization scheme for software engineering conference papers and its application. Journal of Systems and Software 137 (2018), 114–129. https://doi.org/10.1016/j.jss.2017.11.048
  • Cath et al. (2018) Corinne Cath, Sandra Wachter, Brent Mittelstadt, Mariarosaria Taddeo, and Luciano Floridi. 2018. Artificial intelligence and the ’good society’: the US, EU, and UK approach. Science and engineering ethics 24, 2 (2018), 505–528.
  • Cheng and Fleischmann (2010) An-Shou Cheng and Kenneth R Fleischmann. 2010. Developing a meta-inventory of human values. In Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem-Volume 47. American Society for Information Science, 3.
  • Dave (2018) Paresh Dave. 2018. Google bars uses of its artificial intelligence tech in weapons. https://www.reuters.com/article/us-alphabet-ai/google-bars-uses-of-its-artificial-intelligence-tech-in-weapons-idUSKCN1J32M7.
  • Dijk et al. (2017) Roel van Dijk, Christophe Creeten, Jeroen van der Ham, and Jeroen van den Bos. 2017. Model-driven software engineering in practice: privacy-enhanced filtering of network traffic. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 860–865.
  • Etzioni and Etzioni (2017) Amitai Etzioni and Oren Etzioni. 2017. Incorporating ethics into artificial intelligence. The Journal of Ethics 21, 4 (2017), 403–418.
  • Ferrario et al. (2016) Maria Angela Ferrario, Will Simm, Stephen Forshaw, Adrian Gradinar, Marcia Tavares Smith, and Ian Smith. 2016. Values-first SE: research principles in practice. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 553–562.
  • Ferrario et al. (2014) Maria Angela Ferrario, Will Simm, Peter Newman, Stephen Forshaw, and Jon Whittle. 2014. Software Engineering for ’Social Good’: Integrating Action Research, Participatory Design, and Agile Development. In Companion Proceedings of the 36th International Conference on Software Engineering (ICSE Companion 2014). ACM, New York, NY, USA, 520–523. https://doi.org/10.1145/2591062.2591121
  • Flanagan et al. (2005) Mary Flanagan, Daniel C Howe, and Helen Nissenbaum. 2005. Values at play: Design tradeoffs in socially-oriented game design. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 751–760.
  • Friedman (1996) Batya Friedman. 1996. Value-sensitive design. interactions 3, 6 (1996), 16–23.
  • Friedman and Kahn Jr (2007) Batya Friedman and Peter H Kahn Jr. 2007. Human values, ethics, and design. In The human-computer interaction handbook. CRC Press, 1223–1248.
  • Glass et al. (2002) Robert L. Glass, Iris Vessey, and Venkataraman Ramesh. 2002. Research in software engineering: an analysis of the literature. Information and Software technology 44, 8 (2002), 491–506.
  • Gralla (2016) Preston Gralla. 2016. Amazon Prime and the racist algorithms. https://www.computerworld.com.au/article/599661/amazon-prime-racist-algorithms.
  • Holmes et al. (2011) Tim Holmes, Elena Blackmore, Richard Hawkins, and Tom Wakeford. 2011. The common cause handbook. Public Interest Research Center.
  • Ioannidis et al. (2015) John PA Ioannidis, Daniele Fanelli, Debbie Drake Dunne, and Steven N Goodman. 2015. Meta-research: evaluation and improvement of research methods and practices. PLoS biology 13, 10 (2015), e1002264.
  • Landis and Koch (1977) J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.
  • Montesi and Lago (2008) Michela Montesi and Patricia Lago. 2008. Software engineering article types: An analysis of the literature. Journal of Systems and Software 81, 10 (2008), 1694–1714.
  • Morales et al. (2018) Rodrigo Morales, Rubén Saborido, Foutse Khomh, Francisco Chicano, and Giuliano Antoniol. 2018. Earmo: An energy-aware refactoring approach for mobile apps. IEEE Transactions on Software Engineering 44, 12 (2018), 1176–1206.
  • Mougouei et al. (2018) Davoud Mougouei, Harsha Perera, Waqar Hussain, Rifat Shams, and Jon Whittle. 2018. Operationalizing human values in software: a research roadmap. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2018. 780–784. https://doi.org/10.1145/3236024.3264843
  • Riedl and Harrison (2016) Mark O Riedl and Brent Harrison. 2016. Using stories to teach human values to artificial agents. In Workshops at the Thirtieth AAAI Conference on Artificial Intelligence.
  • Riesco and Ogata (2018) Adrián Riesco and Kazuhiro Ogata. 2018. Prove It&Excl; Inferring Formal Proof Scripts from CafeOBJ Proof Scores. ACM Trans. Softw. Eng. Methodol. 27, 2, Article 6 (July 2018), 32 pages. https://doi.org/10.1145/3208951
  • Rokeach (1973) Milton Rokeach. 1973. The nature of human values. Free press.
  • Sablich (2017) Justin Sablich. 2017. ’Price Gouging’ and Hurricane Irma: What Happened and What to Do. https://www.nytimes.com/2017/09/17/travel/price-gouging-hurricane-irma-airlines.html.
  • Schwartz (1992) Shalom H Schwartz. 1992. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In Advances in experimental social psychology. Vol. 25. Elsevier, 1–65.
  • Schwartz (1994) Shalom H Schwartz. 1994. Are there universal aspects in the structure and contents of human values? Journal of social issues 50, 4 (1994), 19–45.
  • Schwartz (2005) Shalom H Schwartz. 2005. Basic human values: Their content and structure across countries. Valores e comportamento nas organizações (2005), 21–55.
  • Schwartz (2006) Shalom H Schwartz. 2006. Les valeurs de base de la personne: théorie, mesures et applications. Revue française de sociologie 47, 4 (2006), 929–968.
  • Schwartz (2007) Shalom H Schwartz. 2007. Basic human values: Theory, methods, and application. Risorsa Uomo (2007).
  • Schwartz (2012) Shalom H. Schwartz. 2012. An Overview of the Schwartz Theory of Basic Values. Online Readings in Psychology and Culture 2, 1 (2012), 12–13. https://doi.org/10.9707/2307-0919.1116
  • Schwartz and Boehnke (2004) Shalom H Schwartz and Klaus Boehnke. 2004. Evaluating the structure of human values with confirmatory factor analysis. Journal of research in personality 38, 3 (2004), 230–255.
  • Shaw (2003) Mary Shaw. 2003. Writing good software engineering research papers. In Software Engineering, 2003. Proceedings. 25th International Conference on. IEEE, 726–736.
  • Sjøberg et al. (2005) Dag IK Sjøberg, Jo Erskine Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, N-K Liborg, and Anette C Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE transactions on software engineering 31, 9 (2005), 733–753.
  • Smith (foot) David Smith. 2018. https://www.theguardian.com/technology/2018/apr/11/ zuckerberg-hearing-facebook-tracking-questions-house-back-foot. Zuckerberg put on back foot as House grills Facebook CEO over user tracking.
  • Steinmacher et al. (2016) Igor Steinmacher, Tayana Uchoa Conte, Christoph Treude, and Marco Aurélio Gerosa. 2016. Overcoming Open Source Project Entry Barriers with a Portal for Newcomers. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 273–284. https://doi.org/10.1145/2884781.2884806
  • Stol and Fitzgerald (2015) Klaas-Jan Stol and Brian Fitzgerald. 2015. A holistic overview of software engineering research strategies. In Proceedings of the Third International Workshop on Conducting Empirical Studies in Industry. IEEE Press, 47–54.
  • Systä et al. (2012) Tarja Systä, Maarit Harsu, and Kai Koskimies. 2012. Inbreeding in software engineering conferences.
  • Thew and Sutcliffe (2018) Sarah Thew and Alistair Sutcliffe. 2018. Value-based requirements engineering: method and experience. Requirements Engineering 23, 4 (2018), 443–464.
  • Vasilescu et al. (2014) Bogdan Vasilescu, Alexander Serebrenik, Tom Mens, Mark GJ van den Brand, and Ekaterina Pek. 2014. How healthy are software engineering conferences? Science of Computer Programming 89 (2014), 251–272.
  • Vessey et al. (2002) Iris Vessey, Venkataraman Ramesh, and Robert L Glass. 2002. Research in information systems: An empirical study of diversity in the discipline and its journals. Journal of Management Information Systems 19, 2 (2002), 129–174.
  • Wieringa et al. (2006) Roel Wieringa, Neil Maiden, Nancy Mead, and Colette Rolland. 2006. Requirements engineering paper classification and evaluation criteria: a proposal and a discussion. Requirements Engineering 11, 1 (2006), 102–107.
  • Zannier et al. (2006) Carmen Zannier, Grigori Melnik, and Frank Maurer. 2006. On the success of empirical studies in the international conference on software engineering. In Proceedings of the 28th international conference on Software engineering. ACM, 341–350.
  • Zelkowitz and Wallace (1997) Marvin V Zelkowitz and Dolores Wallace. 1997. Experimental validation in software engineering. Information and Software Technology 39, 11 (1997), 735–743.