Psychometrics in Behavioral Software Engineering: A Methodological Introduction with Guidelines

05/20/2020 ∙ by Daniel Graziotin, et al. ∙ University of Stuttgart Chalmers University of Technology 0

Humans are what constitutes the most complex and complicated, yet fascinating, component of a software engineering endeavor. A meaningful and deep understanding of the human aspects of software engineering requires psychological constructs to be taken into account. We argue that psychology and statistics theory can facilitate the development and adoption of valid and reliable instruments to assess these constructs. In particular, to ensure high quality, the psychometric properties of measurement instruments need evaluation. In this paper, we provide an introduction to psychometric theory for the evaluation of measurement instruments (e.g., psychological tests and questionnaires) for software engineering researchers. We present guidelines that enable using existing instruments and developing new ones adequately. We conducted a comprehensive review of the psychology literature, including journal articles, textbooks, and society standards, framed by the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014). We detail activities used when operationalizing new psychological constructs, such as item analysis, factor analysis, standardization and normalization, reliability, validity, and fairness in testing and test bias. With this paper, we hope to encourage a culture change in software engineering research towards the adoption of established methods from social science. To improve the quality of behavioral research in software engineering, we believe that studies focusing on introducing, validation, and then using psychometric instruments need to be more common. Finally, we present an example of a psychometric evaluation based on our guidelines, to which we openly provide code and dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Software is developed for people, by people. For decades we have recognized that no matter the size and importance of the technical side of software engineering, it is humans that ultimately drive the underlying processes and produce the desired artifacts (Weinberg, 1971). Software engineers are knowledge workers and have knowledge as main capital (Swart and Kinnie, 2003). They need to construct, retrieve, model, aggregate, and present knowledge in all their analytic and creative daily activities (Rus et al., 2002). Operations related to knowledge are cognitive in nature, and cognition is influenced by characteristics of human behavior, including personality, affect, and motivation (Hilgard, 1980). It is no wonder that industry and academia have explored psychological aspects of software development and the assessment of psychological constructs at the individual, team, and organization level (Feldt et al., 2008; Lenberg et al., 2015).

Psychological assessment is the gathering of psychology-related data towards an evaluation that is accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and other procedures (Cohen et al., 1995). Here, we are primarily focused on psychological tests, i.e. quantitative research111We are interested in qualitative research as well. We have offered our proposal for guidelines for qualitative behavioral software engineering elsewhere (Lenberg et al., 2017a), as behavioral software engineering has turned much attention into employing theory and measurement instruments from psychology (Capretz, 2003; Feldt et al., 2008; Lenberg et al., 2015; Graziotin et al., 2015c; Cruz et al., 2015).

Psychological tests are instruments (e.g., questionnaires) used to measure unobserved constructs (Cohen et al., 1995). We can not assess these constructs directly like when measuring the source lines of code or specific properties of a UML diagram. Hence, the underlying variables are called latent variables (Nunnally, 1994). Examples of such unobserved constructs include attitude, mood, happiness, job satisfaction, commitment, motivation, intelligence, soft skills, abilities, and performance. We need to create valid and reliable measurement instruments 222In this paper, we use the terms psychological test, measurement instrument, and questionnaire interchangeably. for the assessment of such constructs. For ensuring a systematic and sound development of psychological tests and their interpretation, the field of psychometrics was born already in the 1930’s and 1940’s with roots going evern further back (Rust, 2009; Nunnally, 1994).

Psychometrics is the development of measurement instruments and the assessment on whether these instruments are reliable and valid forms of measurement (Ginty, 2013). Psychometrics is also the branch of psychology and education which is devoted to the construction of valid and reliable psychological tests (Rust, 2009).

Psychological testing is one of the most important contributions of psychology to our society. Proper development and validation of tests results in better decisions on individuals, while, on the opposite end, improper development and validation of tests might result in invalid results, economic loss, and even harm of individuals (American Educational Research Association et al., 2014). The individual level also affects teams and, ultimately, organisations and even when these are in focus, measuring constructs that concerns them is typically built up from measurements on the individual level (e.g., organizational readiness for change, at the organization level, is computed from aggregated responses of individuals who work at the organization (Shea et al., 2014)). Personality assessment is a classic example of psychological testing in personnel selection, which has been employed by all companies, including those related to information technology (Darcy and Ma, 2005; Wyrich et al., 2019). Improper development, administration, and handling of psychological tests could harm the company by hiring a non-desirable person, and it could harm the interviewee because of missed opportunities.

We see psychometrics as a key missing link for strong methodology in behavioral software engineering

We believe that solid theoretical and methodological foundations should be the first step when designing any measurement instrument. The reality, however, is that not all tests are well developed in psychology (American Educational Research Association et al., 2014), and software engineering research, especially when studying psychological constructs, is far from adopting rigorous and validated research artifacts.

1.1. Abuse and misuse of psychological tests in software engineering research

Already in 2007, McDonald and Edwards (2007) subtitled their paper “Examining the use and abuse of personality tests in software engineering.”. The authors anticipated the issue that we attempt to address in the present manuscript, that is the “the lack of progress in this [personality research in software engineering] field is due in part to the inappropriate use of psychological tests, frequently coupled with basic misunderstandings of personality theory by those who use them. ” (p. 67). While their focus was on personality, their concerns are valid more broadly to other psychometric instruments and constructs.

Instances of misconduct 333We believe that direct accusations bring no value to our contribution and are counterproductive to our advancement of knowledge, so we discuss resources that point to specific issues rather than specific papers or authors. can, for example, be observed in the results of a systematic literature review of personality research in software engineering by Cruz et al. (2015). We noted in Cruz et al. (2015) results that 48% of personality studies in software engineering have employed the Myers-Brigg Type Indicator (MBTI) questionnaire, which has been shown to possess low to none reliability and validity properties (Pittenger, 1993) up to the point of being called a “little more than an elaborate Chinese fortune cookie” (Hogan, 2017). Feldt and Magazinius (2010) similarly pointed out deficiencies of MBTI and proposed and used an alternative (IPIP) with more empirical support in the psychological literature.

Feldt et al. (2008) have argued in favor of systematic studies of human aspects of software engineering. More specifically, to adopt measurement instruments coming from psychology and related fields. Graziotin et al. (2015a) have echoed the call seven years after but found that research on the affect of software developers had been threatened by a deep misunderstanding of related constructs and how to assess them. In particular, the authors noted that peers in software engineering tend to confuse affect-related psychological constructs such as emotions and moods with related, yet different, constructs such as motivation, commitment, and well-being.

Lenberg et al. (2015) have conducted a systematic literature review of studies of human aspects in software development and engineering that made use of behavioral science, calling the field behavioral software engineering. Among their results, they found that software engineering research is threatened by several knowledge gaps when performing behavioral research, and that there have been very few collaborations between software engineering and behavioral science researchers.

Graziotin et al. (2015c), meanwhile, extended their prior observations on affect to a broader view of software engineering research with a psychological perspective. Given the observation that much research in the field has misinterpreted (or even ignored) validated measurement instruments from psychology, the work offered what we can consider the sentiment for the present article, that is brief guidelines to select a theoretical framework and validated measurement instruments from psychology.  Graziotin et al. (2015c) called the field “psychoempirical software engineering” but later agreed with Lenberg et al. (2015) to unify the vision under “behavioral software engineering”. Hence, the present collaboration.

Our previous studies have also reported that, when a validated test from psychology is adopted by software engineering researchers, its items are often modified, causing the destruction of its reliability and validity properties. This includes a thorough evaluation of the psychometric properties of candidate instruments. Gren and Goldman (2016) has argued in favor of “useful statistical methods for human factors research in software engineering” (title), which include underused methods such as Test-Retest, Cronbach’s , and exploratory factor analysis—all of which are covered in this paper—.  Gren (2018) has also offered a psychological test theory lens for characterizing validity and reliability in behavioral software engineering research, further enforcing our view that software engineering research that investigates any psychological construct should maintain fair psychometric properties. We agree with Gren (2018) that we should “change the culture in software engineering research from seeing tool-constructing as the holy grail of research and instead value [psychometric] validation studies higher.” (p. 3).

A mea culpa works better than a j’accuse in further building our case, so we bring a negative example from one of our previous studies. As reported in a very recent work by Ralph et al. (2020) (which we appreciate in the next paragraph), “there is no widespread consensus about how to measure developers’ productivity or the main antecedents thereof. Many researchers use simple, unvalidated productivity scales” (p. 6). In one of the earliest works by the first author of the present paper (Graziotin et al., 2015b), which was published way after when it was conducted, we compared the affect triggered by a software development task with the self-assessed productivity of individual programmers. While we were very careful to select a validated measurement instrument of emotions and to highlight how self-assessment of productivity converges to objective assessment of productivity, we used a single Likert item scale to represent productivity. This choice was to reduce as much as possible the items of the measurement instrument, which had to be used every ten minutes. While the results of the study are not invalidated by this choice, the productivity scale itself was not validated, making the results less valuable from a psychometric perspective and, thus, our interpretation of its results. The study was also (successfully) independently replicated twice by two ICSE papers, which suffer from the same unfortunate choice.

We wish to refrain from being overly negative. The field of software engineering does have positive cases—excluding those from the present authors—that we can showcase here. For example, Fagerholm and Pagels (2014) developed a questionnaire on lean and agile values and applied psychometric approaches to inspect the structure of value dimensions. Fagerholm (2015) has also embodied psychometric approaches in his PhD dissertation by analyzing the validity of the constructs he studied. A more recent example is Ralph et al. (2020), who analyzed through a questionnaire the effects of the COVID-19 pandemic on developers’ well-being and productivity. The authors constructed their measurement instrument by incorporating psychometrically validated scales on constructs such as perceived productivity, disaster preparedness, fear and resilience, ergonomics, and organizational support. Furthermore, they employed confirmatory factor analysis (which we touch upon in the present paper) to verify that the included items do indeed cluster and converge into the factors that are claimed to converge to.

Filling the knowledge gap: introduction and guidelines to psychometric evaluation for behavioral software engineering research

Overall, we argue that one thing that is missing is an introduction to the field of psychometrics for behavioral software engineering researchers. Such an introduction can help improve the understanding of the available measurement instruments and, also, the development of new tests, allowing researchers as well as practitioners to explore the human component in the software construction process more accurately.

1.2. Objective

Our overall objective is to address the lack of understanding and use of psychometrics in behavioral software engineering research.

We also hope to increase software engineering researchers’ awareness and respect of theories and tools developed in established fields of the behavioral science, towards stronger methodological foundations of behavioral software engineering research. While we previously have argued for a similar approach to qualitative research we here complement this with a focus on quantitative research.

1.3. Contribution

With this paper, we contribute to the (behavioral) software engineering body of knowledge with a set of guidelines which enable a better understanding of psychological constructs in research activities. This improvement in research quality is achieved by either (1) reusing psychometrically validated measurement instruments, as well understanding why and how they are validated, or, if no such questionnaires exist, (2) developing new psychometrically validated questionnaires that are better suited for the software engineering domain.

Our contribution is enabled by offering one theoretical deliverable and one practical companion deliverable.

  1. We offer a review and synthesis of psychometric guidelines in form of textbooks, review papers, as well as empirical studies.

  2. We offer a hands-on counterpart to our review by providing a fully reproducible implementation of our guidelines as R Markdown.

The Standards for Educational and Psychological Testing (SEPT, American Educational Research Association et al. (2014)) is a set of gold standards in psychological testing jointly developed the American Psychological Association (APA), National Council on Measurement in Education (NCME), and the American Educational Research Association (AERA). The book defines areas and standards that should be met when developing, validating, and administering psychological tests. We adopted SEPT as a framework to guide the paper construction, for ensuring that the standards are met and that the various other references are framed in the correct context.

Additionally, we organized the scoping of the paper by comparing related work from the fields of psychology research. While the present paper is not a systematic literature review or a mapping study—the discipline is so broad that entire textbooks have been written on it—we systematically framed its construction to ensure that all important topics were covered.

1.4. Scope

Several authors, e.g., Crocker and Algina (2006); Singh et al. (2016); Rust (2009), have proposed different phases for the psychometric development and evaluation of measurement instruments. Through our review, we identified 15 phases that we summarize visually in Figure 1 and outline as follows.

Figure 1. Steps for developing a psychological tests. Phases with dark background are uncommon in software engineering research and are covered in the present paper.
  1. Identification of the primary purpose for which the test scores will be employed.

  2. Identification of constructs, traits, and behaviors that are reflected by the purpose of the instrument.

  3. Development of a test specification, delineation of the items proportion that should focus on each type of constructs, traits, and behaviors of the test.

  4. Construction of an initial pool of items.

  5. Review of the items.

  6. Conduction of a pilot test with the revised items.

  7. Execution of an item analysis to possibly reduce the number of items.

  8. Evaluation of an exploratory factor analysis to possibly reduce and group items into components or factors.

  9. Execution of a field test of the items with a larger, representative sample.

  10. Determination of statistical properties of item scores.

  11. Design and execution of reliability studies.

  12. Design and execution of validity studies.

  13. Evaluation of fairness in testing and test bias.

  14. Development of guidelines for administering, scoring, and interpreting test scores.

We focus mainly on the second half of psychometric activities—those with a dark background in Figure 1—as they are the most challenging and usually uncovered in software engineering research. The first half of the activities, on questionnaire design, are covered by existing literature in software engineering (e.g., (Ji et al., 2008; Ciolkowski et al., 2003; Molléri et al., 2016; Kitchenham and Pfleeger, 2008; Wagner et al., 2020)) and psychology research (e.g., (Collins, 2003; Sutton et al., 2003; Schwarz and Oyserman, 2016; Oppenheim, 1992)).

We do want to note that one aspect of the method we have used, that can be seen as a limitation, is that there is much current discussion about the statistical methods that are and/or should be applied in behavioral and social science, including psychology (Wagenmakers et al., 2018; Schad et al., 2019), as well as in applied sciences in general (Wasserstein et al., 2019)

. This has also affected software engineering and, for example, a recent paper argued for transitioning to Bayesian statistical analysis in empirical software engineering  

(Furia et al., 2019). However, it is too early to base guidelines on proposals in this ongoing, scholarly discussion since there is not yet a clear consensus. Thus, since we base our review on the current and more established literature, it is likely that future work will need to consider more powerful and up-to-date statistical methods for the creation and assessment of psychometric instruments. Thus, we foresee future updates to this paper that extends it by using such, more recent analysis methods.

As a final note, the present paper, as well as any psychometric construction of measurement instruments, is not a checklist. A psychometric evaluation does not include all elements reported in this paper, as many facets of psychometrics are influenced by the research questions, study design, and data at hand. Yet, a proper psychometric evaluation requires a consideration of all elements reported in the present paper.

1.5. Structure

After a brief introduction to the key concepts of psychometrics (section 2), that are required to understand the rest of the paper, we focus on test construction in psychometrics, namely item review and analysis (section 3), factor analysis (section 4), statistical properties (section 5), reliability (section 6), validity (section 7), and fairness in testing and test bias (section 8). The paper ends with our recommendations for further reading (section 9) and a hands-on running example (Section 10) of a psychometric evaluation. We provide R code and generated datasets openly (Graziotin et al., 2020).

2. Concepts

This section provides an overview of basic terms and concepts from psychometrics that will enable an understanding of all remaining sections. In particular, we clarify on psychometric models and test types (and types of testing) as these will sometimes have an influence on the statistical methods and lens to adopt when designing and evaluating a measurement instrument.

2.1. Building blocks

The fundamental idea behind psychological testing is that what is being assessed is not a physical object, such as height and weight. Rather, we are attempting to assess a construct, that is a hypothetical entity (American Educational Research Association et al., 2014; Rust, 2009). If we assess the job satisfaction of a software developer, we are not directly measuring the satisfaction of the individual. Instead, we compare the developers’ score with other developers’ scores or a set of established norms for job satisfaction. When comparing the satisfaction scores between developers, we are limited to seeing how the scores differentiate between satisfied and unsatisfied developers according to the knowledge and ideas we have about satisfied an dissatisfied individuals.

There are two common models of psychometrics, namely functionalist and trait (Rust, 2009). Functionalist psychometrics is often occurring in educational and occupational tests; it deals with how the design of a test is determined by its application and not about the constructs being measured (Rust, 2009; Green, 2009). For functionalist design, a good test is one that is able to distinguish between individuals who perform well and individuals who perform less well on a job or in schools’ activities. This is also called local criterion-based validity (explained in section 7). The functionalist paradigm can be applied to most cases where a performance assessment or an evaluation are required.

Trait psychometrics attempts to address notions such as human intelligence, personality, and affect scientifically (Rust, 2009). The classic trait approach was based on the notion that, for example, intelligence is related to biological individual differences, and trait psychometric tests aimed to measure traits that would represent biological differences among people (Rust, 2009).

No matter the difference between the two schools of thoughts, they have several aspects in common, including test construction and validation methods, which differ in how validity is seen, and they are linked by the theory of true scores (Hambleton et al., 1991). The theory of true scores, or latent trait theory, is governed by formulas of the form:

(1)

where is the observed score, is the true score, and is the error. There are three assumptions with the theory of true scores. (1) all errors

are random and normally distributed, (2) true scores

are uncorrelated with the errors, and (3) different measures of the observed score on the same participants are independent from each other. Besides all issues that come with the three assumptions, the theory has been criticized with the major point being that there is arguably no such thing as a true score, and that all that tests can measure are abstractions of psychological constructs (Loevinger, 1957).

Elaborations and re-interpretations of the theory of true score have been proposed, among which is the statistical true score (Carnap, 1962). The statistical true score defines the true score as the score we would obtain by averaging an infinite number of measures from the same individual. With an infinite number of measures the random errors cancel each other, leaving with the true score . The statistical form of the theory of true scores should not be completely new to readers of software engineering, as most quantitative methods that are in use in our field nowadays are based on it. The statistical interpretation of the theory of true score applies both to trait and functional psychometrics. A difference lies in generalization. Functional tests can only be specific to a certain context while trait tests attempt to generalize to an overall construct present in a group of individuals.

2.2. Test types and types of testing

Items on psychological tests can be knowledge-based or person-based (Rust, 2009). Knowledge-based tests assess whether an individual performs well regarding the knowledge of certain information, including possessing skills favoring performance or quality in knowledge-based tasks. For an SE example, debugging skills would be assessed by a knowledge-based test. Person-based tests, on the other hand, assess typical performance, or how they are represented, towards a construct. Examples of constructs in person-based tests include personality, mood, and attitudes. The personalities of programmers in pair programming settings would be assessed by a person-based test.

Knowledge-based tests are usually uni-dimensional as they gravitate towards the notion of possessing or not possessing certain knowledge. We can also easily rank individuals on their scores and state who ranks better. Person-based tests are usually multi-dimensional and do not allow direct ranking of individuals without some assumptions. For example, a developer could score high on extroversion. A high score on extroversion does not make a developer with a lower extroversion score a“worse” developer in any way, because of a lower score. .

A second distinction is between criterion-referenced and norm-referenced testing (Glaser, 1963). Criterion-referenced tests are constructed with reference to performance on a-priori defined values for establishing excellence (Glaser, 1963; Berger, 2013a).

Continuing with the example on debugging skills, a criterion-referenced test would assess, with a score from 0 to 10, whether a developer is able to open a debugging tool and use its ten basic functionalities. A score of 10 over 10 would mean that the developer is able to debug software.

Norm-referenced tests lack a-priori defined scores. What constitutes a high score is in relation to how everyone else scores. A test for assessing the happiness of software developers will return scores for each participant. The test itself will have a theoretical range, say for strong unhappiness and for strong happiness. When a developer scores on our happiness scale, all we can say is that the developer is rather happy than unhappy. If, to that, we know that software developers score

on average, with a standard deviation of

, then we do know that the developer is quite a happy one. The development and evolution of norm-referenced testing attempts, in addition to develop valid and reliable instruments, to establish norms, values for populations and sub-populations of individuals. That is, norm-referenced tests allow us to compare scores with respect to what is considered normal (Glaser, 1963; Berger, 2013b).

3. Item Review and Item Analysis

When developing a new measurement instrument, we are likely to create more items than what is really needed. Item review and item analysis are a series of methods to reduce the number of items of a measurement instrument and keep the best performing ones (Rust, 2009). This is a two-steps process, as shown in Figure 2. First, it requires a review by experts; then, a pilot study and statistical calculations. During the first step (item review), experts in the domain of knowledge evaluate items one by one and argue for their presence in the test (Rust, 2009). During the second step (item analysis), the developers of the measurement instrument calculate item facility and item discrimination based on a pilot study that uses the tentative set of items.

Figure 2. Phases for item review and item analysis

We are not describing item review in detail here, because it is a straightforward process that involves familiar methods found in systematic literature reviews and qualitative studies. The experts in the domain of knowledge discuss candidate items and argue in favor or against them, as it happens when discussing inclusion and exclusion of publications in systematic literature reviews (Kitchenham, 2007). Inter-rater reliability measures such as Cronbach ’s  (Cronbach, 1951) and Krippendorff’s  (De Swert, 2012) can be adopted for assessing the degree of agreement among raters. After reaching an agreement on the items to be included, a pilot study is required for an analysis of the items.

3.1. Item analysis

Item analysis refers to several statistical methods for the selection of items for a psychological test (Kline, 2015). Two of the two most known techniques can be found in item facility and item discrimination. This section will explain both techniques according to the test type (see previous section).

3.1.1. Item facility

Item facility for an item, also known with its opposite meaning of item difficulty, is a measure of tendency in answering one item with the same score. This has different meanings according to the type of test.

Knowledge-based test

Item facility for an item, also known as item difficulty, is defined as the ratio of the number of participants who provided a wrong answer over the number of all participants to a test (Rust, 2009; Kline, 2015). The value of item facility ranges from (all respondents are wrong) to

(all respondents are right). In other words, item facility is the probability of obtaining the right answer for the item 

(Rust, 2009)

. Of interest for test construction is the variance of an item. An item variance is the calculated variance of a set of item scores, which is a set of zeros and ones. That is, the item variance for an item with facility

is (all are wrong), and an item variance for an item with facility is (all are right). Both these extremes would render the item rather void, as all individuals scoring the same on an item would not tell us anything interesting on the individuals. What usually happens is that some individuals will get the answer right, some will get the answer wrong. For such cases, we can compute the variance for an item by using the formula in 2:

(2)

where is the item and is the item facility for . The highest possible value for is , and this is the case when items are neither very easy or very difficult. When has small values, for example , it means that most respondents tend to reply the same way for that item, making it either extremely easy or extremely difficult. A value of near does not warrant automatic exclusion of item , but the value should solicit a review.

Person-based tests

Facility suggests that participants to all tests can be either right or wrong on an answer. What about trait measurement, where participants are not exactly right or wrong but are assessed in terms of a psychological construct? We can calculate item facility for these cases as well. The issue relies only in the naming, because item facility was developed for knowledge-based tests first. Some scholars prefer to use the term item endorsement or item location (Revelle, 2009) to better reflect how calculations can be done on traits.

For trait measurement, it is common to have items with Likert items (Likert, 1932). Item facility for Likert items can be calculated with the mean value of the item. If a Likert item maps to the values 0 (strongly disagree) to 5 (strongly agree), the extreme values for the item will be 0 and 5 instead of 0 and 1. An item with average value of 4.8 with variance of 0.09 is a candidate for deletion, whereas an item with average value of 2.72 with variance of 3.02 is deemed interesting. Items such as those worded negatively should be reversed in their scoring prior to item analysis, so that all items have comparable values.

3.1.2. Item discrimination

Item discrimination is a technique to discover items that behave oddly with respect to what we expect participants to score the item. The meaning of item discrimination differs according to the type of test.

Knowledge-based test

Item discrimination reflects items that behave oddly, in the sense that individuals that tend to score very high (or very low) on a test as a whole tend to be wrong (right) on the same item (Rust, 2009). Such an item would possess a negative (positive) discrimination. Ideal for a test is to have items with zero discrimination (Kline, 2015).

On a statistical point of view, if an item is uncorrelated with the overall test score, then it is almost certainly uncorrelated with the other items and making very little contribution to the overall variance of the test (Rust, 2009; Singh et al., 2016). Therefore, we calculate item discrimination by comparing the correlation coefficient of an item score and the overall test score. If the computed correlation coefficient is 0 or below, we should consider removing the item.

Person-based tests

What holds for knowledge-based tests holds to a wide extent with person-based tests. Instead of assessing how well an item behaves with respect to the test score, we instead assess how an item is in fact measuring the overall trait in question. By calculating the correlation coefficient of an item and the overall test score for a specific trait, we will have an initial estimate of how well an item represents the trait in question. If the computed correlation coefficient is 0 or below, we should consider removing the item.

The case of norm-referenced tests

The variance of an item, here calculated using the classic definition from statistics, is interesting also in the context of norm-referenced testing. Item facility also applies with norm-referenced testing, as the purpose of the test is to spread out individuals’ scores as much as possible on a continuum. A larger spread is due to a larger variance, and we are interested in including items that make a contribution to the variance (Rust, 2009; Kline, 2015). Furthermore, if an item has a high correlation to other items and has a large variance, it derives that the item makes high contribution to the total variance of a test and it will be kept in the pool of items.

The case of criterion-referenced tests

Item analysis is often seen as applicable to norm-referenced testing exclusively (Rust, 2009). With criterion-referenced testing, it is still possible to calculate item facility and item discrimination, and these can be conducted, for example, before and after teaching and formative activities (and this might include workshops of, say, Scrum methods, at IT companies). A difference in item facility before and after the teaching activities would indicate that the item is a valid measure of the skill taught. This would turn the measure of item facility into a measure of item discrimination as well.

3.1.3. Limitations of item analysis

Item analysis, while valuable and still in use today, is part of the so called classical test theory (CTT), which assumes that an individual’s observed score is the same as a true score plus an error score (Traub, 2005). Modern replacements for CTT have been proposed, and the most prominent one is item response theory (IRT) (Embretson and Reise, 2013). IRT models build upon a function (called item response function, IRF, or item characteristics curve, ICC) that defines the probability of being right or wrong on an item (Alphen et al., 1994). IRT is outside the scope of the present paper as CTT is still in place to this day (Rust, 2009) and explaining IRT requires a publication on its own.

Item analysis, as presented in this section, assumes that there is a single test score, meaning that a single construct is being measured. Whenever multiple constructs or a construct of multiple factors are being measured, item analysis requires to be accompanied by factor analysis (Singh et al., 2016).

4. Factor Analysis

Factor analysis is one of the most widely employed psychometric tools (Kline, 2015; Rust, 2009; Singh et al., 2016) and it can be applied to any dataset where the number of participants is higher than the number of item scores under observation. Factor analysis is for understanding which test items “go together” to form factors in the data that ideally should correspond to the constructs that we are aiming to assess (Rust, 2009). At the same time, factor analysis allows to reduce the dimensionality of the problem space (i.e., reducing factors and/or associated items) and explaining the variance in the observed variables compared to underlying latent factors (Kootstra, 2006). In case we intend to assess a single construct, factor analysis helps in identifying those items that (best) represent the construct we are interested in, so that we can exclude the other items.

Factor analysis techniques are based on the notion that those constructs that we observe through our measurement instruments can be reduced to fewer latent variables which are unobservable but share a common variance (Yong and Pearce, 2013) (see Section 2). Factor analysis starts with computed correlation coefficients as its first building block. A way to summarize correlation coefficients is through a correlation matrix, which is a matrix of items that displays the correlation coefficient between all items of the tentative test. Table 1 provides an example correlation matrix for five items and . Given that two items ( and ) correlate with each other the same, no matter their order (), and that a single item correlates with itself with a perfect correlation coefficient (), the tradition is to show only the lower triangle of the matrix, omitting the repeated upper part, with the value of as its diagonal (Rust, 2009).

items a b c d e
a 1.00
b 0.73 1.00
c 0.03 0.20 1.00
d 0.89 0.84 -0.18 1.00
e -0.04 0.46 0.12 0.04 1.00
Table 1. Example correlation matrix for items a, b, c, d, e

A high correlation among certain items, in our case with and , and with , indicate these items might belong to the same factor. This approach, however, lacks part of the story. Questions such as “how do our candidate factor explain the total variance of the measurement instrument”, “to which candidate factor does an item belong more?” and “how are factors related to each other” are better answered by further analysis.

There are two main factor analysis techniques, namely Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) (Singh et al., 2016; Revelle, 2009; Kootstra, 2006). EFA attempts to uncover patterns and clusters of items by exploring predictions, while CFA attempts to confirm hypotheses on existing factors (Yong and Pearce, 2013).

4.1. Exploratory Factor Analysis

Exploratory factor analysis (EFA) is a family of factor analysis techniques aimed to reduce the number of items by retaining the items that are most relevant to certain factors (Kootstra, 2006). Strictly speaking, when developing a measurement instrument, after item analysis, it is desirable to observe whether the measurements for the items tend to cluster. These clusters are likely to represent different factors that might or might not pertain to the construct being measured (Rust, 2009; Kline, 2015). EFA provides tools to group and select items from a correlation matrix.

EFA operates on the equation in 3 for a measure  (Yong and Pearce, 2013; Singh et al., 2016):

(3)

where are those factors grouping the items being analyzed, are factors that are unique to each measure, are loading of each item on respective factors, and are random measurement errors. Factor loadings are, in practice, weights that provide us with an idea of how much an item contributes to a factor (Yong and Pearce, 2013).

From the equation we derive that the variance of the constructs being measured is explained by three parts: (1) the common factors, also known as communality of a variable (Yong and Pearce, 2013; Singh et al., 2016; Fabrigar et al., 1999) (2), the influence of factors that are unique to that measure, and (3) random error, or .

Estimates for communalities of a an item are often referred to as . is the calculated proportion of variance that is free of error variance and is thus shared with other variables in a correlation matrix (Yong and Pearce, 2013; Singh et al., 2016; Fabrigar et al., 1999). Several techniques calculate the communality of a variable by summing the squared loadings of each variable associated with a variable.

Estimates for the unique variance, denoted as , is the proportion of variance that is associated with communalities, that is  (Yong and Pearce, 2013; Singh et al., 2016; Fabrigar et al., 1999). Determining a value for for an item allows us to find how much specific variance can be attributed to that variable.

Lastly, the random error that is associated with an item is the last component of the total variance. Random error is also often called the unreliability of variance (Yong and Pearce, 2013; Fabrigar et al., 1999).

Unique factors are never correlated with common factors, but common factors may or may not be correlated with each other (Yong and Pearce, 2013).

EFA encompasses three phases (Rust, 2009; Singh et al., 2016; Revelle, 2009; Kootstra, 2006), described in Figure 3. First, we have to select a fitting procedure to estimate the factor loadings and unique variances of the model. Then, we need to define and extract a number of factors. Finally, we need to rotate the factors to be able to properly interpret the produced factor matrices.

Many statistical programs allow to either perform all these phases separately or to perform more than one at the same time. It is not an easy task to assign a methodology to one of the three categories below. The reader is advised that some textbooks avoid our classification of phases and simply revert to a more practical set of questions, e.g., “how to calculate factors” and “how many factors should we retain”. We also note that recent studies have formulated Bayesian versions of these classical exploratory factor analysis techniques and claimed several benefits (Conti et al., 2014a).

Figure 3. Phases for exploratory factor analysis

4.1.1. Factor loading

The most common technique for estimating the loadings and variance is to use the standard statistical technique of principal component analysis (PCA) 

(Pearson, 1901)

. PCA assumes that the communalities for the measures are equal to 1.0. That is, all the variance for a measure is explained only by the factors that are derived by the methodology, and hence there is no error variance. PCA operates on the correlation matrix, mostly on its eigenvalues, to extract factors that correlate with each other. The eigenvalue of a factor represents the variance of the variables accounted for by that factor. The lower the eigenvalue, the least the factor contributes to the variance explanation in the variables 

(Norris and Lecavalier, 2010). Factor weights are computed to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left. PCA is not a factor analysis method strictu sensu, as factor analysis does assume a presence of error variance rather than being able to explain all variance. Some advocates prefer to state that its output should be referred to as a series of components rather than factors. While less simplistic than other techniques to estimate factor loading, performing a PCA is still encouraged as a first step in EFA before performing the actual factor analysis (Rust, 2009).

Among the proper factor analysis techniques that exist, we are interested in a widely recommended technique for estimating loadings and variance named principal axis factoring (PAF) (Russell, 2016; Widaman, 1993; Kline, 2015). PAF does not operate under the assumption that the communalities are equal to 1.0, so the diagonal of the correlation matrix (e.g., the one in Table 1) is substituted with estimations of communalities, . PAF estimates the communalities using different techniques (e.g., the squared multiple correlation between a measure and the rest of measures) and a covariance matrix of the items. Factors are estimated one at a time up until there is a large enough of variance accounted for in the correlation matrix. Under PAF, the ordering of the factors determines their importance in terms of fitting, e.g., the first factor accounts for as much variance as possible, followed by the second factor, and so on.  Russell (2016) provides a detailed description of the underlying statistical operations of PAF, that we will omit for the sake of brevity. Most statistical software provides functions to implement PAF.

4.1.2. Factor extraction

Both PCA and PAF result in values assigned to candidate factors. Therefore, there has to be a strategy to extract meaningful factors. The bad news here is that there is no unique way, let alone a single proper way, to extract factors (Russell, 2016; Singh et al., 2016; Revelle, 2018b). More than one strategy should be adopted in parallel to allow a comparison of results, ending with a sense-making analysis to take an ultimate decision (Fabrigar et al., 1999). Several factor extraction techniques exist, which are mentioned in the cited references on the present section. We provide here those that are used most widely as well as those that are easier to apply and understand.

Perhaps the simplest strategy to extract factors is Kaiser (1960) eigenvalue-greater-than-one (K1) rule. The rule simply states that factors with eigenvalue higher than 1.0 should be retained. Kaiser’s rule is quite easy to apply but it is highly controversial (Russell, 2016; Revelle, 2018b; Fabrigar et al., 1999; Courtney, 2013). First, the rule was originally designed for PCA and not for PAF or other factor analysis methods, which might make it unsuitable for methodologies that provide estimation for commonalities as diagonals for correlation matrices (Courtney, 2013). Second, the cut-off value for 1.0 might discriminate for factors that are just above or just below 1.0 (Courtney, 2013). Third, computer simulations found that K1 tends to overestimate the number of factors (Fabrigar et al., 1999). Yet, K1 is still the default option for some statistical software suites, making it an unfortunate de-facto main method for factor extraction (Courtney, 2013).

Cattell (1966) scree test, also based on eigenvalues, foresees the plot of the eigenvalues extracted from either the correlation matrix or the reduced correlation matrix (thus making it suitable for both PCA and PAF) against the factor they are associated with, in descending order. One then inspects the curved line for a break in the values (or an elbow) up to when a substantial drop in the eigenvalues cannot be observed anymore. The break is a point at which the shape of the curve becomes horizontal. The strategy is then to keep all factors before the breaking point. The three major criticisms of this approach is that it is subjective (Courtney, 2013), that more than one scree might exist (Tinsley and Tinsley, 1987), but also that data often does not offer a discernible scree and a conceptual analysis of the candidate factor is thus always required (Rust, 2009).

Revelle and Rocklin (1979) proposed the Very Simple Structure (VSS) method for factor extraction that is based on assessing how the original correlation matrix can be reproduced by a simplified pattern matrix, for which only the highest loading of each item is retained (everything else set to zero) (Courtney, 2013). The VSS criterion to assess how well the pattern matrix performs is a number from 0 to 1, making it a goodness-of-fit measure almost of a confirmatory nature rather than an exploratory one (Revelle, 2020). The VSS criterion is gathered for solutions involving a number of factors that goes to 1 to a user-specified value. The strategy ends with selecting the number of factors that provides the highest VSS criterion.

Finally, the method of parallel analysis (PA), introduced by Horn (1965), was found to be very robust for factor extraction (Courtney, 2013; Fabrigar et al., 1999). PA starts with the K1 concept that only factors with eigenvalue larger than 1.0 should be kept. Horn (1965) has argued that the K1 rule was developed with population statistics and was thus unsuitable when sampling data. Sampling errors would then cause some components from uncorrelated variables to have eigenvalues higher than one in the population (Courtney, 2013). PA takes into account the proportion of variance that results from sampling rather than being able to access to the population. The way it achieves this is a constant comparison of the solution with randomly generated data (Revelle, 2020). PA generates a large number of matrices from random data in parallel with the real data. Factors are retained as long as they are greater than the mean eigenvalue generated from the random matrices.

4.1.3. Factor Rotation

The last step is to rotate the factors in the dimensional space for improving our interpretation of the results (Rust, 2009).

An unrotated output, that is the one that often results after factor extraction, maximizes the variance accounted for by the first factor, followed by the second factor, the third factor, and so on. That is, most items would load on the first factors and many of them would load on more than one factor in a substantial way.

Rotating factors builds on the concept that there are number of “factor solutions” that are mathematically equivalent to the solution found after factor extraction. By performing a rotation of the factors, we retain our solution but allow an easier interpretation. We rotate the factors to seek a so called simple structure, that is a loadings pattern such that each item loads strongly on one factor only and weakly on other factors. If the reader is interested in mathematical foundations of factor rotation, two deep overviews of factor rotation are offered by Darton (1980); Browne (2001).

There are two families of rotations, namely orthogonal and oblique (Russell, 2016). Orthogonal rotations force the assumption of independency between the factors, whereas oblique rotations allows the factors to correlate with each others. Which methodology to use is influence]d by the statistic software; for example, R psych package (Revelle, 2019) provides “varimax”, “quartimax”, “bentlerT”, “equamax”, “varimin”, “geominT”, and“bifactor” for orthogonal rotations and “promax”, “oblimin”, “sim- plimax”, “bentlerQ, “geominQ”, “biquartimin”, and“cluster” for oblique rotations. Several rotation methodologies are summarized by Browne (2001); Abdi (2003); Russell (2016).

Perhaps the most known and employed (Darton, 1980) orthogonal rotation method is the Varimax rotation (Kaiser, 1958). Varimax maximizes the variance (hence the name) of the squared loadings of a factor on all variables. Each factor will tend to have either large or small loadings of any particular variable. While this solution makes it rather easy to identify factors and their loading on items, the independency condition of orthogonal rotation techniques is hard to achieve. The assumption of independency of factors, especially in the context of behavioral research, belittles the value of orthogonal rotation techniques, to the point that “we see little justification for using orthogonal rotation as a general approach to achieving solutions with simple structure” (Fabrigar et al., 1999, p. 283).

Oblique rotation is preferred for behavioral software engineering studies, because it is sensible to assume that behavioral, cognitive, and affective factors are separated by soft walls of independence (e.g., motivation and job satisfaction) (Rust, 2009; Fabrigar et al., 1999; Russell, 2016). If any, one would have to first conduct an investigation using oblique rotation and observe if the solution shows little to no correlation between factors and, in that case, switch to orthogonal rotation (Fabrigar et al., 1999). The two most employed and recommended oblique rotation techniques are Direct Oblimin (and its slight variation Direct Quartimin) and Promax, both of which perform well (Fabrigar et al., 1999).

Fabrigar et al. (1999); Russell (2016) recommended to use a Promax rotation because it provides the best of both approaches. A Promax rotation first performs an orthogonal rotation (a Varimax rotation) to maximize the variance of the loadings of a factor on all the variables (Russell, 2016). Then, Promax relaxes the constraint that the factors are independent between each others, turning the rotation to oblique. The advantage of this technique is that it will reveal whether factors really are uncorrelated with each other (Russell, 2016).

4.1.4. Further recommendations

There have been several recommendations regarding the required sample size, number of measures per factor, number of factors to retain, and interpretation of loadings (Russell, 2016; Singh et al., 2016; Yong and Pearce, 2013).

The recommended overall sample size as reported by Yong and Pearce (2013) is at least 300 participants, with each variable that is subject to factor analysis with at least 5 to 10 observations. This recommendation has, however, low empirical validation. As reported by Russell (2016), a Monte Carlo study by MacCallum et al. (1999) analyzed how different sample sizes and communalities of the variables were able to reproduce the population factor loadings. They found that with item communalities higher or equal to 0.60, results were very consistent with sample sizes as low as 60 cases. Communality levels around 0.50 required samples of 100 to 200 cases. In his review, Russell (2016) also found that 39% of EFA studies involved samples of 100 or fewer cases.

On the number of measures (items) per factor, Yong and Pearce (2013) report that for something to be labeled as a factor it should have at least 3 variables, especially in cases when factors receive a rotation treatment, where only a high correlation (coefficient higher than 0.70) with each other and mostly uncorrelation with other items would make them worthy of consideration. Generally speaking, the correlation coefficient for an item to belong to a factor should be 0.30 or higher (Tabachnick et al., 2007)Russell (2016) identifies that prior work has requested at least three items per factor; however, four or more items per factor was found to be a better holistic to ensure an adequate identification of the factors. In his review he identified 25% studies with three or fewer measures per factor.

We reported on the number of factors to retain in the Factor Extraction subsection, so we will not repeat ourselves here. There is not a recommended number and one would follow (possibly more than) one extraction method to identify the best number of factors according to the case.  Tabachnick et al. (2007) add that cases with missing values should be deleted to prevent overestimation of factors. Russell (2016) wrote something that is worthy of mentioning for the uninitiated behavioral software engineering researchers, that is, even when constructing a new measurement instrument there is already an expectation of possible factors in the mind of the researcher. The reason is that items are developed following an investigation of prior work and/or empirical data (see section 3). That number is a good starting point to base ourselves when conducting EFA.

Yong and Pearce (2013) spends some further explanations on interpretation of loadings when they are produced by a statistical software. There should be few item crossloadings (i.e., split loadings, when an item loads at 0.32 or higher with two or more factors) so that each factor defines a distinct cluster of interrelated variables. There are exceptions to this that require an analysis of the items. Sometimes it is useful to retain an item that crossloads, with the assumption that it is the latent nature of the variable. Furthermore, Tabachnick et al. (2007) report that, with an alpha level of 0.01, a rotated factor loading with a meaningful sample size would need to have a value of at least 0.32 for loadings as this would correspond to approximately 10% of the overlapping variance.

4.2. Confirmatory Factor Analysis

Contrary to exploratory factor analysis, confirmatory factor analysis (CFA) is for confirming a-priori hypotheses and assessing how well an hypothesized factor structure fits the obtained data (Russell, 2016). A hypothesized factor could be derived from existing literature as well as data from a previous study to explore the factor structure.

Once the data is obtained to compare to the hypothesized factor structure, a goodness-of-fit test should be conducted. CFA requires statistical modeling that is outside the scope of this paper and the estimation of the goodness-of-fit in CFA is a long lasting debate as “there are perhaps as many fit statistics as there are psychometricians” ( (Revelle, 2018a, p. 31)).  Russell (2016); Singh et al. (2016); Rust (2009) provide several techniques for estimating the goodness-of-fit in CFA, e.g., Chi-squared test, root mean square residual (RMSR), root mean square error of approximation (RMSEA), and the standardized RMSR. Statistical software implement these techniques, including R psych package (Revelle, 2020). A widely employed technique for CFA is to be found in structural equation modeling (SEM), which is a family of models to fit networks of constructs (Kaplan, 2008)MacCallum and Austin (2000) provided a comprehensive review of SEM techniques in the psychological science including their applications and pitfalls.

Conducting both EFA and CFA is very expensive. When designing and validating a measurement instrument, and when obtaining a large enough sample of participants, it is common to split the sample for conducting EFA on a part of it and CFA on the remaining part (Singh et al., 2016). Most authors, however, prefer conducting EFA only (Russell, 2016) and rely on future independent studies towards a better psychometrics evaluation of a tool. This is also why statistical tools, e.g., R psych package (Revelle, 2019) provide estimates of fits for EFA as well as convenience tools to adapt the data to CFA packages, e.g., R sem (Fox, 2017).

We refer the reader to a prior work of ours in the behavioral software engineering domain (Lenberg et al., 2017b) where we conducted a CFA and describe its application. We also note that, like for EFA, there have also been Bayesian methods for CFA proposed (Lee, 1981a; Ansari et al., 2002).

5. Statistical properties of items

Assessing characteristics and performance of individuals poses several challenges when interpreting the resulting scores. One of them is that a raw score is not meaningful without understanding the test standardization characteristics. For example, a score of 38 on a debugging performance test is meaningless without knowing that 38 means to be able to open a debugger only. Furthermore, the interpretation of the results vary wildly when knowing that on average, developers score 400 on the test itself compared to if they score 42. The former issue is related to criterion-referenced standardization, the latter to norm-referenced standardization (Rust, 2009; Kline, 2015).

Criterion-referenced tests assess what an individual with any score is expected to be able to do or know. Norm-referenced standardization enables to compare an individual’s score to the ordered ranking of a population (also see section 2). We concentrate on norm-referenced standardization as criterion-referenced standardization is unique to a test criteria.

A first step to norm-reference a test is to order the results of all participants and rank an individual’s score. Measures such as median and percentiles are useful for achieving the ranking and compare. When we can treat our data as interval scales and have it approximately following a normal distribution, we can also use the mean and the standard deviation. The standard deviation is useful for telling us how much an individual’s score is above or below the mean score. Instead of reporting that an individual’s score is, e.g., 13 above the mean score, it is more interesting to know that the score is 1.7 standard deviations above the mean score. Hence, we norm-standardize scores using different approaches. The remaining of this sections is modelled after Rust (2009) text.

5.1. Standardization to Z scores

Whenever a sample approximates a normal distribution, we know that a score above average is in the upper 50% and by following the three sigma empirical rule (Pukelsheim, 1994)

, we know that a score greater than one or two standard deviations from the mean is in the top 68% and 95% respectively. For expressing an individual’s score in terms of how distant it is from the mean score, we transform the value to its Z score (also called standard score) using the formula in 

4:

(4)

where is a participant’s score, is the mean of all participants’ scores, and is the standard deviation for all participants’ scores.

The ideal case would be to use the population mean and standard deviation. In software engineering research we lack studies estimating population characteristics (an example of norm studies was provided by  (Graziotin et al., 2017)), so we should either aggregate the results of some studies or gather more samples.

An important note is that transforming scores into a Z scores does not make the scores normally distributed. This would require a normalization procedure, explained below.

5.2. Standardization to T scores

Z scores typically range between -3.00 and +3.00. The range is not always suitable for its application. A software developer could, for example, object to a Z score of -0.89 which, at first glance, might suggest to be low (value) or negative (sign).

A T score, not to be confused with t-statistics of the Student’s t-test, is a standard Z score that is scaled and shifted so that it has a mean of 50 and a standard deviation of 10. T scores thus typically range between 20 and 80.

(5)

For transforming a Z score into a T score, we use the formula in 5.

The software developer in the previous example would have a T score of from a Z score of .

5.3. Standardization to stanine and sten scores

Stanine and sten scores respond to the need of transforming a score to a scale from 1 to 9 (stanine) or 10 (sten) with a mean of 5 (stanine) or 5.5 (sten) and a standard deviation of 2. These scores purposely lose precision by keeping only decimal values.

Stanine score Z score Sten score Z score
1 <-1.75 1 <-2.00
2 -1.75 to -1.25 2 -2.00 to -1.50
3 -1.25 to -0.75 3 -1.50 to -1.00
4 -0.75 to -0.25 4 -1.00 to -0.50
5 -0.25 to +0.25 5 -0.50 to -0.00
6 +0.25 to +0.75 6 +0.00 to +0.50
7 +0.75 to +1.25 7 +0.50 to +1.00
8 +1.25 to +1.75 8 +1.00 to +1.50
9 >+1.75 9 +1.50 to +2.00
10 >+2.00
Table 2. Conversion from Z scores to Stanine and Sten scores

The conversion to stanine and sten scores follows the rules in Table 2.

The advantage of stanine and sten scores lies in their imprecision. If our non performing developer with a Z score of -0.89 was compared with other two developers having scores of -0.72 and -0.94, how meaningful would be such a tiny difference in scores? Their stanine scores are 3, 4, and 4, respectively. Their sten score would be 4. Stanine and sten scores provide clear cut-off points for easier comparisons.

There is an important difference between stanine and sten scores, besides their range. A stanine score of 5 represents an average score in a sample. An average sten sore can not be obtained, because the value of 5.5 does not belong to its possible values. A score of 4 represents the low average band (which ranges from 4.5 to 5.5, that is one standard deviation below the man), and a score of 6 represents the high average band (which ranges from 5.5 to 6.5, that is one standard deviation above the mean).

5.4. Normalization

The standardization techniques that we presented in the previous section carry the assumption that the sample and population approximate the normal distribution. For all other cases, it is possible to normalize the data. Examples include algebraic transformation, e.g., square-rooting or log transformation, as well as graphical transformation. See introductory statistical texts for more detailed explanations of and a broad set of such transformations.

6. Reliability

Reliability can be seen either in terms of precision, that is the consistency of test scores across replications of the testing procedure (reliability/precision), or as a coefficient, that is the correlation between scores on two equivalent forms of the test (reliability/coefficients) (American Educational Research Association et al., 2014)). For evaluating the precision of a measurement instrument, ideal would be to have as many independent replications of the testing procedure as possible on the same very large sample. Scores are expected to generalize over alternative tests forms, contexts, raters, and over time. The reliability/precision of a measurement instrument is then assessed through the range of differences of the obtained scores. The reliability/precision of an instrument should be assessed with as many sub-groups of a population as possible.

The reliability/coefficients of a measurement instrument, which we will simply call reliability from this point on, is the most common way to refer to the reliability of a test (American Educational Research Association et al., 2014). There are three categories of reliability coefficients, namely alternate-form (derived by administering alternative forms of test), test-retest (same test on different times), and internal-consistency (relationship among scores derived from individual test items during a single session).

We adhere to Rust (2009); Nunnally (1994) classification of reliability and provide a brief overview of reliability facets in psychometric theory in Figure 4.

Figure 4. Reliability in psychometric theory

Several factors, as defined by American Educational Research Association et al. (2014), affect the reliability of a measurement instrument, especially adding or removing items, changing wording or intervals of items, causing variations in the constructs to be measured (e.g., using a measurement instrument for happiness to assess job satisfaction of developers), and ignoring variations in the intended populations for which the test was originally developed.

We now introduce the most widely employed techniques for establishing the reliability of a test.

6.1. Test-retest reliability

Test-retest reliability, also known as test stability, is assessed when administering the measurement instrument twice to the same sample within a short interval of time. The paired set of scores for each participant is then compared with a correlation coefficient such as Pearson product-moment correlation coefficient or Spearman’s rank-order correlation. A correlation coefficient of 1.00, while rare, would indicate a perfect test-retest reliability, whereas a correlation coefficient of 0.00 would indicate no test-retest reliability at all. A negative score is no good news either, and it is automatically considered as a value of 0.00.

6.2. Parallel forms reliability

Test-retest reliability is not suitable for certain tests, such as those assessing knowledge or performance in general. Participants either face a learning or motivation effect from the first test session or simply improve (or worsen) their skills between sessions. Fur such cases, the parallel forms method is more suitable. The technique requires a systematic development of two versions of the same measurement instrument, namely two parallel tests, that are assessing the same construct but using different wording or items. Parallel tests for assessing debugging skills would feature the same sections and amount of items, e.g., arithmetic, logic, and syntax errors. The two tests would need different source code snippets that are, however, very similar. A trivial example would be to test for unwanted assignments inside conditions in different places and with different syntax (e.g., using if (n = foo()) in version one and if (x = y + 2) in version two). As with test-retest reliability, each participant faces both tests and a correlation coefficient can be computed.

6.3. Split-half reliability

Split-half reliability is a widely adopted and more convenient alternative to parallel forms reliability. Under this technique, a measurement instrument is split into two half-size versions. The split should be as random as possible, e.g., splitting by taking odd and even numbered items. Participants face both halves of the test and, again, a correlation coefficient can be computed. The obtained coefficient, however, is not a measure of reliability yet. The reliability of the whole measurement instrument is computed with the Spearman-Brown formula in 6.

(6)

where is the correlation of the split tests. This formula shows that the more discriminating items a test has, the higher will be its reliability.

6.4. Inter-rater reliability

Inter-rater reliability is perhaps the most common reliability that is found in software engineering studies. Qualitative studies or systematic literature reviews and mapping studies often have different raters for evaluating the same items. The sets of rates can be assessed using a correlation coefficient. Cohen’s kappa is widely used in the literature for inter-rater coefficient of two entities together with Fleiss’ kappa for the inter-rater coefficient of more than two entities. Cases have also been made for using Krippendorff’s alpha (De Swert, 2012).

6.5. Standard error of measurement

The standard error of measurement is used for generating confidence intervals around reported scores. The score is strictly related to the reliability coefficient 

(Rust, 2009) as shown in formula 7

(7)

where is the variance of the test scores and is the calculated reliability coefficient of the test. The standard error of measurement also provides an idea of how errors are distributed around observed scores. The standard error of measurement is maximized—and becomes equal to the standard deviation of the observed scores—when a test is completely unreliable. The standard error of measurement is minimized to zero when a test is perfectly reliable.

If the assumption that errors are distributed normally is met, one can calculate the 95% confidence interval by using the z curve value of 1.96 to construct the interval . Confidence intervals could also be used to compare participants’ scores. Should one participant score fall below or above the interval, their results would differ significantly from the normality of scores.

7. Validity

Validity in psychometrics is defined as “The degree to which evidence and theory support the interpretation of test scores for proposed uses of tests.” (American Educational Research Association et al., 2014). Psychometric validity is therefore a different (but related) concept than the one of study validity that software engineers are used to deal with (Wohlin et al., 2012; Feldt and Magazinius, 2010; Siegmund et al., 2015; Petersen and Gencel, 2013). Validation in psychometric research is related to the interpretation of the test scores. For validating a test, we should gather relevant evidence for providing a sound scientific basis for the interpretation of the proposed scores.

Rust (2009) has summarized six major facets of validity in the context of psychometric tests, which we represent in Figure 5 and describe below.

Figure 5. Validity in psychometric theory

7.1. Face validity

Face validity concerns how the items of a measurement instrument are accepted by respondents. For example, software developers expect the wording of certain items to be targeted to them instead of say, a children. Similarly, if a test presents itself to be about a certain construct, such as debugging expertise, it could cause face validity issues if it contained a personality assessment.

7.2. Content validity

Content validity (sometimes called criterion validity or domain-referenced validity) concerns the extent to which a measurement instrument reflects the purpose for which the instrument is being developed. If a test was developed under the specifications of job satisfaction, but measured developers’ motivation instead, it would present issues of content validity. Content validity is evaluated qualitatively (Rust, 2009) most of the times because the form of deviation matters more than the degree of deviation.

7.3. Predictive validity

Predictive validity is a statistical validity defined as the correlation between the score of a measurement instrument and a score of the degree of success of the selected field. For example, the degree of success of debugging performance capability is expected to be higher with a higher programming experience. Computing a score for predictive validity is as simple as calculating a correlation value (such as Pearson or Spearman). According to the acceptance criterion for predictive validity, a score higher than 0.5 could be considered an acceptable predictive validity for the items. We would then feel justified in including programming experience as an item to represent the construct of debugging performance capability.

7.4. Concurrent validity

Concurrent validity is a statistical validity that is defined as the correlation of a new measurement instrument and existing measurement instruments for the same construct. A measurement instrument tailored to the personality of software developers ought to correlate with existing personality measurement instruments. While concurrent validity is a common measure for test validity in psychology, it is a weak criterion as the old measurement instrument itself could have a low validity. Nevertheless, concurrent validity is important for detecting low validity issues in measurement instruments.

7.5. Construct validity

Construct validity is a major validity criterion in psychometric tests. As constructs are not directly measurable, we observe the relationship between the test and the phenomena that the test attempts to represent. For example, a test that identifies highly communicative team members should have a high correlation with…observations of highly communicative people who have been labelled as such. The nature of construct validity is that it is cumulative over the number of available studies (Rust, 2009).

7.6. Differential validity

Differential validity assesses how a measurement instrument should not correlate with measures from which it should differ, and how a measurement instrument should not correlate with measures from which it should not differ. In particular, Campbell and Fiske (1959) have differentiated between two aspects of differential validity, namely convergent and discriminant validity.  Rust (2009) mentions a straightforward example of both. A test of mathematics reasoning should correlate positively with a test of numerical reasoning (convergent validity). However, the mathematics test should not strongly correlate positively with a test of reading comprehension, because the two construct are supposed to be different (discriminant validity). In case of a low discriminant validity, there should be an investigation of whether the correlation is a result of a wider underlying trait, say, continues Rust, general intelligence. Differential validity is overall empirically demonstrated by a discrepancy between convergent validity and discriminant validity.

8. Fairness in testing and test bias

Fairness is “the quality of treating people equally or in a way that is right or reasonable” (Cambridge (2018), online.). A test is fair when it reflects the same constructs for all participants, and its scores have the same meaning for all individuals of the population (American Educational Research Association et al., 2014). A fair test does neither advantage or disadvantage any participant through characteristics that are not related to the constructs under observation. From a participant point of view, an unfair test brings a wrong decision based on the test results. An example of test that requires fairness is an attitude or skills assessment when interviewing candidates for hire in an information technology company.

American Educational Research Association et al. (2014) reports on several facets of fairness. Individuals should have the opportunity to maximize how they perform with respect to the constructs being assessed. Similarly, for a measurement instrument that assesses traits of participants, the test should maximize how it assesses that the constructs being measured are present among individuals. This fairness comes from how the test is administered, which should be as standardized as possible. Research articles should describe the environment for the experimental settings, how the participants were instructed, which time limits were given, and so on. Fairness also comes, on other hand, from participants themselves. Participants should be able to access the constructs as being measured without being advantaged or disadvantaged by individual characteristics. This is an issue of accessibility to a test and is also part of limiting item, test, and measurement bias.

We provide an overview or bias in psychometric theory in Figure 6.

Figure 6. Bias in psychometric theory

Rust (2009) provides an overview of item, test, and measurement bias. It almost feels unnecessary to state that a measurement instrument should be free from bias from age, sex, gender, and race. These cases are indeed covered by legislation to ensure fairness. In general, there are three forms of bias in tests, namely item bias, intrinsic test bias, and extrinsic test bias (Rust, 2009).

8.1. Item bias

Item bias, also known as differential item functioning, refers to bias born out of individual items of the measurement instrument. A straightforward example would be to test a (non-UK) European developer about coding snippets dealing with imperial system units. A more common item bias is about the wording of items. Even among native speakers, the use of idioms such as double negatives can cause confusion. Asking a developer to mark a coding snippet that is free from logic and syntax errors is clearer than asking to mark code that does not possess neither logic nor syntax errors.

A systematic identification of item bias that goes beyond carefully checking an instrument is to carry out an item analysis with all possible groups of potential participants, for example men and women, or speakers of English of different levels. A comparison of the facility values (the proportion of correct answers) of each item can reveal potential item bias. For instruments that assess traits and characteristics of a group instead of function or skills, a strategy is to follow a checklist of questions that researchers and pilot participants can answer (Hambleton and Rodgers, 1995).

Differential item functioning (DIF) is a statistical characteristic of an item that shows potential unfairness of the item among different groups that should provide same test results otherwise (Perrone, 2006). A presence of DIF does not necessarily indicate bias but unexpected behavior on an item (American Educational Research Association et al., 2014). This is why, after the detection of DIF, it is important to review the root causes of the differences. Whenever DIF happens for many items of a test, a test construct or final score is potentially unfair among different groups that should provide same test results otherwise. This situation is called differential test functioning (DTF) (Runnels, 2013)

. There are three main techniques for identifying DIF, namely Mantel-Haenszel approach, item response theory (IRT) methods, and logistic regression 

(Zumbo, 2007).

8.2. Intrinsic test bias

Intrinsic test bias occurs when there are differences in the mean scores of two groups that are due to the characteristics of the test itself rather than difference between the groups in the constructs being measured. Measurement invariance is the desired property upon lack of which intrinsic test bias occurs. If a test for assessing the knowledge of software quality is developed in English and then administered to individuals who are not fluent in English, the measure for the construct of software quality knowledge would be contaminated by a measure of English proficiency. Differential content validity (see section 7.2) is the most severe form of intrinsic test bias as it causes lower test scores in different groups. If a measurement instrument for debugging skills has been designed by keeping in mind American software testers, any participant that is not an American software tester will likely perform worse on the test to different degrees. Rust (2009) reports various statistical model proposals over the last 50 years to detect intrinsic test bias which, however, present various issues including the introduction of more unfairness near cut-off points or for certain groups of individuals. There is not a recommended way to detect intrinsic test bias other than perform item bias analysis paired with sensitivity analysis.

8.3. Extrinsic test bias

Extrinsic test bias occurs whenever unfair decision happen based on a non-biased test. These issues usually belong to tests about demographics dealing with social, economical, and political issues, so they are unlikely to belong to measurement instruments developed for the software engineering domain.

9. Further reading

The present paper only scratches the surface of psychometric theory and practice, and it is its aim to be broad rather than deep. We collect, in this section, what we consider to be good next steps for a better understanding and expansion of the concepts that we have presented.

The books written by Rust (2009); Kline (2015); Nunnally (1994) provide an overall overview of psychometric theory, cover all topics mentioned in the present paper, and more. We invite in particular to compare how they present measurement theory and their views and classifications of validity and reliability. A natural followup is The Standards for Educational and Psychological Testing (SEPT, (American Educational Research Association et al., 2014)), which proposes standards that should be met in psychological testing.

While our summary breaks down fundamental concepts and presents them for the unitiated researcher of behavioral software engineering, our writing can not honor enough the guidelines and recommendations for factory analysis offered by Yong and Pearce (2013); Russell (2016); Singh et al. (2016); Fabrigar et al. (1999). To those we add the work of Zumbo (2007), who have explored, through data simulations, the conditions that yield reliable exploratory factor analysis with samples below 50, which is unfortunately a condition we often live with in software engineering research. Furthermore, we wish to point the reader to alternatives to factor analysis, especially for confirmatory factor analysis (CFA). Flora and Curran (2004)

analyzed benefits when using Robust Weighted Least Squares (Robust WLS) regression. With a Monte Carlo simulation, they have shown that robust WLS provided accurate test statistics, parameter estimates and standard errors even when the the assumption of CFA were not met. Bayesian alternatives for CFA have been proposed in the early 80s already 

(Lee, 1981b) and later expanded to cover the exploratory phase as well, see, for example, the works by Conti et al. (2014b); Muthén and Asparouhov (2012); Lu et al. (2016).

In the above sections we have pointed to several papers that can provide a modern, Bayesian statistical view of many psychometric analysis procedures. We also note that a more general treatment and overview can be found in Levy and Mislevy (2016). While it is important for a SE researcher that wants to use and develop psychometric instruments to know the key concepts and techniques of the more classical, typically frequentist, psychometric techniques one can then switch to a Bayesian view either on philosophical or for practical reasons (a simpler more unified treatment, for one).

Within the software engineering domain, Gren (2018) has offered an alternative lens on validity and reliability of software engineering studies, also based on psychology, that we advise to read. Ralph and Tempero (2018) has offered a deep overview of construct validity in software engineering through a psychological lens.

10. Running example of psychometric evaluation

We believe that a methodology description is best complemented by a concrete example of its application. In Appendix A, we thus provide a complete scenario of the development of a fictitious measurement instrument and the establishment of its psychometric properties with the R programming language. The evaluation follows the same structure as the present paper for ease of understanding. In the spirit of open science in software engineering (Fernández et al., 2020), we provide the running example as a replication package as well (Graziotin et al., 2020). We wrote the example using R Markdown, making it fully repeatable, as well as the generated dataset, and instructions for replication with newly generated data.

11. Conclusion

The adoption and development of valid and reliable measurement instruments in software engineering research, whenever humans are to be evaluated, should benefit from psychology and statistics theory; we need not and should not ‘reinvent the wheel’. This paper provides a brief introduction to the evaluation of psychometric tests. Our guidelines will contribute to a better development of new tests as well as a justified decision-making process when selecting existing tests.

After providing basic building blocks and concepts of psychometric theory, we introduced item analysis, factor analysis, standardization and normalization, reliability, validity, and fairness in testing and test bias. In an appendix, we also provided a running example of an implementation of a psychometric evaluation and shared both its data and source code (scripts) openly to promote self-study and a basis for further exploration. We followed textbooks, method papers, and society standards for ensuring a coverage of all important steps, but we could only offer a brief introduction and invite the reader to explore our referenced material further. Each of these steps is a universe of its own, with dozens of published artifacts related to them.

Adding the steps described in this paper will increase the time required for developing measurement instruments. However, the return on investment will be considerable. Psychometric analysis and refinement of measurement instruments can improve their reliability and validity. The software engineering community must value psychometric studies more. This, however, requires a cultural change that we hope to champion with this paper.

“Spending an entire Ph.D. candidacy on the validation of one single measurement of a construct should be, not only approved, but encouraged.” (Gren, 2018) and, we believe, should also become normal.

Acknowledgements.
We acknowledge the support of Swedish Armed Forces, Swedish Defense Materiel Ad- ministration and Swedish Governmental Agency for Innovation Systems (VINNOVA) in the project number 2013-01199.

References

  • (1)
  • Abdi (2003) Hervé Abdi. 2003. Factor Rotations in Factor Analyses. In Encyclopedia of Social Sciences Research Methods, A. Lewis-Beck M., Bryman and Futing T (Eds.). SAGE, Thousand Oaks (CA), 792–795.
  • Alphen et al. (1994) Arnold Alphen, Ruud Halfens, Arie Hasman, and Tjaart Imbos. 1994. Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing 20, 1 (1994), 196–201.
  • American Educational Research Association et al. (2014) American Educational Research Association, American Psychological Association, National Council on Measurement in Education, and Joint Committee on Standards for Educational and Psychological Testing (U.S.). 2014. Standards for educational and psychological testing. American Educational Research Association, Washington, DC.
  • Ansari et al. (2002) Asim Ansari, Kamel Jedidi, and Laurette Dube. 2002. Heterogeneous factor analysis models: A Bayesian approach. Psychometrika 67, 1 (2002), 49–77.
  • Berger (2013a) Michael Berger. 2013a. Criterion-Referenced Testing. In Encyclopedia of Autism Spectrum Disorders. Springer New York, Springer New York, 823–823. https://doi.org/10.1007/978-1-4419-1698-3_146
  • Berger (2013b) Michael Berger. 2013b. Norm-Referenced Testing. In Encyclopedia of Autism Spectrum Disorders. Springer New York, Springer New York, 2063–2064. https://doi.org/10.1007/978-1-4419-1698-3_451
  • Browne (2001) Michael W. Browne. 2001. An Overview of Analytic Rotation in Exploratory Factor Analysis. Multivariate Behavioral Research 36, 1 (2001), 111–150.
  • Cambridge (2018) English Dictionary Cambridge. 2018. Fairness. Cambridge English Dictionary 1, 1 (2018), 1. Available: https://dictionary.cambridge.org/dictionary/english/fairness.
  • Campbell and Fiske (1959) Donald T. Campbell and Donald W. Fiske. 1959. Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin 56, 2 (1959), 81–105.
  • Capretz (2003) Luiz Fernando Capretz. 2003. Personality types in software engineering. International Journal of Human-Computer Studies 58, 2 (2003), 207–214.
  • Carnap (1962) Rudolf Carnap. 1962. Logical foundations of probability. The University of Chicago Press.
  • Cattell (1966) Raymond B. Cattell. 1966. The Scree Test For The Number Of Factors. Multivariate Behavioral Research 1, 2 (1966), 245–276.
  • Ciolkowski et al. (2003) Marcus Ciolkowski, Oliver Laitenberger, Sira Vegas, and Stefan Biffl. 2003. Practical Experiences in the Design and Conduct of Surveys in Empirical Software Engineering. In Empirical methods and studies in software engineering, Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, Reidar Conradi, and Alf Inge Wang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 104–128.
  • Cohen et al. (1995) Ronald Jay Cohen, Mark E. Swerdlik, and Suzanne M. Phillips. 1995. Psychological Testing and Assessment: An Introduction to Tests and Measurement. Mayfield Pub Co.
  • Collins (2003) D Collins. 2003. Pretesting survey instruments: an overview of cognitive methods. Qual Life Res 12, 3 (2003), 229–238.
  • Conti et al. (2014a) Gabriella Conti, Sylvia Frühwirth-Schnatter, James J Heckman, and Rémi Piatek. 2014a. Bayesian exploratory factor analysis. Journal of econometrics 183, 1 (2014), 31–57.
  • Conti et al. (2014b) G Conti, S Frühwirth-Schnatter, JJ Heckman, and R Piatek. 2014b. Bayesian Exploratory Factor Analysis. J Econom 183, 1 (2014), 31–57.
  • Courtney (2013) Matthew Gordon Rau Courtney. 2013. Determining the Number of Factors to Retain in EFA: Using the SPSS R-Menu v2.0 to Make More Judicious Estimations. Practical Assessment, Research & Evaluation 18, 8 (2013), 1–14.
  • Crocker and Algina (2006) Linda Crocker and James Algina. 2006. Introduction to Classical and Modern Test Theory. Wadsworth Pub Co.
  • Cronbach (1951) Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests. psychometrika 16, 3 (1951), 297–334.
  • Cruz et al. (2015) Shirley Cruz, Fabio Q.B. da Silva, and Luiz Fernando Capretz. 2015. Forty years of research on personality in software engineering: A mapping study. Computers in Human Behavior 46 (2015), 94–113. https://doi.org/10.1016/j.chb.2014.12.008
  • Darcy and Ma (2005) David P Darcy and Meng Ma. 2005. Exploring individual characteristics and programming performance: Implications for programmer selection. In Exploring individual characteristics and programming performance: Implications for programmer selection, Vol. Proceedings of the 38th Annual Hawaii International Conference on System Sciences. IEEE, 314a–314a.
  • Darton (1980) R. A. Darton. 1980. Rotation in Factor Analysis. The Statistician 29, 3 (1980), 167.
  • De Swert (2012) Knut De Swert. 2012. Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. Center for Politics and Communication (2012), 1–15.
  • Embretson and Reise (2013) Susan E Embretson and Steven P Reise. 2013. Item response theory. Psychology Press.
  • Fabrigar et al. (1999) Leandre R Fabrigar, Duane T Wegener, Robert C MacCallum, and Erin J Strahan. 1999. Evaluating the use of exploratory factor analysis in psychological research. Psychological methods 4, 3 (1999), 272.
  • Fagerholm (2015) Fabian Fagerholm. 2015.

    Software Developer Experience: Case Studies in Lean-Agile and Open Source Environments

    .
    Ph.D. Dissertation. Ph. D. Dissertation. Department of Computer Science, University of Helsinki …, Helsinki.
  • Fagerholm and Pagels (2014) Fabian Fagerholm and Max Pagels. 2014. Examining the Structure of Lean and Agile Values among Software Developers. In Lecture Notes in Business Information Processing: Agile Processes in Software Engineering and Extreme Programming. Springer International Publishing, Cham, 218–233.
  • Feldt and Magazinius (2010) Robert Feldt and Ana Magazinius. 2010. Validity Threats in Empirical Software Engineering Research - An Initial Survey. In

    Proceedings of the 22nd International Conference on Software Engineering & Knowledge Engineering (SEKE’2010), Redwood City, San Francisco Bay, CA, USA, July 1 - July 3, 2010

    . NA, 374–379.
    Available: http://www.cse.chalmers.se/~feldt/publications/feldt_2010_validity_threats_in_ese_initial_survey.pdf.
  • Feldt et al. (2008) Robert Feldt, Richard Torkar, Lefteris Angelis, and Maria Samuelsson. 2008. Towards individualized software engineering. In empirical studies should collect psychometrics, Vol. the 2008 international workshop. ACM Press, New York, New York, USA, 49–52.
  • Fernández et al. (2020) Daniel Méndez Fernández, Daniel Graziotin, Stefan Wagner, and Heidi Seibold. 2020. Open science in software engineering. In Contemporary Empirical Methods in Software Engineering, Michael Felderer and Guilherme Horta Travassos (Eds.). Springer International Publishing, Cham, Switzerland, 479–504. arXiv:1712.08341 [cs.SE] In press. Available https://arxiv.org/abs/1712.08341.
  • Flora and Curran (2004) David B Flora and Patrick J Curran. 2004. An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological methods 9, 4 (2004), 466.
  • Fox (2017) John Fox. 2017. sem: Structural Equation Models. Technical Report. The Comprehensive R Archive Network. 1–79 pages.
  • Furia et al. (2019) Carlo Alberto Furia, Robert Feldt, and Richard Torkar. 2019. Bayesian data analysis in empirical software engineering research. IEEE Transactions on Software Engineering (2019).
  • Ginty (2013) Annie T. Ginty. 2013. Psychometric Properties. In Encyclopedia of Behavioral Medicine, Marc D. Gellman and J. Rick Turner (Eds.). Springer New York, New York, NY, 1563–1564. https://doi.org/10.1007/978-1-4419-1005-9_480
  • Glaser (1963) Robert Glaser. 1963. Instructional technology and the measurement of learing outcomes: Some questions. American Psychologist 18, 8 (1963), 519–521. https://doi.org/10.1037/h0049294
  • Graziotin et al. (2017) Daniel Graziotin, Fabian Fagerholm, Xiaofeng Wang, and Pekka Abrahamsson. 2017. On the Unhappiness of Software Developers. In On the Unhappiness of Software Developers, Emilia Mendes, Steve Counsell, and Kai Petersen (Eds.), Vol. 21st International Conference on Evaluation and Assessment in Software Engineering. ACM Press, New York, New York, USA, 324–333.
  • Graziotin et al. (2020) Daniel Graziotin, Per Lenberg, Robert Feldt, and Stefan Wagner. 2020. Behavioral Software Engineering - Example of psychometric evaluation with R. https://doi.org/10.5281/zenodo.3799603
  • Graziotin et al. (2015a) Daniel Graziotin, Xiaofeng Wang, and Pekka Abrahamsson. 2015a. The Affect of Software Developers: Common Misconceptions and Measurements. In 2015 IEEE/ACM 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE). IEEE, 123–124.
  • Graziotin et al. (2015b) Daniel Graziotin, Xiaofeng Wang, and Pekka Abrahamsson. 2015b. Do feelings matter? On the correlation of affects and the self-assessed productivity in software engineering. Journal of Software: Evolution and Process 27, 7 (2015), 467–487.
  • Graziotin et al. (2015c) Daniel Graziotin, Xiaofeng Wang, and Pekka Abrahamsson. 2015c. Understanding the Affect of Developers: Theoretical Background and Guidelines for Psychoempirical Software Engineering. In Proceedings of the 7th International Workshop on Social Software Engineering (Bergamo, Italy) (SSE 2015). ACM, New York, NY, USA, 25–32. https://doi.org/10.1145/2804381.2804386
  • Green (2009) Christopher D. Green. 2009. Darwinian theory, functionalism, and the first American psychological revolution. American Psychologist 64, 2 (2009), 75–83. https://doi.org/10.1037/a0013338
  • Gren (2018) Lucas Gren. 2018. Standards of validity and the validity of standards in behavioral software engineering research. In Standards of validity and the validity of standards in behavioral software engineering research. ACM Press, New York, New York, USA.
  • Gren and Goldman (2016) Lucas Gren and Alfredo Goldman. 2016. Useful Statistical Methods for Human Factors Research in Software Engineering: A Discussion on Validation with Quantitative Data. In Proceedings of the 9th International Workshop on Cooperative and Human Aspects of Software Engineering (Austin, Texas) (CHASE ’16). Association for Computing Machinery, New York, NY, USA, 121–124. https://doi.org/10.1145/2897586.2897588
  • Hambleton and Rodgers (1995) Ronald K Hambleton and Jane Rodgers. 1995. Item bias review. ERIC Clearinghouse on Assessment and Evaluation, the Catholic University of America, Department of Education.
  • Hambleton et al. (1991) Ronald K Hambleton, Hariharan Swaminathan, and H Jane Rogers. 1991. Fundamentals of item response theory. Sage.
  • Hilgard (1980) Ernest R Hilgard. 1980. The trilogy of mind: Cognition, affection, and conation. Journal of the History of the Behavioral Sciences 16, 2 (1980), 107–117.
  • Hogan (2017) Robert Hogan. 2017. Personality and the fate of organizations. Psychology Press.
  • Horn (1965) John L Horn. 1965. A Rationale And Test For The Number Of Factors In Factor Analysis. Psychometrika 30 (1965), 179–185.
  • Ji et al. (2008) Junzhong Ji, Jingyue Li, Reidar Conradi, Chunnian Liu, Jianqiang Ma, and Weibing Chen. 2008. Some lessons learned in conducting software engineering surveys in china. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement. IEEE, 168–177.
  • Kaiser (1958) Henry F. Kaiser. 1958. The varimax criterion for analytic rotation in factor analysis. Psychometrika 23, 3 (1958), 187–200.
  • Kaiser (1960) Henry F. Kaiser. 1960. The Application of Electronic Computers to Factor Analysis. Educational and Psychological Measurement 20, 1 (1960), 141–151.
  • Kaplan (2008) David Kaplan. 2008. Structural equation modeling: Foundations and extensions. Vol. 10. Sage Publications.
  • Kitchenham (2007) B A Kitchenham. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report. Keele University and University of Durham Keele and Durham, UK. 1–65 pages.
  • Kitchenham and Pfleeger (2008) Barbara A. Kitchenham and Shari L. Pfleeger. 2008. Personal Opinion Surveys. In Guide to Advanced Empirical Software Engineering, Forrest Shull, Janice Singer, and Dag I. K. Sjøberg (Eds.). Springer London, London, 63–92.
  • Kline (2015) Paul Kline. 2015. A handbook of test construction (psychology revivals): introduction to psychometric design. Routledge.
  • Kootstra (2006) GJ Kootstra. 2006. Exploratory Factor Analysis: Theory and Application. Technical Report. University of Groningen. 1–15 pages.
  • Lee (1981a) Sik-Yum Lee. 1981a. A Bayesian approach to confirmatory factor analysis. Psychometrika 46, 2 (1981), 153–160.
  • Lee (1981b) Sik-Yum Lee. 1981b. A bayesian approach to confirmatory factor analysis. Psychometrika 46, 2 (1981), 153–160.
  • Lenberg et al. (2017a) Per Lenberg, Robert Feldt, Lars Göran Wallgren Tengberg, Inga Tidefors, and Daniel Graziotin. 2017a. Behavioral software engineering - guidelines for qualitative studies. arXiv:1712.08341 [cs.SE] Available https://arxiv.org/abs/1712.08341.
  • Lenberg et al. (2015) Per Lenberg, Robert Feldt, and Lars Göran Wallgren. 2015. Behavioral software engineering: A definition and systematic literature review. Journal of Systems and Software 107 (2015), 15–37. https://doi.org/10.1016/j.jss.2015.04.084
  • Lenberg et al. (2017b) Per Lenberg, Lars Göran Wallgren Tengberg, and Robert Feldt. 2017b. An initial analysis of software engineers’ attitudes towards organizational change. Empirical Software Engineering 22, 4 (2017), 2179–2205.
  • Levy and Mislevy (2016) Roy Levy and Robert J Mislevy. 2016. Bayesian psychometric modeling. CRC Press.
  • Likert (1932) Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology 22, 40 (1932), 1–55.
  • Loevinger (1957) Jane Loevinger. 1957. Objective Tests as Instruments of Psychological Theory. Psychological Reports 3 (1957), 635–694. https://doi.org/10.2466/pr0.1957.3.3.635
  • Lu et al. (2016) ZH Lu, SM Chow, and E Loken. 2016. Bayesian Factor Analysis as a Variable-Selection Problem: Alternative Priors and Consequences. Multivariate Behav Res 51, 4 (2016), 519–539.
  • MacCallum and Austin (2000) RC MacCallum and JT Austin. 2000. Applications of structural equation modeling in psychological research. Annu Rev Psychol 51 (2000), 201–226.
  • MacCallum et al. (1999) Robert C. MacCallum, Keith F. Widaman, Shaobo Zhang, and Sehee Hong. 1999. Sample size in factor analysis. Psychological Methods 4, 1 (1999), 84–99.
  • McDonald and Edwards (2007) Sharon McDonald and Helen M. Edwards. 2007. Who should test whom. Commun. ACM 50, 1 (2007), 66–71.
  • Molléri et al. (2016) Jefferson Seide Molléri, Kai Petersen, and Emilia Mendes. 2016. Survey guidelines in software engineering: An annotated review. In Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE, 58.
  • Muthén and Asparouhov (2012) Bengt Muthén and Tihomir Asparouhov. 2012. Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods 17, 3 (2012), 313–335.
  • Norris and Lecavalier (2010) Megan Norris and Luc Lecavalier. 2010. Evaluating the Use of Exploratory Factor Analysis in Developmental Disability Psychological Research. Journal of Autism and Developmental Disorders 40, 1 (01 Jan 2010), 8–20. https://doi.org/10.1007/s10803-009-0816-2
  • Nunnally (1994) Jum C Nunnally. 1994. Psychometric theory 3E. Tata McGraw-Hill Education.
  • Oppenheim (1992) A.N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measurement. Pinter Pub Ltd, London, UK.
  • Pearson (1901) Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (Nov. 1901), 559–572. https://doi.org/10.1080/14786440109462720
  • Perrone (2006) Michael Perrone. 2006. Differential item functioning and item bias: Critical considerations in test fairness. Teachers College, Columbia University Working Papers in TESOL and Applied Linguistics 6 (2006), 1–3.
  • Petersen and Gencel (2013) Kai Petersen and Cigdem Gencel. 2013. Worldviews, Research Methods, and their Relationship to Validity in Empirical Software Engineering Research. In Worldviews, Research Methods, and their Relationship to Validity in Empirical Software Engineering Research, Vol. 2013 Joint Conference of the 23nd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement (IWSM-MENSURA). IEEE, 81–89.
  • Pittenger (1993) David J Pittenger. 1993. Measuring the MBTI… and coming up short. Journal of Career Planning and Employment 54, 1 (1993), 48–52.
  • Pukelsheim (1994) Friedrich Pukelsheim. 1994. The Three Sigma Rule. The American Statistician 48, 2 (1994), 88–91.
  • Ralph et al. (2020) Paul Ralph, Sebastian Baltes, Gianisa Adisaputri, Richard Torkar, Vladimir Kovalenko, Marcos Kalinowski, Nicole Novielli, Shin Yoo, Xavier Devroey, Xin Tan, Minghui Zhou, Burak Turhan, Rashina Hoda, Hideaki Hata, Gregorio Robles, Amin Milani Fard, and Rana Alkadhi. 2020. Pandemic Programming: How COVID-19 affects software developers and how their organizations can help. , 32 pages. arXiv:2005.01127 [cs.SE] Available: https://arxiv.org/abs/2005.01127.
  • Ralph and Tempero (2018) Paul Ralph and Ewan Tempero. 2018. Construct Validity in Software Engineering Research and Software Metrics. In Construct Validity in Software Engineering Research and Software Metrics. ACM Press, New York, New York, USA.
  • Revelle (2009) William Revelle. 2009. An introduction to psychometric theory with applications in R. personality-project.org.
  • Revelle (2018a) William Revelle. 2018a. An introduction to the psych package: Part II Scale construction and psychometrics. Technical Report. The Comprehensive R Archive Network. 1–97 pages.
  • Revelle (2018b) William Revelle. 2018b. Using the psych package to generate and test structural models. Technical Report. The Comprehensive R Archive Network. 1–52 pages.
  • Revelle (2019) William Revelle. 2019. psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. https://cran.r-project.org/package=psych R package version 1.9.12.
  • Revelle (2020) William Revelle. 2020. How To: Use the psych package for Factor Analysis and data reduction. Technical Report. The Comprehensive R Archive Network. 1–96 pages.
  • Revelle and Rocklin (1979) William Revelle and Thomas Rocklin. 1979. Very simple structure: An alternative procedure for estimating the optimal number of interpretable factors. Multivariate Behavioral Research 14, 4 (1979), 403–414.
  • Runnels (2013) Judith Runnels. 2013. Measuring differential item and test functioning across academic disciplines. Language Testing in Asia 3, 1 (2013), 9.
  • Rus et al. (2002) Ioana Rus, Mikael Lindvall, and S Sinha. 2002. Knowledge management in software engineering. IEEE software 19, 3 (2002), 26–38.
  • Russell (2016) Daniel W. Russell. 2016. In Search of Underlying Dimensions: The Use (and Abuse) of Factor Analysis in Personality and Social Psychology Bulletin. Personality and Social Psychology Bulletin 28, 12 (2016), 1629–1646. https://doi.org/10.1177/014616702237645
  • Rust (2009) John Rust. 2009. Modern psychometrics : the science of psychological assessment. Routledge, Hove, East Sussex New York.
  • Schad et al. (2019) Daniel J Schad, Michael Betancourt, and Shravan Vasishth. 2019. Toward a principled Bayesian workflow in cognitive science. arXiv preprint arXiv:1904.12765 (2019).
  • Schwarz and Oyserman (2016) Norbert Schwarz and Daphna Oyserman. 2016. Asking Questions About Behavior: Cognition, Communication, and Questionnaire Construction. American Journal of Evaluation 22, 2 (2016), 127–160. https://doi.org/10.1177/109821400102200202
  • Shea et al. (2014) CM Shea, SR Jacobs, DA Esserman, K Bruce, and BJ Weiner. 2014. Organizational readiness for implementing change: a psychometric assessment of a new measure. Implement Sci 9 (2014), 7.
  • Siegmund et al. (2015) Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on internal and external validity in empirical software engineering. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1. IEEE, 9–19. https://doi.org/10.1109/ICSE.2015.24
  • Singh et al. (2016) Kamlesh Singh, Mohita Junnarkar, Jasleen Kaur, Kamlesh Singh, Mohita Junnarkar, and Jasleen Kaur. 2016. Norms for Test Construction. In Measures of Positive Psychology. Springer India, New Delhi, 17–34.
  • Sutton et al. (2003) Stephen Sutton, David P. French, Susie J. Hennings, Jo Mitchell, Nicholas J. Wareham, Simon Griffin, Wendy Hardeman, and Ann Louise Kinmonth. 2003. Eliciting salient beliefs in research on the theory of planned behaviour: The effect of question wording. Current Psychology 22, 3 (2003), 234–251. https://doi.org/10.1007/s12144-003-1019-1
  • Swart and Kinnie (2003) Juani Swart and Nicholas Kinnie. 2003. Sharing knowledge in knowledge-intensive firms. Human resource management journal 13, 2 (2003), 60–75.
  • Tabachnick et al. (2007) Barbara G Tabachnick, Linda S Fidell, and Jodie B Ullman. 2007. Using multivariate statistics. Vol. 5. Pearson Boston, MA, Boston.
  • Tinsley and Tinsley (1987) Howard E. Tinsley and Diane J. Tinsley. 1987. Uses of factor analysis in counseling psychology research. Journal of Counseling Psychology 34, 4 (1987), 414–424.
  • Traub (2005) Ross E. Traub. 2005. Classical Test Theory in Historical Perspective. Educational Measurement: Issues and Practice 16, 4 (Oct. 2005), 8–14. https://doi.org/10.1111/j.1745-3992.1997.tb00603.x
  • Wagenmakers et al. (2018) Eric-Jan Wagenmakers, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Jonathon Love, Ravi Selker, Quentin F Gronau, Martin Šmíra, Sacha Epskamp, et al. 2018. Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. Psychonomic bulletin & review 25, 1 (2018), 35–57.
  • Wagner et al. (2020) Stefan Wagner, Daniel Mendez, Michael Felderer, Daniel Graziotin, and Marcos Kalinowski. 2020. Challenges in survey research. In Contemporary Empirical Methods in Software Engineering, Michael Felderer and Guilherme Horta Travassos (Eds.). Springer International Publishing, Cham, Switzerland, 95–127. arXiv:1908.05899 [cs.SE] In press. Available https://arxiv.org/abs/1908.05899.
  • Wasserstein et al. (2019) Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. 2019. Moving to a world beyond “p< 0.05”.
  • Weinberg (1971) Gerald M Weinberg. 1971. The psychology of computer programming. Vol. 932633420. Van Nostrand Reinhold New York.
  • Widaman (1993) Keith F. Widaman. 1993. Common Factor Analysis Versus Principal Component Analysis: Differential Bias in Representing Model Parameters. Multivariate Behavioral Research 28, 3 (1993), 263–311. https://doi.org/10.1207/s15327906mbr2803_1
  • Wohlin et al. (2012) Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Wyrich et al. (2019) Marvin Wyrich, Daniel Graziotin, and Stefan Wagner. 2019. A theory on individual characteristics of successful coding challenge solvers. PeerJ Computer Science 5 (2019), e173.
  • Yong and Pearce (2013) An Gie Yong and Sean Pearce. 2013. A Beginner’s Guide to Factor Analysis: Focusing on Exploratory Factor Analysis. Tutorials in Quantitative Methods for Psychology 9, 2 (2013), 79–94.
  • Zumbo (2007) Bruno D Zumbo. 2007. Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language assessment quarterly 4, 2 (2007), 223–233.

Appendix A Appendix

The following appendix (explained in section 10) is available as an always updated replication package (Graziotin et al., 2020). See pages - of example/graziotin_et_al-bse_psychometrics_example.pdf