Software is developed for people, by people. For decades we have recognized that no matter the size and importance of the technical side of software engineering, it is humans that ultimately drive the underlying processes and produce the desired artifacts (Weinberg, 1971). Software engineers are knowledge workers and have knowledge as main capital (Swart and Kinnie, 2003). They need to construct, retrieve, model, aggregate, and present knowledge in all their analytic and creative daily activities (Rus et al., 2002). Operations related to knowledge are cognitive in nature, and cognition is influenced by characteristics of human behavior, including personality, affect, and motivation (Hilgard, 1980). It is no wonder that industry and academia have explored psychological aspects of software development and the assessment of psychological constructs at the individual, team, and organization level (Lenberg et al., 2015).
Psychological assessment is the gathering of psychology-related data towards an evaluation that is accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and other procedures (Cohen et al., 1995). We are interested in psychological tests111We are interested in qualitative research as well. We have offered our proposal for guidelines for qualitative behavioral software engineering elsewhere (Lenberg et al., 2017a) as behavioral software engineering has turned much attention into employing theory and measurement instruments from psychology (Lenberg et al., 2015; Graziotin et al., 2015c).
Psychological tests are instruments (e.g., questionnaires) used to measure unobserved constructs (Cohen et al., 1995). We can not assess these constructs directly like when measuring the source lines of code. Hence, the related variables are called latent variables (Nunnally, 1994). Examples of unobserved constructs include attitude, mood, happiness, job satisfaction, commitment, motivation, intelligence, soft skills, abilities, and performance. We need to create valid and reliable measurement instruments 222In this paper, we use the terms psychological test, measurement instrument, and questionnaire interchangeably. to proxy the assessment of such constructs. For ensuring a systematic and sound development of psychological tests and their interpretation, the field of psychometrics was born (Rust, 2009; Nunnally, 1994).
Psychometrics is the development of measurement instruments and the assessment on whether these instruments are reliable and valid forms of measurement (Ginty, 2013). Psychometrics is also the branch of statistics and psychology which is devoted to the construction of valid and reliable psychological tests (Rust, 2009).
Psychological testing is one of the most important contributions of psychology to our society. Proper development and validation of tests results in better decisions on individuals, while, on the opposite end, improper development and validation of tests might result in invalid results, economic loss, and even harm of individuals (American Educational Research Association et al., 2014). Personality assessment is a classic example of psychological testing in personnel selection, which has been employed by all companies, including those related to information technology (Darcy and Ma, 2005; Wyrich et al., 2019). Improper development, administration, and handling of psychological tests could harm the company by hiring a non-desirable person, and it could harm the interviewee because of missed opportunities.
We believe that solid theoretical and methodological foundations should be the first step when designing any measurement instrument. The reality, however, is that not all tests are well developed in psychology (American Educational Research Association et al., 2014), and software engineering research, especially when studying psychological constructs, is far from adopting rigorous and validated research artifacts.
1.1. Abuse and misuse of psychological tests in software engineering research
McDonald and Edwards (2007) have subtitled their paper “Examining the use and abuse of personality tests in software engineering.”. The authors anticipated, thirteen years ago, the issue that we attempt to address in the present submission, that is the “the lack of progress in this [personality research in software engineering] field is due in part to the inappropriate use of psychological tests, frequently coupled with basic misunderstandings of personality theory by those who use them. ” (p. 67).
Instances of such misconduct 333We believe that direct accusations bring no value to our contribution and are counterproductive to our advancement of knowledge, so we discuss resources that point to specific issues rather than specific papers or authors. can be observed in the results of a systematic literature review of personality research in software engineering by Cruz et al. (2015). We noted in Cruz et al. (2015) results that 48% of personality studies in software engineering have employed the Myers-Brigg Type Indicator (MBTI) questionnaire, which has been shown to possess low to none reliability and validity properties (Pittenger, 1993) up to the point of being called a “little more than an elaborate Chinese fortune cookie” (Hogan, 2017).
Feldt et al. (2008) have argued in favor of systematic studies of human aspects of software engineering. More specifically, to adopt measurement instruments coming from psychology and related fields. Graziotin et al. (2015b) have echoed the call seven years after but found that research on the affect of software developers had been threatened by a deep misunderstanding of related constructs and how to assess them. In particular, the authors noted that peers in software engineering tend to confuse affect-related psychological constructs such as emotions and moods with related, yet different, constructs such as motivation, commitment, and well-being.
Lenberg et al. (2015) have conducted a systematic literature review of studies of human aspects that made use of behavioral science, calling the field behavioral software engineering. Among their results, they found that software engineering research is threatened by several knowledge gaps when performing behavioral research, and that there have been very few collaborations between software engineering and behavioral science researchers.
Graziotin et al. (2015c), meanwhile, extended their prior observations on affect to a broader view of software engineering research with a psychological perspective. Given the observation that much research in the field has misinterpreted (when not ignored) validated measurement instruments from psychology, the work offered what we can consider the sentiment for the present article, that is brief guidelines to select a theoretical framework and validated measurement instruments from psychology. Graziotin et al. (2015c) called the field “psychoempirical software engineering” but later agreed with Lenberg et al. (2015) to unify the vision under “behavioral software engineering”. Hence, the present collaboration.
Our previous studies have also reported that, when a validated test from psychology is adopted from software engineering researchers, its items get modified, causing the destruction of its reliability and validity properties. This includes a thorough evaluation of the psychometric properties of candidate instruments. Gren (2018) has offered a psychological test theory lens for characterizing validity and reliability in behavioral software engineering research, further enforcing our view that software engineering research that investigates any psychological construct should maintain fair psychometric properties. We agree with Gren (2018) that we should “change the culture in software engineering research from seeing tool-constructing as the holy grail of research and instead value [psychometric] validation studies higher.” (p. 3).
A mea culpa works better than a j’accuse in further building our case, so we bring a negative example from one of our previous studies. As reported in a very recent work by Ralph et al. (2020) (which we appreciate in the next paragraph), “there is no widespread consensus about how to measure developers’ productivity or the main antecedents thereof. Many researchers use simple, unvalidated productivity scales” (p. 6). In one of the earliest works by the first author of the present paper (Graziotin et al., 2015a), which was published way after when the study was conducted, we compared the affect triggered by a software development task with the self-assessed productivity of individual programmers. While we were very careful to select a validated measurement instrument of emotions and to highlight how self-assessment of productivity converges to objective assessment of productivity, we used a single likert item to represent productivity. This choice was to reduce as much as possible the items of the measurement instrument, which had to be used every ten minutes. While the results of the study are not invalidated by this choice, the productivity scale itself was not validated, making the results less valuable from a psychometric perspective and, thus, our interpretation of its results. The study was also (successfully) independently replicated twice by two ICSE papers, which suffer from the same unfortunate choice.
We wish to refrain from being overly negative. The field of software engineering does have positive cases—excluding those from the present authors—that we can showcase here. For example, Fagerholm and Pagels (2014) developed a questionnaire on lean and agile values and applied psychometric approaches to inspect the structure of value dimensions. Fagerholm (2015) has also embodied psychometric approaches in his PhD dissertation by analyzing the validity of the constructs he studied. A more recent example has been offered by Ralph et al. (2020), who analyzed through a questionnaire the effects of the COVID-19 pandemic on developers’ well-being and productivity. The authors constructed their measurement instrument by incorporating psychometrically validated scales on constructs such as perceived productivity, disaster preparedness, fear and resilience, ergonomics, and organizational support. Furthermore, they employed confirmatory factor analysis (which we touch upon in the present paper) to verify that the included items do indeed cluster and converge into the factors that are claimed to converge to.
What we see missing is an introduction to the field of psychometrics for behavioral software engineering researchers. Such an introduction would improve their understanding of the available measurement instruments and, also, the development of new tests, allowing them to explore the human component in the software construction process more accurately.
Our overall objective is to address the lack of understanding and use of psychometrics in behavioral software engineering research.
We also hope to increase software engineering researcher’s awareness and respect of theories and tools developed in established fields of the behavioral science, towards stronger methodological foundations of behavioral software engineering research.
With this paper, we contribute to the (behavioral) software engineering body of knowledge with a set of guidelines which enable a better understanding of psychological constructs in research activities. This improvement in research quality is achieved by either (1) reusing psychometrically validated measurement instruments, as well understanding why and how they are validated, or, if no such questionnaires exist, (2) developing new psychometrically validated questionnaires that are better suited for the software engineering domain.
Our contribution is enabled by offering one theoretical deliverable and one practical companion deliverable.
We offer a review and synthesis of psychometric guidelines in form of textbooks, review papers, as well as empirical studies.
We offer a hands-on counterpart to our review by providing a fully reproducible implementation of our guidelines as R Markdown.
The Standards for Educational and Psychological Testing (SEPT, American Educational Research Association et al. (2014)) is a set of gold standards in psychological testing jointly developed the American Psychological Association (APA), National Council on Measurement in Education (NCME), and the American Educational Research Association (AERA). The book defines areas and standards that should be met when developing, validating, and administering psychological tests. We adopted SEPT as a framework to guide the paper construction, for ensuring that the standards are met and that the various other references are framed in the correct context.
Additionally, we organized the scoping of the paper by comparing related work from the fields of psychology research.
While the present paper is not a systematic literature review or a mapping study—the discipline is so broad that entire textbooks have been written on it—we systematically framed its construction to ensure that all important topics were covered.
Several authors, e.g., Crocker and Algina (2006); Singh et al. (2016); Rust (2009), have proposed different phases for the psychometric development and evaluation of measurement instruments. Through our review, we identified 15 phases that we summarize visually in Figure 1 and outline as follows.
Identification of the primary purpose for which the test scores will be employed.
Identification of constructs, traits, and behaviors that are reflected by the purpose of the instrument.
Development of a test specification, delineation of the items proportion that should focus on each type of constructs, traits, and behaviors of the test.
Construction of an initial pool of items.
Review of the items.
Conduction of a pilot test with the revised items.
Execution of an item analysis to possibly reduce the number of items.
Evaluation of an exploratory factor analysis to possibly reduce and group items into components or factors.
Execution of a field test of the items with a larger, representative sample.
Determination of statistical properties of item scores.
Design and execution of reliability studies.
Design and execution of validity studies.
Evaluation of fairness in testing and test bias.
Development of guidelines for administering, scoring, and interpreting test scores.
We focus mainly on the second half of psychometric activities—those with a dark background in Figure 1—as they are the most challenging and usually uncovered in software engineering research. The first half of the activities, on questionnaire design, are covered by existing literature in software engineering (e.g., (Ji et al., 2008; Ciolkowski et al., 2003; Molléri et al., 2016; Kitchenham and Pfleeger, 2008; Wagner et al., 2020)) and psychology research (e.g., (Collins, 2003; Sutton et al., 2003; Schwarz and Oyserman, 2016; Oppenheim, 1992)).
As a final note, the present paper, as well as any psychometric construction of measurement instruments, is not a checklist. A psychometric evaluation does not include all elements reported in this paper, as many facets of psychometrics are influenced by the research questions, study design, and data at hand. Yet, a proper psychometric evaluation requires a consideration of all elements reported in the present paper.
After a brief introduction on key concepts of psychometrics (section 2), that are required to understand the rest of the paper, we focus on the psychometrics of test construction, namely item review and analysis (section 3), factor analysis (section 4), statistical properties (section 5), reliability (section 6), validity (section 7), and fairness in testing and test bias (section 8). The paper ends with our recommendations for further reading (section 9) and a hands-on running example (Section 10) of a psychometric evaluation. We provide R code and generated datasets openly (Graziotin et al., 2020).
2.1. Building blocks
The fundamental idea behind psychological testing is that what is being assessed is not a physical object, such as height and weight. We are attempting to assess a construct, that is a hypothetical entity (American Educational Research Association et al., 2014; Rust, 2009). If we assess the job satisfaction of a software developer, we are not directly measuring the satisfaction of the individual. Instead, we compare the developers’ score with other developers’ scores or a set of established norms for job satisfaction. When comparing the satisfaction scores between developers, we are limited to seeing how the scores differentiate between satisfied and unsatisfied developers according to the knowledge and ideas we have about satisfied an dissatisfied individuals.
There are two common models of psychometrics, namely functionalist and trait (Rust, 2009). Functionalist psychometrics is often occurring in educational and occupational tests; it deals with how the design of a test is determined by its application and not about the constructs being measured (Rust, 2009; Green, 2009). For functionalist design, a good test is one that is able to distinguish between individuals who perform well and individuals who perform less well on a job or in schools’ activities. This is also called local criterion-based validity (explained in section 7). The functionalist paradigm can be applied to most cases where a performance assessment or an evaluation are required.
Trait psychometrics attempts to manage common-sense notions such as human intelligence, personality, and affect scientifically (Rust, 2009). Classic trait approach was based on the notion that intelligence is related to biological individual differences, and trait psychometric tests aimed to measure traits that would represent biological differences among people (Rust, 2009).
No matter the difference between the two schools of thoughts, they have several aspects in common, including test construction and validation methods, which differ in how validity is seen, and they are linked by the theory of true scores (Hambleton et al., 1991). The theory of true scores, or latent trait theory, is governed by the formula in 1.
where is the observed score, is the true score, and is the error. There are three assumptions with the theory of true scores. (1) all errors
are random and normally distributed, (2) true scoresare uncorrelated with the errors, and (3) different measures of the observed score on the same participants are independent from each other. Besides all issues that come with the three assumptions, the theory has been criticized with the major point being that there is arguably no such thing as a true score, and that all tests measure are abstractions of psychological constructs (Loevinger, 1957).
Elaborations and re-interpretation of the theory of true score have been proposed, among which is the statistical true score (Carnap, 1962). The statistical true score defines the true score as the score we would obtain by averaging an infinite number of measures from the same individual. With an infinite number of measures the random errors cancel each other, leaving with the true score . The statistical formula of the theory of true scores should not be completely new to readers of software engineering, as most quantitative methods that are in use in our field nowadays are based on it. The statistical interpretation of the theory of true score applies both to trait and functional psychometrics. A difference lies in generalization. Functional tests can only be specific to a certain context while trait tests attempt to generalize to an overall construct present in a group of individuals.
2.2. Test types and types of testing
Items on psychological tests can be knowledge-based or person-based (Rust, 2009). Knowledge-based tests assess whether an individual performs well regarding the knowledge of certain information, including possessing skills towards certain knowledge-based construct. Debugging skills would be assessed by a knowledge-based test. Person-based tests, on the other hand, assess typical performance, or how they are represented, towards a construct. Examples of constructs related to person-based tests include personality, mood, and attitudes. Pair programming personalities would be assessed by a person-based test. Knowledge-based tests are usually uni-dimensional as they gravitate towards the notion of possessing or not possessing a certain knowledge. We can also easily rank individuals on their scores and state who ranks better. Person-based tests are usually multi-dimensional and do not allow direct ranking of individuals without some assumptions. For example, a developer could score high on extroversion. A high score on extroversion likely implies to score low on introversion but not that a developer scoring less on extroversion is a “worse” developer in any way.
A second distinction is between criterion-referenced and norm-referenced testing (Glaser, 1963). Criterion-referenced tests are constructed with reference to performance on a criterion that is defined as objectively as possible (Glaser, 1963; Berger, 2013a). Continuing with the example on debugging skills, if we design a test that assesses whether a developer is able to open a debugging tool and use its ten basic functionalities, and a score of 10 over 10 would mean to be able to debug software, we are constructing a criterion-referenced test. If our test was constructed instead on knowledge of basic theory of debugging and software quality, and we could administer it to several samples, we would then obtain scores that are comparable to each other. That is, we would start assessing how a group or population of software developers knows about software debugging. Obtaining group or population means and measures of deviations would allow us to evaluate how a developer performs with respect to normality. We would construct a norm-referencing test, which allows comparison with the whole population of respondents (Glaser, 1963; Berger, 2013b).
3. Item Review and Item Analysis
When developing a new measurement instrument, we are likely to create more items than what is really needed. Item review and item analysis are a series of methods to reduce the number of items of a measurement instrument and keep the best performing ones (Rust, 2009). Item review and item analysis is a two-steps process, described in Figure 2. First, it requires a review by experts; then, a pilot study and statistical calculations. During the first step (item review), experts in the domain of knowledge evaluate items one by one and argue for their presence in the test (Rust, 2009). During the second step (item analysis), the developers of the measurement instrument calculate item facility and item discrimination.
We are not describing item review here because it is a straightforward process that involves familiar methods found in systematic literature reviews and qualitative studies. The experts in the domain of knowledge discuss candidate items and argue in favor or against them, as it happens when discussing inclusion and exclusion of publications in systematic literature reviews (Kitchenham, 2007). Inter-rater reliability measures can be adopted for assessing the degree of agreement among raters. After reaching an agreement on the items to be included, a pilot study is required for an analysis of the items.
3.1. Item analysis
Item analysis refers to several statistical methods for the selection of items for a psychological test (Kline, 2015). Two of the two most known techniques can be found in item facility and item discrimination.
3.1.1. Item facility
Item facility for an item, also known as item difficulty, is defined as the ratio of the number of participants who provided a wrong answer over the number of all participants to a test (Rust, 2009; Kline, 2015)
. The value of item facility is 1 if all respondents are right and 0 if all respondents are wrong. In other words, item facility for an item is the probability of obtaining the right answer for the item(Rust, 2009)
. Of interest for test construction is the variance of an item, which is defined with the formula in2:
where is the item and is the item facility for . The highest possible value for is 0.25, and this is the case when items are neither very easy or very difficult–the variance is high, after all. When has small values, for example 0.047, it means that most respondents tend to reply the same way for that item, making it either extremely easy or extremely difficult.
The variance of an item is interesting in the context of norm-reference testing, where item facility also applies, as the purpose of the test is to spread out individuals’ scores as much as possible on a continuum. A larger spread is due to a larger variance, and we are interested in including items that make a contribution to the variance (Rust, 2009; Kline, 2015).
Furthermore, if an item has a high correlation to other items and has a large variance, it derives that the item makes high contribution to the total variance of a test and it will be kept in the pool of items.
Before moving to the next statistics, we would like to offer an explanation on the term facility. Facility suggests that participants to all tests can be either right or wrong on an answer. What about trait measurement, where participants are not exactly right or wrong but are assessed in terms of a psychological construct? We can calculate item facility for these cases as well. The issue relies only in the naming, because item facility was developed for knowledge-based tests first. Some scholars prefer to use the term item endorsement or item location (Revelle, 2009) to better reflect how calculations can be done on traits. For trait measurement, it is common to have items with likert items (Likert, 1932). Item facility for likert items can be calculated with the mean value of the item. If a likert item maps to the values 0 (strongly disagree) to 5 (strongly agree), the extreme values for the item will be 0 and 5 instead of 0 and 1. An item with average value of 4.8 with variance of 0.09 is a candidate for deletion, whereas an item with average value of 2.72 with variance of 3.02 is deemed interesting.
3.1.2. Item discrimination
Item discrimination reflects items that behave oddly, in the sense that individuals that tend to score very high (or very low) on a test as a whole tend to be wrong (right) on the same item(Rust, 2009). Such an item would possess a negative (positive) discrimination. Ideal for a test is to have items with zero discrimination (Kline, 2015). On a statistical point of view, if an item is uncorrelated with the overall test score, then it is almost certainly uncorrelated with the other items and making very little contribution to the overall variance of the test (Rust, 2009; Singh et al., 2016). Therefore, we calculate item discrimination by comparing the correlation coefficient of an item score and the overall test score. If the computed correlation coefficient is 0 or below, we should consider removing the item. The same holds for trait measurement. Instead of assessing how well an item behaves with respect to the test score, we instead assess how an item is in fact measuring the overall trait in question.
Item analysis, while valuable and still in use today, is part of the so called classical test theory (CTT), which assumes that an individual’s observed score is the same of a true score and an error score (Traub, 2005). Modern replacements for CTT have been proposed, and the most prominent one is item response theory (IRT) (Embretson and Reise, 2013). IRT models build upon a function (called item response function, IRF, or item characteristics curve, ICC) that defines the probability of being right or wrong on an item (Alphen et al., 1994). IRT is outside the scope of the present paper as CTT is still in place to this day (Rust, 2009) and explaining IRT requires a publication on its own.
Item analysis, as presented in this section, assumes that there is a single test score, meaning that a single construct is being measured. Whenever multiple constructs or a construct of multiple factors are being measured, item analysis requires to be accompanied by factor analysis (Singh et al., 2016).
4. Factor Analysis
Factor analysis is one of the most widely employed psychometric tool (Kline, 2015; Rust, 2009; Singh et al., 2016) and it can be applied to any dataset where the number of participants is higher than the number of item scores under observation. Factor analysis is for understanding which test items “go together” to form factors in the data, that is constructs that we are aiming to assess (Rust, 2009). At the same time, factor analysis allows to reduce the dimensionality of the problem space (i.e., reducing factors and/or associated items) and explaining the variance in the observed variables compared to underlying latent factors (Kootstra, 2006). In case we intend to assess a single construct, factor analysis helps in identifying those items that represent the construct we are interested in, so that we can exclude the other items.
Factor analysis techniques are based on the notion that those constructs that we observe through our measurement instruments can be reduced to fewer latent variables which are unobservable but share a common variance (Yong and Pearce, 2013) (see Section 2). Factor analysis starts with the computed correlation coefficient as its first building block. A way to summarize correlation coefficients is through a correlation matrix, which is a matrix of items that displays the correlation coefficient between all items. Given that two items correlate with each other the same, no matter their order, and that one item correlates with itself with a perfect 1.0 correlation coefficient, the tradition is to show only the lower triangle of the matrix with 1.0 as its diagonal. Table 1 provides an example correlation matrix for the items and .
A high correlation among certain items, in our case with and , and with , indicate these items might belong to the same factor. This approach, however, lacks part of the story. Questions such as “how do our candidate factor explain the total variance of the measurement instrument”, “to which candidate factor does an item belong better?” and “how are factors related to each other” are better answered by further analysis.
There are two main factor analysis techniques, namely Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) (Singh et al., 2016; Revelle, 2009; Kootstra, 2006). EFA attempts to uncover patterns and clusters of items by exploring predictions, while CFA attempts to confirm hypotheses on existing factors (Yong and Pearce, 2013).
4.1. Exploratory Factor Analysis
Exploratory factor analysis (EFA) is a family of factor analysis techniques aimed to reduce the number of items by retaining the items that are most relevant to certain factors (Kootstra, 2006). Strictly speaking, when developing a measurement instrument, after item analysis, it is desirable to observe whether the measurements for the items tend to cluster. These clusters are likely to represent different factors that might or might not pertain to the construct being measured (Rust, 2009; Kline, 2015). EFA provides tools to group and select items from a correlation matrix.
where are those factors grouping the items being analyzed, are factors that are unique to each measure, are loading of each item on respective factors, and are random measurement errors. Factor loadings are, in practice, weights that provide us with an idea of how much an item contributes to a factor (Yong and Pearce, 2013).
From the equation we derive that the variance of the constructs being measured is explained by three parts: (1) the common factors, also known as communality of a variable (Yong and Pearce, 2013; Singh et al., 2016; Fabrigar et al., 1999) (2), the influence of factors that are unique to that measure, and (3) random error, or .
Estimates for communalities of a an item are often referred to as . is the calculated proportion of variance that is free of error variance and is thus shared with other variables in a correlation matrix (Yong and Pearce, 2013; Singh et al., 2016; Fabrigar et al., 1999). Several techniques calculate the communality of a variable by summing the squared loadings of each variable associated with a variable.
Estimates for the unique variance, denoted as , is the proportion of variance that is associated with communalities, that is (Yong and Pearce, 2013; Singh et al., 2016; Fabrigar et al., 1999). Determining a value for for an item allows us to find how much specific variance can be attributed to that variable.
Lastly, the random error that is associated with an item is the last component of the total variance. Random error is also often called the unreliability of variance (Yong and Pearce, 2013; Fabrigar et al., 1999).
Unique factors are never correlated with common factors, but common factors may or may not be correlated with each other (Yong and Pearce, 2013).
EFA encompasses three phases (Rust, 2009; Singh et al., 2016; Revelle, 2009; Kootstra, 2006), described in Figure 3. First, we have to select a fitting procedure to estimate the factor loadings and unique variances of the model. Then, we need to define and extract a number of factors. Finally, we need to rotate the factors to be able to properly interpret the produced factor matrices.
Many statistical programs allow to either perform all these phases separately or to perform more than one at the same time. It is not an easy task to assign a methodology to one of the three categories below. The reader is advised that some textbooks avoid our classification of phases and simply revert to a more practical set of questions, e.g., “how to calculate factors” and “how many factors should we retain”, which we attempt to answer anyway.
4.1.1. Factor loading
The most common technique for estimating the loadings and variance is called principal component analysis (PCA)(Pearson, 1901)
. PCA assumes that the communalities for the measures are equal to 1.0. That is, all the variance for a measure is explained only by the factors that are derived by the methodology, and hence there is no error variance. PCA operates on the correlation matrix, mostly on its eigenvalues, to extract factors that correlate with each other. The eigenvalue of a factor represents the variance of the variables accounted for by that factor. The lower the eigenvalue, the least the factor contributes to the variance explanation in the variables(Norris and Lecavalier, 2010). Factor weights are computed to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left. PCA is not a factor analysis method strictu sensu, as factor analysis does assume a presence of error variance rather than being able to explain all variance. Some advocates prefer to state that its output should be referred to as a series of components rather than factors. While less simplistic than other techniques to estimate factor loading, performing a PCA is still encouraged as a first step in EFA before performing the actual factor analysis (Rust, 2009).
Among the proper factor analysis techniques that exist, we are interested in a widely recommended technique for estimating loadings and variance named principal axis factoring (PAF) (Russell, 2016; Widaman, 1993; Kline, 2015). PAF does not operate under the assumption that the communalities are equal to 1.0, so the diagonal of the correlation matrix (e.g., the one in Table 1) is substituted with estimations of communalities, . PAF estimates the communalities using different techniques (e.g., the squared multiple correlation between a measure and the rest of measures) and a covariance matrix of the items. Factors are estimated one at a time up until there is a large enough of variance accounted for in the correlation matrix. Under PAF, the ordering of the factors determines their importance in terms of fitting, e.g., the first factor accounts for as much variance as possible, followed by the second factor, and so on. Russell (2016) provides a detailed description of the underlying statistical operations of PAF, that we will omit for the sake of brevity. Most statistical software provides functions to implement PAF.
4.1.2. Factor extraction
Both PCA and PAF result in values assigned to candidate factors. Therefore, there has to be a strategy to extract meaningful factors. The bad news here is that there is no unique way, let alone a single proper way, to extract factors (Russell, 2016; Singh et al., 2016; Revelle, 2018b). More than one strategy should be adopted at the same time including a sense-making analysis (Fabrigar et al., 1999). Several factor extraction techniques exist, which are mentioned in the cited references on the present section. We provide here those that are used most widely as well as those that are easier to apply and understand.
Perhaps the simplest strategy to extract factors is Kaiser (1960) eigenvalue-greater-than-one (K1) rule. The rule simply states that factors with eigenvalue higher than 1.0 should be retained. Kaiser’s rule is quite easy to apply but it is highly controversial (Russell, 2016; Revelle, 2018b; Fabrigar et al., 1999; Courtney, 2013). First, the rule was originally designed for PCA and not for PAF or other factor analysis methods, which might make it unsuitable for methodologies that provide estimation for commonalities as diagonals for correlation matrices (Courtney, 2013). Second, the cut-off value for 1.0 might discriminate for factors that are just above or just below 1.0 (Courtney, 2013). Third, computer simulations found that K1 tends to overestimate the number of factors (Fabrigar et al., 1999). Yet, K1 is still the default option for some statistical software suites, making it an unfortunate de-facto main method for factor extraction (Courtney, 2013).
Cattell (1966) scree test, also based on eigenvalues, foresees the plot of the eigenvalues extracted from either the correlation matrix or the reduced correlation matrix (thus making it suitable for both PCA and PAF) against the factor they are associated with, in descending order. One then inspects the curved line for a break in the values (or an elbow) up to when a substantial drop in the eigenvalues cannot be observed anymore. The break is a point at which the shape of the curve becomes horizontal. The strategy is then to keep all factors before the breaking point. The three major criticisms of this approach is that it is subjective (Courtney, 2013), that more than a scree might exist (Tinsley and Tinsley, 1987), but also that data does not often offer a discernible scree and a conceptual analysis of the candidate factor is always required (Rust, 2009).
Revelle and Rocklin (1979) proposed the Very Simple Structure (VSS) method for factor extraction that is based on assessing how the original correlation matrix can be reproduced by a simplified pattern matrix, for which only the highest loading of each item is retained (everything else set to zero) (Courtney, 2013). The VSS criterion to assess how well the pattern matrix performs is a number from 0 to 1, making it a goodness-of-fit measure almost of a confirmatory nature rather than an exploratory one (Revelle, 2020). The VSS criterion is gathered for solutions involving a number of factors that goes to 1 to a user-specified value. The strategy ends with selecting the number of factors that provides the highest VSS criterion.
Finally, the method of parallel analysis (PA), introduced by Horn (1965), was found to be very robust for factor extraction (Courtney, 2013; Fabrigar et al., 1999). PA starts with the K1 concept that only factors with eigenvalue larger than 1.0 should be kept. Horn (1965) has argued that the K1 rule was developed with population statistics and was thus unsuitable when sampling data. Sampling errors would then cause some components from uncorrelated variables to have eigenvalues higher than one in the population (Courtney, 2013). PA takes into account the proportion of variance that results from sampling rather than being able to access to the population. The way it achieves this is a constant comparison of the solution with randomly generated data (Revelle, 2020). PA generates a large number of matrices from random data in parallel with the real data. Factors are retained as long as they are greater than the mean eigenvalue generated from the random matrices.
4.1.3. Factor Rotation
The last step is to rotate the factors in the dimensional space for improving our interpretation of the results (Rust, 2009).
An unrotated output, that is the one that often results after factor extraction, maximizes the variance accounted for by the first factor, followed by the second factor, the third factor, and so on. That is, most items would load on the first factors and many of them would load on more than one factor in a substantial way.
Rotating factors builds on the concept that there are number of “factor solutions” that are mathematically equivalent to the solution found after factor extraction. By performing a rotation of the factors, we retain our solution but allow an easier interpretation. We rotate the factors to seek a so called simple structure, that is a loadings pattern such that each item loads strongly on one factor only and weakly on other factors. If the reader is interested in mathematical foundations of factor rotation, two deep overviews of factor rotation are offered by Darton (1980); Browne (2001).
There are two families of rotations, namely orthogonal and oblique (Russell, 2016). Orthogonal rotations force the assumption of independency between the factors, whereas oblique rotations allows the factors to correlate with each others. Which methodology to use is influence]d by the statistic software; for example, R psych package (Revelle, 2019) provides “varimax”, “quartimax”, “bentlerT”, “equamax”, “varimin”, “geominT”, and“bifactor” for orthogonal rotations and “promax”, “oblimin”, “sim- plimax”, “bentlerQ, “geominQ”, “biquartimin”, and“cluster” for oblique rotations. Several rotation methodologies are summarized by Browne (2001); Abdi (2003); Russell (2016).
Perhaps the most known and employed (Darton, 1980) orthogonal rotation method is the Varimax rotation (Kaiser, 1958). Varimax maximizes the variance (hence the name) of the squared loadings of a factor on all variables. Each factor will tend to have either large or small loadings of any particular variable. While this solution makes it rather easy to identify factors and their loading on items, the independency condition of orthogonal rotation techniques is hard to achieve. The assumption of independency of factors, especially in the context of behavioral research, belittles the value of orthogonal rotation techniques, to the point that “we see little justification for using orthogonal rotation as a general approach to achieving solutions with simple structure” (Fabrigar et al., 1999, p. 283).
Oblique rotation is preferred for behavioral software engineering studies, because it is sensible to assume that behavioral, cognitive, and affective factors are separated by soft walls of independence (e.g., motivation and job satisfaction) (Rust, 2009; Fabrigar et al., 1999; Russell, 2016). If any, one would have to first conduct an investigation using oblique rotation and observe if the solution shows little to no correlation between factors and, in that case, switch to orthogonal rotation (Fabrigar et al., 1999). The two most employed and recommended oblique rotation techniques are Direct Oblimin (and its slight variation Direct Quartimin) and Promax, both of which perform well (Fabrigar et al., 1999).
Fabrigar et al. (1999); Russell (2016) recommended to use a Promax rotation because it provides the best of both approaches. A Promax rotation first performs an orthogonal rotation (a Varimax rotation) to maximize the variance of the loadings of a factor on all the variables (Russell, 2016). Then, Promax relaxes the constraint that the factors are independent between each others, turning the rotation to oblique. The advantage of this technique is that it will reveal whether factors really are uncorrelated with each other (Russell, 2016).
4.1.4. Further recommendations
There have been several recommendations regarding the required sample size, number of measures per factor, number of factors to retain, and interpretation of loadings (Russell, 2016; Singh et al., 2016; Yong and Pearce, 2013).
The recommended overall sample size as reported by Yong and Pearce (2013) is at least 300 participants, with each variable that is subject to factor analysis with at least 5 to 10 observations. This recommendation has, however, low empirical validation. As reported by Russell (2016), a Monte Carlo study by MacCallum et al. (1999) analyzed how different sample sizes and communalities of the variables were able to reproduce the population factor loadings. They found that with item communalities higher or equal to 0.60, results were very consistent with sample sizes as low as 60 cases. Communality levels around 0.50 required samples of 100 to 200 cases. In his review, Russell (2016) also found that 39% of EFA studies involved samples of 100 or fewer cases.
On the number of measures (items) per factor, Yong and Pearce (2013) report that for something to be labeled as a factor it should have at least 3 variables, especially in cases when factors receive a rotation treatment, where only a high correlation (coefficient higher than 0.70) with each other and mostly uncorrelation with other items would make them worthy of consideration. Generally speaking, the correlation coefficient for an item to belong to a factor should be 0.30 or higher (Tabachnick et al., 2007). Russell (2016) identifies that prior work has requested at least three items per factor; however, four or more items per factor was found to be a better holistic to ensure an adequate identification of the factors. In his review he identified 25% studies with three or fewer measures per factor.
We reported on the number of factors to retain in the Factor Extraction subsection, so we will not repeat ourselves here. There is not a recommended number and one would follow (possibly more than) one extraction method to identify the best number of factors according to the case. Tabachnick et al. (2007) add that cases with missing values should be deleted to prevent overestimation of factors. Russell (2016) wrote something that is worthy of mentioning for the uninitiated behavioral software engineering researchers, that is, even when constructing a new measurement instrument there is already an expectation of possible factors in the mind of the researcher. The reason is that items are developed following an investigation of prior work and/or empirical data (see section 3). That number is a good starting point to base ourselves when conducting EFA.
Yong and Pearce (2013) spends some further explanations on interpretation of loadings when they are produced by a statistical software. There should be few item crossloadings (i.e., split loadings, when an item loads at 0.32 or higher with two or more factors) so that each factor defines a distinct cluster of interrelated variables. There are exceptions to this that require an analysis of the items. Sometimes it is useful to retain an item that crossloads, with the assumption that it is the latent nature of the variable. Furthermore, Tabachnick et al. (2007) report that, with an alpha level of 0.01, a rotated factor loading with a meaningful sample size would need to have a value of at least 0.32 for loadings as this would correspond to approximately 10% of the overlapping variance.
4.2. Confirmatory Factor Analysis
Contrary to exploratory factor analysis, confirmatory factor analysis (CFA) is for confirming a-priori hypotheses and assessing how well an hypothesized factor structure fits the obtained data (Russell, 2016). A hypothesized factor could be derived from existing literature as well as data from a previous study to explore the factor structure.
Once the data is obtained to compare to the hypothesized factor structure, a goodness-of-fit test should be conducted. CFA requires statistical modeling that is outside the scope of this paper and the estimation of the goodness-of-fit in CFA is a long lasting debate as “there are perhaps as many fit statistics as there are psychometricians” ( (Revelle, 2018a, p. 31)). Russell (2016); Singh et al. (2016); Rust (2009) provide several techniques for estimating the goodness-of-fit in CFA, e.g., Chi-squared test, root mean square residual (RMSR), root mean square error of approximation (RMSEA), and the standardized RMSR. Statistical software implement these techniques, including R psych package (Revelle, 2020). A widely employed technique for CFA is to be found in structural equation modeling (SEM), which is a family of models to fit networks of constructs (Kaplan, 2008). MacCallum and Austin (2000) provided a comprehensive review of SEM techniques in the psychological science including their applications and pitfalls.
Conducting both EFA and CFA is very expensive. When designing and validating a measurement instrument, and when obtaining a large enough sample of participants, it is common to split the sample for conducting EFA on a part of it and CFA on the remaining part (Singh et al., 2016). Most authors, however, prefer conducting EFA only (Russell, 2016) and rely on future independent studies towards a better psychometrics evaluation of a tool. This is also why statistical tools, e.g., R psych package (Revelle, 2019) provide estimates of fits for EFA as well as convenience tools to adapt the data to CFA packages, e.g., R sem (Fox, 2017).
We refer the reader to a prior work of ours in the behavioral software engineering domain (Lenberg et al., 2017b) where we conducted a CFA and describe its application.
5. Item statistical properties
Assessing characteristics and performance of individuals poses several challenges when interpreting the resulting scores. One of them is that a raw score is not meaningful without understanding the test standardization characteristics. For example, a score of 38 on a debugging performance test is meaningless without knowing that 38 means to be able to open a debugger only. Furthermore, the interpretation of the results vary wildly when knowing that on average, developers score 400 on the test itself. The former issue is related to criterion-referenced standardization, the latter to norm-referenced standardization (Rust, 2009; Kline, 2015).
Criterion-referenced tests assess what an individual with any score is expected to be able to do or know. Norm-referenced standardization enables to compare an individual’s score to the ordered ranking of a population (also see section 2). We concentrate on norm-referenced standardization as criterion-referenced standardization is unique to a test criteria.
A first step to norm-reference a test is to order the results of all participants and rank an individual’s score. Measures such as median and percentiles are useful for achieving the ranking and compare. When we can treat our data as interval scales and have it approximately following a normal distribution, we can also use the mean and the standard deviation. The standard deviation is useful for telling us how much an individual’s score is above or below the mean score. Instead of reporting that an individual’s score is, e.g., 13 above the mean score, it is more interesting to know that the score is 1.7 standard deviations above the mean score. Hence, we norm-standardize scores using different approaches. The remaining of this sections is modelled afterRust (2009) text.
5.1. Standardization to Z scores
Whenever a sample approximates a normal distribution, we know that a score above average is in the upper 50% and by following the three sigma empirical rule (Pukelsheim, 1994)
, we know that a score greater than one or two standard deviations from the mean is in the top 68% and 95% respectively. For expressing an individual’s score in terms of how distant it is from the mean score, we transform the value to its Z score (also called standard score) using the formula in4:
where is a participant’s score, is the mean of all participants’ scores, and is the standard deviation for all participants’ scores.
The ideal case would be to use the population mean and standard deviation. In software engineering research we lack studies estimating population characteristics (an example of norm studies was provided by (Graziotin et al., 2017)), so we should either aggregate the results of some studies or gather multiple samples.
An important note is that transforming scores into a Z scores does not make the scores normally distributed. This would require a normalization procedure, explained below.
5.2. Standardization to T scores
Z scores typically range between -3.00 and +3.00. The range is not always suitable for its application. A software developer could, for example, object to a Z score of -0.89.
A T score, not to be confused with t-statistics of the Student’s t-test, is a standard Z score that is scaled and shifted so that it has a mean of 50 and a standard deviation of 10. T scores typically range between 20 and 80.
For transforming a Z score into a T score, we use the formula in 5.
The software developer in the previous example would have a score of 41.1.
5.3. Standardization to stanine and sten scores
Stanine and sten scores respond to the need of transforming a score to a scale from 1 to 9 (stanine) or 10 (sten) with a mean of 5 (stanine) or 5.5 (sten) and a standard deviation of 2. These scores purposely lose precision by keeping only decimal values.
|Stanine score||Z score||Sten score||Z score|
|2||-1.75 to -1.25||2||-2.00 to -1.50|
|3||-1.25 to -0.75||3||-1.50 to -1.00|
|4||-0.75 to -0.25||4||-1.00 to -0.50|
|5||-0.25 to +0.25||5||-0.50 to -0.00|
|6||+0.25 to +0.75||6||+0.00 to +0.50|
|7||+0.75 to +1.25||7||+0.50 to +1.00|
|8||+1.25 to +1.75||8||+1.00 to +1.50|
|9||>+1.75||9||+1.50 to +2.00|
The conversion to stanine and sten scores follows the rules in Table 2.
The advantage of stanine and sten scores lies in their imprecision. If our non performing developer with a Z score of -0.89 was compared with other two developers having scores of -0.72 and -0.94, how meaningful would be such a tiny difference in scores? Their stanine scores are 3, 4, and 4, respectively. Their sten score would be 4. Stanine and sten scores provide clear cut-off points for easier comparisons.
There is an important difference between stanine and sten scores, besides their range. A stanine score of 5 represents an average score in a sample. A sten score of 5.5, which would be its average value, does not belong to its possible values. One can only claim that a stanine of 4 is in the low average while a score of 6 is in the high average.
The standardization techniques that we presented in the previous section carry the assumption that the sample and population approximate the normal distribution. For all other cases, it is possible to normalize the data. Examples include algebraic transformation, e.g., square-rooting or log transformation, as well as graphical transformation.
It is beyond the scope of the present paper to explain introductory statistical transformation.
Reliability can be seen either in terms of precision, that is the consistency of test scores across replications of the testing procedure (reliability/precision), or as a coefficient, that is the correlation between scores on two equivalent forms of the test (reliability/coefficients) (American Educational Research Association et al., 2014)). For evaluating the precision of a measurement instrument, ideal would be to have as many independent replications of the testing procedure as possible on the same very large sample. Scores are expected to generalize over alternative tests forms, contexts, raters, and over time. The reliability/precision of a measurement instrument is then assessed through the range of differences of the obtained scores. The reliability/precision of an instrument should be assessed with as many sub-groups of a population as possible.
The reliability/coefficients of a measurement instrument, which we will simply call reliability from this point on, is the most common way to refer to the reliability of a test (American Educational Research Association et al., 2014). There are three categories of reliability coefficients, namely alternate-form (derived by administering alternative forms of test), test-retest (same test on different times), and internal-consistency (relationship among scores derived from individual test items during a single session).
Several factors, as defined by American Educational Research Association et al. (2014), affect the reliability of a measurement instrument, especially adding or removing items, changing wording or intervals of items, causing variations in the constructs to be measured (e.g., using a measurement instrument for happiness to assess job satisfaction of developers), and ignoring variations in the intended populations for which the test was originally developed.
We now introduce the most widely employed techniques for establishing the reliability of a test.
6.1. Test-retest reliability
Test-retest reliability, also known as test stability, is assessed when administering the measurement instrument twice to the same sample within a short interval of time. The paired set of scores for each participant is then compared with a correlation coefficient such as Pearson product-moment correlation coefficient or Spearman’s rank-order correlation. A correlation coefficient of 1.00, while rare, would indicate a perfect test-retest reliability, whereas a correlation coefficient of 0.00 would indicate no test-retest reliability at all. A negative score is no good news either, and it is automatically considered as a value of 0.00.
6.2. Parallel forms reliability
Test-retest reliability is not suitable for certain tests, such as those assessing knowledge or performance in general. Participants either face a learning or motivation effect from the first test session or simply improve (or worsen) their skills between sessions. Fur such cases, the parallel forms method is more suitable. The technique requires a systematic development of two versions of the same measurement instrument, namely two parallel tests, that are assessing the same construct but using different wording or items. Parallel tests for assessing debugging skills would feature the same sections and amount of items, e.g., arithmetic, logic, and syntax errors. The two tests would need different source code snippets that are, however, very similar. A trivial example would be to test for unwanted assignments inside conditions in different places and with different syntax (e.g., using if (n = foo()) in version one and if (x = y + 2) in version two). As with test-retest reliability, each participant faces both tests and a correlation coefficient can be computed.
6.3. Split-half reliability
Split-half reliability is a widely adopted and more convenient alternative to parallel forms reliability. Under this technique, a measurement instrument is split into two half-size versions. The split should be as random as possible, e.g., splitting by taking odd and even numbered items. Participants face both halves of the test and, again, a correlation coefficient can be computed. The obtained coefficient, however, is not a measure of reliability yet. The reliability of the whole measurement instrument is computed with the Spearman-Brown formula in 6.
where is the correlation of the split tests. This formula shows that the more discriminating items a test has, the higher will be its reliability.
6.4. Inter-rater reliability
Inter-rater reliability is perhaps the most common reliability that is found in software engineering studies. Qualitative studies or systematic literature reviews and mapping studies often have different raters for evaluating the same items. The sets of rates can be assessed using a correlation coefficient. Cohen’s kappa is widely used in the literature for inter-rater coefficient of two entities together with Fleiss’ kappa for the inter-rater coefficient of more than two entities.
6.5. Standard error of measurement
where is the variance of the test scores and is the reliability coefficient of the test. The standard error of measurement also provides an idea of how errors are distributed around observed scores. If the assumption that errors are distributed normally is met, one can calculate the 95% confidence interval by using the z curve value of 1.96 to construct the interval . Confidence intervals could also be used to compare participants’ scores. Shall one participant score fall below or above the interval, their results would differ significantly from the normality of scores.
Validity in psychometrics is defined as “The degree to which evidence and theory support the interpretation of test scores for proposed uses of tests.” (American Educational Research Association et al., 2014). Psychometric validity is therefore a different (but related) concept than the one of study validity that software engineers are used to deal with (Wohlin et al., 2012; Feldt and Magazinius, 2010; Siegmund et al., 2015; Petersen and Gencel, 2013). Validation in psychometric research is related to the interpretation of the test scores. For validating a test, we should gather relevant evidence for providing a sound scientific basis for the interpretation of the proposed scores.
7.1. Face validity
Face validity concerns how the items of a measurement instrument are accepted by respondents. For example, software developers expect the wording of certain items to be targeted to them instead of say, a children. Similarly, if a test presents itself to be about a certain construct, such as debugging expertise, it could cause face validity issues if it contained a personality assessment.
7.2. Content validity
Content validity (sometimes called criterion validity or domain-referenced validity) concerns the extent to which a measurement instrument reflects the purpose for which the instrument is being developed. If a test was developed under the specifications of job satisfaction, but measured developers’ motivation instead, it would present issues of content validity. Content validity is evaluated qualitatively (Rust, 2009) most of the times because the form of deviation matters more than the degree of deviation.
7.3. Predictive validity
Predictive validity is a statistical validity defined as the correlation between the score of a measurement instrument and a score of the degree of success of the selected field. For example, the degree of success of debugging performance capability is expected to be higher with a higher programming experience. Computing a score for predictive validity is as simple as calculating a correlation value (such as Person or Spearman). According to the acceptance criterion for predictive validity, a score higher than 0.5 could be considered an acceptable predictive validity for the items. We would then feel justified in including programming experience as an item to represent the construct of debugging performance capability.
7.4. Concurrent validity
Concurrent validity is a statistical validity that is defined as the correlation of a new measurement instrument and existing measurement instruments for the same construct. A measurement instrument tailored to the personality of software developers ought to correlate with existing personality measurement instruments. While concurrent validity is a common measure for test validity in psychology, it is a weak criterion as the old measurement instrument itself could have a low validity. Nevertheless, concurrent validity is important for detecting low validity issues in measurement instruments.
7.5. Construct validity
Construct validity is a major validity criterion in psychometric tests. As constructs are not directly measurable, we observe the relationship between the test and the phenomena that the test attempts to represent. For example, a test that identifies highly communicative team members should have a high correlation with…observations of highly communicative people who have been labelled as such. The nature of construct validity is that it is cumulative over the number of available studies (Rust, 2009).
7.6. Differential validity
Differential validity assesses how a measurement instrument should not correlate with measures from which it should differ, and how a measurement instrument should not correlate with measures from which it should not differ. In particular, Campbell and Fiske (1959) have differentiated between two aspects of differential validity, namely convergent and discriminant validity. Rust (2009) mentions a straightforward example of both. A test of mathematics reasoning should correlate positively with a test of numerical reasoning (convergent validity). However, the mathematics test should not strongly correlate positively with a test of reading comprehension, because the two construct are supposed to be different (discriminant validity). In case of a low discriminant validity, there should be an investigation of whether the correlation is a result of a wider underlying trait, say, continues Rust, general intelligence. Differential validity is overall empirically demonstrated by a discrepancy between convergent validity and discriminant validity.
8. Fairness in testing and test bias
Fairness is “the quality of treating people equally or in a way that is right or reasonable” (Cambridge (2018), online.). A test is fair when it reflects the same constructs for all participants, and its scores have the same meaning for all individuals of the population (American Educational Research Association et al., 2014). A fair test does neither advantage or disadvantage any participant through characteristics that are not related to the constructs under observation. From a participant point of view, an unfair test brings a wrong decision based on the test results. An example of test that requires fairness is an attitude or skills assessment when interviewing candidates for hire in an information technology company.
American Educational Research Association et al. (2014) reports on several facets of fairness. Individuals should have the opportunity to maximize how they perform with respect to the constructs being assessed. Similarly, for a measurement instrument that assesses traits of participants, the test should maximize how it assesses that the constructs being measured are present among individuals. This fairness comes from how the test is administered, which should be as standardized as possible. Research articles should describe the environment for the experimental settings, how the participants were instructed, which time limits were given, and so on. Fairness also comes, on other hand, from participants themselves. Participants should be able to access the constructs as being measured without being advantaged or disadvantaged by individual characteristics. This is an issue of accessibility to a test and is also part of limiting item, test, and measurement bias.
We provide an overview or bias in psychometric theory in Figure 6.
Rust (2009) provides an overview of item, test, and measurement bias. It almost feels unnecessary to state that a measurement instrument should be free from bias from age, sex, gender, and race. These cases are indeed covered by legislation to ensure fairness. In general, there are three forms of bias in tests, namely item bias, intrinsic test bias, and extrinsic test bias (Rust, 2009).
8.1. Item bias
Item bias, also known as differential item functioning, refers to bias born out of individual items of the measurement instrument. A straightforward example would be to test a European developer about coding snippets dealing with imperial system units. A more common item bias is about the wording of items. Even among native speakers, the use of idioms such as double negatives can cause confusion. Asking a developer to mark a coding snippet that is free from logic and syntax errors is clearer than asking to mark code that does not possess neither logic nor syntax errors.
A systematic identification of item bias that goes beyond carefully checking an instrument is to carry out an item analysis with all possible groups of potential participants, for example men and women, or speakers of English of different levels. A comparison of the facility values (the proportion of correct answers) of each item can reveal potential item bias. For instruments that assess traits and characteristics of a group instead of function or skills, a strategy is to follow a checklist of questions that researchers and pilot participants can answer (Hambleton and Rodgers, 1995).
Differential item functioning (DIF) is a statistical characteristic of an item that shows potential unfairness of the item among different groups that should provide same test results otherwise (Perrone, 2006). A presence of DIF does not necessarily indicate bias but unexpected behavior on an item (American Educational Research Association et al., 2014). This is why, after the detection of DIF, it is important to review the root causes of the differences. Whenever DIF happens for many items of a test, a test construct or final score is potentially unfair among different groups that should provide same test results otherwise. This situation is called differential test functioning (DTF) (Runnels, 2013)
. There are three main techniques for identifying DIF, namely Mantel-Haenszel approach, item response theory (IRT) methods, and logistic regression(Zumbo, 2007).
8.2. Intrinsic test bias
Intrinsic test bias occurs when there are differences in the mean scores of two groups that are due to the characteristics of the test itself rather than difference between the groups in the constructs being measured. Measurement invariance is the desired property upon lack of which intrinsic test bias occurs. If a test for assessing the knowledge of software quality is developed in English and then administered to individuals who are not fluent in English, the measure for the construct of software quality knowledge would be contaminated by a measure of English proficiency. Differential content validity (see section 7.2) is the most severe form of intrinsic test bias as it causes lower test scores in different groups. If a measurement instrument for debugging skills has been designed by keeping in mind American software testers, any participant that is not an American software tester will likely perform worse on the test to different degrees. Rust (2009) reports various statistical model proposals over the last 50 years to detect intrinsic test bias which, however, present various issues including the introduction of more unfairness near cut-off points or for certain groups of individuals. There is not a recommended way to detect intrinsic test bias other than perform item bias analysis paired with sensitivity analysis.
8.3. Extrinsic test bias
Extrinsic test bias occurs whenever unfair decision happen based on a non-biased test. These issues usually belong to tests about demographics dealing with social, economical, and political issues, so they are unlikely to belong to measurement instruments developed for the software engineering domain.
9. Further reading
The present paper scratches the surface of psychometric theory and practice, and it is its aim to be broad rather than deep. We collect, in this section, what we consider to be next steps for a better understanding and expansion of the concepts that we have presented.
The books written by Rust (2009); Kline (2015); Nunnally (1994) provide an overall overview of psychometric theory, cover all topics mentioned in the present paper, and more. We invite in particular to compare how they present measurement theory and their views and classifications of validity and reliability. A natural followup is The Standards for Educational and Psychological Testing (SEPT, (American Educational Research Association et al., 2014)), which proposes standards that should be met in psychological testing.
While our summary breaks down fundamental concepts and presents them for the unitiated researcher of behavioral software engineering, our writing can not honor enough the guidelines and recommendations for factory analysis offered by Yong and Pearce (2013); Russell (2016); Singh et al. (2016); Fabrigar et al. (1999). To those we add the work of Zumbo (2007), who have explored, through data simulations, the conditions that yield reliable exploratory factor analysis with samples below 50, which is unfortunately a condition we often live with in software engineering research. Furthermore, we wish to point the reader to alternatives to factor analysis, especially for confirmatory factor analysis (CFA). Flora and Curran (2004)
analyzed benefits when using Robust Weighted Least Squares (Robust WLS) regression. With a Monte Carlo simulation, they have shown that robust WLS provided accurate test statistics, parameter estimates and standard errors even when the the assumption of CFA were not met. Bayesian alternatives for CFA have been proposed in the early 80s already(Lee, 1981) and later expanded to cover the exploratory phase as well, see, for example, the works by Conti et al. (2014); Muthén and Asparouhov (2012); Lu et al. (2016).
Within the software engineering domain, Gren (2018) has offered an alternative lens on validity and reliability of software engineering studies, also based on psychology, that we advise to read. Ralph and Tempero (2018) has offered a deep overview of construct validity in software engineering through a psychological lense.
10. Running example of psychometric evaluation
We believe that a methodology description is best complemented by a concrete example of its application. In Appendix A, we provide a complete scenario of a psychometric evaluation with the R programming language and the psych package. The evaluation follows the same structure of the present paper for ease of understanding. In the spirit of open science in software engineering (Fernández et al., 2020), we provide the running example as a replication package as well (Graziotin et al., 2020). We wrote the example using R Markdown, making it fully repeatable, as well as the generated dataset, and instructions for replication with newly generated data.
The adoption and development of valid and reliable measurement instruments in software engineering research, whenever humans are to be evaluated, should benefit from psychology and statistics theory. This paper provides a gentle introduction to psychometric evaluation of tests, which will help the development of tests as well as a careful selection and preservation of existing tests.
After providing basic building concepts of psychometric theory, we introduced item analysis, factor analysis, standardization and normalization, reliability, validity, and fairness in testing and test bias. We also provided an implementation of a psychometric evaluation with a running example and provided all data and source code openly. We followed textbooks, method papers, and society standards for ensuring a coverage of all important steps, but we could only offer a gentle introduction and invite the reader to explore our referenced material further. Each of these steps is a universe of its own, with dozens of published artifacts related to them.
Adding the steps described in this paper will increase the time required for developing measurement instruments. However, the return on investment will be incommensurable. Psychometric studies of measurement instruments will improve reliability and validity of the adopted instruments. The software engineering community must value psychometric studies more. This, however, requires a cultural change that we hope to favor with this paper.
“Spending an entire Ph.D. candidacy on the validation of one single measurement of a construct should be, not only approved, but encouraged.” (Gren, 2018) and, we believe, should also become normal.
Acknowledgements.We acknowledge the support of Swedish Armed Forces, Swedish Defense Materiel Ad- ministration and Swedish Governmental Agency for Innovation Systems (VINNOVA) in the project number 2013-01199.
- Factor rotations in factor analyses. In Encyclopedia of Social Sciences Research Methods, A. Lewis-Beck M. and F. T (Eds.), pp. 792–795. Cited by: §4.1.3.
- Likert or rasch? nothing is more applicable than good theory. Journal of Advanced Nursing 20 (1), pp. 196–201. Cited by: §3.1.2.
- Standards for educational and psychological testing. American Educational Research Association, Washington, DC. External Links: Cited by: Behavioral Software Engineering: Methodological Introduction to Psychometrics, §1.3, §1, §1, §2.1, §6, §6, §6, §7, §8.1, §8, §8, §9.
- Criterion-referenced testing. In Encyclopedia of Autism Spectrum Disorders, pp. 823–823. External Links: Cited by: §2.2.
- Norm-referenced testing. In Encyclopedia of Autism Spectrum Disorders, pp. 2063–2064. External Links: Cited by: §2.2.
- An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research 36 (1), pp. 111–150. Cited by: §4.1.3, §4.1.3.
- Fairness. Cambridge English Dictionary 1 (1), pp. 1. Note: Available: https://dictionary.cambridge.org/dictionary/english/fairness Cited by: §8.
- Convergent and discriminant validation by the multitrait-multimethod matrix.. Psychological Bulletin 56 (2), pp. 81–105. Cited by: §7.6.
- Logical foundations of probability. The University of Chicago Press. Cited by: §2.1.
- The scree test for the number of factors. Multivariate Behavioral Research 1 (2), pp. 245–276. Cited by: §4.1.2.
- Practical experiences in the design and conduct of surveys in empirical software engineering. In Empirical methods and studies in software engineering, G. Goos, J. Hartmanis, J. van Leeuwen, R. Conradi, and A. I. Wang (Eds.), pp. 104–128. Cited by: §1.4.
- Psychological testing and assessment: an introduction to tests and measurement. Vol. , Mayfield Pub Co. Cited by: §1, §1.
- Pretesting survey instruments: an overview of cognitive methods.. Qual Life Res 12 (3), pp. 229–238. External Links: Cited by: §1.4.
- Bayesian exploratory factor analysis.. J Econom 183 (1), pp. 31–57. Cited by: §9.
- Determining the number of factors to retain in efa: using the spss r-menu v2.0 to make more judicious estimations. Practical Assessment, Research & Evaluation 18 (8), pp. 1–14. External Links: Cited by: §4.1.2, §4.1.2, §4.1.2, §4.1.2.
- Introduction to classical and modern test theory. Vol. , Wadsworth Pub Co. Cited by: §1.4.
- Forty years of research on personality in software engineering: a mapping study. Computers in Human Behavior 46 (), pp. 94–113. External Links: Cited by: §1.1.
- Exploring individual characteristics and programming performance: implications for programmer selection. In Exploring individual characteristics and programming performance: Implications for programmer selection, Vol. Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pp. 314a–314a. Cited by: §1.
- Rotation in factor analysis. The Statistician 29 (3), pp. 167. Cited by: §4.1.3, §4.1.3.
- Item response theory. Vol. , Psychology Press. Cited by: §3.1.2.
- Evaluating the use of exploratory factor analysis in psychological research.. Psychological methods 4 (3), pp. 272. External Links: Cited by: §4.1.2, §4.1.2, §4.1.2, §4.1.3, §4.1.3, §4.1.3, §4.1, §4.1, §4.1, §4.1, §9.
- Examining the structure of lean and agile values among software developers. In Lecture Notes in Business Information Processing: Agile Processes in Software Engineering and Extreme Programming, pp. 218–233. Cited by: §1.1.
Software developer experience: case studies in lean-agile and open source environments. Ph.D. Thesis, Ph. D. Dissertation. Department of Computer Science, University of Helsinki …, Helsinki. Cited by: §1.1.
Validity threats in empirical software engineering research - an initial survey.
Proceedings of the 22nd International Conference on Software Engineering & Knowledge Engineering (SEKE’2010), Redwood City, San Francisco Bay, CA, USA, July 1 - July 3, 2010, pp. 374–379. Note: Available: http://www.cse.chalmers.se/~feldt/publications/feldt_2010_validity_threats_in_ese_initial_survey.pdf Cited by: §7.
- Towards individualized software engineering. In empirical studies should collect psychometrics, Vol. the 2008 international workshop, New York, New York, USA, pp. 49–52. Cited by: §1.1.
- Open science in software engineering. In Contemporary Empirical Methods in Software Engineering, M. Felderer and G. H. Travassos (Eds.), pp. 479–504. Note: In press. Available https://arxiv.org/abs/1712.08341 External Links: Cited by: §10.
- An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data.. Psychological methods 9 (4), pp. 466. Cited by: §9.
- Sem: structural equation models. Technical report The Comprehensive R Archive Network. Cited by: §4.2.
- Psychometric properties. In Encyclopedia of Behavioral Medicine, M. D. Gellman and J. R. Turner (Eds.), pp. 1563–1564. External Links: Cited by: §1.
- Instructional technology and the measurement of learing outcomes: some questions.. American Psychologist 18 (8), pp. 519–521. External Links: Cited by: §2.2.
- On the unhappiness of software developers. In On the Unhappiness of Software Developers, E. Mendes, S. Counsell, and K. Petersen (Eds.), Vol. 21st International Conference on Evaluation and Assessment in Software Engineering, New York, New York, USA, pp. 324–333. Cited by: §5.1.
- Behavioral Software Engineering - Example of psychometric evaluation with R. Zenodo. External Links: Cited by: Appendix A, §1.5, §10.
- Do feelings matter? on the correlation of affects and the self-assessed productivity in software engineering. Journal of Software: Evolution and Process 27 (7), pp. 467–487. Cited by: §1.1.
- The affect of software developers: common misconceptions and measurements. In 2015 IEEE/ACM 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pp. 123–124. Cited by: §1.1.
- Understanding the affect of developers: theoretical background and guidelines for psychoempirical software engineering. In Proceedings of the 7th International Workshop on Social Software Engineering, SSE 2015, New York, NY, USA, pp. 25–32. External Links: Cited by: §1.1, §1.
- Darwinian theory, functionalism, and the first american psychological revolution.. American Psychologist 64 (2), pp. 75–83. External Links: Cited by: §2.1.
- Standards of validity and the validity of standards in behavioral software engineering research. In Standards of validity and the validity of standards in behavioral software engineering research, Vol. , New York, New York, USA, pp. . Cited by: §1.1, §11, §9.
- Item bias review. ERIC Clearinghouse on Assessment and Evaluation, the Catholic University of America, Department of Education. Cited by: §8.1.
- Fundamentals of item response theory. Vol. , Sage. Cited by: §2.1.
- The trilogy of mind: cognition, affection, and conation. Journal of the History of the Behavioral Sciences 16 (2), pp. 107–117. External Links: Cited by: §1.
- Personality and the fate of organizations. Psychology Press. External Links: Cited by: §1.1.
- A rationale and test for the number of factors in factor analysis.. Psychometrika 30 (), pp. 179–185. Cited by: §4.1.2.
- Some lessons learned in conducting software engineering surveys in china. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pp. 168–177. External Links: Cited by: §1.4.
- The varimax criterion for analytic rotation in factor analysis. Psychometrika 23 (3), pp. 187–200. Cited by: §4.1.3.
- The application of electronic computers to factor analysis. Educational and Psychological Measurement 20 (1), pp. 141–151. Cited by: §4.1.2.
- Structural equation modeling: foundations and extensions. Vol. 10, Sage Publications. Cited by: §4.2.
- Guidelines for performing systematic literature reviews in software engineering. Technical report Keele University and University of Durham Keele and Durham, UK. External Links: Cited by: §3.
- Personal opinion surveys. In Guide to Advanced Empirical Software Engineering, F. Shull, J. Singer, and D. I. K. Sjøberg (Eds.), pp. 63–92. Cited by: §1.4.
- A handbook of test construction (psychology revivals): introduction to psychometric design. Routledge. Cited by: §3.1.1, §3.1.1, §3.1.2, §3.1, §4.1.1, §4.1, §4, §5, §9.
- Exploratory factor analysis: theory and application. Technical report University of Groningen. Cited by: §4.1, §4.1, §4, §4.
- A bayesian approach to confirmatory factor analysis. Psychometrika 46 (2), pp. 153–160. Cited by: §9.
- Behavioral software engineering - guidelines for qualitative studies. Note: Available https://arxiv.org/abs/1712.08341 External Links: Cited by: footnote 1.
- Behavioral software engineering: a definition and systematic literature review. Journal of Systems and Software 107 (), pp. 15–37. External Links: Cited by: §1.1, §1.1, §1, §1.
- An initial analysis of software engineers’ attitudes towards organizational change. Empirical Software Engineering 22 (4), pp. 2179–2205. Cited by: §4.2.
- A technique for the measurement of attitudes.. Archives of psychology 22 (40), pp. 1–55. Cited by: §3.1.1.
- Objective tests as instruments of psychological theory. Psychological Reports 3 (), pp. 635–694. External Links: Cited by: §2.1.
- Bayesian factor analysis as a variable-selection problem: alternative priors and consequences.. Multivariate Behav Res 51 (4), pp. 519–539. Cited by: §9.
- Applications of structural equation modeling in psychological research.. Annu Rev Psychol 51 (), pp. 201–226. Cited by: §4.2.
- Sample size in factor analysis.. Psychological Methods 4 (1), pp. 84–99. Cited by: §4.1.4.
- Who should test whom. Communications of the ACM 50 (1), pp. 66–71. Cited by: §1.1.
- Survey guidelines in software engineering: an annotated review. In Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 58. External Links: Cited by: §1.4.
- Bayesian structural equation modeling: a more flexible representation of substantive theory.. Psychological Methods 17 (3), pp. 313–335. Cited by: §9.
- Evaluating the use of exploratory factor analysis in developmental disability psychological research. Journal of Autism and Developmental Disorders 40 (1), pp. 8–20. External Links: Cited by: §4.1.1.
- Psychometric theory 3e. Tata McGraw-Hill Education. Cited by: §1, §6, §9.
- Questionnaire design, interviewing and attitude measurement. Vol. , Pinter Pub Ltd, London, UK. Cited by: §1.4.
- LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11), pp. 559–572. External Links: Cited by: §4.1.1.
- Differential item functioning and item bias: critical considerations in test fairness. Teachers College, Columbia University Working Papers in TESOL and Applied Linguistics 6 (), pp. 1–3. Cited by: §8.1.
- Worldviews, research methods, and their relationship to validity in empirical software engineering research. In Worldviews, Research Methods, and their Relationship to Validity in Empirical Software Engineering Research, Vol. 2013 Joint Conference of the 23nd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement (IWSM-MENSURA), pp. 81–89. Cited by: §7.
- Measuring the mbti… and coming up short. Journal of Career Planning and Employment 54 (1), pp. 48–52. External Links: Cited by: §1.1.
- The three sigma rule. The American Statistician 48 (2), pp. 88–91. Cited by: §5.1.
- Pandemic programming: how covid-19 affects software developers and how their organizations can help. Note: Available: https://arxiv.org/abs/2005.01127 External Links: Cited by: §1.1, §1.1.
- Construct validity in software engineering research and software metrics. In Construct Validity in Software Engineering Research and Software Metrics, Vol. , New York, New York, USA, pp. . Cited by: §9.
- Very simple structure: an alternative procedure for estimating the optimal number of interpretable factors. Multivariate Behavioral Research 14 (4), pp. 403–414. Cited by: §4.1.2.
- An introduction to psychometric theory with applications in r. Vol. , personality-project.org. Cited by: §3.1.1, §4.1, §4.
- An introduction to the psych package: part ii scale construction and psychometrics. Technical report The Comprehensive R Archive Network. External Links: Cited by: §4.2.
- Using the psych package to generate and test structural models. Technical report The Comprehensive R Archive Network. Cited by: §4.1.2, §4.1.2.
- Psych: procedures for psychological, psychometric, and personality research. Northwestern University, Evanston, Illinois. Note: R package version 1.9.12 External Links: Cited by: §4.1.3, §4.2.
- How to: use the psych package for factor analysis and data reduction. Technical report The Comprehensive R Archive Network. Cited by: §4.1.2, §4.1.2, §4.2.
- Measuring differential item and test functioning across academic disciplines. Language Testing in Asia 3 (1), pp. 9. Cited by: §8.1.
- Knowledge management in software engineering. IEEE software 19 (3), pp. 26–38. External Links: Cited by: §1.
- In search of underlying dimensions: the use (and abuse) of factor analysis in personality and social psychology bulletin. Personality and Social Psychology Bulletin 28 (12), pp. 1629–1646. External Links: Cited by: §4.1.1, §4.1.2, §4.1.2, §4.1.3, §4.1.3, §4.1.3, §4.1.4, §4.1.4, §4.1.4, §4.1.4, §4.2, §4.2, §4.2, §9.
- Modern psychometrics : the science of psychological assessment. Routledge, Hove, East Sussex New York. External Links: Cited by: §1.4, §1, §1, §2.1, §2.1, §2.1, §2.2, §3.1.1, §3.1.1, §3.1.2, §3.1.2, §3, §4.1.1, §4.1.2, §4.1.3, §4.1.3, §4.1, §4.1, §4.2, §4, §5, §5, §6.5, §6, §7.2, §7.5, §7.6, §7, §8.2, §8, §9.
- Asking questions about behavior: cognition, communication, and questionnaire construction. American Journal of Evaluation 22 (2), pp. 127–160. External Links: Cited by: §1.4.
- Views on internal and external validity in empirical software engineering. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1, pp. 9–19. External Links: Cited by: §7.
- Norms for test construction. In Measures of Positive Psychology, pp. 17–34. Cited by: §1.4, §3.1.2, §3.1.2, §4.1.2, §4.1.4, §4.1, §4.1, §4.1, §4.1, §4.1, §4.2, §4.2, §4, §4, §9.
- Eliciting salient beliefs in research on the theory of planned behaviour: the effect of question wording. Current Psychology 22 (3), pp. 234–251. External Links: Cited by: §1.4.
- Sharing knowledge in knowledge-intensive firms. Human resource management journal 13 (2), pp. 60–75. External Links: Cited by: §1.
- Using multivariate statistics. Vol. 5, Pearson Boston, MA, Boston. Cited by: §4.1.4, §4.1.4, §4.1.4.
- Uses of factor analysis in counseling psychology research.. Journal of Counseling Psychology 34 (4), pp. 414–424. Cited by: §4.1.2.
- Classical test theory in historical perspective. Educational Measurement: Issues and Practice 16 (4), pp. 8–14. External Links: Cited by: §3.1.2.
- Challenges in survey research. In Contemporary Empirical Methods in Software Engineering, M. Felderer and G. H. Travassos (Eds.), pp. 95–127. Note: In press. Available https://arxiv.org/abs/1908.05899 External Links: Cited by: §1.4.
- The psychology of computer programming. Vol. 932633420, Van Nostrand Reinhold New York. Cited by: §1.
- Common factor analysis versus principal component analysis: differential bias in representing model parameters. Multivariate Behavioral Research 28 (3), pp. 263–311. External Links: Cited by: §4.1.1.
- Experimentation in software engineering. Springer Berlin Heidelberg, Berlin, Heidelberg. Cited by: §7.
- A theory on individual characteristics of successful coding challenge solvers. PeerJ Computer Science 5 (), pp. e173. External Links: Cited by: §1.
- A beginner’s guide to factor analysis: focusing on exploratory factor analysis. Tutorials in Quantitative Methods for Psychology 9 (2), pp. 79–94. Cited by: §4.1.4, §4.1.4, §4.1.4, §4.1.4, §4.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4, §4, §9.
- Three generations of dif analyses: considering where it has been, where it is now, and where it is going. Language assessment quarterly 4 (2), pp. 223–233. Cited by: §8.1, §9.