Inter-Coder Agreement for Improving Reliability in Software Engineering Qualitative Research

07/31/2020 ∙ by Ángel González-Prieto, et al. ∙ Universidad Politécnica de Madrid 0

In recent years, the research on empirical software engineering that uses qualitative data analysis (e.g. thematic analysis, content analysis, and grounded theory) is increasing. However, most of this research does not deep into the reliability and validity of findings, specifically in the reliability of coding, despite there exist a variety of statistical techniques known as Inter-Coder Agreement (ICA) for analyzing consensus in team coding. This paper aims to establish a novel theoretical framework that enables a methodological approach for conducting this validity analysis. This framework is based on a set of statistics for measuring the degree of agreement that different coders achieve when judging a common matter. We analyze different reliability coefficients and provide detailed examples of calculation, with special attention to Krippendorff's α coefficients. We systematically review several variants of Krippendorff's α reported in the literature and provide a novel common mathematical framework in which all of them are unified through a universal α coefficient. Finally, this paper provides a detailed guide of the use of this theoretical framework in a large case study on DevOps culture. We explain how α coefficients is computed and interpreted using a widely used software tool for qualitative analysis like Atlas.ti.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, the research on empirical software engineering that uses qualitative research techniques is on the rise [Ghanbari, STOL:2016, wohlin2012experimentation, SALLEH:2018, Storer]. Grounded theory, content analysis and thematic analysis have been established as top notch procedures for conducting qualitative data analysis as they provide methods for examining and interpreting qualitative data to understand what it represents [cruzes2011recommended, Saldana2012]. However, few studies in software engineering analyze and test reliability and trustworthiness of their content analysis, and thus, the validity of their findings. A systematic search in the main publication repositories (namely ACM Digital Library, Science Direct and Springer) returns no more than 25 results, some of them being [example1, example2a, example2b, example3, example4]. Similar results were obtained in a systematic literature review reported by Nili et al. [Nili:2020] in information management research. Nevertheless, the amount of publications that test the reliability of their findings is notably higher in other areas, specially in health sciences, social psychology, education, and business [Nili:2020]. Reliability in content analysis is particularly crucial to identify mistakes before the codes are used in developing and testing a theory or model. In this way, it assesses the soundness and correctness of the drawn conclusions, with a view towards creating well-posed and long-lasting knowledge. Weak confidence in the data only leads to uncertainty in the subsequent analysis and generate doubts on findings and conclusions. In Krippendorff’s own words: “If the results of reliability testing are compelling, researchers may proceed with the analysis of their data. If not, doubts prevail as to what these data mean, and their analysis is hard to justify” [Krippendorff:2018]. This problem can be addressed by means of well-established statistical techniques known as Inter-Coder Agreement (ICA) analysis. These are a collection of coefficients that measure the extend of the agreement/disagreement between several judges when they subjectively interpret a common reality. In this way, these coefficients allow researchers to establish a value of reliability of the coding that will be analyzed later to infer relations and to lead to conclusions. Coding is reliable if coders can be shown to agree on the categories assigned to units to an extent determined by the purposes of the study [craggs2005evaluating]. In this paper, we propose to introduce the ICA analysis techniques in software engineering empirical research in order to enhance the reliability of qualitative data analysis and the soundness of the results. For that purpose, in Section 2 we review some of the coefficients that, historically, have been reported in the literature to measure the ICA. We start discussing, in Section 2.1, some general purpose statistics, like Cronbach’s [cronbach1951coefficient], Pearson’s [pearson1895vii] and Spearman’s [Spearman], that are typically misunderstood to be suitable for ICA analysis, since they cannot measure the degree of agreement among coders but only the much weaker concept of correlation. In a similar vein, in Section 2.2 we review some coefficients that evaluate agreement between coders, but are not suitable for measuring reliability, since they do not take into account the agreement by chance, like the percent agreement [feng2014intercoder, feng2015mistakes] and the Holsti Index [holsti1969content]. We also present a third group, with representants like Scott’s [scott1955reliability] (Section 2.3), Cohen’s [cohen1960coefficient] (Section 2.4) and Fleiss’ [fleiss1971measuring] (Section 2.5), that have been intensively used in the literature for measuring reliability, specially in social sciences. However, as pointed out by Krippendorff in [hayes2007answering], all of these coefficients suffer some kind of weakness that turns them non-optimal for measuring reliability. Finally, in Section 2.6 we sketch briefly Krippendorff’s proposal to overcome these flaws, the so-called Krippendorff’s coefficient [Krippendorff:2018], on which we will focus along of this paper. Despite the success and wide spread of Krippendorff’s , there exists in the literature plenty of variants of this coefficient formulated quite ad hoc for very precise and particular situations, like [krippendorff1970estimating, Krippendorff:1995, yang2011coefficient, gwet2011krippendorff, de2012calculating, Krippendorff:2016, Krippendorff:2018]. The lack of uniform treatment of these measures turns their use confusing, and the co-existence of different formulations and diffuse interpretations becomes their comparison a hard task. To address this problem, Sections 3 and 4 describe a novel theoretical framework that reduces the existing variants to a unique universal coefficient by means of labels that play the role of meta-codes. With this idea in mind, we focus on four of the most outstanding and widely used coefficients and show how their computation can be reduced to the universal by means of a simple re-labelling. This framework provides new and more precise interpretations of these coefficients that will help to detect flaws in the coding process and to correct them easily on the fly. Moreover, this clearer interpretation in terms of labels sheds light to some awkward behaviors of the coefficients that are very hard to understand otherwise. Section 5 includes a tutorial on the use and interpretation of Krippendorff’s coefficients for providing reliability in software engineering case studies through the software tool Atlas.ti v8.4. This tool provides support for the different tasks that take place during qualitative data analysis, as well as the calculation of the ICA measures. There exists in the market a variety of available tools with a view towards qualitative research, like NVIVO, MaxQDA, Qualcoder, Qcoder, etc., but for its simplicity and ability for computing Krippendorff’s coefficients, along this tutorial we will focus on Atlas.ti. The tutorial is driven by a running example based on a real case study developed by the authors about a qualitative inquiry in the DevOps domain [devops]. Additionally, we highlight several peculiarities of Atlas.ti when dealing with large corpus and sparse relevant matter. Finally, in Section 6 we summarize the main conclusions of this paper and we provide some guidelines on how to apply this tutorial to case studies in qualitative research. We expect that the theoretical framework, the methodology, and the subsequent tutorial introduced in this paper may help empirical researchers, particularly in software engineering, to improve the quality and soundness of their studies.

2. Background

In a variety of situations, researchers have to deal with the problem of judging some data. While the observed data is objective, the perception of each researcher is deeply subjective. In many cases, the solution is to introduce several judges to reduce the amount of subjectivity by comparing their judgements. However, in this context, it arises a new problem that we need to cope with: we need a method for measuring the degree of agreement that the judges achieve in their evaluations, and the resulting codification of the raw data. Thus, researchers need to measure the reliability of the coding problem. Only after establishing that the reliability is sufficiently high, it makes sense to proceed with the analysis of the data. It is worthy to mention that, although often used interchangeably, there is a technical distinction between the terms agreement and reliability. Inter-Coder Agreement (ICA) coefficients assess the extent to which the responses of two or more independent raters are concordant; on the other hand, inter-coder reliability evaluates the extent to which these raters consistently distinguish between different responses [gisev2013interrater]. In other words, the measurable quantity is the ICA, and using this value we can infer reliability. In the same vein, we should not confuse reliability and validity. Reliability deals with the extent to which the experiment is deterministic and independent of the coders, and in this sense it is strongly tied to reproducibility; whereas validity deals with the truthfulness, with how the claims assert the truth. Reliability is a must for validity, but does not guarantee it. Several coders may share a common interpretation of the reality, so that we have a high level of reliability, but this interpretation might be wrong and biased, so the validity is really small. In this section, we analyze several statistics that have been reported in the literature for quantifying the Inter-Coder Agreement (ICA) and to infer, from this value, the reliability of the codification.

2.1. Common misunderstandings for measuring ICA

There co-exists in the literature several statistical coefficients that have been applied for evaluating ICA, such as Cronbach’s , Pearson’s and Spearman’ . However, these coefficients cannot be confused with methods of inter-coder reliability test, as none of these three types of methods measure the degree of agreement among coders. For instance, Cronbach’s [cronbach1951coefficient] is a statistic for interval or ratio level data that focuses on the consistency of judges when numerical judgments are required for a set of units. As claimed in [nili2017critical]: “It calculates the consistency by which people judge units without any aim to consider how much they agree on the units in their judgments”, and as written in [hayes2007answering] “[it is] unsuitable to assess reliability of judgments”. On the other hand, correlation coefficients, as Pearson’s [pearson1895vii] or Spearman’s rank [Spearman], measure the extent to which two logically separate interval variables, say and , covary in a linear relationship of the form . They indicate the degree to which the values of one variable predict the values of the other. Agreement coefficients, in contrast, must measure the extent to which . High correlation means that data approximate to some regression line, whereas high agreement means that they approximate the 45-degrees line [Krippendorff:2018].

2.2. Percent agreement

This measure is computed as the rate between the number of times the judges agreed when classifying an item (a datum to be analyzed) with the total amount of items (multiplied by 100 if we want to express it as a percent). It has been widely used in the literature due to its simplicity and straightforward calculation

[feng2014intercoder, feng2015mistakes]. However, it is not a valid measure for inferring reliability in case studies that require a high degree of accuracy, since it does no take into account the agreement by chance [nili2017critical] and, according to [hayes2007answering], there is no a clear interpretation for values different than . In addition, it can be used only by two coders and only for nominal data [zhao2013assumptions]. On the other hand, it presents an undesirable collateral effect: the larger the number of codes (categories), the harder to achieve high values of ICA. Table 1 shows an illustrative example, which is part of a Systematic Literature Review (SLR) performed by some authors of this paper [perez2020systematic]. This table shows the selection of primary studies; each item corresponds to one of these studies and each judge (referred to as and ) determines, according to a pre-established criterion, if the studies should be promoted to an analysis phase (Y) or not (N). From these data, the percent agreement attained is . At a first sight, this seems to be a high value that would lead to a high reliability on the data. However, we are missing the fact that the judges may achieve agreement purely by chance. As pointed out by [Krippendorff:2018]: “Percent-agreement is often used to talk about the reliability between two observers, but it has no valid reliability interpretations, not even when it measures

. Without reference to the variance in the data and to chance, percent agreement is simply uninterpretable as a measure of reliability—regardless of its popularity in the literature”. Indeed, following sections show the values of other ICA measures such as Cohen’s

and Krippendorff’s , which are very low: 0.39 () and 0.34 (), respectively.

Item: #01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12 #13 #14 #15
Table 1. Decision table of two judges in a SLR

It is worthy to mention that there exists a variation of the simple percent agreement called the Holsti index [holsti1969content]. It allows researchers to consider that the case in matter to be analyzed is not pre-divided into items to be judged, so that each coder selects the matter that considers relevant. Nevertheless, by the same reasons that the simple percent agreement, this index is not a valid measure for analyzing ICA.

2.3. Scott’s

This index, introduced in [scott1955reliability], is an agreement coefficient for nominal data and two coders. The method corrects percent agreement by taking into account the agreement that can occur between the coders by chance. The index of Inter-Coder Agreement for Scott’s coefficient is computed as

Here, (observed percent agreement) represents the percentage of judgments on which the two analysts agree when coding the same data independently; and is the percent agreement to be expected on the basis of chance. This later value can be computed as

where is the total number of categories and is the proportion of the entire sample which falls in the -th category. For instance, for our example of Table 1, we have that , as computed in Section 2.2. For , it is given by . Observe that, while the number of items is , in the previous computation we divided by . This is due to the fact that there are items, but pairs of evaluations, see also Table 2.

Y 4 9 4/30 + 9/30 = 0.433 0.188
N 11 6 11/30 + 6/30 = 0.567 0.321
Total 0.509
Table 2. Expected percent agreement for the data of Table 1

Therefore, the Scott’s coefficient has a value of

2.4. Cohen’s

Cohen’s coefficient measures the concordance between two judges’ classifications of elements into mutually exclusive categories. Cohen defined the coefficient as “the proportion of chance-expected disagreements which do not occur, or alternatively, it is the proportion of agreement after chance agreement is removed from consideration” [cohen1960coefficient]. The coefficient is defined as

where is the proportion of units for which the judges agreed (relative observed agreement among raters) and is the proportion of units for which agreement is expected by chance (chance-expected agreement). In order to compute these proportions, we will use the so-called contingency matrix, as shown in Table 3. This is a square matrix of order the number of categories . The -entry, denoted , is the number of times that an item was assigned to the -th category by judge and to the -th category by judge . In this way, the elements of the form are precisely the agreements in the evaluations.

Category 1 Category 2 Category
Category 1
Category 2
Table 3. Contingency matrix

From this contingency matrix, the observed agreement, , is defined as

On the other hand, the agreement by chance, is given by

where the probability of the

-th category, , is given by

The coefficient is when the observed agreement is entirely due to chance agreement. Greater-than-chance agreement corresponds to a positive value of and less-than-chance agreement corresponds to a negative value of . The maximum value of is , which occurs when (and only when) there is perfect agreement between the judges [cohen1960coefficient]. Landis and Koch, in [Landis-Koch], proposed the following table for evaluating intermediate values (Table 4).

Cohen’s Strength of Agreement
0.0 Poor
0.00- 0.20 Slight
0.21 - 0.40 Fair
0.41 - 0.60 Moderate
0.61 - 0.80 Substantial
0.81 - 1.00 Almost perfect
Table 4. Landis & Koch: interpretation of the value of

As an example of application of this coefficient, let us come back to our example of Table 1. From the data, we compute the contingency matrix as shown in Table 5.

Y N Total
Y 4 5 9
N 0 6 6
Total 4 11 15
Table 5. Contingency matrix for the data of Table 1

In this way, is given by

On the other hand, for we have

Hence . Therefore, we have

It is worthy to mention that, despite of its simplicity, this coefficient has some intrinsic problems. On one hand, it is limited to nominal data and two coders. On the other hand, it is difficult to interpret the result. Under various conditions, the

statistic is affected by two paradoxes that return biased estimates of the statistic itself: (1) high levels of observer agreement with low

values; (2) lack of predictability of changes in with changing marginals [lantz1996behavior]. Some proposals for overcoming these paradoxes are described in [feinstein1990high] and in [cicchetti1990high]. According to [hayes2007answering]: “ is simply incommensurate with situations in which the reliability of data is the issue”.

2.5. Fleiss’

Fleiss’ is a generalization of Scott’s statistic to an arbitrary number, say , of raters . As above, we set to be the number of items to be coded and the number of categories (possible categorical ratings) under consideration. It is important to note that whereas Cohen’s assumes the same two raters have rated a set of items, Fleiss’ specifically allows that, although there are a fixed number of raters, different items may be rated by different individuals [fleiss1971measuring]. Analogously to Scott’s , the coefficient is calculated via the formula

For this coefficient, we will no longer focus on the contingency matrix, but on the number of ratings. Hence, given a category and an item , we will denote by the number of raters that assigned the -th category to the -th item. In this case, for each item and for each category , we can compute the corresponding proportion of observations as

Recall that is very similar to the one considered in Cohen’s but, now, counts the rate of pairs judge–judge that are in agreement (relative to the number of all possible judge–judge pairs) In this way, the observed agreement, , and the expected agreement, , are the average of these quantities relative to the total number of possibilities, that is

As an example of application, consider again the Table 1. Recall that, in our notation, the parameters of this example are is the number of items (primary studies in our case), is the number of coders and is the number of nominal categories (Y and N in our example which are categories and , respectively). From these data, we form the Table 6 with the computation of the counting values . In the second and the third row of this table we indicate, for each item, the number of ratings it received for each of the possible categories (Y and N). For instance, for item , voted it for the category N, while assigned it to the category Y and, thus, we have . On the other hand, for item both coders assigned it the the category N so we have and . In the last column of the table we compute the observed percentages of each category, which give the results

Observe that, as expected, . Therefore, .

#01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12 #13 #14 #15 Total
Y 1 1 0 0 0 1 2 1 0 2 2 0 1 0 2 13 13/30 = 0.433
N 1 1 2 2 2 1 0 1 2 0 0 2 1 2 0 17 17/30 = 0.567
0 0 1 1 1 0 1 0 1 1 1 1 0 1 1 10
Table 6. Count of the proportion of observations for Fleiss’ for the example of Table 1

On the other hand, for the observed percentages per item, we have two types of results. First, if for the -th item the two coders disagreed in their ratings, we have that . However, if both coders agreed in their ratings (regardless if it was a Y or a N), we have that . Their average is the observed agreement . Therefore, the value of the Fleiss’ coefficient is

2.6. Krippendorff’s

The last coefficient that we will consider for measuring ICA is Krippendorff’s coefficient. Sections 3 and 4 are entirely devoted to the mathematical formulation of Krippendorff’s and its variants for content analysis. However, we believe that, for the convenience of the reader, it is worthy to introduce this coefficient here through a working example. The version that we will discuss here corresponds to the universal coefficient introduced in [krippendorff1970estimating] (see also Section 3), that only deals with simple codifications as the examples above. This is sometimes called binary in the literature, but we reserve this name for a more involved version (see Section 4.2). Again, we consider the data of Table 1, which corresponds to the simplest reliability data generated by two observers who assign one of two available values to each of a common set of units of analysis (two observers, binary data). In this context, this table is called the reliability data matrix. From this table, we construct the so-called matrix of observed coincidences, as shown in Table 7. This is a square matrix of order the number of possible categories (hence, a matrix in our case since we only deal with the categories Y and N). The way in which this table is built is the following. First, you need to count in Table 1 the number of pairs (Y, Y). In this case, items received two Y from the coders ( and ). However, the observed coincidences matrix counts ordered pairs of judgements, and in the previous count always shows up first and appears second. Hence, we need to multiply this result by , obtaining a total count of that is written down in the (Y, Y) entry of the observed coincidences matrix, denoted . In the same spirit, the entry of the matrix corresponds to the ordered pairs of responses (N, N). In addition, the anti-diagonal entries of the matrix, and , correspond to responses of the form (Y, N) and (N, Y). There are items in which we got a disagreement ( and ), so, as ordered pairs of responses, there are pairs of responses (Y, N) and pairs of responses (N, Y), which are written down in the observed coincidences matrix. Finally, the marginal data, and , are the sums of the values of the rows and the columns and is twice the number of items. Observe that, by construction, the observed coincidences matrix is symmetric.

Table 7. Observed coincidences matrix for the data of Table 1

In this way, the observed agreement is given by

On the other hand, as in the previous methods, we need to compare these observed coincidences with the expected coincidences by chance. This information is collected in the so-called expected coincidences matrix, as shown in Table 8. The entries of this matrix, , measure the probability of getting an ordered response entirely by chance. In our case, the expected coincidences are given by

Table 8. Expected coincidences matrix for the data of Table 1

Therefore, the expected agreement is

Thus, using the same formula for the ICA coefficient as is Section 2.3, we get that Krippendorff’s is given by

As a final remark, in the context of Krippendorff’s , it is customary to use the equivalent formulation

where is the observed disagreement and is the expected disagreement.

3. The universal Krippendorff’s coefficient

Krippendorff’s coefficient is one of the most widely used coefficients for measuring Inter-Coder Agreement in content analysis. As we mentioned in Section 2.6, one of the reasons is that this coefficient solves many of the flaws that Cohen’s and Fleiss’ suffer. For a more detailed exposition comparing these coefficients see [hayes2007answering], and for an historical description of this coefficient check [Krippendorff:2018]. In this section, we explain the probabilistic framework that underlies Krippendorff’s coefficient. For this purpose, we introduce a novel interpretation that unifies the different variants of the coefficient presented in the literature (see for instance [krippendorff1970estimating, Krippendorff:1995, hayes2007answering, yang2011coefficient, gwet2011krippendorff, de2012calculating, Krippendorff:2016, Krippendorff:2018]). These coefficients are usually presented as unrelated and through a kind of ad hoc formulation for each problem. This turns the use of Krippendorff’s for the unfamiliar researcher confusing and unmotivated. For this reason, we consider that is worthy to provide a common framework in which precise interpretations and comparisons can be conducted. Subsequently, in Section 4 we will provide descriptions of this variants in terms of this universal coefficient. The present formulation is an extension of the work of the authors in [devops] towards a uniform formulation. Suppose that we are dealing with different judges (also referred to as coders), denoted by , as well as with a collection of items to be judged (also referred to as quotations in this context), denoted . We fix a set of admissible ‘meta-codes’, called labels, say . The task of each of the judges is to assign, to each item , a collection (maybe empty) of labels from . Hence, as byproduct of the evaluation process, we get a set , for and , where is the set of labels that the judge assigned to the item . Recall that is not a multiset, so every label appears in at most once. Moreover, notice that multi-evaluations are now allowed, that is, judge may associate more than a label to an item. This translates to the fact that may be empty (meaning that did not assign any label to ), it may have a single element (meaning that assigned only one label) or it may have more than an element (meaning that chose several labels for ). From the collection of responses , we can count the number of observed pairs of responses. For that, fix and set

In other words, counts the number of (ordered) pairs of responses of the form that two different judges and gave to the same item and such that included in his response and included in his response. In the notation of Section 2.4, in the case that (two judges) we have that .

Remark 3.1.

Suppose that there exists an item that was judged by a single judge, say . The other judges, for , did not vote it, so . Then, this item makes no contribution to the calculation of since there is no other judgement to which can be paired. Hence, from the point of view of Krippendorff’s , is not taken into account. This causes some strange behaviours in the coefficients of Section 4 that may seem counterintuitive.

From these counts, we construct the matrix of observed coincidences as . By its very construction, is a symmetric matrix. From this matrix, we set , which is (twice) the total number of times to which the label was assigned by any judged. Observe that is the total number of judgments. In the case that each judge evaluates each item with a single non-empty label, we have . On the other hand, we can construct the matrix of expected coincidences, , where

The value of might be though as the average number of times that we expect to find a pair , when the frequency of the label is estimated from the sample as . It is analogous to the value of the proportion in Section 2.5. Again, is a symmetric matrix. Finally, let us fix a pseudo-metric , i.e. a symmetric function satisfying the triangle inequality and with for any (recall that this is only a pseudo-metric since different labels at distance zero are allowed). This metric is given by the semantic of the analyzed problem and, thus, it is part of the data used for quantifying the agreement. The value should be seen as a measure of how similar the labels and are. A common choice for this metric is the so-called discrete metric, given by if and otherwise. The discrete metric means that all the labels are equally separated and is the one that will be used along this paper. For subtler metrics that may be used for extracting more semantic information from the data, see [Krippendorff:2016]. From these computations, we define the observed disagreement, , and the expected disagreement, , as


These quantities measure the degree of disagreement that is observed from and the degree of disagreement that might be expected by judging randomly (i.e. by chance), respectively.

Remark 3.2.

In the case of taking as the discrete metric, we have another interpretation of the disagreement. Observe that, in this case, since we can write the disagreements as

The quantity (resp. ) can be understood as the observed (resp. expected) agreement between the judges. In the same vein, may be seen as the maximum achievable agreement. Hence, in this context, the disagreement (resp. ) is actually the difference between the maximum possible agreement and the observed (resp. expected) agreement.

From these data, Krippendorff’s coefficient is defined as

From this formula, observe the following limiting values:

  • is equivalent to or, in other words, it means that there exists perfect agreement in the judgements among the judges.

  • is equivalent to , which means that the agreement observed between the judgements is entirely due to chance.

In this way, Krippendorff’s can be interpreted as a measure of the degree of agreement that is achieved out of the chance. The bigger the is, the better agreement is observed. A common rule-of-thumb in the literature [Krippendorff:2018] is that is the minimal threshold required for drawing conclusions from the data. For , we can consider that there exists statistical evidence of reliability in the evaluations. Apart from these considerations, there are doubts in the community that more partitioned interpretations, like the one of Landis Koch of Table 4, are valid in this context (see [Krippendorff:2018]).

Remark 3.3.

Observe that may only be achieved if , which means that there is even more disagreement than the one that could be expected by chance. This implies that the judges are, consistently, issuing different judgements for the same items. Thus, it evidences that there exists an agreement between the judges to not agree, that is, to fake the evaluations. On the other hand, as long as the metric is non-negative, and, thus, .

4. Theoretical framework: Semantic domains and variants of the coefficient

The ideal setting, as described in Sections 2 and 3, might be too restrictive for the purposes of content analysis (particularly, as applied by the Atlas.ti software [Atlas:2019]). In this section, we describe a more general framework that enables a more complex and detailed analysis, but it also leads to more reliability issues to be measured. For that purpose, several variants of Krippendorff’s have being proposed in the literature (up to 10 are mentioned in [Krippendorff:2018, Section 12.2.3]). In this vein, we explain some of these variants and how they can be reduced to the universal coefficient of Section 3 after an algorithmic translation by re-labeling codes. As Section 3, this framework is an extension of the authors’ work [devops]. To be precise, in content analysis we usually need to consider a two-layers setting as follows. First, we have a collection of semantic domains, . A semantic domain defines a space of distinct concepts that share a common meanings (say, might be colors, brands, feelings…). Subsequently, each semantic domain embraces mutually exclusive concepts indicated by a code. Hence, for , the domain decomposes into codes, that we denote by . For design consistency, these semantic domains must be logically and conceptually independent. This principle translates into the fact that there exists no shared codes between different semantic domains and two codes within the same semantic domain cannot be applied at the same time by a judge. Now, the data under analysis (e.g. scientific literature, newspapers, videos, interviews) is chopped into items, which in this context are known as quotations, that represent meaningful parts of the data by their own. The decomposition may be decided by each of the judges (so different judges may have different quotations) or it may be pre-established (for instance, by the codebook creator or the designer of the ICA study). In the later case, all the judges share the same quotations so they cannot modify their limits and they should evaluate each quotation as a block. In order to enlighten the notation, we will suppose that we are dealing with this case of pre-established quotations. Indeed, from a mathematical point of view, the former case can be reduced to this version by refining the data division of each judge to get a common decomposition into the same pieces. Therefore, we will suppose that the data is previously decomposed into items or quotations, . Observe that the union of all the quotations must be the whole matter so, in particular, irrelevant matter is also included as quotations. Now, each of the judges , , evaluates the quotations , , assigning to any number of semantic domains and, for each chosen semantic domain, one and only one code. No semantic domain may be assigned in the case that the judge considers that is irrelevant matter, and several domains can be applied to by the same judge. Hence, as byproduct of the evaluation process, we obtain a collection of sets , for and . Here, is the collection of codes that the judge assigned to the quotation . The exclusion principle of the codes within the semantic domain means that the collection of chosen semantic domains contains no repetitions.

Remark 4.1.

To be precise, as proposed in [Krippendorff:1995], when dealing with a continuum of matter each of the quotations must be weighted by its length in the observed and expected coincidences matrices. This length is defined as the amount of atomic units the quotation has (say characters in a text or seconds in a video). In this way, (dis)agreements in long quotations are more significant than (dis)agreements in short quotations. This can be easily incorporated to our setting just by refining the data decomposition to the level of units. In this way, we create new quotations having the length of an atomic unit. Each new atomic quotation is judged with the same evaluations as the old bigger quotation. In the coefficients introduced below, this idea has the mathematical effect that, in the sums of Equation (1), each old quotation appears as many times as atomic units it contains, which is the length of such quotation. Therefore, in this manner, the version explained here computes the same coefficient as in [Krippendorff:1995].

In order to quantify the degree of agreement achieved by the judges in the evaluations , several variants of Krippendorff’s are proposed in the literature [Krippendorff:2016, Krippendorff:2018]. Some of the most useful for case studies, and the ones implemented in Atlas.ti, are the following variants.

  • The coefficient : This is a global measure. It quantifies the agreement of the judges when identifying relevant matter (quotations that deserve to be coded) and irrelevant matter (part of the corpus that is not coded).

  • The coefficient : This coefficient is computed on a specific semantic domain . It is a measure of the degree of agreement that the judges achieve when choosing to apply a semantic domain or not.

  • The coefficient : This coefficient is computed on a semantic domain . It indicates the degree of agreement to which coders identify codes within .

  • The coefficient : This is a global measure of the goodness of the partition into semantic domains. measures the degree of reliability in the decision of applying the different semantic domains, independently of the chosen code.

Before diving into the detailed formulation, let us work out an illustrative example. Figure 1 shows an example of the use of these coefficients. Let us consider three semantic domains, which their respective codes being as follows

The two judges, and , assign codes to four quotations as shown in Figure 1(a). We created a graphical metaphor so that each coder/judge, each semantic domain, and each code are represented as shown in Figure 1(b). Each coder is represented by a shape, so that is represented by triangles and by circles. Each domain is represented by a colour, so that is red, is blue and is green. Each code within the same semantic domain is represented a fill, so that codes are represented by a solid fill and codes are represented by dashed fill.

Figure 1. Illustrative example for coefficients

The coefficient is calculated per domain (i.e. red, blue, green) and analyzes whether the coders assigned or not a domain—independently of the code—to the quotations (see Figure 1(c)). Notice that we only focus on the presence or absence of a semantic domain by quotation, so Figure 1(c) only takes into account the color. Now, the coefficient measures the agreement that the judges achieved in assigning the same color to the same quotation. The bigger the coefficient, the better the agreement. In this way, we get total agreement () for as both coders assigned this domain (blue) to the second quotation and the absence of this domain in the rest of quotations. On the other hand, for as assigned this domain (red) to quotations 1 and 3 while assigned it to quotations 1, 2 and 3, leading to a disagreement in quotation 2. The coefficient is also calculated per domain (i.e. red, blue, green), but it measures the agreement attained when applying the codes of that domain. In other words, given a domain , this coefficient analyzes whether the coders assigned the same codes of (i.e. the same fills) to the quotations or not. In this way, as shown in Figure 1(d), it only focuses on the applied fills to each quotation. In particular, observe that for since both coders assigned the same code to the second quotation and no code from this domain to the rest of quotations, i.e. total agreement. Also notice that for as the coders assigned the same code of to the third quotation but they did not assign the same codes of to the rest of quotations. Finally, observe that cu-alpha for is very small (near to zero) since the judges achieve no agreement on the chosen codes. With respect to the global coefficients, the coefficient analyzes all the domains as a whole, but it does not take into account the codes within each domain. In this way, in Figure 1(e), we colour each segments with the colors corresponding to the applied semantic domain (regardless of the particular used code). From these chromatic representation, measures the agreement in applying these colours globally between the coders. In particular, notice that as both coders assigned the same domain to the first quotations and the domains and to the third quotation, but they did not assign the same domains in the second and fourth quotations. Finally, the coefficient measures the agreement in the selection of relevant matter, as shown in Figure 1(f). In this case, both judges recognized the first three segments as relevant (they were coded), as highlighted in gray in the figure. However, considered that the forth quotation was irrelevant (it was not coded), as marked in white, and marked it as relevant, in gray. In this way, we have that .

4.1. The coefficient

The first variation of Krippendorff’s coefficient that we consider is the coefficient. It is a global measure that summarizes the agreement of the judges for recognizing relevant parts of the matter. For computing it, we consider a set of labels have only two labels, that semantically represent ‘recognized as relevant’ (1) and ‘not recognized as relevant’ (0). Hence, we take

Now, using the whole set of evaluation , we create a new labelling as follows. Let and . We set if the judge assigned some code to the quotation (i.e. if ) and otherwise (i.e. if did not code , that is ). From this set of evaluations, , is given as

Therefore, measure the degree of agreement that the coders achieved when recognizing relevant parts, that is coded parts, and irrelevant matter. A high value of may be interpreted as that the matter is well structured and it is relatively easy to detect and isolate the relevant parts of information.

Remark 4.2.

In many studies (for instance in case studies in software engineering), it is customary that a researcher pre-processes the raw data to be analyzed, say by transcribing it or by writing it down into a ICA software like Atlas.ti. In that case, usually this pre-processor selects the parts that must be analyzed and chops the matter into quotations before the starting of the judgement process. In this way, the coders are required to code these pre-selected parts, so that they no longer chop the matter by themselves and they code all the quotations. Hence, we always get that , since the evaluation protocol forces the coders to consider as relevant matter the selected parts by the pre-processor. Therefore, in these scenarios, the coefficient is not useful for providing reliability on the evaluations and other coefficients of the family are required.

4.2. The coefficient

The second variation of the Krippendorff’s coefficient is the so-called coefficient. This is a coefficient that must be computed on a specific semantic domain. Hence, let us fix a semantic domain for some fixed with . As above, the set of labels will have only two labels, that semantically represent ‘voted ’ (1) and ‘did not vote ’ (0). Hence, we take

For the assignment of labels to items, the rule is as follows. For and , we set if the judge assigned some code of to the quotation (i.e. if for some ) and otherwise. Observe that, in particular, if considered that was irrelevant matter. From this set of evaluations, , is given as

In this way, the coefficient can be seen as a measure of the degree of agreement that the judges achieved when choosing to apply the semantic domain or not. A high value of is interpreted as an evidence that the domain is clearly stated, its boundaries are well-defined and, thus, the decision of applying it or not is near to be deterministic. However, observe that it does not measure the degree of agreement in the application of the different codes within the domain . Hence, it may occur that the boundaries of the domain are clearly defined but the inner codes are not well chosen. This is not a task of the coefficient, but of the coefficient explained below.

Remark 4.3.

By the definition of , in line with the implementation in Atlas.ti [Atlas:2019], the irrelevant matter plays a role in the computation. As we mentioned above, all the matter that was evaluated as irrelevant (i.e. was not coded) is labelled with . In particular, a large corpus with only a few sparse short quotations may distort the value of .

4.3. The coefficient

Another variation of the Krippendorff’s coefficient is the so-called coefficient. As the previous variation, this coefficient is computed per semantic domain, say for some . Suppose that this semantic domain contains codes . The collection of labels is now a set

Semantically, they are labels that represent the codes of the chosen domain . For the assignment of labels to items, the rule is as follows. For and , we set if the judge assigned the code of to the item (quotation) . Recall that, from the exclusion principle for codes within a semantic domain, the judge applied at most one code from to . If the judge did not apply any code of to , we set . From this set of judgements , is given as

Remark 4.4.

As explained in Remark 3.1, for the computation of the observed and expected coincidence matrices, only items that received at least two evaluations with codes of from two different judges count. In particular, if a quotation is not evaluated by any judge (irrelevant matter), received evaluations for other domains but not for (matter that does not corresponds to the chosen domain) or only one judge assigned to it a code from (singled-voted), the quotation plays no role in . This limitation might seem a bit cumbersome, but it could be explained by arguing that the presence/absence of is measured by so it will be redundant to take it into account for too.

4.4. The coefficient

The last variation of Krippendorff’s coefficient that we consider in this study is the so-called coefficient. In contrast with the previous coefficients, this is a global measure of the goodness of the partition into semantic domains. Suppose that our codebook determines semantic domains . In this case, the collection of labels is the set

Semantically, they are labels representing the semantic domains of our codebook. We assign labels to items as follows. Let and . Then, if , we set . In other words, we label with the labels corresponding to the semantic domains chosen by judge for this item, independently of the particular code. Observe that this is the first case in which the final evaluation might be multivaluated. From this set of judgements, , is given as

In this way, measures the degree of reliability in the decision of applying the different semantic domains, independently of the particular chosen code. Therefore, it is a global measure that quantifies the logical independence of the semantic domains and the ability of the judges of looking at the big picture of the matter, only from the point of view of semantic domains.

5. Atlas.ti for Inter-Coder Agreement (ICA): a tutorial

In this section, we describe how to use the ICA utilities provided by Atlas.ti v8.4 [Atlas:2019] (from now on, shortened as Atlas) as well as a guide for interpreting of the obtained results. We will assume that the reader is familiar with the general operation of Atlas (otherwise, a detailed user guide can be found in [friese:2019]) and focus on the computation and evaluation of the different ICA coefficients calculated by Atlas. Somehow, the aim of this section is to extend, and sometimes clarify, the official manual offered by Atlas [friese:2019]. In particular, the explanations below are related with the following sections of the manual:

  • “Measuring Inter-coder Agreement” (pages 7-8)

  • “Methods For Testing ICA” (pages 8-10), with special attention to Krippendorff’s coefficients.

  • “Calculating An ICA Coefficient” (pages 20-22)

This section is structured as follows. First, in Section 5.1 we describe the different operation methods provided by Atlas for the computation of ICA. In Section 5.2, we describe briefly the protocol for analyzing case studies in software engineering and we introduce a running example on the topic that will serve as a guide along all the tutorial. Finally, in Section 5.3, we discuss the calculation, interpretation and validity conclusions that can be drawn from the ICA coefficients provided by Atlas.

5.1. Coefficients in Atlas for the computation of ICA

Atlas provides three different methods for computing the agreement between coders, namely simple percent agreement, Holsti Index (both can be checked in Section 2.2), and Krippendorff’s coefficients (see Sections 2.6 and 4). We can access them by clicking in Analyze > Intercoder Agreement > Agreement Measure. In this path, we get the menu depicted in Figure 2.

Figure 2. Available methods in Atlas for computing ICA

5.1.1. Simple percent agreement and Holsti Index

As we pointed out in Section 2.2, it is well reported in the literature that simple percent agreement is not a valid ICA measure, since it does not take into account the agreement that the judges can be attained by chance. On the other hand, the Holsti Index [holsti1969content], as referred in Section 2.2, is a variation of the percent agreement that can be applied when there are no pre-defined quotations and each coder selects the matter that considers relevant. Nevertheless, as in the case of the percent agreement, it ignores the agreement by chance so it is not suitable for a rigorous analysis. In any case, Atlas provides us these measures that allow us to glance at the results and to get an idea of the distribution of the codes. However, they should not be used for drawing any conclusions about the validity of the coding. For this reason, in this tutorial we will focus on the application and interpretation of Krippendorff’s coefficients.

5.1.2. Krippendorff’s in Atlas

Atlas also provides an integrated method for computing the Krippendorff’s coefficients. However, it may be difficult at a first sight to identify the prompted results since the notation is not fully consistent between Atlas and some reports in the literature. The development of the different versions of the coefficient has taken around fifty years and, during this time, the notation for its several variants has changed. Now, a variety of notations co-exists in the literature that may confuse the unfamiliar reader with this ICA measure. In order to clarify these relations, in this paper we always use the notation introduced in Section 4. These notations are based on the ones provided by Atlas, but some slightly differences can be appreciated. For the convenience of the reader, in Table 9 we include a comparative between the original Krippendorff’s notation, the Atlas notation for the coefficient and the notation used in this paper.

Name Krippendorff’s [Krippendorff:2018] Atlas [Atlas:2019, friese:2019] This paper
Global binary alpha-binary (global)
Binary per semantic domain , alpha-binary (semantic domain)
cu- , cu-alpha
Cu- Cu-alpha
Table 9. Equivalence of notations between the variants of the coefficient
Remark 5.1.

Empirically, we have discovered that the semantics that the software Atlas applies for computing the coefficients / and / are the ones explained in this paper, as provided in Section 4. However, to our understanding, this behaviour is not fully consistent with the description provided in the Atlas user’s guide.

5.2. Case study: instilling DevOps culture in software companies

This tutorial uses as guiding example an excerpt of a research conducted by the authors in the domain of DevOps [devops]. The considered example is an exploratory study to characterize the reasons why companies move to DevOps and what results do they expect to obtain when adopting the DevOps culture [leite2019survey]. This exploratory case study is based on interviews to software practitioners from 30 multinational software-intensive companies. The study has been conducted according to the guidelines for performing qualitative research in software engineering proposed by Wohlin et al. [wohlin2012experimentation]. In Figure 3 we show, through a UML activity diagram [rumbaugh1999unified], the different stages that comprise the above-mentioned study. For the sake of completeness, in the following exposition each step of the analysis is accompanied with a brief explanation of the underlying qualitative research methodology that the authors carried out. For a complete description of the methodology in qualitative research and thematic analysis, please check the aforementioned references.

Figure 3. Phases for conducting case study research involving qualitative data analysis in software engineering

5.2.1. Set research objectives

The first step needed for conducting an exploratory study is to define the aim of the prospective work, the so-called research questions (RQ). These objectives must be clearly stated and the boundaries of each research question should be undoubtedly demarcated. On the other hand, a research study is a flexible process, in which many variables and the center of attention of the research should be open to changes. In this way, it is important to state broad enough research questions that do not tightly restrict the focus of the research and provides enough room to the researcher to analyze different aspects. In the case study of the running example presented in this paper, we propose two research questions related to the implications of instilling a DevOps culture in a company, which is the main concern of the analysis. These are the following:

  • RQ1: What problems do companies try to solve by implementing DevOps?

  • RQ2: What results do companies try to achieve by implementing DevOps?

5.2.2. Collect data

The next step in the research is to collect the empirical evidences needed for understanding the phenomenon under study. Data is the only window that the researchers have to the object of research, so getting high quality data typically leads to good researches. Moreover, in this phase, it is important to gather the data in a structured and methodical way, so that the collected information can be organized for easy random access in the future. This allows researchers to come back to the evidences frequently in order to assess their evaluations. As a rule-of-thumb, the better the data, the more precise the conclusions can be drawn. There are two main methods for collecting information in qualitative analysis, and both are particularly useful in software engineering: questionnaires and interviews [wohlin2012experimentation]. Usually, questionnaires are easier to issue, since they can be provided by email or web pages that can can be access whenever is preferable for the person in charge of answering it. On the other hand, interviews tent to gather a more complete picture of the phenomenon under study since there exists an active interaction between interviewer and interviewee, typically face to face. In this way, interviews are usually better suited for case studies since they allow the researcher to modify the questions to be asked on the fly, in order to emphasize the key points under analysis. As a drawback, typically the number of answers that can be obtained through a questionnaire is much larger than the number of interviews that can be conducted, but the later usually lead to higher quality data. In the study considered in this paper, the data collection method was semi-structured interviews to software practitioners of 30 companies. The interviews were conducted face-to-face, using the Spanish language, and the audio was recorded with the permission of the participants, transcribed for the purpose of data analysis, and reviewed by respondents. In the transcripts, the companies were anonymized by assigning them an individual identification number from ID01 to ID30. The full script of the interview is available at the project’s web

5.2.3. Analyze data

This is the most important phase in the study. In this step, the researchers turn the raw data into structured and logically interconnected conclusions. On the other hand, due to its creative component, it is the less straightforward phase in the cycle. To help researchers to analyze the data and to draw the conclusions, there exists several methods for qualitative data analysis that can be followed. In the DevOps exploratory study considered here, the authors conducted a thematic analysis approach [cruzes2011recommended, thomas2008methods]. Thematic analysis is a method for identifying, analyzing, and reporting patterns within the data. For that purpose, the data is chopped into small pieces of information, the quotations or segments, that are minimal units of data. Then, some individuals (typically some of the researchers) act as judges, codifying the segments to highlight the relevant information and to assign it a condensate description, the code. In the literature, codes are defined as “descriptive labels that are applied to segments of text from each study” [cruzes2011recommended]. In order to easy the task of the coders, the codes can be grouped into bigger categories that share come higher level characteristics, forming the semantic domains (also known as themes in this context). This introduce a multi-level codification that usually leads to richer analysis. A very important point is that splitting of the matter under study into quotations can be provided by a non-coder individual (typically, the thematic analysis designer), or it can be a task delegated to the coders. In the former case, all the coders work with the same segments, so it is easier to achieve a high level of consensus that leads to high reliability in the results of the analysis. In the later case, the coders can decide by themselves how to cut the stream of data, so hidden phenomena can be uncovered. However, the cuts may vary from a coder to another, so there exists a high risk of getting too diverse codifications that cannot be analyzed under a common framework. Thematic analysis can be instrumented through Atlas [Atlas:2019, friese:2019], which provides an integrated framework for defining the quotations, codes and semantic domains, as well as for gathering the codifications and to compute the attained ICA. In the study considered in this section, the method for data analysis followed is described in the four phases described below (see also Figure 3).

  1. Define quotations & codebook. In the study under consideration, the coders used pre-defined quotations. In this way, once the interviews were transcripted, researcher R1 chopped the data into its unit segments that remain unalterable during the subsequent phases. In parallel, R1 elaborated a codebook by collecting all the available codes and their aggregation into semantic domains. After completing the codebook, R1 also created a guide with detailed instructions about how to use the codebook and how to apply the codes. The design of the codebook is accomplished through two different approaches: a deductive approach [miles1994qualitative] for creating semantic domains and an inductive approach (grounded theory) [corbin2008techniques] for creating codes. In the first phase, the deductive approach, R1 created a list of of semantic domains in which codes will be grouped inductively during the second phase. These initial domains integrate concepts known in the literature. For domains related to RQ1 (problems), each domain is named P01, P02, P03, etc. For domains related to RQ2 (results), each domain is named R01, R02, R03, etc. Domains were written with uppercase letters (see Figure 4).

    Figure 4. Atlas code manager

    In the second phase, the inductive approach, R1 approached the data (i.e. the interviews’ transcriptions) with the research questions RQ1 y RQ2 in mind. R1 reviewed the data line by line and created the quotations. R1 also assigned them a code (new or previously defined) in order to get a comprehensive list of all the needed codes. As more interviews were analyzed, the resulting codebook was refined by using a constant comparison method that forced R1 to go back and forth. Additionally, the codes were complemented with a brief explanation of the concept they describe. This allows R1 to guarantee that the collection of created codes satisfy the requirements imposed by thematic analysis, namely exhaustiveness and mutual exclusiveness. The exhaustiveness requirement means that the codes of the codebook must cover all the relevant aspects for the research. Mutual exclusiveness means that there must exist no overlapping in the semantics of each code within a semantic domain. In this way, the codes of a particular semantic domain must capture disjoint aspects and complementary aspects, which implies that the codes should have explicit boundaries so that they are not interchangeable or redundant. This mutual exclusiveness translates into the fact that, during the codification phase, a coder cannot apply several codes of the same semantic domain to the same quotation. In other words, each coder can apply at most a code of each semantic domain to each quotation. In order to easily detect violations of mutual exclusiveness, Atlas allows the researchers to color the codes of each semantic domain with the same color, so that exclusiveness reduces to check that no more than one code of the same color can be assigned to a quotation, see Figure 5.

    Figure 5. Codification in Atlas and mutual exclusiveness
  2. Code. In this phase, the chosen coders (usually researchers different than the codebook designer) analyze the prescribed quotations created during phase (1). For that purpose, they use the codebook as a statement of the available semantic domains and codes as well as the definitions of each one, scope of application and boundaries. It is crucial for the process that the coders apply the codes exactly as described in the codebook. No modifications on the fly or alternative interpretations are acceptable. Nevertheless, the coders are encourage to annotate any problem, diffuse limit or misdefinition they find during the coding process. After the coding process ends, if the coders consider that the codebook was not clear enough or the ICA measured in phase (3) does not reach an acceptable level, the coders and the codebook designer can meet to discuss the found problems. With this information, the codebook designer creates a new codebook and instructions for coding that can be used for a second round of codifications. This iterative process can be conducted as many times as needed until the coders consider that the codebook is precise enough and the ICA measures certify an acceptable amount of reliability. In the case study of ICA considered in this paper, the coding process involved two researchers different than R1, that acted as coders C1 and C2. They coded the matter according to the codebook created by R1.

  3. Calculate ICA. It is a quite common misconception in qualitative research that no numerical calculations can be performed for the study. Qualitative research aims to understand very complex an unstructured phenomena, for which a semantic analysis of the different facets and their variations is required. However, by no means this implies that no mathematical measures can be obtained for controlling the process. Due to its broad and flexible nature, qualitative research is highly sensible to introduce biases in the judgements of the researchers, so it is mandatory to supervise the research through some reliability measure that are usually numerical [Krippendorff:2018]. In this way, the quantitative approach takes place in a higher level, as meta-analysis of the conducted process in order to guarantee mathematical reliability in the drawn conclusions. Only when this formal quality assurance process is satisfactory, researchers can trust in the conclusions and the method is sound and complete. Therefore, to avoid biases and be confident that the codes mean the same to anyone who uses them, it is necessary to build that confidence. According to Krippendorff [Krippendorff:2018], reliability grounds this confidence empirically and offers the certainty that research findings can be reproduced. In the presented example about a DevOps case study, we used Inter-Coder Agreement (ICA) analysis techniques for testing the reliability of the obtained codebook. In this way, after coding, another researcher, R4, calculated and interpreted the ICA between C1 and C2. If coders did not reach an acceptable level of reliability, R1 analyzes the disagreements pointed out by R4 to find out why C1 and C2 had not understood a code in the same mode. Using this acquired knowledge, R1 delivers a refined new version of the codebook and the accompanying use instructions. R1 also reviews the codification of those quotations that led to disagreement between C1 and C2, modifying it according to the new codebook when necessary. Notice that, if a code disappears in the new version of the codebook, it also must disappear of all the quotations that were asigned with it. At this point, C1 and C2 can continue coding on a new subset of interviews’ transcriptions. This process is repeated until the ICA reached an acceptable level of reliability (typicall ). In Section 5.3 it is provided a detailed explanation about how to compute and interpret ICA coeffients in Atlas.

  4. Synthetize. Once the loop (1)-(2)-(3) has been completed because the ICA measures reached an acceptable threshold, we can rely in the output of the codification process and start drawing conclusions. At this point, there exists a consensus about the meaning, applicability and limits of the codes and semantic domains of the codebook. Using this processed information, this phase aims to provide a description of higher-order themes, a taxonomy, a model, or a theory. The first action is to determine how many times each domain appears in the data in order to estimate its relevance (grounded) and to support the analysis with evidences through quotations from the interviews. After that, the co-occurrence table between semantic units should be computed, that is, the table that collects the number of times a semantic domain appears jointly with the other domains. With these data, semantic networks can be created in order to portray the relationships between domains (association, causality, etc.) as well as the relationship strength based on co-occurrence. These relations determine the density of the domains, i.e. the number of domains you have related to each domain. If further information is needed, it is possible to repeat these actions for each code within a domain. In the case study considered in this paper, the co-occurrence tables and semantic networks were computed for each semantic domain related with RQ1 (problems) and RQ2 (results). In addition, for the most grounded codes of each research question this analysis was also repeated to straighten the conclusions. Finally, problems and results were analyzed by case (organization) and the correlation relationships between problems and results was discussed, i.e. interconnecting categories. All these synthesis actions are not the main focus of this paper, so we will not describe them further. For more information and techniques, please refer to [wohlin2012experimentation].

5.2.4. Perform validation analysis

As a final step, it is necessary to discuss in which way the obtained analysis and drawn conclusions are valid, as well as the threats to the validity that may jeopardize the study. In the words of Wohlin [wohlin2012experimentation] “The validity of a study denotes the trustworthiness of the results, and to what extent the results are true and not biased by the researchers’ subjective point of view”. There are several strategies for approaching to the validity analysis of the procedure. In the aforementioned case study, it was followed the methodology suggested by Creswell & Creswell [creswell2017research] to improve the validity of exploratory case studies, namely data triangulation, member checking, rich description, clarify bias, and report discrepant information. Most of these methods are out of the scope of this paper and are not described further (for more information, check [creswell2017research, wohlin2012experimentation]). We mainly focus on reducing authors bias by evaluating the reliability and consistency of the codebook on which the study findings are based through ICA analysis.

5.3. ICA calculation

This section describes how to perform the ICA analysis required using Atlas to assess the validity of the exploratory study described in Section 5.2. For this purpose, we use the theoretical framework developed in Section 4 regarding the different variants of Krippendorff’s coefficient. In this way, we will monitor the evolution of the coefficients along the codification process in order to assure it reaches an acceptable threshold of reliability, as mentioned in Section 5.2.3. Nevertheless, before starting the codification/evaluation protocol, it is worthy to consider two important methodological aspects, as described below.

  1. The number of coders. Undoubtedly, the higher the number of involved coders, the richer the codification process. Krippendorff’s coefficients can be applied to an arbitrary number of coders, so there exists no intrinsic limitation to this number. On the other hand, a high number of coders may introduce too many different interpretations that may difficult to reach an agreement. In this way, it is important to find a fair balance between the number of coders and the time to reach agreement. For that purpose, it may be useful to take into account the number of interviews to be analyzed, its length and the resulting total amount of quotations. In the case study analyzed in this section, two coders, C1 and C2 were considered for coding 30 interviews.

  2. The extend of the codification/evaluation loop. A first approach to the data analysis process would be to let the coders codify the whole corpus of interviews, and to get an resulting ICA measure when the codification is completed. However, if the obtained ICA is below the acceptable threshold (say ), the only solution that can be given is to refine the codebook and to re-codify the whole corpus again. This is a slow and repetitive protocol that can lead to intrinsic deviations in the subsequent codifications due to cognitive biases in the coders. In this way, it is more convenient to follow an iterative approach that avoids these problems and speeds up the process. In this approach, the ICA coefficient is screened on several partially completed codifications. To be precise, the designer of the case study splits the interviews into several subsets. The coders process the first subset and, after that, the ICA coefficients are computed. If this value is below the threshold of acceptance (), there exists a disagreement between the judges when applying the codebook. At this point, the designer can use the partial codification to detect the problematic codes and to offer a refined version of the codebook and the accompanying instructions. Of course, after this revision, the previously coded matter should be updated with the new codes. With this new codebook, the coders can face the next subset of interviews, in the expectation that the newer version of the codebook will lead to decrease the disagreement. This reduces drastically the number of complete codifications needed to achieve an acceptable agreement. In the case study considered as example, the first batch of interviews comprised the first 19 interviews (ID01 to ID19). The attained ICA was unsatisfactory, so the codebook designer R1 reviewed the codebook releasing a new version. With the updated codebook, the coders codified the remaining 11 interviews (ID20 to ID30) but, now, the obtained ICA pass the acceptance threshold, which evidences a high level of reliability in the evaluations. As a final remark, it is not recommendable to replace the judges during this iterative process. Despite that Krippendorff’s allows to exchange judges, the new judges may not share the same vision and expertice with the codebook, requiring to roll back to previous versions.

Now, we present the calculation and interpretation of each of the four coefficients mentioned in Section 4. In order to emphasize the methodological aspects of the process, we focus only on research question RQ1 (problems) and we choose some illustrative instances for each of the two rounds of the codification/evaluation protocol. The first step in order to address the ICA analysis is to create an Atlas project and to introduce the semantic domains and codes compiled in the codebook. In addition, all the interviews (and their respective quatations, previously defined by R1) should be loaded as separated documents and all the codifications performed by the coders should be integrated in the Atlas project. This is a straightforward process that is detailed in the Atlas’ user manual [friese:2019]. To illustrate our case study, we have a project containing the codifications of the two coders for the first 19 interviews. The codebook has 10 semantic domains and 35 codes (Figure 6). Observe that Atlas reports a total of 45 codes, since it treats semantic domains as codes (despite that it will work as an aggregation of codes for ICA purposes).

Figure 6. Documents and codes for the Atlas project

In order to activate the ICA computation in an Atlas project, push the Intercoder Agreement button in the Analyze tab. To incorporate the coders that evaluated the interviews, click the Add Coder button (Figure 7). Now, select the coders that evaluated the quotations (Figure 8) and click on the Add Coders button. In the running example, the two researchers acting as coders, Daniel and Jorge, are selected.

Figure 7. Click on the Add Coder button
Figure 8. Select two coders

Next, click on the Add Documents button and select those documents to be analyzed (Figure 9). In the example, we select the first 19 documents (transcribed interviews) that will be analyzed in the first round. Observe that, due to the size of the windows, some of the 19 selected items are missing in Figure 9. Despite that they are loaded from different documents, regarding the ICA calculation, Atlas juxtaposes them and treats them as a continuum.

Figure 9. Documents selection

Now, we should select the semantic domains we wish to analyze, and the codes within them. To do so, we click on the Add Semantic Domain option and, then, we can select the codes of the semantic domain and drag them from the Project explorer into the Add Code field. For example, in Figure 10, the three codes associated to semantic domain P07 have been added.

Figure 10. Coders, documents and domains selected

After adding a semantic domain and its codes, Atlas automatically plots a graphical representation. For each code, this graph is made of as many horizontal lines as coders (two, in our running example) that are identified with a small icon on the left. Each line is divided into segments that represent each of the documents added for analysis. As can be checked in Figure 11, there are two coders (Daniel and Jorge, represented with blue and brown icons respectively) and the semantic domain P07 has three associated codes, so three groups of pairs of horizontal lines are depicted. In addition, since we selected 19 documents for this codification round, the lines are divided into 19 segments (notice that the last one is very short and it can be barely seen). Observe that the length of each segment is proportional to the total length of the file.

Figure 11. Codification summary by code, semantic domain and coder

Moreover, on the right of the horizontal lines we find a sequence of numbers organized into two groups separated by a slash. For example, in Figure 12 we can see those numbers for the first code of P07 (problems/lack of collaboration/sync). The left-most group shows the number of quotations to which the corresponding code has been applied by the coder along all the documents, as well as the total length (i.e. the number of characters) of the chosen quotations. In the example of Figure 12, the first coder (Daniel) used the first code twice, and the total length of the chosen quotations is ; while the second coder (Jorge) used the first code only once on a quotation of length . On the other hand, the right-most group indicates the total length of the analyzed documents (in particular, it is a constant independent of the chosen code, semantic domain or coder). This total length is accompanied with the rate of the coded quotations among the whole corpus. In this example, the total length of the documents to be analyzed is and the coded quotations (with a total length of and respectively) represent the and the of the corpus (rounded to and in the Atlas representation). Recall that these lengths of the coded quotations and total corpus play an important role in the computation of the coefficient, as mentioned in Remark 4.1.

Figure 12. Length information for code P07

Each time that a coder uses a code, a small coloured mark is placed in the position of the quotation within the document. The colour of the mark agrees with the assigned color to the coder, and its length corresponds to the length of the coded quotation. Due to the short length of the chosen quotations in Figure 11 they are barely seen, but we can zoom in by choosing the Show Documents Details in the Atlas interface (Figure 13).

Figure 13. Option for showing the ICA per document

In Figure 14, we can check that, in document ID01, both coders agreed to codify two quotations (of different length) with the second and third codes of P07.

Figure 14. Codified quotations in document ID01

5.3.1. The coefficient

In order to compute this coefficient, click on the Agreement Measure button and select Krippendorff’s c-Alpha-binary option. As it is shown in Figure 15, the system returns two values. The first one is the coefficient per semantic domain (P07 in this case, with ) an another global coefficient of the domains as a whole that corresponds to what we called as described in Sections 4.1 and 4.2. Since we selected a single semantic domain (P07), both values of and agree. It is worthy to mention that, according to the Atlas’ user manual [friese:2019], the interpretation of c-Alpha-binary ( in our notation) is “The c-Alpha Binary coefficient is a measure for the reliability of distinguishing relevant from irrelevant matter. It is applicable if the coders have created quotations themselves”. To our understanding, this is a bit confusing and not very accurate interpretation of this coefficient, that will be better substitute by the one provided in Section 4.2: “It is a measure of the degree of agreement that the judges achieve when choosing to apply a particular semantic domain or not”. In the case shown in Figure 15, the value of the coefficient is high () which can be interpreted as an evidence that the domain P07 is clearly stated, its boundaries are well-defined and, thus, the decision of applying it or not is near to be deterministic. However, observe that this does not measure the degree of agreement in the application of the different codes within the domain P07. It might occur that the boundaries of the domain P07 are clearly defined but the inner codes are not well chosen. This is not a task of the , but of the coefficient.

Figure 15. Computation of the coefficient

In order to illustrate how Atlas performed the previous computation, let us calculate by hand. For this purpose, we export the information provided by Atlas about the coding process. In order to do so, we click on the Excel Export button, as shown in Figure 16.

Figure 16. Export data for analysing the semantic domain P07

In Figure 17 we show the part of the exported information that is relevant for our analysis. As we can see, there are two coders (Jorge and Daniel) and three codes. The meaning of each column is as follows:

  • Applied*: Number of times the code has been applied.

  • Units*: Number of units to which the code has been applied.

  • Total Units*: Total number of units across all selected documents, voted or not.

  • Total Coverage*: Percentage of coverage in the selected documents

Figure 17. Computation of the coefficient

The length of the quotation (what is called units in Atlas) is expressed in number of characters. From this information, we see that coder Daniel voted 388 units (characters) with the first code of the domain (problems/lack of collaboration/sync) while coder Jorge only voted 81 units with that code. For the others codes, both judges apply them to 1143 and 403 units, respectively. Indeed, as we will check later, the quotations that Jorge chose for applying P07 are actually a subset of the ones chosen by Daniel. Hence, Daniel and Jorge achieved perfect agreement when applying the second and third codes of P07 while Jorge only considered eligible 81 units for the first code of the 388 chosen by Daniel. From these data, we can construct the observed coincidence matrix, shown in Table 10, as explained in Section 2.6 (see also Section 3). Recall from Section 4.2 that a label means that the coder voted the quotation with a code of the semantic domain (P07 in this case) and the label means that no code of the domain was applied.

Table 10. Observed coincidences matrix for .

This matrix is computed as follows. The number of units to which the coders assigned any code from domain P07 is in the case of Jorge and for Daniel. Since the choices of Jorge are a subset of the ones of Daniel, we get that they agreed in units. Recall that counts ordered pairs of votes, so we need to double the contribution to get . On the other hand, Jorge did not apply any code of P07 to units, while Daniel did not apply them to , which means that they agreed on not to chose P07 in units. Doubling the contribution, we get . Finally, for the disagreements we find that Daniel applied a code from P07 to 307 units that Jorge did not select, so we get that . Observe that we do not have to double this value, since there is already an implicit order in this votes (Daniel voted and Jorge voted ). From these data, it is straightforward to compute the aggregated quantities , and . In the same vein, we can construct the matrix of expected coincidences, as explained in Section 2.6 (see also 3). The value of the expected disagreements are

Analogously, we can compute and . However, they are not actually needed for computing the coefficient, so we will skip them. With these calculations, we finally get that

and, therefore, the coefficient is given by

We want to notice again that the previous calculation is correct because Jorge voted with a code of P07 a subset of the quotations that Daniel selected for domain P07. We can check this claim using Atlas. For that purpose, click on Show Documents Details, as shown in Figure 13, and review how the codes were assigned per document. In the case considered here, Table 10 shows an excerpt of the displayed information. To shorten the notation, the first code of P07 is denoted by 7a, the second one by 7b, and the third one by 7c. In this table, we see that all the voted elements coincide except the last 307 corresponding to document ID17, that Daniel codified and Jorge did not.

Document ID Daniel Jorge
7b 1x112
7c 1x306
7b 1x112
7c 1x306
ID03 7b 1x185 7b 1x185
ID05 7b 1x159 7b 1x159
ID10 7a 1x81 7a 1x81
ID11 7b 1x314 7b 1x314
7b 1x373
7c 1x 97
7b 1x373
7c 1x 97
ID17 7a 1x307
Table 11. Codified units, per document, for the semantic domain P07.

Another example of calculation of this coefficient is shown in Figure 18. It refers to the computation of the same coefficient, , but in the second round of the coding process (see Section 5.2.3), where 11 documents were analyzed. We focus on this case because it shows a cumbersome effect of the coefficient. As shown in the figure, we get a value of . This is an extremely small value, that might even point out to a deliberate disagreement between the judges. However, this is not happening here, but an undesirable statistical effect that fakes the result. The point is that, as shown in Figure 18

, in round 2 there is only one evaluation from one of the judges that assigns this semantic domain, in contrast with the 17 evaluations obtained in round 1. For this reason, there are not enough data for evaluating this domain in round 2 and, thus, this result can be attributed to statistical outliers. The researcher interested in using Atlas for qualitative research should stay alert to this annoying phenomenon. When Atlas considers that there are not enough statistical evidences (p-value

) or the number of coded quotations is very small, these anomalous values can be obtained or even the text (Not Available). In this case, thanks to the few received codifications, the reliability may be assessed by hand.

Figure 18. Computation of the coefficient for the domain P07 in the second round

5.3.2. The coefficient

As we mentioned in Section 4.1, the coefficient allows researchers to measure the degree of agreement that the judges reached when distinguishing relevant and irrelevant matter. In this way, is only useful if each coder chops the matter by him/herself to select the relevant information to code. On the other hand, if the codebook designer pre-defines the quotations to be evaluated, this coefficient is no longer useful since it always attains the value