1 Introduction
We develop a modelbased alternative to Krippendorff’s (Hayes and Krippendorff, 2007), a wellknown nonparametric measure of agreement. In keeping with the naming convention that is evident in the literature on agreement (e.g., Spearman’s , Cohen’s , Scott’s ), we call our approach Sklar’s . Although Krippendorff’s is intuitive, flexible, and subsumes a number of other coefficients of agreement, we will argue that Sklar’s improves upon in (at least) the following ways. Sklar’s

permits practitioners to simultaneously assess intracoder agreement, intercoder agreement, agreement with a gold standard, and, in the context of multiple scoring methods, intermethod agreement;

identifies the above mentioned types of agreement with intuitive, welldefined population parameters;

can accommodate any number of coders, any number of methods, any number of replications (per coder and/or per method), and missing values;

allows practitioners to use regression analysis to reveal important predictors of agreement (e.g., coder experience level, or time effects such as learning and fatigue);

provides complete inference, i.e., point estimation, interval estimation, diagnostics, model selection; and

performs more robustly in the presence of unusual coders, units, or scores.
The rest of this article is organized as follows. In Section 2 we present an overview of the agreement problem, and state our assumptions. In Section 3 we present three example applications of both Sklar’s and Krippendorff’s . These case studies showcase various advantages of our methodology. In Section 4 we specify the flexible, fully parametric statistical model upon which Sklar’s is based. In Section 5 we describe four approaches to frequentist inference for , namely, maximum likelihood, distributional transform approximation, and composite marginal likelihood. We also consider a twostage semiparametric method that first estimates the marginal distribution nonparametrically and then estimates the copula parameter(s) by conditional maximum likelihood. In Section 6 we use an extensive simulation study to assess the performance of Sklar’s relative to Krippendorff’s . In Section 7 we briefly describe our opensource R (Ihaka and Gentleman, 1996) package, sklarsomega, which is available for download from the Comprehensive R Archive Network (R Core Team, 2018). Finally, in Section 8 we point out potential limitations of our methodology, and posit directions for future research on the statistical measurement of agreement.
2 Measuring agreement
We feel it necessary to define the problem we aim to solve, for the literature on agreement contains two broad classes of methods. Methods in the first class seek to measure agreement while also explaining disagreement—by, for example, assuming differences among coders (as in Aravind and Nawarathna (2017), for example). Although our approach permits one to use regression to explain systematic variation away from a gold standard, we are not, in general, interested in explaining disagreement. Our methodology is for measuring agreement, and so we do not typically accommodate (i.e., model) disagreement. For example, we assume that coders are exchangeable (unless multiple scoring methods are being considered, in which case we assume coder exchangeability within each method). This modeling orientation allows disagreement to count fully against agreement, as desired.
Although our understanding of the agreement problem aligns with that of Krippendorff’s and other related measures, we adopt a subtler interpretation of the results. According to Krippendorff (2012), social scientists often feel justified in relying on data for which agreement is at or above 0.8, drawing tentative conclusions from data for which agreement is at or above 2/3 but less than 0.8, and discarding data for which agreement is less than 2/3. We use the following interpretations instead (Table 1), and suggest—as do Krippendorff and others (Artstein and Poesio, 2008; Landis and Koch, 1977)—that an appropriate reliability threshold may be context dependent.
Range of Agreement  Interpretation 

Slight Agreement  
Fair Agreement  
Moderate Agreement  
Substantial Agreement  
NearPerfect Agreement 
3 Case studies
In this section we present three case studies that highlight some of the various ways in which Sklar’s can improve upon Krippendorff’s . The first example involves nominal data, the second example interval data, and the third example ordinal data.
3.1 Nominal data analyzed previously by Krippendorff
Consider the following data, which appear in (Krippendorff, 2013). These are nominal values (in ) for twelve units and four coders. The dots represent missing values.
1  2  3  3  2  1  4  1  2  
1  2  3  3  2  2  4  1  2  5  3  
3  3  3  2  3  4  2  2  5  1  
1  2  3  3  2  4  4  1  2  5  1 
Note that all columns save the sixth are constant or nearly so. This suggests nearperfect agreement, yet a Krippendorff’s analysis of these data leads to a weaker conclusion. Specifically, using the discrete metric yields
and bootstrap 95% confidence interval (0.39, 1.00). (We used a bootstrap sample size of
1,000, which yielded Monte Carlo standard errors (MCSE)
(Flegal et al., 2008) smaller than 0.001.) This point estimate indicates merely substantial agreement, and the interval implies that these data are consistent with agreement ranging from moderate to nearly perfect.Our method produces and ( 1,000; MCSEs
0.004), which indicate nearperfect agreement and at least substantial agreement, respectively. And our approach, being model based, furnishes us with estimated probabilities for the marginal categorical distribution of the response:
Because we estimated and simultaneously, our estimate of differs substantially from the empirical probabilities, which are 0.22, 0.32, 0.27, 0.12, and 0.07, respectively.
The marked difference in these results can be attributed largely to the codes for the sixth unit. The relevant influence statistics are
and
where the notation “” indicates that all rows are retained and column 6 is left out. And so we see that column 6 exerts 2/3 more influence on than it does on . Since , inclusion of column 6 draws us away from what seems to be the correct conclusion for these data.
3.2 Interval data from an imaging study of hip cartilage
The data for this example, some of which appear in Figure 2, are 323 pairs of T2* relaxation times (a magnetic resonance quantity) for femoral cartilage (Nissi et al., 2015) in patients with femoroacetabular impingement (Figure 3), a hip condition that can lead to osteoarthritis. One measurement was taken when a contrast agent was present in the tissue, and the other measurement was taken in the absence of the agent. The aim of the study was to determine whether raw and contrastenhanced T2* measurements agree closely enough to be interchangeable for the purpose of quantitatively assessing cartilage health. The Bland–Altman plot (Altman and Bland, 1983) in Figure 4 suggests good agreement: small bias, no trend, consistent variability.
27.3  28.5  29.1  31.2  33.0  19.7  21.9  17.7  22.0  19.5  
27.8  25.9  19.5  27.8  26.6  18.3  23.1  18.0  25.7  21.7 
We applied our procedure for each of three choices of parametric marginal distribution: Gaussian, Laplace, and Student’s with noncentrality. The results are shown in Table 2, where the fourth and fifth columns give the estimated location and scale parameters for the three marginal distributions, the sixth column provides values of Akaike’s information criterion (AIC) (Akaike, 1974), and the final column shows model probabilities (Burnham et al., 2011) relative to the model (since that model yielded the smallest value of AIC).
Marginal Model 
Location  Scale  AIC 
Model Probability 


Gaussian  0.837  (0.803, 0.870)  24.9  5.30  3,605  0.0002 
Laplace  0.858  (0.829, 0.886)  24.0  4.34  3,643  0 
0.862  (0.833, 0.890)  23.3  11.21  3,588  1  
Empirical  0.846  (0.808, 0.869) 
We see that the estimates are comparable for the three choices of marginal distribution, yet the distribution is far superior to the others in terms of model probabilities. Figure 5 provides visual corroboration: it is clear that the assumption proves more appropriate because it is able to capture the mild asymmetry of the marginal distribution. The assumption also yielded the largest estimate of and the narrowest confidence interval (arrived at using the method of maximum likelihood). In any case, we must conclude that there is nearperfect agreement between raw T2* and contrastenhanced T2*.
Note that, for a sufficiently large sample, it may be advantageous to employ a nonparametric estimate of the marginal distribution (see Section 5 for details). Results for this approach are shown in the final row of Table 2. We used a bootstrap sample size of 1,000 in computing the confidence interval. This yielded MCSEs smaller than 0.002.
A Krippendorff’s analysis gave and ( 1,000; MSCEs 0.001). Since implicitly assumes Gaussianity— is the intraclass correlation for the ordinary oneway mixedeffects ANOVA model, and so is a ratio of sums of squares (Artstein and Poesio, 2008)—it is not surprising that the results are quite similar to the results obtained by Sklar’s with Gaussian marginals.
3.3 Ordinal data from an imaging study of liver herniation
The data for this example, some of which are shown in Figure 6, are liverherniation scores (in ) assigned by two coders (radiologists) to magnetic resonance images (MRI) of the liver in a study pertaining to congenital diaphragmatic hernia (CDH) (Dannull et al., 2017). The five grades are described in Table 3.
2  4  4  4  4  2  1  2  1  1  
2  4  5  4  4  2  1  2  1  1  
3  5  5  5  4  2  2  2  1  1  
3  5  5  4  4  2  2  2  1  1 
Grade  Description 

1  No herniation of liver into the fetal chest 
2  Less than half of the ipsilateral thorax is occupied by the fetal liver 
3  Greater than half of the thorax is occupied by the fetal liver 
4  The liver dome reaches the thoracic apex 
5 
The liver dome not only reaches the thoracic apex but also extends across the thoracic midline 
Each coder scored each of the 47 images twice, and so we are interested in assessing both intracoder and intercoder agreement. We can accomplish both goals with a single analysis by choosing an appropriate form for our copula correlation matrix . Specifically, we let be block diagonal: , where the block for unit is given by
being the intracoder agreement for the first coder, being the intracoder agreement for the second coder, and being the intercoder agreement. See Section 4 for more information regarding useful correlation structures for agreement.
Our results (from a single analysis), and Krippendorff’s results (three separate analyses), are shown in Table 4. We see that the point estimates for the two approaches are comparable (since the outcomes are approximately Gaussian). Our method produced considerably wider intervals, however. This is not surprising given that Sklar’s estimates all three correlation parameters simultaneously and accounts for our uncertainty regarding the marginal probabilities. By contrast, Krippendorff’s can yield optimistic inference, as we show by simulation in Section 6.
Method  Agreement  Interval (MCSEs) 

(IntraAgreement for Coder 1)  (0.957, 1.000) ( 0.002)  
(IntraAgreement for Coder 2)  (0.958, 1.000) ( 0.001)  
(InterAgreement)  (0.917, 0.980) ( 0.002)  
(IntraAgreement for Coder 1)  (0.897, 0.995) ( 0.002)  
(IntraAgreement for Coder 2)  (0.835, 0.994) ( 0.005)  
(InterAgreement)  (0.838, 0.996) ( 0.006) 
4 Our model
The statistical model underpinning Sklar’s is a Gaussian copula model (XueKun Song, 2000). We begin by specifying the model in full generality. Then we consider special cases of the model that speak to the tasks listed in Section 1 and the assumptions and levels of measurement presented in Section 2.
The stochastic form of the Gaussian copula model is given by
(1) 
where is a correlation matrix, is the standard Gaussian cdf, and is the cdf for the th outcome . Note that is a realization of the Gaussian copula, which is to say that the are marginally standard uniform and exhibit the Gaussian correlation structure defined by . Since is standard uniform, applying the inverse probability integral transform to produces outcome having the desired marginal distribution .
In the form of Sklar’s that most closely resembles Krippendorff’s , we assume that all of the outcomes share the same marginal distribution . The choice of is then determined by the level of measurement. While Krippendorff’s typically employs two different metrics for nominal and ordinal outcomes, we assume the categorical distribution
(2) 
for both levels of measurement, where is the number of categories. For , (4
) is of course the Bernoulli distribution.
Note that when the marginal distributions are discrete (in our case, categorical), the joint distribution corresponding to (
4) is uniquely defined only on the support of the marginals, and the dependence between a pair of random variables depends on the marginal distributions as well as on the copula.
Genest and Neslehova (2007) described the implications of this and warned that, for discrete data, “modeling and interpreting dependence through copulas is subject to caution.” But Genest and Neslehova go on to say that copula parameters may still be interpreted as dependence parameters, and estimation of copula parameters is often possible using fully parametric methods. It is precisely such methods that we recommend in Section 5, and evaluate through simulation in Section 6.For interval outcomes can be practically any continuous distribution. Our R package supports the Gaussian, Laplace, Student’s
, and gamma distributions. The Laplace and
distributions are useful for accommodating heavierthanGaussian tails, and the and gamma distributions can accommodate asymmetry. Perhaps the reader can envision more “exotic” possibilities such as mixture distributions (to handle multimodality or excess zeros, for example).Another possibility for continuous outcomes is to first estimate nonparametrically, and then estimate the copula parameters in a second stage. In Section 5 we will provide details regarding this approach.
Finally, the natural choice for ratio outcomes is the beta distribution, the twoparameter version of which is supported by package
sklarsomega.Now we turn to the copula correlation matrix , the form of which is determined by the question(s) we seek to answer. If we wish to measure only intercoder agreement, as is the case for Krippendorff’s , our copula correlation matrix has a very simple structure: block diagonal, where the th block corresponds to the th unit and has a compound symmetry structure. That is,
where
On the scale of the outcomes, ’s interpretation depends on the marginal distribution. If the outcomes are Gaussian, is the Pearson correlation between and , and so the outcomes carry exactly the correlation structure codified in . If the outcomes are nonGaussian, the interpretation of (still on the scale of the outcomes) is more complicated. For example, if the outcomes are Bernoulli, is often called the tetrachoric correlation between those outcomes. Tetrachoric correlation is constrained by the marginal distributions. Specifically, the maximum correlation for two binary random variables is
where and are the expectations (Prentice, 1988). More generally, the marginal distributions impose bounds, called the Fréchet–Hoeffding bounds, on the achievable correlation (Nelsen, 2006). For most scenarios, the Fréchet–Hoeffding bounds do not pose a problem for Sklar’s because we typically assume that our outcomes are identically distributed, in which case the bounds are and 1. (We do, however, impose our own lower bound of 0 on since we aim to measure agreement.)
In any case, has a uniform and intuitive interpretation for suitably transformed outcomes, irrespective of the marginal distribution. Specifically,
where denotes Pearson’s correlation and the second subscripts index the scores within the th unit ().
By changing the structure of the blocks we can use Sklar’s to measure not only intercoder agreement but also a number of other types of agreement. For example, should we wish to measure agreement with a gold standard, we might employ
In this scheme captures agreement with the gold standard, and captures intercoder agreement.
In a more elaborate form of this scenario, we could include a regression component in an attempt to identify important predictors of agreement with the gold standard. This could be accomplished by using a cdf to link coderspecific covariates with . Then the blocks in might look like
where , being a cdf,
being a vector of covariates for coder
, and being regression coefficients.For our final example we consider a complex study involving a gold standard, multiple scoring methods, multiple coders, and multiple scores per coder. In the interest of concision, suppose we have two methods, two coders per method, two scores per coder for each method, and gold standard measurements for the first method. Then is given by
where the subscript denotes score for coder of method . Thus represents agreement with the gold standard for the first method, represents intracoder agreement for the first coder of the first method, represents intracoder agreement for the second coder of the first method, represents intercoder agreement for the first method, and so on, with representing intermethod agreement.
Note that, for a study involving multiple methods, it may be reasonable to assume a different marginal distribution for each method. In this case, the Fréchet–Hoeffding bounds may be relevant, and, if some marginal distributions are continuous and some are discrete, maximum likelihood inference is infeasible (see the next section for details).
5 Approaches to inference for
When the response is continuous, i.e., when the level of measurement is interval or ratio, we recommend maximum likelihood (ML) inference for Sklar’s . When the marginal distribution is a categorical distribution (for nominal or ordinal level of measurement), maximum likelihood inference is infeasible because the loglikelihood, having terms, is intractable for most datasets. In this case we recommend the distributional transform (DT) approximation or composite marginal likelihood (CML), depending on the number of categories and perhaps the sample size. If the response is binary, composite marginal likelihood is indicated even for large samples since the DT approach tends to perform poorly for binary data. If there are three or four categories, the DT approach may perform at least adequately for larger samples, but we still favor CML for such data. For five or more categories, the DT approach performs well and yields a more accurate estimator than does the CML approach. The DT approach is also more computationally efficient than the CML approach.
5.1 The method of maximum likelihood for Sklar’s
For correlation matrix , marginal cdf , and marginal pdf , the loglikelihood of the parameters given observations is
(3) 
where and denotes the identity matrix. We obtain by minimizing . For all three approaches to inference—ML, DT, CML—we use the optimization algorithm proposed by Byrd et al. (1995) so that , and perhaps some elements of , can be appropriately constrained. To estimate an asymptotic confidence ellipsoid we of course use the observed Fisher information matrix, i.e., the estimate of the Hessian matrix at :
where denotes the observed information and .
Optimization of is insensitive to the starting value for , but it can be important to choose an initial value for carefully. For example, if the assumed marginal family is , we recommend (Serfling and Mazumder, 2009), where is the noncentrality parameter,
is the degrees of freedom,
is the sample median, andis the sample median absolute deviation from the median. For the Gaussian and Laplace distributions we use the sample mean and standard deviation. For the gamma distribution we recommend
, wherefor sample mean
and sample variance
. Similarly, we provide initial valueswhen the marginal distribution is beta. Finally, for a categorical distribution we use the empirical probabilities.
5.2 The distributional transform method
When the marginal distribution is discrete (in our case, categorical), the loglikelihood does not have the simple form given above because is not standard Gaussian (since is not standard uniform if has jumps). In this case the true loglikelihood is intractable unless the sample is rather small. An appealing alternative to the true loglikelihood is an approximation based on the distributional transform.
It is well known that if is continuous,
has a standard uniform distribution. But if
is discrete, tends to be stochastically larger, and tends to be stochastically smaller, than a standard uniform random variable. This can be remedied by stochastically “smoothing” ’s discontinuities. This technique goes at least as far back as Ferguson (1967), who used it in connection with hypothesis tests. More recently, the distributional transform has been applied in a number of other settings—see, e.g., Rüschendorf (1981), Burgert and Rüschendorf (2006), and Rüschendorf (2009).Let , and suppose that and is independent of . Then the distributional transform
follows a standard uniform distribution, and follows the same distribution as .
Kazianka and Pilz (2010) suggested approximating by replacing it with its expectation with respect to :
To construct the approximate loglikelihood for Sklar’s , we replace in (3) with
If the distribution has integer support, this becomes
This approximation is crude, but it performs well as long as the discrete distribution in question has a sufficiently large variance (Kazianka, 2013). For Sklar’s , we recommend using the DT approach when the scores fall into five or more categories.
5.3 Composite marginal likelihood
For nominal or ordinal outcomes falling into a small number of categories, we recommend a composite marginal likelihood (Lindsay, 1988; Varin, 2008) approach to inference. Our objective function comprises pairwise likelihoods (which implies the assumption that any two pairs of outcomes are independent). Specifically, we work with log composite likelihood
where ,
denotes the cdf for the bivariate Gaussian distribution with mean zero and correlation matrix
, and . Since this objective function, too, is misspecified, bootstrapping or sandwich estimation is necessary.
5.4 Sandwich estimation for the DT and CML procedures
As we mentioned above, the DT and CML objective functions are misspecified, and so the asymptotic covariance matrices of and have sandwich forms (Godambe, 1960; Geyer, 2013). Specifically, we have
where is the appropriate Fisher information matrix and is the variance of the score:
We recommend that be estimated using a parametric bootstrap, i.e, our estimator of is
where is the bootstrap sample size and the are datasets simulated from our model at . This approach performs well, as our simulation results show, and is considerably more efficient computationally than a “full” bootstrap (it is much faster to approximate the score than to optimize the objective function). What is more, is accurate for small bootstrap sample sizes (100 in our simulations). The procedure can be made even more efficient through parallelization.
5.5 A twostage semiparametric approach for continuous measurements
If the sample size is large enough, a twostage semiparametric method (SMP) may be used. In the first stage one estimates nonparametrically. The empirical distribution function is a natural choice for our estimator of , but other sensible choices exist. For example, one might employ the Winsorized estimator
where is a truncation parameter (Klaassen et al., 1997; Liu et al., 2009). A third possibility is a smoothed empirical distribution function
where is a kernel (Fernholz, 1991).
Armed with an estimate of —, say—we compute , where , and optimize
to obtain . This approach is advantageous when the marginal distribution is complicated, but has the drawback that uncertainty regarding the marginal distribution is not reflected in the (ML) estimate of ’s variance. This deficiency can be avoided by using a bootstrap sample , the th element of which can be generated by (1) simulating from the copula at ; (2) computing a new response as , where
is the empirical quantile function; and (3) applying the estimation procedure to
. We compute sample quantiles using the medianunbiased approach recommended by Hyndman and Fan (1996). It is best to compute the bootstrap interval using the Gaussian method since that interval tends to have the desired coverage rate while the quantile method tends to produce an interval that is too narrow. This is because the upperquantile estimator is inaccurate while the bootstrap estimator of is rather more accurate. To get adequate performance using the quantile method, a much larger sample size is required.Although this approach may be necessary when the marginal distribution does not appear to take a familiar form, twostage estimation does have a significant drawback, even for larger samples. If agreement is at least fair, dependence may be sufficient to pull the empirical marginal distribution away from the true marginal distribution. In such cases, simultaneous estimation of the marginal distribution and the copula should perform better. Development of such a method is beyond the scope of this article.
6 Application to simulated data
To investigate the performance of Sklar’s relative to Krippendorff’s , we applied both methods to simulated outcomes. We carried out a study for each level of measurement, for various sample sizes, and for a few values of . The study plan is shown in Table 5, where denotes a beta distribution, denotes a Laplace distribution, denotes a Gaussian density function, denotes a categorical distribution, and denotes a Bernoulli distribution.
Margins 
Sample Geometry

Method  Interval  


0.70  30, 3  ML  ML 

0.95  10, 5  ML  ML 
0.65  40, 2  ML  ML  

0.80  100, 4  SMP  Bootstrap 
0.90  20, 10  DT  Sandwich  
0.40  300, 6  CML  Sandwich 
We applied Krippendorff’s and our procedure to each of 500–1,000 simulated datasets for each scenario. The results are shown in Table 6. The coverage rates are for 95% intervals. All of the intervals for are bootstrap intervals.
For every simulation scenario, our estimator exhibited smaller bias, variance (excepting the final scenario), and mean squared error, and a significantly higher coverage rate. Our method proved especially advantageous for the first, fifth, and sixth scenarios. This is not surprising in light of the fact that Krippendorff’s implicitly assumes Gaussianity; the marginal distributions for the first, fifth, and sixth scenarios are far from Gaussian, and so Krippendorff’s performs relatively poorly for those scenarios. Krippendorff’s performed best for the second and third scenarios because the and Laplace distributions do not depart too markedly from Gaussianity. Note that our estimator had a much larger variance than did for the final scenario. This is because we employed composite likelihood, which is a must for binary data since the DT estimator is badly biased for binary outcomes.
Margins  Median Est.  Bias  Variance  MSE  Coverage  

0.70  2%  0.0065  0.0067  94%  
19%  0.0073  0.0242  51%  
0.95  2%  0.0017  0.0021  95%  
3%  0.0021  0.0031  66%  
0.65  2%  0.0098  0.0099  93%  
4%  0.0111  0.0120  89%  
0.80  2%  0.0008  0.0010  95%  
6%  0.0018  0.0040  73%  
0.90  1%  0.0010  0.0010  98%  
44%  0.0052  0.1626  0%  
0.40  6%  0.0173  0.0180  93%  
38%  0.0007  0.0240  0% 
7 R package sklarsomega
Here we briefly introduce our R package, sklarsomega, by way of a usage example. The package is available for download from the Comprehensive R Archive Network.
The following example applies Sklar’s to the nominal data from the first case study of Section 3. We provide the value "asymptotic" for argument confint. This results in sandwich estimation for the DT approach (see Section 5.4). We estimate using a parallel bootstrap, where 1,000 and six CPU cores are employed (control parameter type takes the default value of "SOCK"). Since argument verbose was set to TRUE, a progress bar appears (Solymos and Zawadzki, 2017). We see that was computed in one minute and 49 seconds.
> data = matrix(c(1,2,3,3,2,1,4,1,2,NA,NA,NA, + 1,2,3,3,2,2,4,1,2,5,NA,3, + NA,3,3,3,2,3,4,2,2,5,1,NA, + 1,2,3,3,2,4,4,1,2,5,1,NA), 12, 4) > colnames(data) = c("c.1.1", "c.2.1", "c.3.1", "c.4.1") > fit = sklars.omega(data, level = "nominal", confint = "asymptotic", verbose = TRUE, + control = list(bootit = 1000, parallel = TRUE, nodes = 6)) Control parameter ’type’ must be "SOCK", "PVM", "MPI", or "NWS". Setting it to "SOCK". ++++++++++++++++++++++++++++++++++++++++++++++++++ 100% elapsed = 01m 49s > summary(fit) Call: sklars.omega(data = data, level = "nominal", confint = "asymptotic", verbose = TRUE, control = list(bootit = 1000, parallel = TRUE, nodes = 6)) Convergence: Optimization converged at 40.42 after 31 iterations. Control parameters: bootit 1000 parallel TRUE nodes 6 dist categorical type SOCK Coefficients: Estimate Lower Upper inter 0.89420 0.76570 1.0230 p1 0.25170 0.01407 0.4893 p2 0.24070 0.01842 0.4631 p3 0.22740 0.04639 0.4084 p4 0.18880 0.06007 0.4377 p5 0.09136 0.15580 0.3385
Next we compute DFBETAs (Young, 2017) for units 6 and 11, and for coders 2 and 3. We see that omitting unit 6 results in a much larger value for , whereas unit 11 is not influential. Likewise, coder 2 is influential while coder 3 is not.
> (inf = influence(fit, units = c(6, 11), coders = c(2, 3))) $dfbeta.units inter p1 p2 p3 p4 p5 6 0.07914843 0.03438538 0.052599491 0.05540904 0.05820757 0.026631732 11 0.01096758 0.04546670 0.007630807 0.01626192 0.01514173 0.006432246 $dfbeta.coders inter p1 p2 p3 p4 p5 2 0.0579843781 0.002743713 0.002974195 0.02730064 0.01105672 0.01601343 3 0.0008664934 0.006572821 0.048168128 0.05659853 0.02149364 0.02335122 > fit$coef  t(inf$dfbeta.units) 6 11 inter 0.97335265 0.88323664 p1 0.21731494 0.20623362 p2 0.18814896 0.24837926 p3 0.28280614 0.24365903 p4 0.24700331 0.20393747 p5 0.06472664 0.09779062
Much additional functionality is supported by sklarsomega, e.g., plotting, simulation. And we note that computational efficiency is supported by our use of sparsematrix routines (Furrer and Sain, 2010) and a clever bit of Fortran code (Genz, 1992). Future versions of the package will employ C++ (Eddelbuettel and Francois, 2011).
8 Conclusion
Sklar’s offers a flexible, principled, complete framework for doing statistical inference regarding agreement. In this article we developed various frequentist approaches for Sklar’s , namely, maximum likelihood, distributionaltransform approximation, composite marginal likelihood, and a twostage semiparametric method. This was necessary because a single, unified approach does not exist for the form of Sklar’s presented in Section 4, wherein the copula is applied directly to the outcomes. An appealing alternative would be a hierarchical formulation of Sklar’s
such that the copula is applied through the responses’ mean structure. This would permit, for example, a wellmotivated Bayesian scheme and/or expectationmaximization algorithm to be devised.
Another potentially appealing extension/refinement would focus on composite likelihood inference for the current version of the model. Perhaps one should use a different composite likelihood, for example, and/or employ wellchosen weights (Xu et al., 2016).
References
 Akaike (1974) Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723.
 Altman and Bland (1983) Altman, D. G. and Bland, J. M. (1983). Measurement in medicine: The analysis of method comparison studies. The Statistician, pages 307–317.

Aravind and
Nawarathna (2017)
Aravind, K. R. and Nawarathna, L. S. (2017).
A statistical method for assessing agreement between two methods of heteroscedastic clinical measurements using the copula method.
Journal of Medical Statistics and Informatics, 5(1):3.  Artstein and Poesio (2008) Artstein, R. and Poesio, M. (2008). Intercoder agreement for computational linguistics. Computational Linguistics, 34(4):555–596.
 Burgert and Rüschendorf (2006) Burgert, C. and Rüschendorf, L. (2006). On the optimal risk allocation problem. Statistics & Decisions, 24(1/2006):153–171.
 Burnham et al. (2011) Burnham, K. P., Anderson, D. R., and Huyvaert, K. P. (2011). AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. Behavioral Ecology and Sociobiology, 65(1):23–35.
 Byrd et al. (1995) Byrd, R., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208.
 Dannull et al. (2017) Dannull, K., Stein, J., Hughes, J., and KlineFath, B. (2017). Grading of liver herniation in cases of congenital diaphragmatic hernia: Further refining neonatal mortality (without the use of liver volume). under review.
 Davison and Hinkley (1997) Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and their Application, volume 1. Cambridge University Press.
 Eddelbuettel and Francois (2011) Eddelbuettel, D. and Francois, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8):1–18.
 Ferguson (1967) Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York.
 Fernholz (1991) Fernholz, L. T. (1991). Almost sure convergence of smoothed empirical distribution functions. Scandinavian Journal of Statistics, 18(3):255–262.
 Flegal et al. (2008) Flegal, J., Haran, M., and Jones, G. (2008). Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23(2):250–260.
 Furrer and Sain (2010) Furrer, R. and Sain, S. R. (2010). spam: A sparse matrix R package with emphasis on MCMC methods for Gaussian Markov random fields. Journal of Statistical Software, 36(10):1–25.
 Genest and Neslehova (2007) Genest, C. and Neslehova, J. (2007). A primer on copulas for count data. Astin Bulletin, 37(2):475.
 Genz (1992) Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, pages 141–149.
 Geyer (2013) Geyer, C. J. (2013). Le Cam made simple: Asymptotics of maximum likelihood without the LLN or CLT or sample size going to infinity. In Jones, G. L. and Shen, X., editors, Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Morris L. Eaton. Institute of Mathematical Statistics, Beachwood, Ohio, USA.
 Godambe (1960) Godambe, V. (1960). An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, pages 1208–1211.
 Hayes and Krippendorff (2007) Hayes, A. F. and Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1):77–89.
 Hyndman and Fan (1996) Hyndman, R. J. and Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician, 50:361–365.
 Ihaka and Gentleman (1996) Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314.
 Kazianka (2013) Kazianka, H. (2013). Approximate copulabased estimation and prediction of discrete spatial data. Stochastic Environmental Research and Risk Assessment, 27(8):2015–2026.
 Kazianka and Pilz (2010) Kazianka, H. and Pilz, J. (2010). Copulabased geostatistical modeling of continuous and discrete data including covariates. Stochastic Environmental Research and Risk Assessment, 24(5):661–673.
 Klaassen et al. (1997) Klaassen, C. A., Wellner, J. A., et al. (1997). Efficient estimation in the bivariate normal copula model: Normal margins are least favourable. Bernoulli, 3(1):55–77.
 Krippendorff (2012) Krippendorff, K. (2012). Content Analysis: An Introduction to Its Methodology. Sage.
 Krippendorff (2013) Krippendorff, K. (2013). Computing Krippendorff’s alphareliability. Technical report, University of Pennsylvania.
 Landis and Koch (1977) Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, pages 159–174.
 Lindsay (1988) Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics, 80(1):221–239.

Liu et al. (2009)
Liu, H., Lafferty, J., and Wasserman, L. (2009).
The nonparanormal: Semiparametric estimation of high dimensional
undirected graphs.
Journal of Machine Learning Research
, 10(Oct):2295–2328.  Nelsen (2006) Nelsen, R. B. (2006). An Introduction to Copulas. Springer, New York.
 Nissi et al. (2015) Nissi, M. J., Mortazavi, S., Hughes, J., Morgan, P., and Ellermann, J. (2015). T2* relaxation time of acetabular and femoral cartilage with and without intraarticular GdDTPA2 in patients with femoroacetabular impingement. American Journal of Roentgenology, 204(6):W695.
 Prentice (1988) Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, pages 1033–1048.
 R Core Team (2018) R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 Rüschendorf (1981) Rüschendorf, L. (1981). Stochastically ordered distributions and monotonicity of the OCfunction of sequential probability ratio tests. Statistics, 12(3):327–338.
 Rüschendorf (2009) Rüschendorf, L. (2009). On the distributional transform, Sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139(11):3921–3927.
 Serfling and Mazumder (2009) Serfling, R. and Mazumder, S. (2009). Exponential probability inequality and convergence results for the median absolute deviation and its modifications. Statistics & Probability Letters, 79(16):1767–1773.
 Solymos and Zawadzki (2017) Solymos, P. and Zawadzki, Z. (2017). pbapply: Adding Progress Bar to ’*apply’ Functions. R package version 1.32.
 Varin (2008) Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical Analysis, 92(1):1–28.
 Xu et al. (2016) Xu, X., Reid, N., and Xu, L. (2016). Note on information bias and efficiency of composite likelihood. arXiv preprint arXiv:1612.06967.
 XueKun Song (2000) XueKun Song, P. (2000). Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2):305–320.
 Young (2017) Young, D. S. (2017). Handbook of Regression Methods. CRC Press.
Comments
There are no comments yet.