 # Testing for Stochastic Order in Interval-Valued Data

We construct a procedure to test the stochastic order of two samples of interval-valued data. We propose a test statistic which belongs to U-statistic and derive its asymptotic distribution under the null hypothesis. We compare the performance of the newly proposed method with the existing one-sided bivariate Kolmogorov-Smirnov test using real data and simulated data.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We discuss the two-sample tests for stochastic order of two interval-valued samples. In the interval-valued data, the variable of interest is not observed as a single point but is displayed in the form of an interval, with lower and upper bounds. For example, interval-valued data is observed when stock price is reported monthly by lower and upper limit prices. In addition, blood pressure data, which motivates our research, has diastolic blood pressure (DBP) and systolic blood pressure (SBP) as lower and upper bounds. It is a fundamental problem in statistics to test the stochastic order of two populations as well as to verify the equality of the two distributions. However, little research has been done for the interval-valued data; even definition of stochastic order for interval-valued data is not clearly established. Thus, this paper introduces its definition and proposes a method to test the stochastic order of two samples of interval-valued data.

The remainder of the paper is organized as follows. In section 2, we define the stochastic order of interval-valued data. In section 3, we propose a test statistic for testing the order of interval-valued data and derive its asymptotic null distribution using the general theory on U-statistic. In section 4, we examine the performance of the modified two-dimensional Komogorov-Smirov(K-S) statistic and the proposed through a numerical study. In section 5, we apply the methods to the blood pressure data from female students in the US. In section 6, we conclude the paper with a summary.

## 2 Simple stochastic order

Before we introduce the notion of the stochastic order for interval-valued data, we look at the stochastic order for the usual univariate case. Let and

be two univariate random variables such that

 Pr(X>z)≤Pr(Y>z),for all z∈R.

Then, is said to be stochastically greater than (denoted by ). If additionally for some , then is said to be stochastically strictly greater than (Shaked and Shanthikumar, 2006).

The stochastic order for interval-valued data can be defined similarly. Let and be two intervals. Then we denote and say is greater than if and . Now, let and be two random intervals such that

 Pr(X>z)≤Pr(Y>z),for all interval z. (1)

Then, is said to be stochastically greater than and denoted by . Let and be the survival functions of the random intervals and , respectively. Then, (1) is equivalent to

 ¯¯¯¯F(ℓ,u)≤¯¯¯¯G(ℓ,u)   for all (ℓ,u):ℓ Figure 1: A graphical illustration of the order of interval-valued data. Region A, B, C, and D are respectively defined by intersection of the half-plane u>ℓ with the first, second, third, and fourth quadrant when I1=(ℓ1,u1] is set as the origin.

We can illustrate the order of the intervals as follows (see Figure 1). Let the interval denoted by the point in the plane. Note that in the plane, interval-valued data is displayed at the top of the line due to the constraint . Any interval-valued data of the half-plane belongs to any of three cases according to the order relation with the interval .

1. region A: intervals are greater than .

2. region C: intervals are less than .

3. region B or D: intervals do not have an order relation with .

For the last case, an interval in region B satisfies , while an interval in region D satisfies .

## 3 Test statistic

Let us consider two independent samples of random intervals. Suppose that a first sample , , has a survival function and the second sample , , has a survival function . We want to verify the null hypothesis that both samples come from an identical distribution, “: for all ” against to the alternative hypothesis that is stochastically strictly greater than , i.e., “ for all and for some interval ”.

The statistic we propose to test the stochastic order is

 T=1mnm∑i=1n∑j=1Sij, (2)

where

 Sij=⎧⎪⎨⎪⎩ 1if  ℓ1i<ℓ2j% and u1iℓ2j and u1i>u2j, 0otherwise.

Note that under the null , and thus .

The statistic belongs to a class of U-statistics, which allows one to derive its asymptotic null distribution based on the asymptotic theory of the U-statistic. We introduce below a general asymptotic theory of U-statistics reported in Chapter 6 of Lehmann (1999). Let be a symmetric kernel of () arguments. Here, the symmetric kernel denotes a function whose value does not change by changing the order of arguments or . Let defined below be a parameter of interest;

 θ=θ(¯F,¯G)=E[ϕ(X1,…,Xa;Y1,…,Yb)],

and define its U-statistic by

 (3)

where is the collection of all subsets of with size and dummy indices running over summations are and , respectively.

is an unbiased estimator of

and its variance is

 Var(Um,n)=a∑i=1b∑j=1(ai)(m−aa−i)(ma)(bj)(n−bb−j)(nb)σ2ij,

where is given by

 σ2ij = Cov[ϕ(X1,…,Xi,Xi+1,…,Xa;Y1,…,Yj,Yj+1,…,Yb), ϕ(X1,…,Xi,X′i+1,…,X′a;Y1,…,Yj,Y′j+1,…,Y′b)],

and and are independent copies of and . The theorem below from Chapter 6 of Lehmann (1999) explains the asymptotic distribution of the U-statistic (3) above.

###### Theorem 1 (Lehmann(1999), Theorem 6.1.3 (ii)).

As and ,

converges in distribution to the normal distribution with mean

and variance . Here, and are computed by

 σ210 = Cov[ϕ(X1,X2,…,Xa;Y1,…,Yb),ϕ(X1,X′2,…,X′a;Y′1,…,Y′b)]∈(0,∞), σ201 = Cov[ϕ(X1,…,Xa;Y1,Y2…,Yb),ϕ(X′1,…,X′a;Y1,Y′2,…,Y′b)]∈(0,∞).

Applying the general theory above for U-statistics to our case, we can derive the asymptotic null distribution of our statistic.

###### Theorem 2.

Under the null hypothesis that , if as , then

 √NT d→ N(0,θ1+θ2−2θ3ρ(1−ρ)),

where , , and .

Parameters used to compute the asymptotic variance can be approximated by permuting observations within each sample. To understand it, we observe the followings.

 θ1=Pr(X

where are independent random intervals from the first population. Consequently, can be approximated by

 ^θ1=∑i,j,k:distinctI(Xi

Equation (4) has an implication that the above approximation would be a valid estimate of even under the alternative hypothesis.

###### Proof.

For , let us define . Then, can be presented by a two sample U-statistic when ;

 Um,n=1mnm∑i=1n∑j=1ϕ(Xi;Yj).

Therefore, by applying Theorem 1, we have

 √N(Um,n−θ) d→ N(0,σ210ρ+σ2011−ρ),

where , , , and .

Now, let us denote interval random variables by , , and . Under the null hypothesis , we have . The variance component (= ) is evaluated as

 σ210 =Cov[ϕ(X;Y),ϕ(X;Y′)] =E[ϕ(X;Y)ϕ(X;Y′)](∵θ=0) −E[I(X>Y)I(XY)I(X>Y′)] =Pr(X

Now, we write , , and . Thus, under , we get

 σ210=σ201=θ1+θ2−2θ3.

Hence, the asymptotic variance of is

 σ210ρ+σ2011−ρ=θ1+θ2−2θ3ρ(1−ρ).

## 4 Numerical study

In this section, we compare the power of our proposed test (denoted as “U-test”) to one-sided bivariate K-S test (denoted as “K-S test”). U-test can be classified by how its null distribution is approximated. “U-perm” designates U-test where we approximate the null distribution by a permutation method, while “U-asym” is the one depending on the approximation given in Theorem

2. K-S test for the alternative hypothesis is given by (Feller, 1948)

 D+m,n=(mnm+n)1/2sups,t∈R,s

where and . The null distribution of is approximated using a permutation method (Gail and Green, 1976).

In the study, to generate interval-valued data , we consider a transformation to obtain and half-range . We consider two underlying distributions for ; bivariate normal distribution and bivariate

distribution with the degrees of freedom

.

 N((μCμR),(1ρρ1))ort5((μCμR),(1ρρ1))

For two populations, we consider and parameterized as follows;

 Π1: μ1=(0,0),  Σ1=(1ρρ1) Π2: μ2=(δ,0),  Σ2=Σ1. Figure 2: A graphical illustration of two populations in the simulation study

For , the following four values are used : where indicates the alternative hypothesis. Figure 2 shows the graphical illustration of the simulation setting. To examine the effect of correlation between the center and range, we use three values for . The significance level is set as . The size and power are evaluated as the rejection rate among replicates. The number of permutations to generate a null distribution is set as . For the sample size , we consider following 4 cases: (30, 30), (30, 120), (50, 50), (50, 200).

Table 1 shows some interesting findings with regard to the proposed U-test. First, the power of our U-test is higher than the one-sided K-S test in all cases under consideration regardless of the magnitude of

. Also, it is noted that when it comes to U-test, the powers based on a permutation method and asymptotic results are almost same in all cases, which proves the asymptotic result and its accuracy. Third, the greater the correlation between center and range, the higher power of each test we can get. This phenomenon can be explained using the Mahalanobis distance between two mean vectors from the null and the alternative. The distance is

, which is increasing in terms of . Specifically, when is , , and , the corresponding distance is , and , respectively.

## 5 Data example

In this section, we apply the stochastic order tests to a real dataset. The data we use is obtained from National Heart, Lung, and Blood Institute Growth and Health Study (NGHS), which is a -year cohort study to evaluate the temporal trends of cardiovascular risk factors, such as systolic and diastolic blood pressures (SBP, DBP) based on annual visits of 2,379 African-American and Caucasian girls. The blood pressure (BP) data, which is measured at two levels, can be an example of the MM-type interval-valued data. In this analysis, we only use BP measurements at the first visit and remove subjects with missing values. After all, the total number of subjects is , where Caucasians and African-American girls are and , respectively. The goal of this application is to test a hypothesis “BP of African-American is stochastically greater than that of Caucasian girls”.

Table 2 shows that SBP, DBP, and their center are significantly higher in African-American than in Caucasian. Meanwhile, it is confirmed that there is no difference in the range between two groups. These results are very similar to the setting of the numerical study, where centers of two groups are similar, but ranges are different.

Now, we verify whether the BP of African-American is stochastically greater than that of Caucasian based on interval-valued data, instead of marginal distributions. Table 3 presents test results of previously compared methods. In all tests, the p-values are smaller than 0.001, which ensures that the BP of African-American is stochastically greater than that of Caucasians.

## 6 Conclusion

In this paper, we introduce the notion of stochastic order between two samples of interval-valued data and propose a test statistic based on U-statistic. We compute the asymptotic null distribution of the proposed statistic. The numerical study shows that the asymptotic distribution approximates the null distribution with accuracy, even with small size of samples. Also, the proposed test has higher power than the one-sided bivariate KS test in all cases we consider. Therefore, it can be said that the procedure proposed in this paper is of great use for testing the order of interval-valued data.

## Notes

Authors want to inform that this manuscript is an English version of the article written in Korean and accepted at The Korean Journal of Applied Statistics.

## Acknowledgements

We would like to show our gratitude to two anonymous reviewers and the editor of The Korean Journal of Applied Statistics for their detailed and instructive comments.

## References

• Blanco-Fernandez and Winker (2016) Blanco-Fernández, A. and Winker, P. (2016). Data generation processes and statistical management of interval data. AStA Advances in Statistical Analysis, 100(4), 475-494.
• Feller (1948) Feller, W. (1948). On the Kolmogorov-Smirnov limit theorems for empirical distributions. The Annals of Mathematical Statistics, 19(2), 177-189.
• Lehmann (1999) Lehmann, E.L. (1999). Elements of Large Sample Theory. Springer.
• Gail and Green (1976) Gail, M., and Green, S. (1976). Critical values for the one-sided two-sample Kolmogorov-Smirnov Statistic. Journal of the American Statistical Association, 71(355), 757-760.
• Shaked and Shanthikumar (2006) Shaked, M. and Shanthikumar, J.G. (2006). Stochastic Orders. Springer.