multivariate-ci
Multivariate confidence intervals for R
view repo
Confidence intervals are a popular way to visualize and analyze data distributions. Unlike p-values, they can convey information both about statistical significance as well as effect size. However, very little work exists on applying confidence intervals to multivariate data. In this paper we define confidence intervals for multivariate data that extend the one-dimensional definition in a natural way. In our definition every variable is associated with its own confidence interval as usual, but a data vector can be outside of a few of these, and still be considered to be within the confidence area. We analyze the problem and show that the resulting confidence areas retain the good qualities of their one-dimensional counterparts: they are informative and easy to interpret. Furthermore, we show that the problem of finding multivariate confidence intervals is hard, but provide efficient approximate algorithms to solve the problem.
READ FULL TEXT VIEW PDF
We provide the asymptotic distribution of the major indexes used in the
...
read it
The goal of eQTL studies is to identify the genetic variants that influe...
read it
Rating systems are ubiquitous, with applications ranging from product
re...
read it
AB-testing is a very popular technique in web companies since it makes i...
read it
The multivariate Fay-Herriot model is quite effective in combining
infor...
read it
The factor estimation process is a really challenging task for non-linea...
read it
The factor estimation process is a really challenging task for non-linea...
read it
Multivariate confidence intervals for R
Confidence intervals are a natural and commonly used way to summarize a distribution over real numbers. In informal terms, a confidence interval is a concise way to express what values a given sample mostly contains. They are used widely, e.g., to denote ranges of data, specify accuracies of parameter estimates, or in Bayesian settings to describe the posterior distribution. A confidence interval is given by two numbers, the lower and upper bound, and parametrized by the percentage of probability mass that lies within the bounds. They are easy to interpret, because they can be represented visually together with the data, and convey information both about the location as well as the variance of a sample.
There is a plethora of work on how to estimate the confidence interval of a distribution based on a finite-sized sample from that distribution, see [7] for a summary. However, most of these approaches focus on describing a single univariate distribution over real numbers. Also, the precise definition of a confidence interval varies slightly across disciplines and application domains.
In this paper we focus on confidence areas: a generalization of univariate confidence intervals to multivariate data. All our intervals and areas are such that they describe ranges of data and are of minimal width
. In other words, they contain a given fraction of data within their bounds and are as narrow as possible. By choosing a confidence area with minimal size we essentially locate the densest part of the distribution. Such confidence areas are particularly effective for visually showing trends, patterns, and outliers.
Considering the usefulness of confidence intervals, it is surprising how little work exists on applying confidence intervals on multivariate data [11]. In multivariate statistics confidence regions are a commonly used approach, see e.g., [6], but these methods usually require making assumptions about the underlying distribution. Moreover, unlike confidence areas, most conference regions cannot be described simply with an upper and a lower bound, e.g., confidence regions for multivariate Gaussian data are ellipsoids. Thus, these approaches do not fully capture two useful characteristics of one-dimensional confidence intervals: a) easy interpretability and b) lack of assumptions about the data.
The simplest approach to construct multivariate confidence intervals is to compute one-dimensional intervals separately for every variable. While this naive approach satisfies conditions a) and b) above, it is easy to see how it fails in general. Assume, e.g., that we have 10 independently distributed variables, and have computed for each variable a 90% confidence interval. This means that when considering every variable individually, only 10% of the distribution lies outside of the respective interval, as desired. But taken together, the probability that an observation is outside at least one of the confidence intervals is as much as %. This probability, however, depends strongly on the correlations between the variables. If the variables were perfectly correlated with a correlation coefficient of , the probability of an observation being outside all confidence intervals would again be 10%. In general the correlation structure of the data affects in a strong and hard-to-predict way on how the naive confidence intervals should be interpreted in a multivariate setting.
Ideally a multivariate confidence area should retain the simplicity of a one-dimensional confidence interval. It should be representable by upper and lower bounds for each variable and the semantics should be the same: each observation is either inside or outside the confidence area, and most observations of a sample should be inside. Particularly important is specifying when an observation in fact is within the confidence area, as we will discuss next.
Confidence areas for time series data have been defined [9, 11] in terms of the minimum width envelope (MWE) problem: a time series is within a confidence area if it is within the confidence interval of every variable. While this has desirable properties, it can, however, result in very conservative confidence areas if there are local deviations from what constitutes “normal” behavior. The definition in [11] is in fact too strict by requiring an observation to be contained in all variable-specific intervals.
Thus, here we propose an alternative way to define the confidence area: a data vector is within the confidence area if it is outside the variable-specific confidence intervals at most times, where is a given integer. This formulation preserves easy interpretability: the user knows that any observation within the confidence area is guaranteed to violate at most variable-specific confidence intervals. In the special case , this new definition coincides with the MWE problem [11]. The following examples illustrate further properties and uses of the new definition.
The left panel of Fig. 1 presents a number of time series over
time points, shown in light gray, together with three different types of 90% confidence intervals, shown in green, purple and orange, respectively. In this example, “normal” behavior of the time series is exemplified by the black dash-dotted curve that exhibits no noteworthy fluctuation over time. Two other types of time series are shown also in black: a clear outlier (dashed black line), and a time series that exhibits normal behavior most of the time, but strongly deviates from this for a brief moment (solid black line).
In the situation shown, we would like the confidence interval to only show what constitutes “normal” behavior, i.e., not be affected by strong local fluctuations or outliers. Such artifacts can be caused, e.g., by measurement problems, or other types of data corruption. Alternatively, such behavior can also reflect some interesting anomaly in the data generating process, and this should not be characterized as “normal” by the confidence interval. In Fig. 1 (left) the confidence area based on MWE [11], is shown by the green lines; recall the MWE interval corresponds to setting . While it is unaffected by the outlier, it strongly reacts to local fluctuations. The naive confidence area, i.e., one where we have computed confidence intervals for every time point individually, is shown in purple. It is also affected by local peaks, albeit less than the confidence area. Finally, the proposed method is shown in orange. The area is computed using , i.e., a time series is still within the confidence area as long as it lies outside in at most 25 time points. This variant focuses on what we would expect to be normal behavior in this case. Our new definition of a confidence area is thus nicely suited for finding local anomalies in time series data.
On the other hand, the right panel of Fig. 1 shows an example where we focus only on the time points and as indicated in the left panel. Time point resides in the region with a strong local fluctuation, while at there are no anomalies. According to our definition, an observation, in this case an point, is within the confidence area if it is outside the variable-specific confidence intervals at most times. We have computed two confidence areas using our method, one with (green), and another with (orange), as well as the naive confidence intervals (purple).
For , we obtain the green rectangle in Fig. 1 (right panel). The variable specific confidence intervals have been chosen so that the green rectangle contains 90% the data and the sum of the widths of the confidence intervals (sides of the rectangle) is minimized. For , we obtain the orange “cross” shaped area. The cross shape follows from allowing an observation to exceed the variable specific confidence interval in dimensions. Again, the orange cross contains 90% of all observations, and has been chosen by greedily minimizing the sum of the lengths of the respective variable-specific confidence intervals. It is easy to see that with , i.e., when using the MWE method [11], the observations do not occur evenly in the resulting confidence area (green rectangle). Indeed, the top right and bottom right parts of the rectangle are mostly empty. In contrast, with , the orange cross shaped confidence area is a much better description of the underlying data distribution, as there are no obvious “empty” parts. Our novel confidence area is thus a better representation of the underlying data distribution than the existing approaches.
The basic problem definition we study in this paper is straightforward: for -dimensional data and the parameters and integer , find a confidence interval for each of the variables so that the fraction of the observations lie within the confidence area, defined so that the sum of the length of the intervals is minimized, and an observation can break at most of the variable-specific confidence intervals. We make the following contributions in this paper:
We formally define the problem of finding a multivariate confidence area, where observations have to satisfy most but not all of the variable-specific confidence intervals.
We analyze the computational complexity of the problem, and show that it is NP-hard, but admits an approximation algorithm based on a linear programming relaxation.
We propose two algorithms, which produce good confidence areas in practice.
We conduct experiments demonstrating various aspects of multivariate confidence intervals.
The rest of this paper is organized as follows: Related work is discussed in Sec. 2. We define the ProbCI and CombCI problems formally in Sec. 3. In Sec. 4 we study theoretical properties of the proposed confidence areas, as well as study problem complexity. Sec. 5 describes algorithms used in experiments in Sec. 6. Finally, Sec. 7 concludes this work.
Confidence intervals have recently gained more popularity, as they convey information both of statistical significance of the result as well as the effect size. In contrast, p-values give information only of the statistical significance: it is possible to have statistically significant results that are meaningless in practice due to the small effect size. The problem has been long and acutely recognized, e.g., in medical research [5]. Some psychology journals have recently banned the use of p-values [18, 16]. The proposed solution is not to report p-values at all, but use confidence intervals instead [14].
Simultaneous confidence intervals for time series data have been proposed [9, 11]. These correspond to the confidence areas in this paper when . The novelty here generalization of the confidence areas to allow outlying dimensions (), similarly to [17], and the related theoretical and algorithmic contributions, allowing for narrower confidence intervals and in some cases more readily interpretable results. Simultaneous confidence intervals with were in [2, p. 154], and studied [13] by using the most extreme value within a data vector as a ranking criterion. Another examples of type procedures include [12, 15]. In the field of information visualization and the visualization of time series confidence areas have been used extensively; see [1] for a review. An interesting approach is to construct simultaneous confidence regions by inverting statistical multiple hypothesis testing methods, see e.g., [6].
The approach presented in this paper allows some dimensions of an observation to be partially outside the confidence area. This is in the same spirit—but not equivalent—to false discovery rate (FDR) in multiple hypothesis testing, which also allows a controlled fraction of positives to be “false positives”. In comparison, the approach in [11] is more akin to family-wise error rate (FWER) that controls the probability of at least one false discovery.
A data vector is a vector in and denotes the value of in th position. Let matrix be a dataset of data vectors , i.e. rows of . We start by defining the confidence area, its size, and the envelope for a dataset .
Given , a confidence area for is a doublet of vectors , composed of lower and upper bounds satisfying for all , respectively. The size of the confidence area is , where is the width of the confidence interval w.r.t. th position. The envelope of is a confidence area denoted by , where and .
We define the error of the confidence area as the number of outlying data vectors as follows.
Let be a data vector in and a confidence area. The error of given the confidence area is defined as
The indicator function is unity if the condition is satisfied and zero otherwise.
The main problem in this paper is as follows.
Given , an integer , and a distribution over , find a confidence area that minimizes subject to constraint
(3.1) |
For this problem definition to make sense, the variables or at least their scales must be comparable. Otherwise variables with high variance will dominate the confidence areas. Therefore, some thinking and a suitable preprocessing, such as normalization of variables, should be applied before solving for the confidence area and interpreting it.
The combinatorial problem is defined as follows.
Given integers and , and vectors in , find a confidence area by minimizing subject to constraint
(3.2) |
Any confidence area satisfying (3.2) is called a -confidence area. In the special case with , Problems 3.1 and 3.2 coincide with the minimum width envelope problem from [11]. The problem definition with non-vanishing is novel to best of our knowledge. The relation of Problems 3.1 and 3.2 is as follows.
It is instructive to consider the behavior of Problem 3.1
with the uniform distribution
. We show that a solution may contain very narrow confidence intervals and discuss the required number of data vectors to estimate a confidence area with a desired level of .Consider Problem 3.1 in a simple case of two-dimensional data with , , and . In this case an optimal solution to Problem 3.1 is given by confidence intervals with and , resulting in the size of confidence area of . A solution with confidence intervals of equal width, i.e., would lead to substantially larger area of . As this simple example demonstrates, if data is unstructured or if some of the variables have unusually high variance, we may obtain solutions where some of the confidence intervals are very narrow. In the case of uniform distribution the choice of variables with zero width confidence intervals is arbitrary: e.g., in the example above we could as well have chosen and . Such narrow intervals are not misleading, because they are easy to spot: for a narrow interval—such as the one discussed in this paragraph—only a small fraction of the values of the variable lie within it. In real data sets with non-trivial marginal distributions and correlation structure the confidence intervals are often of roughly similar width.
Next we consider the behavior of the problem at the limit of high-dimensional data.
The solution with confidence intervals of equal width corresponds to of
(4.3) |
where
is the cumulative binomial distribution.
If vectors are sampled from then the expected width of the envelope for each variable is . The probability of more than variables from a vector from being outside the envelope is given by Eq. (4.3) with .
One implication of Lemma 4.2 is that there is a limit to the practical accuracy that can be reached with a finite number of samples. The smallest we can hope to realistically reach is the one implied by the envelope of the data, unless we make some distributional assumptions of the shape of the distribution outside the envelope. Conversely, the above lemmas define a minimum number of samples needed for uniform data to find the confidence area for a desired level of .
In the case of to reach the accuracy of , we have , or , where we have ignored higher order terms in and . For a typical choice of and this would imply that at least data vectors are needed to estimate the confidence area; the number of data vectors needed increases linearly with the dimensionality .
On the other hand, for a given , if we let , a solution where the width of the envelope is is asymptotically sufficient when the dimensionality is large, in which case the number of data vectors required is . For a value of only data vectors would therefore be needed even for large . As hinted by the toy example in Fig. 1, a non-vanishing parameter leads at least in this example to substantially narrower confidence intervals and, hence, make it possible to compute the confidence intervals with smaller data sets.
We continue by briefly addressing the computational properties of the CombCI problem. The proofs are provided in the appendix.
CombCI is NP-hard for all .
For the result directly follows from [11, Theorem 3], and for and a reduction from Vertex-Cover can be used.
Now, there exists a polynomial time approximation algorithm for solving a variant of CombCI in the special case of . In particular, we consider here a one-sided confidence area, defined only as the upper bound , the lower bound is assumed to be fixed, e.g., at zero, or any other suitable value. This complements the earlier result that the complement of the objective function of CombCI is hard to approximate for and [11, Theorem 3].
There is a polynomial time approximation algorithm for the one-sided CombCI problem with .
The proof uses a linear programming relaxation of an integer linear program corresponding to the variant of the one-sided CombCI and the approximation ratio obtained from the solution given by the relaxation.
Finding a bound that does not depend on is an interesting open question, as well as extending the proof to the two-sided CombCI. It is unlikely that the problem admits a better approximation bound than , since this would immediately result in a new approximation algorithm for the Vertex-Cover problem. This is because in the proof of Theorem 4.2 we describe a simple reduction from Vertex-Cover to CombCI with and . This reduction preserves approximability, as it maps the Vertex-Cover instance to an instance of our problem in a straightforward manner. For Vertex-Cover it is known that approximation ratios below are hard to obtain in the general case. Indeed, the best known bound is equal to [8].
We present two algorithms for -confidence areas.
Our first method is based on minimum intervals. A standard approach to define a confidence interval for univariate data is to consider the minimum length interval that contains % of the observations. This can be generalized for multivariate data by treating each variable independently, i.e., for a given , we set and equal to lower and upper limit of such a minimum length interval for every . The mi algorithm solves the CombCI problem for given and by adjusting the parameter such that the resulting satisfies the constraint in Eq. (3.2) in the training data set. The resulting confidence area may have a larger size than the optimal solution, since all variables use the same . Note that the mi solution differs from the naive solution mentioned in the introduction because the naive solution does not adjust but simply sets it to . The time complexity of mi with binomial search for correct is .
Our second method is a greedy algorithm. The greedy search could be done either bottom-up (starting from an empty set of included data vectors and then adding data vectors) or top-down (starting from data vectors and by excluding data vectors). Since typically is smaller than we will consider here a top-down greedy algorithm.
The idea of the greedy algorithm is to start from the envelope of whole data and sequentially find vectors to exclude by selecting at each iteration the vector whose removal reduces the envelope the largest amount. In order to find the envelope wrt. the relaxed condition allowing positions from each vector to be outside, at each iteration the set of included data points needs to be (re)computed. This is done by implementing a greedy algorithm solving the CombCI problem for . Here one removes individual points that result in maximal decrease in the envelope size so that at most points from each data vector are be removed, thus obtaining the envelope wrt. the criterion. After this envelope has been computed, the data vector whose exclusion yields a maximal decrease in the size of the confidence area is excluded. For this, a slightly modified variant of the greedy MWE algorithm from [11] with is used. After data vectors have been excluded, the final set of points included in the envelope is returned. The time complexity of gr is .
We present here empirical evaluation of the algorithms. In the following mi and gr refer to the Minimum intervals and Greedy algorithm, respectively.
We make experiments using various datasets that reflect different properties of interest from the point of view of fitting confidence areas. In particular, we want to study the effects of correlation (autocorrelation in the case of time-series or regression curves), number of variables, and distributional qualities.
Artificial data. We use artificial multivariate (i.i.d. variables) datasets sampled from the uniform distribution (in the interval
), the standard normal distribution, and the Cauchy distribution (location parameter
, scale parameter ), with varying and to study some theoretical properties of multivariate confidence areas.Kernel regression data. We use the Ozone and South African heart disease (SA heart) datasets (see, e.g., [4]) to compute confidence areas for bootstrapped kernel regression estimates. We use a simple Nadaraya-Watson kernel regression estimate to produce the vectors , and fit confidence intervals to these using our algorithms. By changing the number of bootstrap samples we can vary , and by altering the number of points in which the estimate is computed we can vary .
Stock market data. We obtained daily closing prices of stocks for years 2011–2015 ( trading days). The time-series are normalized to reflect the absolute change in stock price with respect to the average price of the first five trading days in the data. The data is shown in Fig. 5.
Our algorithms both solve the CombCI problem (Problem 3.2) in which we must specify the number of vectors that are allowed to violate the confidence area. To obtain a matching in ProbCI (Definition 3.3) we must choose the parameter appropriately. We study here how this can be achieved.
Fig. 2 shows as a function of in data with independent normally distributed variables for four combinations of and . The dashed green line shows the situation with . (This is a desirable property as it means fine-tuning is not necessary to reach some particular value of .) We observe from Fig. 2 that when the data is relatively small (), for a given both gr as well as mi tend to find confidence areas that are somewhat too narrow for the test data leading to . To obtain some particular , we must thus set to a lower value. For example, with and , to have we must let . As the number of training examples is increased to , we find that both algorithms are closer to the ideal situation of . Interestingly, this happens also when when increases from to . The relaxed notion of confidence area we introduce in this paper is thus somewhat less prone to overfitting issues when compared against the basic confidence areas with of [11]. On the other hand, for we also observe how gr starts to “underfit” as increases, meaning that we have . Errors in this direction, however, simply mean that the resulting confidence area is conservative and will satisfy the given by a margin.
The dependency between and and other observations made above are qualitatively identical for uniform and Cauchy distributed data (not shown).
Note that the value of that we compute for a given confidence area also depends on the value of used when evaluating. We continue by showing how depends on the value of used when evaluating the confidence area on test data. In this experiment we train confidence areas using the Ozone data (with and or ) and adjust so that we have for a given value of , where is the parameter given to our algorithms (mi, gr) to solve Prob. 3.2. Then we estimate for all using Eq. (3.1) and previously unseen test data set.
Results are shown in Fig. 3. We find that in every case the line for a given value of intersects the thin dashed (red) line (indicating ) at the correct value . More importantly, however, we also find that has a very strong effect on the observed . This means that if we relax the condition under which an example still belongs in the confidence area (i.e., increase ), drops at a very fast rate, meaning that a confidence area trained for a particular value of , will be rather conservative when evaluated using . Obviously the converse holds as well, i.e., when , the confidence area will become much too narrow very fast. This implies that should be set conservatively (to a low value) when it is important to control the false positive rate, e.g., when the “true” number of noisy dimensions is unknown.
Next we briefly compare the mi and gr algorithms in terms of the confidence area size (given as , where is the number of variables) and running time (in seconds). Results for artificial data sets as well as the two regression model data are shown in Table 1. The confidence level (in test data) was set to in every case, and all training data had examples. All numbers are averages of 25 trials. We can observe, that mi tends to produce confidence areas that are consistently somewhat smaller than those found by gr. Also, mi is substantially faster, albeit our implementation of gr is by no means optimized. Finally, the column shows the confidence level that was used when fitting the confidence area to obtain . Almost without exception, we have for both algorithms, with mi usually requiring a slightly larger than gr
. Also worth noting is that for extremely skewed distributions, e.g., Cauchy, the confidence area shrinks rapidly as
is increased from zero.mi | gr | |||||||
---|---|---|---|---|---|---|---|---|
dataset | ||||||||
unif | 0 | 10 | 1.0 | 0.1 | 0.06 | 1.0 | 0.7 | 0.07 |
unif | 2 | 10 | 0.9 | 0.2 | 0.08 | 0.9 | 14.0 | 0.11 |
unif | 10 | 100 | 0.9 | 0.7 | 0.05 | 0.9 | 32.1 | 0.03 |
unif | 50 | 500 | 0.9 | 2.1 | 0.03 | 0.9 | 220.2 | 0.03 |
norm | 0 | 10 | 5.1 | 0.1 | 0.07 | 5.3 | 0.8 | 0.07 |
norm | 2 | 10 | 3.2 | 0.2 | 0.09 | 3.2 | 12.0 | 0.09 |
norm | 10 | 100 | 3.6 | 0.8 | 0.06 | 3.6 | 33.1 | 0.03 |
norm | 50 | 500 | 3.5 | 2.1 | 0.03 | 3.5 | 209.7 | 0.03 |
Cauchy | 0 | 10 | 262.3 | 0.1 | 0.08 | 325.1 | 0.7 | 0.07 |
Cauchy | 2 | 10 | 21.8 | 0.1 | 0.09 | 25.6 | 10.7 | 0.08 |
Cauchy | 10 | 100 | 36.8 | 0.7 | 0.07 | 38.5 | 31.4 | 0.03 |
Cauchy | 50 | 500 | 30.5 | 2.2 | 0.05 | 31.5 | 203.3 | 0.03 |
ozone | 0 | 10 | 5.1 | 0.1 | 0.08 | 5.1 | 0.8 | 0.07 |
ozone | 2 | 10 | 3.4 | 0.1 | 0.09 | 3.5 | 9.0 | 0.09 |
ozone | 5 | 50 | 4.2 | 0.3 | 0.09 | 4.3 | 11.4 | 0.05 |
ozone | 15 | 50 | 3.0 | 0.3 | 0.09 | 3.1 | 48.6 | 0.06 |
SA heart | 0 | 10 | 5.3 | 0.1 | 0.06 | 5.4 | 0.7 | 0.06 |
SA heart | 2 | 10 | 3.3 | 0.2 | 0.08 | 3.7 | 15.5 | 0.14 |
SA heart | 5 | 50 | 4.3 | 0.2 | 0.07 | 4.5 | 8.6 | 0.04 |
SA heart | 15 | 50 | 2.9 | 0.3 | 0.08 | 3.3 | 66.7 | 0.06 |
We can model the accuracy of an estimate given by a regression model by resampling the data points, e.g., by the bootstrap method, and then refitting the model for each of the resampled data sets [3]. The estimation accuracy or spread of values for a given independent variable can be readily visualized using confidence intervals.
Fig. 4 shows examples of different confidence intervals fitted to bootstrapped kernel regression estimates on two different datasets using both and ( and ). In both cases was adjusted so that in a separate test data. We find that qualitatively there is very little difference between mi and gr when . For , gr tends to produce a somewhat narrower area. In general, this example illustrates the effect of on the resulting confidence area in visual terms. By allowing the examples to lie outside the confidence bounds for a few variables () we obtain substantially narrower confidence areas that still reflect very well the overall data distribution.
The visualization for the stock market data of Fig. 5 has been constructed using gr algorithm with parameters and . The figure shows in yellow the stocks that are clearly outliers and among the anomalously behaving observations ignored in the construction of the confidence area. The remaining stocks (shown in blue) remain within the confidence area at least % of the time. However, they are allowed to deviate from the confidence area 10% of the time. Fig. 5 shows several such stocks, one of them highlighted with red. By allowing these excursions, the confidence area is smaller and these potentially interesting deviations are easy to spot. E.g., the red line represents Mellanox Technologies and market analyses from fall 2012 report the stock being overpriced at that time. The black line in Fig. 5 represents Morningstar, an example staying inside the confidence area. If we do not allow any deviations outside the confidence intervals, i.e., we set , then the confidence area will be larger and such deviations might be missed.
The versatility of confidence intervals stems from their simplicity. They are easy to understand and to interpret, and therefore often used in presentation and interpretation of multivariate data. The application of confidence intervals to multivariate data is, however, often done in a naive way, disregarding effects of multiple variables. This may lead to false interpretations of results if the user is not being careful.
In this paper we have presented a generalization of confidence intervals to multivariate data vectors. The generalization is simple and understandable and does not sacrifice the interpretability of one-dimensional confidence intervals. The confidence areas defined this way behave intuitively and offer insight into the data. The problem of finding confidence areas is computationally hard. We present two efficient algorithms to solve the problem and show that even a rather simple approach (mi) can produce very good results.
Confidence intervals are an intuitive and useful tool for visually presenting and analyzing data sets, spotting trends, patterns, and outliers. The advantage of confidence intervals is that they give at the same time information about both the statistical significance of the result and size of the effect. In this paper, we have shown several examples demonstrating the usefulness of multivariate confidence intervals, i.e. confidence areas.
In addition to visual tasks, the confidence intervals can also be used in automated algorithms as a simple and robust distributional estimators. As the toy example of Fig. 1 shows, the confidence areas with can be a surprisingly good distributional estimator, if data is sparse, i.e., a majority of variables is close to the mean and in each data vector only a small number of variables have significant deviations from the average.
With p-values there are established procedures to deal with multiple hypothesis testing. Indeed, a proper treatment of the multiple comparisons problem is required, e.g., in scientific reporting. Our contribution to the discussion about reporting scientific results is to point out that it is indeed possible to treat multidimensional data with confidence intervals in a principled way.
Sample quantiles in statistical packages.
Am. Stat., 50(4):361–365, 1996.Some new methods for the comparison of two linear regression models.
J. Stat. Plan. Inference, 137(1):57–67, 2007.The case of follows directly from [11, Theorem 3]. In the case we use a reduction from Vertex-Cover. In the Vertex-Cover problem we are given a graph , and an integer , and the task is to cover every edge of by selecting a subset of vertices of of size at most . A reduction from Vertex-Cover to the CombCI problem maps every edge of the input graph into the vector , where , and when . Furthermore, we add two vectors and which satisfy and which have a value of zero otherwise. Moreover, we set and .
For the optimal confidence area, we must allow the first element of vector and the second element of vector be outside the confidence area, resulting to a lower bound for all . Thus, it suffices to consider a variant of the problem where the input vectors are constrained to reside in and to seek an upper bound . To minimize the size of the confidence area we need to minimize the number of s in . We consider a decision variant of the CombCI problem where we must decide if there exists an with for a given integer . An optimal vertex cover is obtained simply by selecting the vertices with in the optimal upper bound .
To prove Theorem 4.2 we use a linear programming relaxation of an integer linear program (ILP) corresponding to the variant of the one-sided CombCI which we define as follows. Let be a matrix representing the data vectors . For the moment, consider the th position of every vector in , and let this be sorted in decreasing order of . Let denote the vector that follows vector in this sorted order. Moreover, at every position we only consider vectors that are strictly above the lower bound .
The ILP we define uses binary variables
to express whether a given vector is inside the one-sided confidence area. We have when vector is contained in the confidence area at position , and otherwise.To compute the size of the confidence area, we introduce a set of coefficients, denoted by . We define as the difference between the values of vectors and at position . For the vector that appears last in the order, i.e., there is no corresponding , we let , where is the value of the lower side of the confidence interval at position . Using this, we can write the difference between the largest and smallest value at position as the sum , and the size of the envelope that contains all of is equal to .
Now, given a feasible assignment of values to , i.e., an assignment that satisfies the constraints that we will define shortly, we can compute the size of the corresponding one-sided confidence area using the sum . This is the objective function of the ILP.
We continue with discussing the constraints. Recall that denotes the vector that follows vector when the vectors are sorted in decreasing order of their value at position . First, since every vector may violate the confidence area at most at positions, we must make sure that holds for every . Moreover, observe that if the vector at position is within the upper bound at position , i.e., we have , the vector must be inside the upper bound at position as well, since . However, if the vector is below the upper bound, this does not imply anything for vector . Therefore, we must have for all and .
The resulting ILP is as follows:
(A.4) | st. | ||||
(A.5) | |||||
(A.6) | |||||
(A.7) |
There are at most variables and constraints in total in addition to the binary constraints. A relaxed version of this ILP is otherwise equivalent, but allows all variables to take fractional values in the interval . We use standard techniques to design an approximation algorithm for the one-sided CombCI problem (with ) as follows. The algorithm first computes (in polynomial time using a suitable LP solver) the optimal fractional solution, denoted , and then rounds this to an integer solution using a simple threshold value. We must select the threshold so that the resulting integer solution is guaranteed to be feasible for the original ILP. We define the rounding procedure as follows:
(A.8) |
The threshold value we use is thus . To prove that the algorithm is correct, we must show (i) that the rounding procedure results in a solution where every vector is outside the confidence area at most times, and (ii) that there are no “holes” in the solution, meaning that if , then must be equal to as well. We start with this latter property.
Consider position , and let all vectors be sorted in decreasing order of . This order results in a monotonically increasing sequence of values.
The proof is a simple observation that since Equation (A.6) holds for any feasible solution, we must have for every and . ∎
This means that if the rounding procedure of Equation (A.8) sets , we must have as well. The result is also guaranteed to include every vector within the confidence area at least times.
We must find a worst-case distribution of values in that requires a very small threshold value so that the rounded variables satisfy the constraint. Suppose that there are ones in . Since the total sum of all elements in is at least , the remaining items must sum to . The worst-case situation happens when all of the remaining items are equal to . By selecting this as the threshold value, we are guaranteed to satisfy the constraint. ∎
Next, we consider the approximation ratio of the solution given by . The proof follows standard approaches.
The cost of the rounded solution is at most times the optimal solution to the original ILP.
By the definition of in Equation (A.8), we must have . This implies that
The cost of is thus at most times the cost of the optimal fractional LP, which lower bounds the optimal cost of the ILP. ∎
The idea of the greedy algorithm, presented in Algorithm A.1, is to start from the whole data envelope and sequentially find vectors to exclude by selecting at each iteration the vector whose removal reduces the envelope the largest amount. In order to find the envelope wrt. the relaxed condition that allows positions from each vector to be outside, at each iteration the set of included data points needs to be (re)computed (line 6). This is realized in function FindEnvelope (details presented in Algorithm A.2), which essentially implements a greedy algorithm solving the CombCI problem for . After the envelope has been computed, the data vector whose exclusion yields a maximal decrease in the size of the confidence area is excluded. The function ExcludeObservation (details presented in Algorithm A.3) is a variant of as the greedy MWE algorithm from [11] with , and with the ordering structure provided. After data vectors have been excluded FindEnvelope is called for the final set of data vectors in line 9 and matrix indicating included points is returned.
The function FindEnvelope, presented in Algorithm A.2, identifies which points from each data vector to leave outside of the confidence area in order to obtain a confidence area of minimal size. To efficiently maintain information about the ordering of the values in columns an ordering structure is used (line 5). Each element of is a doubly-linked list storing the ordering information for column , with the first element corresponding to the row index of the smallest element in . Several functions are associated with . The function 1st returns the row index of the smallest () or the largest () value for column in , and, similarly, function 2nd returns the row index of the second smallest or the second largest value. Furthermore, function remove1st removes the first () or the last element ( of . Lines 6–12 initialize a priority queue maintaining a list of candidate points for exclusion along with the respective gains, i.e., reductions of confidence area.
The main part of the search for the points to exclude from the relaxed confidence area is the while-loop in lines 13–22. At each iteration the point leading to maximal decrease in the size of the confidence area is excluded, if less than elements have already been excluded from the data vector in question. The loop terminates when there are no positions left in which the point with an extreme value can be excluded without breaking the vector-wise constraints.
Our approach in the function ExcludeObservation, presented in Algorithm A.3, is similar to that in [11, Algorithm 1], i.e., the index of observation to remove is the one that results in the maximal decrease in the envelope size wrt. the current one.
Notice that for it is computationally more efficient to use the greedy algorithm from [11], as in that case there is no need to update the set of data vectors included in the envelope for each iteration over as done in Algorithm A.1.
The greedy algorithm performs well in practice, although it does not provide approximation guarantees. Consider, e.g., a setting with , , , and and a data matrix such that
with arbitrary small. The optimal solution is given by
and has a cost , while the greedy algorithm gives solution
with a cost of .
Comments
There are no comments yet.