Dynamic processes in populations are often described using functions (e.g., Bebbington et al., 2007, 2011; and references therein). They are observed in the form of data points, usually contaminated by measurement errors. We may think of these points as randomly perturbed true values of underlying functions, whose measurements are taken at certain time instances. The functions, their rates of change, and de/acceleration can be and frequently are non-monotonic. Nevertheless, it is of interest to assess and even compare the extent of their monotonicity, or lack of it. We refer to Qoyyimi (2015) for a discussion and literature review of various applications.
Several methods for assessing monotonicity have been suggested in the literature (e.g., Davydov and Zitikis, 2005, 2017; Qoyyimi, and Zitikis, 2014, 2015). In particular, Davydov and Zitikis (2017) show the importance of such assessments in insurance and finance, especially when dealing with weighted insurance calculation principles (Furman and Zitikis, 2008), among which we find such prominent examples as the Esscher (Bühlmann, 1980, 1984), Kamps (1998), and Wang (1995, 1998) premiums. Furthermore, Egozcue et al. (2011) provide problems in economics where the sign of the covariance
needs to be determined for various classes of function
. One of such examples concerns the slope of indifference curves in two-moment expected utility theory (e.g., Eichner and Wagener, 2009; Sinn, 1990; Wong, 2006; and references therein). Another problem concerns decision making (e.g., speculation, normal backwardation, contango, etc.) of competitive companies under price uncertainty (e.g., Holthausen, 1979; Feder et al., 1980; Hey, 1981; Meyer and Robison, 1988; and references therein).
Lehmann (1966) has shown that if the function is monotonic, then covariance (1.1) is either positive (when is increasing) or negative (when is decreasing). This monotonicity assumption on , though satisfied in a number of cases of practical interest, excludes a myriad of important cases with more complex risk profiles. For example, when dealing with the aforementioned economics-based problems, the role of is played by the derivative of the underlying utility function, which may not be convex or concave everywhere, as argued and illustrated by, e.g., Friedman and Savage (1948), Markowitz (1952), Kahneman and Tversky (1979), Tversky and Kahneman (1992), among others. Hence, since might be non-monotonic, how far can this function be from being monotonic, or increasing? Furthermore, since the population risk- or utility-profile cannot be really known, the non-monotonicity of needs to be assessed from data, and this leads us to the statistical problem of this paper.
In addition, supported by the examples of Anscombe (1973) on potential pitfalls when using the classical correlation coefficient, Chen and Zitikis (2017) argue in favour of using the index of increase, as defined by Davydov and Zitikis (2017), for assessing non-monotonicity of scatterplots. Chen and Zitikis (2017) apply this approach to analyze and compare student performance in subjects such as mathematics, reading and spelling, and illustrate their reasoning on data provided by Thorndike and Thorndike-Christ (2010). One of the methods discussed by Chen and Zitikis (2017) deals with scatterplots representing finite populations, in which case large-sample estimation is not possible. The other method involves large-sample regression techniques (Figure 1.1), in which case Chen and Zitikis (2017) calculate the corresponding indices of increase using a numerical approach, that gives rise to the values denoted by and reported in the bottom-right corners of the panels of Figure 1.1.
Though important, these methods do not allow direct large-sample non-monotonicity quantifications and thus inferences about larger populations. In this paper, therefore, we offer a statistically attractive and computationally efficient procedure for assessing data patterns that arise from non-monotonic patterns contaminated by random measurement errors.
We have organized the rest of the paper as follows. In Section 2, we introduce the index and provide basic arguments leading to it. In Section 3, we explain why and how the index needs to be adjusted in order to become useful in situations when random measurement errors are present. In Section 4, we rigorously establish consistency of the estimator and introduce relevant data-exploratory and cross-validatory techniques. Since the limiting distribution of the estimator is complex, in Section 56 concludes the paper with a brief summary of our main contributions.
2 The index of increase
Davydov and Zitikis (2017) have introduced the index of increase
for any absolutely continuous (e.g., differentiable) function on interval , where , and “” denotes equality by definition. Throughout the paper, we use to denote the Lebesgue measure, which helps us to write integrals compactly, as seen from the ratios above. We shall explain how the index arises later in the current section. Of course, this framework reduces to the unit interval by considering the function instead of . Namely, we have
To illustrate, in Figure 2.1
we have visualized the following quartet of functions
and we have also calculated their indices of increase. Since and are monotonic functions on the interval , calculating their indices of increase using formula (2.2) is trivial, but the same task in the case of non-monotonic functions and requires some effort. To facilitate such calculations in a speedy fashion, and irrespective of the complexity of functions, we suggest using the numerical approximation
with for . Intuitively, is the proportion of the upward movements of the function with respect to all the movements, upward and downward.
Knowing the convergence rate of to when is important as it allows us to set a frequency at which the measurements of could be taken during the observation period (e.g., unit interval ) so that any pre-specified estimation precision of would be achieved. For example, we have used to calculate the index values with the four-digit precision reported in Figure 2.1. We refer to Chen and Zitikis (2017) for details on computational precision.
The following proposition, which is a special case of Lemma 4.1 below, establishes the convergence rate based on the level of smoothness of the function .
Let be a differentiable function defined on the unit interval , and let its derivative be -Hölder continuous for some . Then, when , we have
for any positively homogeneous and Lipschitz function (e.g., and ). Consequently,
To explain the basic meaning of the index , we start with an un-normalized version of it, which we denote by . Namely, let denote the set of all absolutely continuous functions on the interval such that . Denote the total variation of on the interval by , that is, . Furthermore, by definition, we have and , and we also have the equations and . Finally, we use to denote the set of all the functions that are non-increasing. All of these are of course well-known fundamental notions of Real Analysis (e.g., Kolmogorov and Fomin, 1970; Dunford and Schwartz,1988; and Natanson, 2016).
For any function , we define its (un-normalized) index of increase as the distance between and the set , that is,
Obviously, if is non-increasing, then , and the larger the value of , the farther the function is from being non-increasing on the interval . Determining the index using its definition (2.7) is not, however, a straightforward task, and to facilitate it, we next establish a very convenient integral representation of .
Theorem 2.1 (Davydov and Zitikis, 2017).
The infimum in definition (2.7) is attained at any function such that , and thus
A direct proof of this theorem was not provided by Davydov and Zitikis (2017), who refer to a more general and abstract result. Nevertheless, a short and enlightening proof exists, and we present it next.
Proof of Theorem 2.1.
We start with the note that the bound holds for every function , and in particular for the function specified in the formulation of the theorem. Hence,
It now remains to show the opposite bound. Let be the set of all such that , and let be the complement of the set , which consists of all those for which . Then
The index never exceeds , and so the normalized version of is
which is exactly the index of increase given by equation (2.2). In summary, the index of increase is the normalized distance of the function from the set of all non-increasing functions on the interval : we have when the function is non-increasing, and when the function is non-decreasing. The closer the index is to , the more (we say) the function is increasing, and the closer it is to , the less (we say) the function is increasing or, equivalently, the more it is decreasing.
3 Practical issues and their resolution
Measurements are usually taken with errors, whose natural model is some distribution (e.g., normal) with mean
and finite variance. In other words, the numerical index turns into the random index of increase
where, for ,
Right at the outset, however, serious issues arise. To illustrate them in a speedy and transparent manner, we put aside mathematics such as in Davydov and Zitikis (2004, 2007) and, instead, simulate standard normal errors , thus obtaining four sequences corresponding to the functions of quartet (2.1). Then we calculate the corresponding indices of increase using formula (3.1). All of the obtained values of are virtually equal to (see Figure 3.1).
Clearly, there is something amiss.
It is not, however, hard to understand the situation: when all ’s are zero, the definition of the integral as the limit of the Riemann sums works as intended, but when the ’s are not zero, they accumulate so much when gets larger that the deterministic part (i.e., the Riemann sum) gets hardly, if at all, visible (compare Figures 2.1 and 3.1). In summary, we are facing two extremes:
If the model is purely deterministic in the sense that there are no measurement errors, which we can understandably argue to be outside the realm of practice, then the more frequently we observe the function , the more precisely we can estimate its index of increase.
If, however, there are measurement errors, as they usually are in practice, then the more frequently we observe the function, the less precisely we can estimate its index of increase, because the accumulated measurement errors obscure the deterministic part.
Neither of the two extremes can be of much interest, or use, for reasons either practical or computational. The purpose of this paper is to offer a way out of this difficulty by showing how to strike a good balance between determinism and randomness inherent in the problem.
We next present an intuitive consideration that will guide our subsequent mathematical considerations, and it will also hint at potential applications of this research. Namely, suppose that the unit interval represents an one-day observation period, and let an observation be taken (e.g., by a measuring equipment) every second. Hence, in total, we have observations of the (unknown) function , and they are prone to measurement errors as in expression (3.2). For the sake of argument, let ’s be i.i.d. standard normal. If we calculate the index based on these data, we already know the problem: tends to when . To diminish the influence of these errors, we average the observed values:
where means ‘approximately in distribution,’ and
follows the standard normal distribution. However, in the process of averaging out the errors, we have inevitably also averaged the deterministic part and arrived at the mean valueof the function . This value has very little to do with the index , which fundamentally relies on the derivative . In short, we have clearly over-averaged the observations : having maximally reduced the influence of measurement errors, we have obscured the function so much that the estimation of has become impossible. Clearly, we need to adopt a more tempered approach.
Hence, we group the observations into only groups , , whose cardinalities we assume to be the same for all . It is convenient to re-parametrize these choices using parameter , which turns and into
This re-parametrization is not artificial. It is, in a way, connected to smoothing histograms and estimating regression functions, and in particular to bandwidth selection in these research areas. We shall elaborate on this topic more in the next section. At the moment, we only note that the aforementioned connection plays a pivotal role in obtaining practically useful and sound estimates of the parameter .
To gain additional intuition on the grouping parameter , we come back for a moment to our numerical example with the one-day observation period, which is comprised of observations, one per second. Suppose that we decide to average the sixty observations within each minute. Thus, we have and in this way produce new data points, which we denote by . Since , we have and thus . If, however, instead of averaging minute-worth data we decide to average, for example, hour-worth data, then we have (=group cardinality), (=number of groups), and thus .
Continuing our general discussion, we average the original observations , , falling into each group and in this way obtain group-averages
Based on these averages, we modify the earlier introduced index as follows:
The problem that we now face is to find, if exist, those values of that make the index converge to when . This is the topic of the next section.
The following theorem establishes consistency of the estimator and, in particular, specifies the range of possible values.
Let be a differentiable function defined on the unit interval , and let its derivative be -Hölder continuous for some . If , then is a consistent estimator of , that is, when , we have
The rate of convergence is of the order
with arising from the deterministic part of the problem, that is, associated with the function , and arising from the random part, that is, associated with the measurement errors ’s.
We next discuss the choice of from the theoretical and practical perspectives, which do not coincide due to a number of reasons, such as the fact that theory is concerned with asymptotics when , while practice deals with finite values of , though possibly very large. Under the (practical) non-asymptotic framework, any value of is, in principle, acceptable because the quantities and in the specification of convergence rate (4.2) interact, as both of them depend on and .
Under the (theoretical) asymptotic framework, the values and have to be discarded immediately, as we have already noted. The remaining ’s should, as Theorem 4.1 tells us, be further restricted to only those below . Since we wish to chose that results in the fastest rate of convergence, we maximize the function and get
For example, if the second derivative is uniformly bounded on the interval , which is the case in all our illustrative examples, then and thus .
The grouping and averaging technique that we employ is closely related to smoothing in non-parametric density and regression estimation (e.g., Silverman, 1986; Härdle, 1991; Scott, 2015; and references therein). To elaborate on this connection, we recall that the number of groups is , whose reciprocal
would play the role of ‘bandwidth.’ In non-parametric density and regression estimation, the optimal bandwidth is of the order when , which in our case corresponds to . Hence, means only one bin/group and thus over-smoothing, whereas means as many bins/groups as there are observations, and thus under-smoothing. Of course, as we have already noted above, the values and are excluded, unless all the measurement errors vanish, in which case smoothing is not necessary and thus can be used, as we indeed did earlier when dealing with the numerical index .
Thinking of the role of -Hölder continuity of on the problem, it is useful to look at two extreme cases: First, when , we have from formula (4.3), which corresponds (under weak conditions) to the optimal bandwidth in non-parametric density and regression estimation. Second, when no smoothing is applied, like in the case of the histogram density-estimator, then (under weak conditions) the optimal bandwidth is of the order , which corresponds to when , which essentially means boundedness but no continuity of .
Hence, choosing an appropriate value of the grouping parameter is a delicate task. We next discuss two approaches: The first one is data-exploratory (visual) when we assume that we know the population and want to gain insights into what might happen in practice. The second, practice-oriented approach relies on the idea of cross-validation (e.g., Arlot and Celisse, 2010, Celisse, 2008; and references therein) and is designed to produce estimates of based purely on data.
4.1 Data exploratory (visual) choice of
To gain intuition on how to estimate the grouping parameter from data, we start out with the functions in quartet (2.3), which we view as populations, and then we contaminate their observations with i.i.d. errors according to formula (3.2).
We have visualized the values of the estimator with respect to and in Figure 4.1,
where the hyperplane in each panel is at the height of the corresponding actual index of increase . For each panel, we visually choose a value of which is in the intersection of the curved surface with the hyperplane, because in this case the index is close to the actual index .
Even though the chosen parameter value, which we denote by , may not be optimal due to roughness of the surface, it nevertheless offers a sound choice, as we see from Figure 4.2
where we depict the convergence of to when grows. In each panel, the horizontal red ‘reference’ line is at the height of the actual index value.
Note that in panel (a) of Figure 4.2, the visually obtained is slightly larger than , but we have to say that we had decided on this value (as a good estimate) before we knew the result of Theorem 4.1, and thus before we knew the (theoretical) restriction . Nevertheless, we have decided to leave the value as it is, without tempering with our initial guess in any way. As we shall see in next Section 4.2, however, the purely data-driven and based on cross-validation value is , which is within the range of theoretically acceptable values.
4.2 Choosing based on cross validation
As we have already elucidated, equation (4.4) connects our present problem with nonparametric regression-function estimation. In the latter area, researchers usually choose the optimal bandwidth as the point at which cross-validation scores become minimal (e.g., Arlot and Celisse, 2010, Celisse, 2008; and references therein). We adopt this viewpoint as well. Namely, given a scatterplot, say , we cross validate it (computational details and R packages will be described in a moment). Then we find the minimizing value and finally, according to equation (4.4), arrive at the ‘optimal’ via the equation
In Figure 4.3,
we see some differences between the values of and . Nevertheless, we should not prejudge the situation in any way because in practice, when no hyperplanes can be produced due to unknown values of , only the values of can be extracted from data. Note, however, that the four values of reported in the panels of Figure 4.3 are in compliance with the condition of Theorem 4.1 stipulating that ’s must be in the range in order to have (asymptotic) consistency.
To explore how the grouped estimator based on ’s actually performs, we have produced Figure 4.4.
Naturally, since the respective visual ’s and cross-validatory ’s do not coincide, the corresponding values of are also different. Which of them are better from the statistical point of view will become clearer only in Section 5, where bootstrap-based standard errors and confidence intervals are derived.
We next present a detailed implementation procedure for finding cross-validatory estimates of the grouping parameter . Naturally, the help of the R computing language (R Core Team, 2013) becomes indispensable, and we have used a number of R packages to accomplish the task. We also wish to acknowledge the packages
ggplot2 (Wickham, 2009) and
plotly (Sievert et al., 2017) that we have used extensively in this paper to draw two-dimentional plots and interactive surface plots; the latter plots have been pivotal in extracting the values visually.
Hence, from the purely practical computational perspective, we now utilize bandwidth selection techniques of kernel-based regression-function estimation in order to get estimates of the grouping parameter. First, for the sake of programming efficiency, we restrict ’s to the interval , and we evenly split the latter interval into bins of width , all of which can of course be refined in order to achieve, if desired, smaller computational errors. Hence, from now on, we have thirty equidistant ’s, which are for . Next we use the common cross-validation method called repeated -fold cross validation, and we set for our purpose. The following main steps are:
For each function under consideration, we generate data points based on equation (3.2).
We randomly split the given points into folds, denoted by , of roughly equal sizes.
For each value , we use as the validation set and let other ’s be training sets, which we use to fit a kernel regression model. Specifically, we use the function
ksmoothfrom the R package
stats, with the parameter
normal, which means that we use the normal kernel. Then we use the validation set to get the predicted values and calculate one prediction error, defined as the mean-square error and denoted by . We repeat this step until we use up all the folds as our validation sets. Hence, we obtain prediction errors . Finally, we average these prediction errors and denote this average by .
We repeat Step 3 for all ’s, thus arriving at one estimated prediction error for each . Hence, in total, we have .
We repeat Steps 1–4 fifty times, for every , and then take the averages of the corresponding fifty estimated prediction errors. This gives us fifty final estimates, which we denote by . For example, for , the final estimate is the average of . In summary, after this step, we have of the final estimates of the prediction error.
We draw the plot of the ’s versus the corresponding estimated prediction errors. The that gives the minimal prediction error is denoted by . Finally, we use equation (4.5) to get .
4.3 Proof of Theorem 4.1
Let be differentiable, and let its derivative be -Hölder continuous for some . Furthermore, let be any positively homogeneous and Lipschitz function. Then there is a constant such that, for any set of points ,
where is such that .
Since is Lipschitz and is -Hölder continuous, we have
where the values of and might have changed from line to line. Next, we explore the right-most sum of equation (4.7), to which we add and subtract the right-hand side of equation (4.6). Then we use the mean-value theorem with some and arrive at the equations
Proof of Theorem 4.1.
We start with the equations
We next tackle the deterministic sum on the right-hand side of equation (4.9), and start with the equation
because for all . Consequently,
where we used the fact that is Lipschitz. By the mean-value theorem, there is between and such that the right-hand side of equation (4.10) is equal to . Consequently, with the notation
we have and thus the increments are equal to . This gives us the equations
because for all real and
. The random variables, , are independent and identically distributed with the means and variances . Hence,
Furthermore, by Lemma 4.1 we have
The rate of convergence is of the order . Theorem 4.1 is proved. ∎
5 Bootstrap-based confidence intervals
To construct confidence intervals for based on the estimator , we need to determine standard errors, which turns out to be a very complex task from the viewpoint of asymptotic theory. Hence, we employ bootstrap (e.g., Hall, 1992; Efron and Tibshirani, 1993; Shao and Tu, 1995; Davison and Hinkley, 1997; and references therein). The re-sampling size is quite often chosen to be equal to the actual sample size , but in our case, we find it better to re-sample fewer than observations (i.e., ) and thus follow specialized to this topic literature by Bickel et al. (1997), Bickel and Sakov (2008), Gribkova and Helmers (2007, 2011); see also references therein. Specifically, the steps that we take are:
For a given function , we generate values according to the model , where are i.i.d. standard normal.
We re-sample times and in this way obtain 1000 sub-samples of size , which we choose to be according to a rule of thumb (DasGupta, 2008, p. 478).
We use formula (3.3) to calculate the grouped index of increase, thus obtaining 1000 values of it; one value for each sub-sample. We denote the empirical distribution of the obtained values by .
With denoting the (generalized) inverse of
, the 95% quantile-based confidence interval is, where and .
To illustrate, we introduce a second quartet of functions of this paper, namely:
We have visualized the functions in Figure 5.1.
As expected, our preliminary analysis has shown that the un-groped estimators converge to in all the four cases, but the grouped estimator does converge under appropriate choices of the grouping (or smoothing) parameter values. Next are summaries of our findings using two approaches: the first one is data exploratory (visual) and the second one is based on cross validation.
5.1 Data exploratory (visual) choice of
Based on the crossings of surfaces and hyperplanes depicted in Figure 5.2,
|Confidence intervals||(0.0372, 0.3368)||(0.6527, 1.0000)||(0.6090, 1.0000)||(0.3675, 0.5713)|
Finally, we use bootstrap to get standard errors and confidence intervals, all of which are also reported in Table 5.1.
Reflecting upon the findings in Table 5.1, we see that the values of corresponding to the functions and are outside the range specified by the consistency result of Theorem 4.1, but this of course does not invalidate anything – we are simply working with finite sample sizes . Naturally, we are now eager to compare all the findings reported in Table 5.1 with the corresponding ones obtained by cross validation, which is our next topic.
5.2 Choosing based on cross validation
we visualize the cross-validation scores, specify their minima , and also report the grouping parameters derived via the equation . Based on these values, we explore the performance of using the convergence graphs depicted in Figure 5.5.
The values of point estimates, standard errors, and confidence intervals are reported in Table 5.2.
|Confidence intervals||(0.0000, 0.2813)||(0.5987, 1.0000)||(0.6256, 1.0000)||(0.1803, 0.6193)|
Note that the first three values of reported in Table 5.2 are inside the range specified by the consistency result of Theorem 4.1, whereas corresponding to is just slightly outside the range. Note also that the values of corresponding to the functions and are considerably smaller than the corresponding ’s reported in Table 5.1.
The confidence intervals reported in Tables 5.1 and 5.2 comfortably cover the actual values of , and the widths of these confidence intervals, denoted by and respectively, are comparable for the functions , and . The of the cv-based confidence interval for the function is, however, considerably wider than the corresponding