Ranked set sampling (RSS) was first introduced by McIntyre (1952) and has been widely used as a design in many applications. The idea behind RSS is appealing particularly to agricultural and environmental scientists where identifying sampling units in the field is straightforward but the exact exploration measurement of the units by measurements is time consuming. Many sampling units can be identified and within them subsets are actually measured. In RSS the identification of these subsets is based on ranking the units and a selection according to their relative ranks.
The RSS technique briefly involves taking random samples of size from the population. The sample units are ranked by some quick and easy measure. Then, one unit from each sample is chosen and precisely measured for the character of interest. To take a sample of size , the unit that has the lowest rank in the first sample (with size ) is chosen, the unit with the second lowest rank is chosen from the second sample, and so on. This process is repeated times, giving a final sample size,
. Sampling can be balanced or unbalanced where the number of sample units selected in the ranks are not constant. With highly skewed population distributions more units from low (or high) ranks can be selected. Unbalanced designs are similar in concept to the optimal allocation in stratified sampling where strata with bigger variances, take bigger sample fractions. RSS is reported as being more efficient than simple random sampling (Ridout, 2003; Samawi, 1996). See full reviews of RSS by Patil et al. (1999) and the related book of Chen et al. (2004).
In this paper, based on method of Panahbehagh et al. (2017) multivariate RSS based on partial order sets will be introduced. In some populations there are more than one character of interest, Patil et al. (1994) have discussed RSS for multiple variables when one of the variable can be defined as a primary variable. Ranking is based on this main primary variable only, and if the other variables are correlated with the main one, the method will perform reasonably well. Norris et al. (1995) have developed two approaches, one using an unbalanced allocation process based on the Neyman allocation for the variable of primary interest, treating this as a concomitant for the other variables of interest and the other using a design based on randomly choosing sample units from the rank list derived from an individual variable. Al-Saleh and Zheng (2002) as well as Chen and Shen (2003) have proposed a two-layer ranked set sampling for the situation in which we have two main variables or two concomitant variables to rank the data. In their methods at the first layer, the data is ranked based on the first variable and a RSS sample is selected. At the second round, the first layer RSS data will be ranked based on the second variable and the RSS data in the second layer will be present as the final sample. One disadvantage of their methods is that they consider the two variables separately, and not simultaneously. Another disadvantage is that they are requiring many initial samples to achieve the needed sample size and also with increasing dimension of the space of variables, the size of the needed sample will increase severely.
In this paper, applying the framework developed by Panahbehagh et al. (2017) multivariate RSS based on partial order sets will be introduced.
We demonstrate our suggested sampling technique with two environmental examples:
The first example deals with the estimation of mean values of “flower dry weight” and “essence” of Matricaria chamonilla, which is considered as a very important commercial and medicinal plant in Iran and many other countries. The main part of chamomile for medicinal purposes is the flower essence and it is economically important to maximize the oil yield. It is hardly possible to measure the efficiency of oil yield under all scenarios and all suitable geographical units within Iran. Therefore sampling technique is necessary and is performed.
Chemical pollution in the environment is a problem which came into the focus of administration since the early eighties. Chemicals pose a hazard to humans, animals, plants, etc. due to their toxicity. The quantification of the hazard is however extremely difficult as uptake mechanisms, mode of toxic action, the role of chemical speciation and the state of the environmental geographical unit are important. Therefore in almost all nations monitoring programs were installed to observe the chemical pollution spatially and temporarily. The data have mostly the unit mass of chemicals (as total concentration) mass of the target, for example soil.
These data are thought of as surrogates expressing the hazard potential due to the considered chemicals. It is difficult to obtain for example mean values of concentrations taking into account all geographical units, especially when a temporal trend is to be monitored. Here 59 geographical units are selected by the environmental protection agency taking care for defining the regions as homogeneous as possible with respect to the chemical pollution processes. The sampling technique can be validated, because in that specific case the mean values can also directly obtained from all 59 units for a specified year of observation. When the proposed method is successful then the monitoring process can be simplified, namely to relax the precondition of almost homogeneous geographical units and a more elaborated locally specific monitoring can be applied.
To develop the new method, in section 2 we extend the method of Panahbehagh et al. (2017) for multiple variables. In section 3 we introduce stratified sampling using RSS derived from linear extensions (LE) in partial order sets (Posets). Section 4 contains examples, simulations and two real case study to compare the methods and evaluate the results and the paper will be finished in section 5 with a conclusion.
Multivariate Virtual Stratified Ranked Set Sampling (MVSR)
In multivariate RSS, we have an
dimensional random variable. We start with the basic idea of multivariate RSS (Patil et al., 1994), ranking according to just one of the variables. Then we adapt the design with the design of Panahbehagh et al. (2017).
Suppose that with where and also , and for all . Main aim is to estimate . Our strategy to get a sample of size from the population is to generate an sample of s of size from and sort them according to (using itself or based on an auxiliary variable) in columns and repeat this method times. Then we will have a stratified population, formed in strata, each of size (see table 1
), just assume we have a vector ofinstead of a , where , and is the order statistics in the set with and as the mean and variance respectively, and for are concomitant variables with respect to in set with and . Now we get a Simple Random Sampling Without Replacement (SRSWOR) from the stratum of size (an integer smaller than ), say and we can estimate by
In MVSR, is an unbiased
is an unbiased estimator forand
and if we assume that and are linked with below
linear regression model
are linked with below linear regression model
where is a random variable independent from , then
are unbiased estimators for the variance of variables.
For the proof of Theorem 1 see Appendix A.
As we saw, in MVSR one is selected as a leading one to perform a ranking, the others are just adjusted which implies some errors. Therefore we introduce a method of ranking that ranks all variables simultaneously.
3 Ranking based on Posets
In this section, we first describe Posets theory and then introduce two new versions of multivariate RSS, based on them.
3.1 Posets and Linear Extensions
The application of theory of partial orders for ranking has been
described by Bruggemann and Patil (2011). In this theory, we have a set containing elements each of them with variables,
with a binary relation between the elements. To compare two
elements of the set, if all variables of the first element are equal or bigger (smaller) than the second one, then the first element is better () (worse ()) than second one, otherwise the two elements
are not comparable.
Linear extensions (LEs) are different projections of the partial order into a complete
order that respect all the relations in the
partial order set. I.e. Linear extensions are the result of order preserving mappings. Therefore a relation in a poset is preserved in all linear extensions.
We use this theory to introduce two designs; Ranking based on Posets using complete form (or at least a random sample) of LEs (CPOR) and Ranking based on Posets using just one random selection of LEs (RPOR):
First rank the elements according to the mean height of the elements due to all the possible LEs where height is defined as the rank of the element in the respective LE and then construct an unequal size population using these mean heights based on complete LEs.
Select one of the LEs to construct an equal size population.
We illustrate the topic
with an example where we assume a set with and (see
table 2). The set of all LEs obtained from the data in table 2 is shown in table 3. Here, due to the low number of linear extensions, the average height of each element can be easily directly determined from table 3.
Generally, the determination of all linear extensions is computationally a hard problem. Therefore the determination of average heights needs themselves sampling techniques as shown by Bubley and Dyer (1999). However, it is not necessary to determine the set of LEs explicitly, because only the average height is of interest. In this case, there are also pretty good approximations available, see for instance Bruggemann et al. (2004), (2013) or De Loof et al. (2013). According to the heights of each element in LEs form, we have table 4.
We will use above theory to stratify each set in the next subsection.
We are going to put each element of a
set into a stratum equal to the nearest integer of the mean of its
Following the previous example according to table 3, we will put
the elements of the set into 5 virtual strata (see table 5).
Then, the design proceeds as follow: an sample of size (a set) from will be generated, and according to their variables () all possible linear extensions will be constructed. We then calculate the mean height (either explicitly by determination of the set of all LEs or directly by applying approximations). Finally, using these heights, put the elements of the set into the strata and repeat this approach times. It is obvious that this method leads to an unequal size stratified population.
Then instead of a R dimensional variable we have a R+1 dimensional variable where stands for the mean heights of the objects.
We now have a stratified population with unequal size. For the stratum we will take a SRSWOR, with size , proportional to the stratum size, , where such that . The stratified population is presented in table 6.
In table 6, where is the character of an element that has been fallen into the stratum after elements, according to its mean height MH in respective LEs. Now we propose an estimator for (the expectation of the character in ) as
In CPOR, is an unbiased estimator for .
For the proof of Theorem 2 see Appendix B.
Here instead of Neyman allocation, proportional to size is used that is easy to implement and does not need extra information (Sarndal et al. 1992).
RPOR is easier than CPOR to perform. Here it is just enough to select (or construct) on of the LEs in table 3 randomly and put them in 5 strata and then we will have a stratified population, formed in strata, each of size like MVSR (see table 1). Here we show the vector of variable in stratum with . Now we get a SRSWOR from the stratum of size (an integer smaller than ), say . Now we propose an estimator for as
In RPOR, is an unbiased estimator for with variance
where are all the possible combinations of LEs, with the below unbiased estimator of variance
where and are variance of stratum under combination of LEs and sample variance of stratum for variable respectively.
For the proof of Theorem 3 see Appendix C.
3.4 Negative Correlation
When correlation between variables are strongly negative, according to Posets theory, it is probable that most of the elements in a set are incomparable. This can make it meaningless to stratify the sets (note that in this case most of the elements will fall in the middle stratum).
An extreme case is when the correlation between two variables is ”-1”. All the generated elements will be incomparable and in the LEs the mean height of all of them will be the same and all will fall in the same stratum. The weight of the stratum (equation (2)) will be 1 (and the other strata zero). Finally we will take a simple random sampling without replacement of size from the stratum and the design will essentially become simple random sampling with replacement.
To overcome this problem, we suggest that if the bivariate correlations between some variables are negative, multiple a ”-1” to some of them to change the correlations to positive. But if we have more than two variables, sometimes it would not possible to make all the correlations positive. In such cases, it is better to select some more important variables that we are able to make their correlations positive. We then rank the elements using Posets theory with this new correlations.
In Bruggemann and Patil (2011) a procedure is explained, how subsets of variables can be systematically found. The crucial concept is the number of incomparabilities of a poset. First a sensitivity measure for each variable is to be defined. The sensitivity measures the impact of each variable on the structure of the poset (roughly: the system of comparabilities within a poset). Secondly the variables are ordered due to their impact on a poset. Thirdly considering first the poset, due to the most sensitive variable, then the poset, due to the first two most important variables, etc the number of incomparabilities is calculated as function of the merged variables. The resulting curve motivates to find subsets of variables, which constitute mainly the poset. The remaining variables are considered as fine tuning, and will be ignored.
4 Simulation Study
To evaluate and compare the efficiency of the designs, we calculate
where is the sample mean of a simple random sample, and
stands for (MVSR design), (CPOR design) or (RPOR design) and MSE indicate mean square error.
This section contains 3 parts:
Comparing CPOR and RPOR with MVSR using some simulations
Comparing CPOR and RPOR with MVSR using a real case study on medical flowers
Comparing CPOR and RPOR with MVSR using a real case study on environmental pollution.
Also in the simulations, no matter how small was size or variance of a particular stratum, at least one sample is dedicated to the stratum. All the simulations are done by ”R 3.1.2” software. For the Monte Carlo simulation we have used 20000 iterations. Expectations, variances and MSEs of the estimators are computed using Mote Carlo method.
4.1 Comparing CPOR and RPOR with MVSR using some simulations
In this part we will investigate efficiency of the designs that are introduced in section 2 and 3, using bivariate normal distribution (with solving negative correlation problem).
4.1.1 Bivariate Normal distribution with negative correlation
Here we performed the simulation assuming normal distribution with negative
correlation, with and . As we can see in table 7,
and as we asserted in Section 3.4, when the correlation is
strongly negative, CPOR and RPOR decline to simple random sampling (efficiency). When we convert the correlation to a positive value by
changing the sign of one variable, the efficiency problem will be
solved (compare the results in the last two columns with the
results in the first two columns).
4.1.2 Bivariate Normal
More complete simulations for bivariate Normal distribution are shown in
For all the cases we simulated bivariate normal with , , , and .
First note that changing , does not affect the efficiency of the first variable which is confirmed by simulations with less than 0.02 error. As a general point, CPOR and RPOR designs increase the efficiency of the estimator for both variables, simultaneously, whereas the traditional multivariate ranked set sampling just enhances estimation of one of the variables. As the correlations increase, efficiency increase. Unlike MVSR, CPOR and RPOR had good and reasonable efficiency with all the correlations. Also CPOR that uses all information of LEs was more efficient than RPOR.
4.2 Comparing CPOR and RPOR with MVSR using a real case study on medical flowers
To evaluate the designs in this section we used a real case study data on chamomile flower (Panahbehagh et al. 2017) as an medicinal use of flowers. We consider the population mean of the ”Flower dry weight” (Fdw)
and ”Essence” (Esn) as the two main parameters. Because we have no
information about them before sampling, and it is expensive to
measure them, we used two auxiliary variables, easy to measure
with reasonable correlation with the two main variables. For
sorting Fdw, we used ”Flower height” (Fht) with correlation
of 0.78 and for Esn we used ”Number of petals” (Npt) with correlation
of 0.71. Also the correlation between Fht and Npt was 0.77. Simulation results are in table 9. As we can see in
table 9, CPOR and RPOR enhance efficiency of both of the estimators
simultaneously. The most important factor in efficiency is
the portion of and efficiency increased with increasing this factor. For example compare two cases: one and two , although is larger in the
second case, because the portion of is larger for first
one, the efficiency of the first case is larger than the second
case. Also if the other parameters are equal,
is the other important parameter that affect efficiency and efficiency increased with increasing . Again CPOR was more efficient than RPOR in almost all the cases.
4.3 Chemical Pollution
The Environmental Protection Agency (EPA) of the German state Baden-Wuerttemberg performed a series of measurements in different targets, for example in the herb layer, in the epiphytic mosses of trees, in fish etc.. For this purpose the state Baden Wuerttemberg was divided in 60 more or less homogenous regions with respect to their natural environment. The regions are not selected according to administrative classification but to get regions as homogeneous as possible with respect to environmental pollution processes.
The task was and is, to protocol the pollution due to industry, traffic, agrarian management with respect to the total concentrations of Lead, Cadmium, Zinc and Sulfur (measured in mg/kg dry mass).
According to the different emission types there are different chemical species, for example SO or solved in atmospheric droplets HSO, similarly the other metals as for example Pb, which can be bounded in organic chemicals or as oxids.
The different targets, selected by the EPA should help to differentiate among the different transport processes and to be able to trace back the emission source. So, the herb layer is mainly a short range transport indicator, whereas the epiphytic mosses (simply: moss layer) is considered as indicating middle range transports. The herb layer should especially indicate the loading due to the public traffic whereas the moss layer may mainly indicate industrial sources.
An interesting point of geochemical research is as to how far the presence of e.g. Pb implies the presence of Cadmium. A first attempt in this direction can be found in a paper by Bruggemann, Kerber, 2018 (submitted to a special issue of Comm.in Math. and in Comp. Chemistry). A classification approach concerning the pollution of Baden - Wuerttemberg was published by Bruggemann et al. (2013).
4.3.1 Comparing CPOR and RPOR with MVSR using a real case study on environmental pollution
In this study, regions in Baden-Wuerttemberg, South-West of Germany were selected and monitored with respect to total concentrations of the chemical elements Pb, Cd, Zn and S in the herb layer (Environmental Protection Agency Baden-Wurttemberg (Germany) 1994, Signale aus der Natur). The herb layer is one of the targets, selected by the Environmental Protection Agency of Baden-Wuerttemberg. This multi-indicator system with regions as objects and concentrations of the four chemical elements as indicators (Bruggemann and Patil 2011) raises the questions:
How can we get information about the pollution status?
What can be said about geochemical relations?
For example does an increase in pollution with respect to one pollutant,for example Pb, always imply the
increase of another pollutant, for instance Cd? For an answer from the point of view of applied partial order theory, see Bruggemann and Voigt (2012) (For more details see Bruggemann et al., 1996; Bruggemann et al., 1998; Bruggemann et al., 1999; Bruggemann et al., 2003 and Bruggemann et al., 2013).
Here to give all the correlations a positive value, we multiple a ”-1”to Cd and Zn. In this part we run two different scenarios:
Selecting Pb and Zn as the two main variables with high correlation () and Cd and S as the two main variables with low correlation (). In this scenario we used perfect ranking, and we didn’t use auxiliary variables.
From a chemical point of view we, selecting Cd and Pb as the two main variables and for sorting them using two auxiliary variables; Zn with correlation with Cd and S with
with Pb. This is a heuristic approach. Basically economical or sociological information or the density of highways could also serve as auxiliary variables.
Results are shown in table 10 (Scenario I) and table 11 (Scenario II). In table 10, efficiency of estimators for estimating the means of Pb and Zn with 0.6 correlation, and the means of Cd and S with 0.06 correlation are presented. For two variables with reasonable correlation (Zn and Pb) MVSR is not bad, because ranking just based on the first variable, supports the second variable.
For Cd and S, the situation is worse for MVSR, because of weak correlation around 0.06 between them. The first variable is not able to support the second one. Efficiency for S in MVSR is around 1. But for CPOR and RPOR results for the second variable are better. With decreasing efficiency of the first variable (Cd) from MVRS to CPOR and RPOR, the efficiency of the second variable (S) raise reasonably. Average of efficiency for S in MVSR is around 1.01 but in CPOR and RPOR are around 1.09. Again, is the most important parameter in efficiency and after that .
In table 11, we have used two auxiliary variables to rank the main variables. For Zn we have used Cd with 0.48 and for Pb we have used S with 0.27 correlations. As we can see, MVRS just improves efficiency of the first variable (Cd) and CPOR and RPOR improve the both variables estimations however the improvement is not so large because of almost week correlations between auxiliary variables and the main variables ( and ).
Also table 12 presents Monte Carlo expectation of the estimators that shows unbiasness of the estimators.
By our sampling technique mean values referring to a complete set of 59 geographical units are obtained. Clearly the regional relation is not taken into regard (which is already done by papers mentioned above) but there is now a number available which can characterize the status of Baden-Württemberg overall, and for example a time series could be done to see the general changes with respect to the pollution.
CPOR and RPOR can be used for implement RSS in population surveys where
there are multiple variables of interest. CPOR and RPOR enhance the
parameters estimation simultaneously with a reasonable sample size, that most of the RSS methods can not do in multiple
variables cases. As we see in the real case studies, for CPOR and ROPR
there are no need to use perfect ranking using the main variables
and it can be done using some variables, easy to measure, with
reasonable correlation with the main variables.
The simulation section and real case study confirmed the assertions in the paper.
For further works, it would be beneficial to find some unbiased estimators for variance of CPOR. Because of randomness of it is not easy to calculate variance and an unbiased estimator of variance for CPOR but as CPOR uses information of all LEs and in simulations we saw that CPOR was more efficient than RPOR in almost all the cases and maybe it is reasonable to use variance estimator of RPOR as a conservative estimate for variance of CPOR.
Proving Theorem 1
Proof of the theorem is the same as Panahbehagh et al. (2017) and just please note that here .
Proving Theorem 2
Here according to the sampling strategy, (i) taking an sample from (a model) and (ii) taking an stratified finite population sampling from the selected sample (a design), we have a Model-Design based sampling, let indexes of and , mean ”according to the Model and the Design” respectively. Then with
where indicates whole sample of size Km.
Proving Theorem 3
Here the design affected by two sources of variations; variation from selecting one of the LEs and variation from selection the sample from the fixed form of the stratified population conditional on the result of the LEs which we indicate them with and respectively. Therefore here based on LEs assume we have and may happen with probability . Please note that because all combinations of LEs happen with equal probability. Then we have
For variance we have
It is easy to see that
and then as (because is not variable respect to ) we have
For the unbiased estimator of the variance first note that as we take an iid sample for each set and rank them in ranks then rank for each unit is distributed uniformly in vector and therefore we have
where indicates rank of in its selected set and is an indicator function which takes 1, if .
where the last equation is based on 5, we have
Al-Saleh, M. and Zheng, G. (2002) Estimation of bivariate characterstics using ranked set sampling. Australian & New Zealand Jurnal of Statistics, 44, 221–232.
Bruggemann, R. and Carlsen, L. (2011) An Improved Estimation of Averaged Ranks of Partial Orders. MATCH Comm.Math.Comput.Chem. 65, 383–414.
Brüggemann, R., Kaune, A. and Voigt. K. (1996) Vergleichende ökologische Bewertung von Regionen in Baden- Württemberg. Pages 455-467 in Landesanstalt für Umweltschutz Baden-Württemberg, ed. 4.Statuskolloquium, Projekt ”Angewandte Ökologie” Nr. 16. Präzis-Druck Karlsruhe, Karlsruhe.
Bruggemann, R., Mucha, H.-J. and Bartel, H.-G. (2013) Ranking of Polluted Regions in South West Germany Based on a Multi-indicator System. MATCH Commun. Math. Comput. Chem., 69,433–462.
Bruggemann, R., Voigt, K., Kaune, A., Pudenz, S., Komossa, D., and Friedrich, J. (1998) Vergleichende ökologische Bewertung von Regionen in Baden- Württemberg GSF-Bericht 20/98. GSF, Neuherberg.
Bruggemann, R., Welzl, G., and Voigt, K. (2003) Order Theoretical Tools for the Evaluation of Complex Regional Pollution Patterns. J. Chem. Inf. Comp. Sc., 43, 1771–1779.
Environmental Protection Agency Baden-Wurttemberg, (1994) Signale aus der Natur 10 Jahre Okologisches Wirkungskataster Baden-Wurttemberg. Kraft Druck GmbH, Ettlingen
Patil, G. P., Sinha, A. K. and Taillie, C. (1999) Ranked set sampling: A bibliography. Environmental and Ecological Statistics, 6, 91–98.
Ridout, M. S. (2003) On ranked set sampling for multiple characterestics. Environmental and Ecological Statistics, 10, 225–262.
Samawi, H. M. (1996) Stratified ranked set sample. Pakistan Journal of Statistics, 12, 9–16.