Given a sequence of numbers and an integer (selection) parameter , the selection problem asks to find the -th smallest element in . If the elements are distinct, the -th smallest is larger than elements of and smaller than the other elements of . By symmetry, the problems of determining the -th smallest and the -th largest are equivalent; throughout this paper, we will be mainly concerned with the latter dual problem.
Together with sorting, the selection problem is one of the most fundamental problems in computer science. Sorting trivially solves the selection problem; however, a higher level of sophistication is required in order to obtain a linear time algorithm. This was accomplished in the early 1970s, when Blum et al.  gave an -time algorithm for the problem. Their algorithm performs at most comparisons and its running time is linear irrespective of the selection parameter . Their approach was to use an element in as a pivot to partition into two smaller subsequences and recurse on one of them with a (possibly different) selection parameter . The pivot was set as the (recursively computed) median of medians of small disjoint groups of the input array (of constant size at least ). More recently, several variants of Select with groups of and , also running in time, have been obtained by Chen and Dumitrescu and independently by Zwick; see .
The selection problem, and computing the median in particular are in close relation with the problem of finding the quantiles of a set. The -th quantiles of an -element set are the order statistics that divide the sorted set in equal-sized groups (to within ); see, e.g., [10, p. 223]. The -th quantiles of a set can be computed by a recursive algorithm running in time.
In an attempt to drastically reduce the number of comparisons done for selection (down from ), Schönhage et al.  designed a non-recursive algorithm based on different principles, most notably the technique of mass-production. Their algorithm finds the median (the -th largest element) using at most comparisons; as noted by Dor and Zwick , it can be adjusted to find the -th largest, for any , within the same comparison count. In a subsequent later work, Dor and Zwick  managed to reduce the comparison bound to about ; this however required new ideas and took a great deal of effort.
Mediocre elements (players).
Following Yao, an element is said to be -mediocre if it is neither among the top (i.e., largest) nor among the bottom (i.e., smallest) of a totally ordered set of elements. Yao remarked, that historically, finding a mediocre element is closely related to finding the median, with a common motivation being selecting an element that is not too close to either extreme. Observe also that -mediocre elements where , (and symmetrically exchanged), are medians of .
In her PhD thesis , Yao suggested a stunningly simple algorithm for finding an -mediocre element: Pick elements arbitrarily and select the -th largest among them. It is easy to check that this element satisfies the required condition. Yao asked whether this algorithm is optimal. No improvements over this algorithm were known. An interesting feature of this algorithm is that its complexity does not depend on (unless or do). The author also proved that this algorithm is optimal for . For , let denote the minimum number of comparisons needed in the worst case to find an -mediocre element. Yao [35, Sec. 4.3] proved that , and so is independent of . Here denotes the minimum number of comparisons needed in the worst case to find the second largest out of elements.
The question of whether this algorithm is optimal for all values of and has remained open ever since; alternatively, the question is whether is independent of for all other values of and . Here we provide two alternative algorithms for finding a mediocre element (one deterministic and one randomized), and thereby confront Yao’s algorithm with concrete challenges.
Background and related problems.
Determining the comparison complexity for computing various order statistics including the median has lead to many exciting questions, some of which are still unanswered today. In this respect, Yao’s hypothesis on selection [35, Sec. 4] has stimulated the development of such algorithms [14, 30, 33]. That includes the seminal algorithm of Schönhage et al. , which introduced principles of mass-production for deriving an efficient comparison-based algorithm.
Due to its primary importance, the selection problem has been studied extensively; see for instance [3, 6, 7, 11, 13, 14, 15, 18, 19, 20, 21, 22, 23, 25, 24, 30, 36]. A comprehensive review of early developments in selection is provided by Knuth . The reader is also referred to dedicated book chapters on selection, such as those in [1, 5, 10, 12, 26] and the more recent articles [9, 16], including experimental work .
In many applications (e.g., sorting), it is not important to find an exact median, or any other precise order statistic, for that matter, and an approximate median suffices . For instance, quick-sort type algorithms aim at finding a balanced partition without much effort; see e.g., . As a concrete example, Battiato et al.  gave an algorithm for finding a weak approximate median by using few comparisons. While the number of comparisons is at most in the worst case, their algorithm can only guarantee finding an -mediocre element with ; however, , and so the selection made could be shallow.
Our main results are summarized in the following. It is worth noting, however, that the list of sample data the theorems provide is not exhaustive.
Given a sequence of elements, an -mediocre element, where , , and , can be found by a deterministic algorithm A1 using comparisons in the worst case, where the constants for the quantiles through are given in Fig. 2 (column A1 of the second table). In particular, if the number of comparisons done by Yao’s algorithm is , we have , for each of these quantiles.
Given a sequence of elements, an -mediocre element, where , can be found by a randomized algorithm using comparisons on average. If is a fixed constant, an -mediocre element, where , can be found using comparisons on average. If are fixed constants with , an -mediocre element, where , can be found using comparisons on average.
In particular, finding an element near the median requires about comparisons for any previous algorithm (including Yao’s), and finding the precise median requires comparisons on average, while the main term in this expression cannot be improved . In contrast, our randomized algorithm finds an element near the median using about comparisons on average, thereby achieving a substantial savings of roughly comparisons.
Preliminaries and notation.
Without affecting the results, the following two standard simplifying assumptions are convenient: (i) the input sequence contains distinct numbers; and (ii) the floor and ceiling functions are omitted in the descriptions of the algorithms and their analyses. For example, for simplicity we write the -th element instead of the more precise -th element. In the same spirit, for convenience we treat and as integers. Unless specified otherwise, all logarithms are in base .
2 Instances and algorithms for deterministic approximate selection
We first make a couple of observations on the problem of finding an -mediocre element. Without loss of generality (by considering the complementary order), it can be assumed that ; and consequently, , if convenient. Our algorithm is designed to work for a specific range of values of : ; outside this range our algorithm simply proceeds as in Yao’s algorithm. With anticipation, we note that our test values for purpose of comparison will belong to the specified range. Note that the conditions and imply that .
Yao’s algorithm is very simple: simply pick elements arbitrarily and select the -th largest among them. As mentioned earlier, it is easy to check that this element satisfies the required condition.
Our algorithm (for the specified range) is also simple: Group the elements into pairs and perform comparisons; then select the -th largest from the upper elements in the pairs. Let us first briefly argue about its correctness; denoting the selected element by , on one hand, observe that is smaller than (upper) elements in disjoint pairs; on the other hand, observe that is larger than (lower) elements in disjoint pairs, by the range assumption. It follows that the algorithm returns an -mediocre element, as required.
It should be noted that both algorithms (ours as well as Yao’s) make calls to exact selection, however with different input parameters. As such, we use state of the art algorithms and corresponding worst-case bounds for (exact) selection available. In particular, selecting the median can be accomplished with at most comparisons, by using the algorithm of Dor and Zwick ; and if is any fixed integer, selecting the -th largest element can be accomplished with at most comparisons, where
Here we present an algorithm that outperforms Yao’s algorithm for finding an -mediocre element for large and for a broad range of values of and suitable , using current best comparison bounds for exact selection as described above. A key difference between our algorithm and Yao’s lies in the amount of effort put into processing the input. Whereas Yao’s algorithm chooses an arbitrary subset of elements of a certain size and ignores the remaining elements, our algorithm looks at all the input elements and gathers initial information based on grouping the elements into disjoint pairs and performing the respective comparisons.
Consider the instance of the problem of selecting a mediocre element, where is a constant .
We next specify our algorithm and Yao’s algorithm for our problem instances. We start with our algorithm; and refer to Fig. 1 for an illustration.
Step 1: Group the elements into pairs by performing comparisons.
Step 2: Select and return the -th largest from the upper elements in the pairs. Refer to Fig. 1.
Let denote the selected element. The general argument given earlier shows that is -mediocre: On one hand, there are elements larger than ; on the other hand, there are elements smaller than , as required.
We next specify Yao’s algorithm for our problem instances.
Step 1: Choose an arbitrary subset of elements from the given .
Step 2: Select and return the -th largest element from the chosen.
Let denote the selected element. As noted earlier, is -mediocre. Observe that the element returned by Yao’s algorithm corresponds to a selection problem with a fraction from the available.
Analysis of the number of comparisons.
For , let denote the multiplicative constant in the current best upper bound on the number of comparisons in the algorithm of Dor and Zwick for selection of the -th largest element out of elements, according to (1), with one improvement. Instead of considering only one value for , namely , we also consider the value , and let the algorithm choose the best (i.e., the smallest of the two resulting values in (1) for the number of comparisons in terms of ). This simple change improves the advantage of Algorithm A1 over Yao’s algorithm.
Recall that the algorithm of Dor and Zwick , which is a refinement of the algorithm of Schönhage et al. , is non-recursive, thus the selection target remains the same during its execution, and so choosing the best value for can be done at the beginning of the algorithm. (Recall that the seminal algorithm of Schönhage et al.  is non-recursive as well.)
To be precise, let
It follows by inspection that the comparison counts for Algorithm A1 and Algorithm Yao are bounded from above by and , respectively, where
It is worth noting that Equation (5) yields values larger than for certain values of ; e.g., for , we have , and , , and so . Moreover, a problem instance with would entail Algorithm A1 making a call to an exact selection with parameter (see (6) above). However, taking into consideration the possible adaptation of their algorithm pointed out by the authors , the expression of in (5) can be replaced by
or even by
We next show that (the new) Algorithm A1 outperforms Algorithm Yao with respect to the (worst-case) number of comparisons in selecting a mediocre element for large enough and for all instances , where , and ; that is, for all quantiles and suitable values of the 2nd parameter. This is proven by the data in the two tables in Fig. 2; the entries are computed using Equations (6) and (7), respectively. Moreover, the results remain the same, regardless of whether one uses the expression of in (8) or (9); to avoid the clutter, we only included the results obtained by using the expression of in (8).
We compute lower bounds by leveraging the work of Schönhage on a related problem, namely partial oder production. In the partial order production problem, we are given a poset partially ordered by , and another set of elements with an underlying, unknown, total order ; with . The goal is to find a monotone injection from to by querying the total order and minimizing the number of such queries. Alternatively, the partial order production problem can be (equivalently) formulated with
, by paddingwith singleton elements.
This problem was first studied by Schönhage , who showed by an information-theoretic argument that , where is the minimax comparison complexity of and is the number of linear extensions (i.e., total orders) of . Further results on poset production were obtained by Aigner . Yao  proved that Schönhage’s lower bound can be achieved asymptotically in the sense that , confirming a conjecture of Saks .
Finding an -mediocre element amounts to a special case of the partial order production problem, where consists of a center element, elements above it, and elements below it. For applying Schönhage’s lower bound we have
Interestingly enough, the resulting lower bound does not depend on ; observe here the connection with Yao’s hypothesis, namely the earlier question on the independence of on . Moreover, since
the above lower bound is rather weak, namely at most . Note that a lower bound of for selecting an -mediocre element is immediate by a connectivity argument applied to ; see also [34, Lemma 2]. On the other hand, observe that the coefficients (of the linear terms) in the upper bounds in the right table in Fig. 2 are all strictly greater than .
3 Instances and algorithms for randomized approximate selection
Consider the problem of selecting an -mediocre element, for the important symmetric case . To start with, let (the first scenario described in Theorem 2); an extended range of values will be given in the end.
We next specify our algorithm111We could formulate a general algorithm for finding an -mediocre element, acting differently in a specified range, as we did for the deterministic algorithm in Section 2. However, for clarity, we preferred to specify it in this way. and compare it with Yao’s algorithm instantiated with these values ().
Input: A set of elements over a totally ordered universe.
Output: An -mediocre element, where .
Step 1: Pick a (multi)-set of elements in , chosen uniformly and independently at random with replacement.
Step 2: Let be median of (computed by a linear-time deterministic algorithm).
Step 3: Compare each of the remaining elements of to .
Step 4: If there are at least elements of on either side of , return , otherwise FAIL.
Observe that (i) Algorithm A2 performs at most comparisons; and (ii) it either correctly outputs an -mediocre element, where , or FAIL.
Analysis of the number of comparisons.
Our analysis is quite similar to that of the classic randomized algorithm for finding the median; see , but also [29, Sec. 3.3] and [28, Sec. 3.4]. In particular, the randomized median finding algorithm and Algorithm A2 both fail for similar reasons.
Recall that an execution of Algorithm A2 performs at most comparisons. Define a random variable by
The variables are independent, since the sampling is done with replacement. It is easily seen that
Let be the random variable counting the number of samples in whose rank is less than . By the linearity of expectation, we have
Observe that the randomized algorithm A2 fails if and only if the rank (in ) of the median of is outside the rank interval , i.e., the rank of is smaller than or larger than . Note that if algorithm A2 fails then at least elements of have rank or at least elements of have rank ; denote these two events by and , respectively. We next bound from above their probability.
Since is a Bernoulli trial, is a binomial random variable with parameters and . Observing that , it follows (see for instance [28, Sec. 3.2.1]) that
Applying Chebyshev’s inequality yields
as claimed. ∎
Similarly, we deduce that . Consequently, by the union bound it follows that the probability that one execution of Algorithm A2 fails is bounded from above by
As in [28, Sec 3.4], Algorithm A2 can be converted (from a Monte Carlo algorithm) to a Las Vegas algorithm by running it repeatedly until it succeeds. By Lemma 1, the FAIL probability is significantly small, and so the expected number of comparisons of the resulting algorithm is still . Indeed, the expected number of repetitions until the algorithm succeeds is at most
Since the number of comparisons in each execution of the algorithm is , the expected number of comparisons until success is at most
We now analyze the average number of comparisons done by Yao’s algorithm. On one hand, by a classic result of Floyd and Rivest , the -th largest element out of given, can be found using at most comparisons on average. On the other hand, by a classic result of Cunto and Munro , this task requires comparisons on average. In particular, the median of elements can be found using at most comparisons on average; and the main term in this expression cannot be improved.
Consequently, since , the average number of comparisons done by Algorithm A2 is significantly smaller than the average number of comparisons done by Yao’s algorithm for the task of finding an -mediocre element, when is large and .
A broad range of symmetric instances for comparison purposes can be obtained as follows. Let be any fixed constant. Consider the problem of selecting an -mediocre element in the symmetric case, where . Our algorithm first chooses an arbitrary subset of elements of to which it applies Algorithm A2; as such, it uses at most comparisons on average. It is implicitly assumed here that , i.e., that , which holds for large enough. In contrast, Yao’s algorithm chooses an arbitrary subset of elements and uses comparisons on average. Since for every , the average number of comparisons in Algorithm A2 is significantly smaller than the average number of comparisons in Yao’s algorithm for the task of finding an -mediocre element, when is large and .
Finally, we remark that a broad range of asymmetric instances with a gap, as described in Theorem 2, can be constructed using similar principles; in particular, in Step 2 of Algorithm A2, a different order statistic of (i.e., a biased partitioning element) is computed rather than the median of . It is easy to see that the resulting algorithm performs at most comparisons on average. The correctness argument is similar to the one used above in the symmetric case and so we omit further details.
Schönhage’s lower bound on the minimax comparison complexity of in the problem of partial order production was extended to minimean comparison complexity by Yao . Denoting this complexity by , he showed that . As such, the (same) lower bound for finding an -mediocre element derived at the end of Section 2 holds for randomized algorithms. Recall that this is
For the 2nd case in Theorem 2, namely , this is , which matches the upper bound in the theorem in the main term. The same comment applies for the first case in the theorem, namely , and for the third case. Consequently, the upper bounds in Theorem 2 are optimal modulo lower order terms.
We presented two alternative algorithms—one deterministic and one randomized—for finding a mediocre element, i.e., for approximate selection. The deterministic algorithm outperforms Yao’s algorithm for large with respect to the worst-case number of comparisons for about one third of the quantiles (as the first parameter), and suitable values of the 2nd parameter, using state of the art algorithms for exact selection due to Dor and Zwick . Moreover, we suspect that the comparison outcome remains the same for large and the entire range of and suitable in the problem of selecting an -mediocre element. Whether Yao’s algorithm can be beaten by a deterministic algorithm in the symmetric case remains an interesting question.
The randomized algorithm outperforms Yao’s algorithm for large with respect to the expected number of comparisons for the entire range of in the problem of finding an -mediocre element, where . These ideas can be also used to generate asymmetric instances with a gap for suitable variants of the randomized algorithm.
The author thanks Jean Cardinal for stimulating discussions on the topic. In particular, the idea of examining the existent lower bounds for the partial order production problem is due to him. The author is also grateful to an anonymous reviewer for pertinent comments.
-  A. V. Aho, J. E. Hopcroft, and J. D. Ullman, Data Structures and Algorithms, Addison–Wesley, Reading, Massachusetts, 1983.
-  M. Aigner, Producing posets, Discrete Mathematics 35 (1981), 1–15.
-  M. Ajtai, J. Komlós, W. L. Steiger, and E. Szemerédi, Optimal parallel selection has complexity , Journal of Computer and System Sciences 38(1) (1989), 125–133.
-  A. Alexandrescu, Fast deterministic selection, Proceedings of the 16th International Symposium on Experimental Algorithms (SEA 2017), June 2017, London, pp. 24:1–24:19.
-  S. Baase, Computer Algorithms: Introduction to Design and Analysis, 2nd edition, Addison-Wesley, Reading, Massachusetts, 1988.
-  S. Battiato, D. Cantone, D. Catalano, G. Cincotti, and M. Hofri, An efficient algorithm for the approximate median selection problem, Proceedings of the 4th Italian Conference on Algorithms and Complexity (CIAC 2000), LNCS vol. 1767, Springer, 2000, pp. 226–238.
S. W. Bent and J. W. John,
Finding the median requires comparisons,
Proceedings of the 17th Annual ACM Symposium on Theory of Computing(STOC 1985), ACM, 1985, pp. 213–216.
-  M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, Time bounds for selection, Journal of Computer and System Sciences 7(4) (1973), 448–461.
-  K. Chen and A. Dumitrescu, Select with groups of or , Proceedings of the 29th International Symposium on Algorithms and Data Structures, (WADS 2015), Victoria, Canada, August 2015, Vol. 9214 of LNCS, Springer, pp. 189–199. Also available at arXiv.org/abs/1409.3600.
-  T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 3rd edition, MIT Press, Cambridge, 2009.
-  W. Cunto and J. I. Munro, Average case selection, Journal of ACM 36(2) (1989), 270–279.
-  S. Dasgupta, C. Papadimitriou, and U. Vazirani, Algorithms, Mc Graw Hill, New York, 2008.
-  D. Dor, J. Håstad, S. Ulfberg, and U. Zwick, On lower bounds for selecting the median, SIAM Journal on Discrete Mathematics 14(3) (2001), 299–311.
-  D. Dor and U. Zwick, Finding the -th largest element, Combinatorica 16(1) (1996), 41–58.
-  D. Dor and U. Zwick, Selecting the median, SIAM Journal on Computing 28(5) (1999), 1722–1758.
-  A. Dumitrescu, A selectable sloppy heap; preprint available at arXiv.org/abs/1607.07673.
-  S. Edelkamp and A. Weiß, QuickMergesort: Practically efficient constant-factor optimal sorting, preprint available at arXiv.org/abs/1804.10062.
-  R. W. Floyd and R. L. Rivest, Expected time bounds for selection, Communications of ACM 18(3) (1975), 165–172.
-  F. Fussenegger and H. N. Gabow, A counting approach to lower bounds for selection problems, Journal of ACM 26(2) (1979), 227–238.
-  A. Hadian and M. Sobel, Selecting the -th largest using binary errorless comparisons, Combinatorial Theory and Its Applications 4 (1969), 585–599.
-  C. A. R. Hoare, Algorithm 63 (PARTITION) and algorithm 65 (FIND), Communications of the ACM 4(7) (1961), 321–322.
-  L. Hyafil, Bounds for selection, SIAM Journal on Computing 5(1) (1976), 109–114.
-  J. W. John, A new lower bound for the set-partitioning problem, SIAM Journal on Computing 17(4) (1988), pp. 640–647.
-  H. Kaplan, L. Kozma, O. Zamir, and U. Zwick, Selection from heaps, row-sorted matrices and X+Y using soft heaps, preprint 2018, available at http://arxiv.org/abs/1802.07041.
-  D. G. Kirkpatrick, A unified lower bound for selection and set partitioning problems, Journal of ACM 28(1) (1981), 150–165.
-  J. Kleinberg and É. Tardos, Algorithm Design, Pearson & Addison–Wesley, Boston, Massachusetts, 2006.
-  D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd edition, Addison–Wesley, Reading, Massachusetts, 1998.
-  M. Mitzenmacher and E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge University Press, 2005.
-  R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.
-  M. Paterson, Progress in selection, Proceedings of the 5th Scandinavian Workshop on Algorithm Theory (SWAT 1996), LNCS vol. 1097, Springer, 1996, pp. 368–379.
-  M. Saks, The information theoretic bound for problems on ordered sets and graphs, in Graphs and Order (I. Rival, editor), D. Reidel, Boston, MA, 1985, pp. 137–168.
-  A. Schönhage, The production of partial orders, Astérisque 38-39 (1976), 229–246.
-  A. Schönhage, M. Paterson, and N. Pippenger, Finding the median, Journal of Computer and System Sciences 13(2) (1976), 184–199.
-  A. C. Yao, On the complexity of partial order production, SIAM Journal on Computing 18(4) (1989), 679–689.
-  F. Yao, On lower bounds for selection problems, Technical report MAC TR-121, Massachusetts Institute of Technology, Cambridge, 1974.
-  C. K. Yap, New upper bounds for selection, Communications of the ACM 19(9) (1976), 501–508.