1 Introduction
Given a sequence of numbers and an integer (selection) parameter , the selection problem asks to find the -th smallest element in . If the elements are distinct, the -th smallest is larger than elements of and smaller than the other elements of . By symmetry, the problems of determining the -th smallest and the -th largest are equivalent. Together with sorting, the selection problem is one of the most fundamental problems in computer science. Whereas sorting trivially solves the selection problem in time, Blum et al. [7] gave an -time algorithm for this problem.
The selection problem, and computing the median in particular are in close relation with the problem of finding the quantiles of a set. The -th quantiles of an -element set are the order statistics that divide the sorted set in equal-sized groups (to within ); see, e.g., [10, p. 223]. The -th quantiles of a set can be computed by a recursive algorithm running in time.
The selection problem, determining the median in particular, have been also considered from the perspective of communication complexity in the two-party model introduced by Andrew Yao [36]. Suppose that Alice and Bob hold subsets and of , respectively, and wish to determine the median of the multiset while keeping their communication close to a minimum. Several classic protocols going back to 1980s achieve this task by exchanging bits [29, 34]. The communication complexity for this task has been subsequently reduced to bits [29, 33].
Mediocre elements.
Following Frances Yao [37], an element is said to be -mediocre if it is neither among the top (i.e., largest) nor among the bottom (i.e., smallest) of a totally ordered set of elements. As remarked by Yao, finding a mediocre element is closely related to finding the median, in the sense that the common goal is selecting an element that is not too close to either extreme. In particular, -mediocre elements where , (and symmetrically exchanged), are medians of . Previous work on approximate selection (in this sense) includes [5, 17].
In Section 3 we study the communication complexity of finding a mediocre element in the two-party model introduced by Andrew Yao [36]. The communication complexity of finding the median in this model has been studied in [34, 9, 31]; see also [29, 33]. In particular we outline a scenario in which computing a mediocre element near the median can be accomplished with communication complexity —which is very attractive.
Background and related problems.
Due to its primary importance, the selection problem has been studied extensively; see for instance [2, 5, 6, 11, 13, 14, 15, 19, 20, 21, 22, 23, 24, 25, 26, 32, 35, 38]. A comprehensive review of early developments in selection is provided by Knuth [28]. The reader is also referred to dedicated book chapters on selection, such as those in [1, 4, 10, 12, 27] and the more recent articles [8, 16], including experimental work [3].
In many applications (e.g., sorting), it is not important to find an exact median, or any other precise order statistic, for that matter, and an approximate median suffices [18]. For instance, quick-sort type algorithms aim at finding a (not necessarily perfect) balanced partition rather quickly; see e.g., [5, 22].
Our results.
Our main results are summarized in the two theorems stated below. We first study the communication complexity of finding the median in the multiparty setting. In this model we assume that every message by one of the players is seen by all the players (i.e., it is a broadcast); as in [29, p. 83].
For , let player hold a sequence (i.e., a multiset) whose support is a subset of and . There is a deterministic protocol for finding the median of with communication complexity.
In the second part of our paper, somewhat surprisingly, we show that (under slightly stronger assumptions and a somewhat relaxed requirement) the communication complexity of finding a mediocre element in the vicinity of the median is bounded from above by a constant and is therefore independent of .
Let , where , is fixed and be a positive constant. Let Alice and Bob hold pairwise disjoint sets and , of distinct elements from ; assume that and are known to both of them. Let denote the total number of elements in , where . Put and .
Then an -mediocre element can be found (by at least one player) with communication complexity. If both players return, each element returned is -mediocre; the elements found by the players need not be the same.
Preliminaries.
A simple but effective procedure reduces the selection problem for finding the -th smallest element out of to one for finding the median in a slightly larger sequence. The target is the -th smallest element in an input sequence of size . Assume first that
; in this case pad the input
with elements that are less than or equal to the minimum in the input sequence; call resulting sequence. Note that . It suffices to observe that the median of is the -th smallest element in : indeed, , as required. The case is symmetric; in this case pad the input with elements that are larger than or equal to the maximum in the input sequence; call resulting sequence. Note that . Observe that the median of is the -th smallest element in , as required. We therefore restrict our attention to the median selection problem.Notation.
Without affecting the results, the floor and ceiling functions are omitted in some instances where they are not essential. For example, we frequently write the -th element instead of the more precise -th element. Unless specified otherwise, all logarithms are in base .
For an -bit number and a positive integer , where , denotes the -bit binary prefix of , i.e., the number formed by the first (i.e., most significant) bits of .
If belongs to a sorted list and is not the minimum, denotes its predecessor. If belongs to a sorted list and is not the maximum, denotes its successor.
2 Exact selection
In this section we prove Theorem 1. First, we set up the problem in the context of two-party communication complexity; we start with some background. In this section, each player’s input is allowed to contain duplicates. Following the literature, we refer to these (potential) multisets as sets, and the union operation should be understood as multiset union [29, Example 1.6, p. 6]. (An equivalent formulation is merging of sequences.)
2.1 Two players
Alice and Bob hold multisets and whose supports are subsets of , respectively. It is assumed that . (In a standard setup [29, Example 1.6, p. 6], and are subsets of ; here we extend this setup for potentially larger multisets.) The median of the multiset is denoted by ; as usual, the median of is the -th smallest element of .
There is a simple binary-search type protocol due to M. Karchmer that takes bits of communication; see [29, Example 1.6, p. 6]. At each round Alice and Bob have an interval , , that contains the median. They halve the interval (repeatedly) by deciding whether the median is less than, equal to, or larger than . This is done by Alice sending to Bob the number of elements in that are less than , equal to , and larger than , using bits. Bob can now determine whether the median is less than, equal to, or larger than , and sends this information to Alice using bits. The protocol has rounds, each requiring bits of communication, so the overall communication complexity is .
An alternative binary-search type protocol that takes bits of communication, also due to Karchmer [29, p. 168], works as follows. Assume, without loss of generality that and that the common size is a power of : this can be achieved by exchanging the sizes of their inputs ( bits) and padding them with the appropriate number of the minimal element () and the maximal element (). The protocol works in rounds. During the protocol, Alice maintains a set of elements that may still be the median (initially ) and Bob maintains a set of elements that may still be the median (initially ). At each round, Alice sends Bob the value , which is the median of , and Bob sends Alice the value , which is the median of . At this point we have . If , then Alice discards the lower half of (note that is part of it) and Bob discards the upper half of . If , then Bob discards the lower half of (note that is part of it) and Alice discards the upper half of . In either case, this operation maintains the median of as the desired median of . It should be noted that the size of is reduced (exactly) by a factor of . If , this value is the median, and if , then the smaller number is the median. The protocol has rounds, each requiring bits of communication, and so the communication complexity is .
The communication complexity of finding the median can be further reduced. A subtle refinement of the above protocol, due to Karchmer [29, Example 1.7, p. 6 and p. 168], and revised by Gasarch [30], works with communication complexity: its key idea is to make comparisons in a bit-by-bit manner, but this requires careful bookkeeping of the progress and here we omit the technical details.
We next describe a different (folklore) protocol, running with communication complexity, that we find simpler and subsequently refine for computing a mediocre element. The protocol implements a binary-search strategy and works in rounds. Alice maintains a set of elements that may still be the median (initially ) and Bob maintains a set of elements that may still be the median (initially ). Alice and Bob compute the medians of their current inputs ( and , respectively). At this point we have . Alice and Bob aim to determine the order relation between and in order to halve their input in an appropriate manner.
The protocol avoids sending these -bit numbers at each round by avoiding making a direct comparison between and . The players have an interval , , that contains the median (initially, ). The medians and are compared to the middle element , If , this element is the median of and the protocol terminates. Otherwise, if and are split by , i.e., or , then (by transitivity of ), the relation between and is determined, and Alice and Bob halves their input accordingly (as in the earlier protocol). Otherwise, if and are on the same side of , i.e., or . For example, in the first case, the elements in the lower half of are and the same holds for the median of . As such, both players shrink their common interval by (roughly) half: the resulting interval is or , respectively. Alice and Bob communicate each of the outcomes of the above tests in bits. Each halving operation for and maintains the property that .
Let . Note that after tests, either Alice and Bob hold singleton sets (i.e., ), or the common interval consists of a single integer . If , the smaller number is the median (or either, for equality), whereas if , this number is the median. The number of bits exchanged before the last round of the protocol is and is in the last round. The resulting communication complexity is .
2.2 players
In this subsection we prove Theorem 1. It is worth noting that the number of players, is independent of . The protocol maintains the invariant: the median of in one round is the same for the updated sets in the next round. It is possible that the number of sets drops from to a lower number; the protocol remains unchanged until the value is reached, when the respective players apply the protocol in Subsection 2.1; recall that padding with extra elements may be needed. If the value is reached, the remaining player computes the median in his/her own set and the game ends.
Initially, each player sorts his/her input set locally. The sorted order is used by each player in the pruning process, and if such action occurs, the sorted order is locally maintained. Each set pruning discards elements at one of the two ends of the chain (either low elements below some threshold, or high elements above some threshold).
The protocol roughly halves the size of at least one of the current participating sets; more precisely, for some , we have by the end of each round. Since the size of each set is initially , the size of each of the sets drops to in at most iterations and consequently, the number of rounds is at most . (Padding with extra elements when is reached conforms with this bound.)
Each round of the protocol works as follows. Each player (locally) finds the median of his/her current set: , . The following scheme regarding medians is used: assume that there are sets of even size and sets of odd size in the current round, where ; for the sets of even size the first use the lower median and the remaining use the upper median (in some fixed, e.g., alphabetical, order). The idea of intermixing upper and lower medians is also present in [8]. (A scheme that uses only lower medians or only upper medians fails to guarantee that the median of the union is maintained after pruning, for instance if and all three sets have even size; the smallest example of this kind is .)
In the first round, each player posts his/her median on the communication board; this involves bits of communication. In the remaining rounds, two players whose sets got pruned (as further explained below) need to update their median on the communication board. Depending on the parities of the sets of these two players before and after the pruning, at most one more player may need to update his/her median to maintain the balanced scheme adopted earlier which requires use the lower median and the remaining use the upper median. Therefore, in each round, the communication complexity is .
All players are now able to determine the sorted order of the medians. For simplicity, assume that after relabeling, this order is
(1) |
It is convenient to refer to the players holding the minimum and maximum of these medians as Alice and Bob and to their corresponding sets as and : and (this relabeling is only done for the purpose of analysis).
Let denote the poset made by the chains , together with the relations in (1). Write , , and . The player holding the smaller set between Alice and Bob is in charge of the pruning operation in the current round: the same number of elements is discarded by Alice and Bob as specified below. Refer to Fig 1.
If , Alice discards elements in (all when is odd or is the lower median, or all when is the upper median), and Bob discards the highest elements in . Such operation is charged to Alice. Otherwise, if , Bob discards elements in (all when is odd or is the upper median, or all when is the lower median). Such operation is charged to Bob. It is worth noting that this scheme is feasible: i.e., if the indicated player discards the specified number of elements, the other player can also discard the same number of elements. Then the protocol continues with the next round. Each player keeps track of the players that are still in the game and their set cardinalities, as these can be deduced from the actions of the algorithm.
It remains to show that the same number of elements is discarded from each side of the median in each round. Let be the number of elements in that are above the highest discarded element of , and be the number of elements in that are below the lowest discarded element of . By slightly abusing notation, let denote the number of players in the current round of the protocol (which may differ from the initial number). Specifically we prove the following.
Consider a round of the protocol and assume that and . The following inequalities for and hold: and .
Proof.
For , we start by including corresponding to the upper half elements in the set , for ; this contributes to the sum. In addition we add for each set of odd size, thus over all odd sets. Then we add for each set of even size that uses the lower median, thus over all even sets. This procedure overcounts by if the median is the highest discarded element of . Therefore, we have
Similarly, for , we start by including corresponding to the lower half elements in the set , for ; this contributes to the sum. In addition we add for each set of odd size, thus over all odd sets. Then we add for each set of even size that uses the upper median, thus over all even sets. This procedure overcounts by if the median is the lowest discarded element of . Therefore, we have
Since both and are integers, we have thereby proved that and , as required. ∎
Proof of Theorem 1.
By Lemma 2.2, all the elements discarded from are below the median (of the union), and all elements discarded from are above the median. Thus in each round, the protocol preserves the median and discards the same number of elements from each side of it. This proves the invariant of the protocol. Since the protocol takes rounds and the communication complexity of each round is , the overall communication complexity is , as claimed. ∎
3 Approximate selection
Let denote the total number of elements in . Here we consider the problem of finding an -mediocre element, where is a fixed constant. The protocol described in Subsection 2.1 immediately yields the following.
The deterministic communication complexity of finding an -mediocre element in , where and is a fixed constant, is .
Interestingly enough, this communication complexity can be brought down to a constant under slightly stronger assumptions: (i) and have no duplicates or common elements, and (ii) , for some constant ; and a somewhat relaxed requirement: at least one of the players returns an element to the process that has invoked his/her service; each element returned is -mediocre.
A natural protocol to consider would be to choose one of the median-finding protocols and execute a constant number of rounds from it. However, this seemingly promising idea does not appear to work. It is possible that one of the two sets, say , does not contain any desired elements, namely -mediocre for the given and so at the end of the modified protocol only contains desired elements (and not ). More importantly, the players apparently have no indication of which player is the lucky one. We therefore resort to a different idea of using quantiles (more precisely, a sampling technique with a similar effect).
Proof of Theorem 1.
We may assume, without loss of generality that and are powers of (in particular, is also a power of ). For Alice and Bob use the earlier -protocol for finding the median; we therefore subsequently assume that . In particular, since , we have . We further assume, without loss of generality that : this can be achieved by padding the smaller size set with the appropriate numbers of small elements and large elements as described below. In particular, the padding elements need also be distinct. (It is not assumed that the common size is a power of : since our protocol does not exactly halve the current set of each player at each round, such an assumption would be of no use.)
To illustrate the padding process for arbitrary set sizes, we may assume without loss of generality that the given input satisfies: . We need to pad Alice’s input with small elements and large elements. Alice and Bob replace their inputs by and , respectively; as a result, the elements they hold are now in the range . Then Alice pads her input with and . (Note that .) The resulting sets have the same size and consists of distinct elements in the range . By subtracting , the element(s) returned by the protocol are shifted back to the original range in the end (without explicitly mentioning it there).
and below denote the (new) padded sets (of size ). Set (recall that ) and . By the assumption we have
Let be the set consists of the -th elements of , for (this set resembles the -th quantiles of ). Similarly, let be the set consists of the -th elements of , for (this set resembles the -th quantiles of ). Note that . Since and consist of pairwise distinct elements, between any two elements in (or ), there are at least
elements. Represent each element in (and ) with bits; it follows that the elements in are pairwise distinct; similarly the elements in are pairwise distinct.
The protocol implements a binary-search strategy aimed at finding the median of . Note that . Alice maintains a set of elements that may still be the median quantile (initially ) and Bob maintains a set of elements that may still be the median quantile (initially ). The invariant will be maintained. At each round, Alice and Bob compute the medians of their current sets ( and , respectively). If or the protocol continues with Alice and Bob halving their input as in the median-finding protocol. Specifically, if the protocol discards the lower elements of and the upper elements of . The equality case is addressed below. Observe that the above comparison can be resolved by exchanging bits in each round.
If , and , we have , and the protocol discards the lower elements of and the upper elements of . Note that this is a slight but important deviation from the standard median-finding protocol; it is aimed at handling prefix equality by discarding possibly one fewer element by each player. With this choice, the median of remains the median of ; and the invariant is maintained. Since the sets the players hold are almost halved at each round, the protocol terminates in rounds, as specified below.
If , and , the protocol continues with each player halving his/her own current set accordingly. If , and , the protocol terminates with each player output his/her number ( and , respectively). Observe that in this case, the median of is or and it will be shown below, see (5), that both elements are -mediocre.
If and , the protocol terminates with the player that holds the smaller of and output that number. If and , the protocol terminates with each player output his/her number ( and , respectively). It will be shown below, see (5), that both elements are -mediocre.
Recall that . If and then
(2) |
Recall that the median of is in in the last round of the protocol. Since all elements are distinct, for and above, if , Inequality (2) implies
(3) |
Assume that the median of is ; then Alice returns . In addition, if , Bob also returns . Since is the median of , it is the -th smallest element of . As such (by construction): (i) is than at least
elements of ; and similarly, (ii) is than at least elements of . Note that the median of has rank and is the same as the median of the original union of the two sets. See Fig. 2.
Observe that which yields (recall that ). This implies
(4) |
Recall that if , Bob also returns and Inequality (3) applies.
As such, each output element is an -mediocre element of the original union of the two sets. The elements returned are or (or both). Alice may return and Bob may return to the processes that have invoked their service; the elements returned by the players could be different. Since , we have . The number of bits exchanged is in each of the rounds of the protocol. The overall communication complexity is , as claimed. ∎
4 Conclusion
An obvious question is whether the three-party communication complexity of median computation can be reduced to . A more general question is whether the -party communication complexity of median computation, , can be reduced to . We believe that the answers to both questions are in the negative. Another interesting question regarding the two-party communication complexity of approximate selection is whether the conditions in Theorem 1 can be relaxed.
References
- [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, Data Structures and Algorithms, Addison–Wesley, Reading, Massachusetts, 1983.
- [2] M. Ajtai, J. Komlós, W. L. Steiger, and E. Szemerédi, Optimal parallel selection has complexity , Journal of Computer and System Sciences 38(1) (1989), 125–133.
- [3] A. Alexandrescu, Fast deterministic selection, Proceedings of the 16th International Symposium on Experimental Algorithms (SEA 2017), June 2017, London, pp. 24:1–24:19.
- [4] S. Baase, Computer Algorithms: Introduction to Design and Analysis, 2nd edition, Addison-Wesley, Reading, Massachusetts, 1988.
- [5] S. Battiato, D. Cantone, D. Catalano, G. Cincotti, and M. Hofri, An efficient algorithm for the approximate median selection problem, Proceedings of the 4th Italian Conference on Algorithms and Complexity (CIAC 2000), LNCS vol. 1767, Springer, 2000, pp. 226–238.
- [6] S. W. Bent and J. W. John, Finding the median requires comparisons, Proceedings of the 17th Annual ACM Symposium on Theory of Computing (STOC 1985), ACM, 1985, pp. 213–216.
- [7] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, Time bounds for selection, Journal of Computer and System Sciences 7(4) (1973), 448–461.
- [8] K. Chen and A. Dumitrescu, Selection algorithms with small groups, International Journal of Foundations of Computer Science, to appear. A preliminary version in Proc. 29th International Symposium on Algorithms and Data Structures (WADS 2015), Victoria, Canada, August 2015, Vol. 9214 of LNCS, Springer, pp. 189–199. Preprint available at arXiv.org/abs/1409.3600.
- [9] F. Chin and H. F. Ting, An improved algorithm for finding the median distributively, Algorithmica 2 (1987), 235–249.
- [10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 3rd edition, MIT Press, Cambridge, 2009.
- [11] W. Cunto and J. I. Munro, Average case selection, Journal of ACM 36(2) (1989), 270–279.
- [12] S. Dasgupta, C. Papadimitriou, and U. Vazirani, Algorithms, Mc Graw Hill, New York, 2008.
- [13] D. Dor, J. Håstad, S. Ulfberg, and U. Zwick, On lower bounds for selecting the median, SIAM Journal on Discrete Mathematics 14(3) (2001), 299–311.
- [14] D. Dor and U. Zwick, Finding the -th largest element, Combinatorica 16(1) (1996), 41–58.
- [15] D. Dor and U. Zwick, Selecting the median, SIAM Journal on Computing 28(5) (1999), 1722–1758.
- [16] A. Dumitrescu, A selectable sloppy heap, Algorithms 12(3) (2019), 58; special issue on efficient data structures; doi:10.3390/a12030058.
- [17] A. Dumitrescu, Finding a mediocre player, Proc. 11th International Conference on Algorithms and Complexity (CIAC 2019), LNCS 11485, Springer, 2019, pp. 212–223.
- [18] S. Edelkamp and A. Weiß, Worst-case efficient sorting with QuickMergesort, Proc. 21st Workshop on Algorithm Engineering and Experiments (ALENEX 2019), pp. 1–14.
- [19] R. W. Floyd and R. L. Rivest, Expected time bounds for selection, Communications of ACM 18(3) (1975), 165–172.
- [20] F. Fussenegger and H. N. Gabow, A counting approach to lower bounds for selection problems, Journal of ACM 26(2) (1979), 227–238.
- [21] A. Hadian and M. Sobel, Selecting the -th largest using binary errorless comparisons, Combinatorial Theory and Its Applications 4 (1969), 585–599.
- [22] C. A. R. Hoare, Algorithm 63 (PARTITION) and algorithm 65 (FIND), Communications of the ACM 4(7) (1961), 321–322.
- [23] L. Hyafil, Bounds for selection, SIAM Journal on Computing 5(1) (1976), 109–114.
- [24] J. W. John, A new lower bound for the set-partitioning problem, SIAM Journal on Computing 17(4) (1988), pp. 640–647.
- [25] H. Kaplan, L. Kozma, O. Zamir, and U. Zwick, Selection from heaps, row-sorted matrices and X+Y using soft heaps, Proc. 2nd Symposium on Simplicity in Algorithms (SOSA 2019), Open Access Series in Informatics, 2018, vol. 69, pp. 5:1–5:21.
- [26] D. G. Kirkpatrick, A unified lower bound for selection and set partitioning problems, Journal of ACM 28(1) (1981), 150–165.
- [27] J. Kleinberg and É. Tardos, Algorithm Design, Pearson & Addison–Wesley, Boston, Massachusetts, 2006.
- [28] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd edition, Addison–Wesley, Reading, Massachusetts, 1998.
- [29] E. Kushilevitz and N. Nisan, Communication Complexity, Cambridge University Press, New York, 1997.
- [30] http://www.cs.technion.ac.il/~eyalk/book.html
- [31] S. L. Mantzaris, On “an improved algorithm for finding the median distributively”[1], Algorithmica 10(6) (1993), 501–504.
- [32] M. Paterson, Progress in selection, Proceedings of the 5th Scandinavian Workshop on Algorithm Theory (SWAT 1996), LNCS vol. 1097, Springer, 1996, pp. 368–379.
- [33] A. Rao and A. Yehudayoff, Communication Complexity, https://homes.cs.washington.edu/~anuprao/pubs/book.pdf.
- [34] M. Rodeh, Finding the median distributively, Journal of Computer and System Sciences 24(2) (1982), 162–166.
- [35] A. Schönhage, M. Paterson, and N. Pippenger, Finding the median, Journal of Computer and System Sciences 13(2) (1976), 184–199.
- [36] A. C. Yao, Some complexity questions related to distributive computing (preliminary report), Proc. 11h Annual ACM Symposium on Theory of Computing (STOC 1979), ACM, 1979, pp. 209–213.
- [37] F. Yao, On lower bounds for selection problems, Technical report MAC TR-121, Massachusetts Institute of Technology, Cambridge, 1974.
- [38] C. K. Yap, New upper bounds for selection, Communications of the ACM 19(9) (1976), 501–508.
Comments
There are no comments yet.