Minmax-optimal list searching with O(log_2log_2 n) average cost

05/25/2021 ∙ by I. F. D. Oliveira, et al. ∙ 0

We find a searching method on ordered lists that surprisingly outperforms binary searching with respect to average query complexity while retaining minmax optimality. The method is shown to require O(log_2log_2 n) queries on average while never exceeding ⌈log_2 n ⌉ queries in the worst case, i.e. the minmax bound of binary searching. Our average results assume a uniform distribution hypothesis similar to those of prevous authors under which the expected query complexity of interpolation search of O(log_2log_2 n) is known to be optimal. Hence our method turns out to be optimal with respect to both minmax and average performance. We further provide robustness guarantees and perform several numerical experiments with both artificial and real data. Our results suggest that time savings range roughly from a constant factor of 10% to 50% to a logarithmic factor spanning orders of magnitude when different metrics are considered.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a sorted list and a target value , the problem of searching sorted lists is typically stated as:

(1)

where is of size with entries in and , and a scalar in . This problem is ubiquitous in computer science with applications spanning several different fields of computer programming, engineering and mathematics. Variations of (1) include searching unbounded lists Bentley and Yao (1976), tables Knuth (1998), searching continuous functions for a zero Oliveira and Takahashi (Submitted on April 2019 and Revised in January 2020), as well as the construction of insertions and deletion procedures in canonical data-structures Bentley and Sedgewick (1997).

The standard approach to solve (1), commonly known as binary search Knuth (1998), begins with and updates upper and lower bounds and for the location of the desired index . At each step this is done by recursively probing the index which is defined by rounding

(2)

arbitrarily to the nearest integer, and, by comparing and it updates and accordingly, i.e. if then is updated to if then is updated to and if then and are updated to and . The algorithm terminates when the tolerance is equal to one. For convenience we display below the general structure of the binary searching algorithm as a while-loop; however, binary searching also admits for-loop formulations and other formulations that exploit computer architecture Cannizzo (2018); Schlegel et al. (2009) to improve computational speed. Here, and are initiated at , and, in line (1) is taken to be equal to .
 

Algorithm 0: The Bracketing Algorithm

Input: and

while (), do:

(1). Choose and evaluate ;

(2). Update and according to the values of and ;
Output:
 

The key feature of binary search is it’s minmax optimality. That is, it requires at most

(3)

queries to locate while no other method can provide the same guarantee in less than queries. This property is specifically of interest when the computational cost of one query is known to be much higher than the computation of the search procedure itself. This assumption is often made implicitly in the literature, and for the sake of clarity, it is assumed henceforth.

While binary search is optimal with respect to the worst case metric, interpolation search turns out to be a more efficient alternative with respect to expected query complexity if a uniform distribution is assumed, see Perl and Itai (1978); Perl and Reingold (1977); Yao and Yao (1976). Interpolation search is a bracketing algorithm with in line (1) of Algorithm 0 defined as the linear interpolation between the points and . More precisely is taken to be equal to where is defined as the integer closest to

(4)

that lies in between and . The key feature of interpolation search is that if the entries of and are sorted samples of a uniform distribution over , then, interpolation search is optimal with respect to expected query complexity Yao and Yao (1976) and the expected number of queries to solve (1) is

(5)

which considerably improves on the expected query complexity of binary search.

Although interpolation search enjoys an improved average performance, the improvement comes at the cost of requiring up to queries to terminate in the worst case. Furthermore, the guarantees on the expected query complexity of interpolation search do not hold if the distributional hypothesis is misspecified. Thus, choosing interpolation search over binary search may come at a high cost since interpolation search may also require a full series of queries to terminate on the average under different distributions. Is it possible to simultaneously enjoy the benefits of interpolation search with no costs on the minmax peformance of binary search? And furthermore, is it possible to enjoy such benefits without trading off performance under misspecified conditions?

In this paper we answer the above questions affirmatively. To answer the first question we begin by identifying the necessary and sufficient conditions for a searching method to be minmax optimal. Then, we pin-point one specific minmax method, which we name the Interpolation, Truncation and Projection Method, or simply the ITP Method. We show that this method attains the expected query complexity of queries under similar assumptions as those required by interpolation search; and, it requires no more iterations than the upperbound of of binary searching. Hence, it is both optimal with respect to minmax and average performance at no cost other than the computation of the method itself. To answer the second question, we find lower bounds on the average performance of binary searching under very broad distributional hypothesis and show that the bisection method can never outperform the ITP method on the average performance by any significant margin. Hence, opting for the ITP method instead of binary searching comes at (almost) no cost even if the distribution is misspecified.

It is worth pointing out that our findings bear close resemblance with those of Oliveira and Takahashi (Submitted on April 2019 and Revised in January 2020) for the continuous version of Problem (1), i.e. searching the zero of a continuous function. However, despite the resemblance, our findings here are brand new and do not stem from previously known results. In fact, the methods for analysing the discrete searching problem in this paper are much more closely related to those developed in Perl and Reingold (1977) than those developed in the literature of numerical analysis. Perhaps more importantly, we believe that our findings here might be of more significance and repercussion than previous results due to the fundamental role that Problem (1) and Algorithm 0 plays in the field of computer science, serving as a basis for a much of algorithmic theory and practice.

Paper Outline

The following section, entitled Main Results, is divided into three parts. The section begins by characterizing necessary and sufficient conditions for Algorithm to be minmax optimal, putting forward results analogous to Theorem 2.1 of Oliveira and Takahashi (Submitted on April 2019 and Revised in January 2020) which were previously unknown in the discrete case. Then, in Subsection 2.1 we describe our main contribution: the ITP method for searching sorted lists with minmax and expected query complexity results in Theorem 2. These results show that under mild conditions the ITP method can attain an expected query complexity of the same order of interpolation search while retaining the minmax optimality of binary search. In Subsection 2.2, we calculate lower bounds on the expected query complexity of binary searching under very broad distributional assumptions, and as a consequence we find that our method cannot be outperformed on the average by binary search by more than one or two iterations. Thus, we provide brand new robustness guarantees that cannot be provided by interpolation search. In Section 3 we perform extensive experiments on both artificial and real data from which we find that the expected query complexity of the ITP method can be orders of magnitude lower than interpolation search and binary search alike. Finally, in Section 4 we summarize and discuss the relevance of our findings and point out applications and future directions of research.

2 Main Results

Given a sorted list and a target value , at each iteration of Algorithm define as and . Then,

Theorem 1

Algorithm requires at most iterations to terminate if and only if at each iteration we have

(6)

Given any instance of (1), we may calculate the maximum number of iterations required by any minmax strategy using equation (3). After the first iteration one is left with queries and so, from (3) we have that must be at most , thus can be at most . Combining this with the fact that is less than or equal to and it follows that as long as is chosen in such a way that both and are less than or equal to , then, from that step onward, Algorithm can still guarantee termination in iterations. Requiring that both and be less than or equal to is equivalent to enforcing . This proves Theorem 1 for iteration .

For higher values of the reasoning is very similar. After steps , Algorithm is left with iterations. Thus, on step , as long as is less than or equal to , Algorithm can guarantee termination in at most iterations. Thus, we find similarly that is guaranteed when , and, this completes the proof.

Theorem 1 identifies the class of strategies that, similar to binary searching, enjoy minmax optimality. In most situations the set of strategies that satisfy the conditions of Theorem 1 can be quite large. However, when is equal to for some then we will find that must be null for and thus the class naturally reduces to binary searching. In every other situation may be chosen by means of interpolation, randomization or any other technique as long as the distance of to remains within the ranges established by Theorem 1. Figure 1 depicts two search trees with that have depth . Both of these trees have the same minimal depth of binary searching, however they do not subdivide the nodes in half in each query as binary searching would. Given that is not a power of then several such trees with depth exist.

Figure 1: Two searching trees with nodes and minimal depth of , none of which correspond to binary searching.

Before we proceed in displaying our main algorithm we point out two variations of Theorem 1 that may be of interest in different circumstances, one less conservative and one more conservative. Both variations are motivated by the fact that minmax optimality alone does not avoid certain types of inefficiencies. The first of type arises from the fact that (6) may, at times, be too restrictive and degenerate to bisection steps too early in a run. This may happen if (6) is initiated too small, or, if Algorithm unluckily makes too many “bad guesses” with producing for several iterations. To avoid this and produce a variation of (6) that is more “forgiving” of bad iterations one may upperbound the maximum number of iterations by instead of with . This is attained if and only if in each iteration we have

(7)

The second type of inefficiency that may be present is of opposite nature. Minmax optimality may allow for too much freedom. For example, it is possible that Algorithm after a few iterations reduces to a sufficiently small size that it could be tackled with a few binary steps. However, minmax optimality allows Algorithm to waste “spare iterations” produced in the beginning of the run. One way to avoid this is to require that after each iteration, the new subproblem with would take no more iterations than binary search would, i.e. that at most queries would be used from step onward. This is obtained by enforcing

(8)

in every iteration instead of equation (6).

All three versions of (6) may be of interest to software development. If problem (1

) is generated by a known distribution that allows for the construction of reliable estimators for the location of

, as exemplified in the next section, then perhaps the original form (6) might be chosen. If (6) is too small, then the relaxation in (7) might be an appropriate alternative. In fact allowing for as little as one additional iteration with we have that equation (7) will encompass the entirety of in the first iterations. Finally, equation (8) might be preferred if the underlying distribution does not allow for the construction of reliable estimators for the location of , or, if the underlying distribution is unknown. In any case, the classes of methods here identified by (6) to (8) offer a rich collection of alternatives to traditional binary searching that simultaneously retain minmax optimality and allow for enough freedom to incorporate interpolation and/or randomized strategies. In the following subsection we will see that even the “unaltered” minmax optimality condition in (6) can allow for an improved average performance under standard uniform distribution hypothesis.

2.1 The ITP Method

Let and be two user provided constants111Notice that is defined here to be between and , whereas in Oliveira and Takahashi (Submitted on April 2019 and Revised in January 2020) it is defined to be between and . This difference is key and arises from the fact that in continuous settings one is typically interested in vanishing residuals that are less than or equal to , whereas in discrete scenarios is always greater than .. Now define and as

(9)

where and are as in (2) and (4) respectively. Also, define as

(10)

if and otherwise. Now define the minmax radius and interval as

(11)

Now, in each step define as the projection of onto , i.e.

(12)

The ITP method then takes to be equal to defined as the closest integer to that lies between and .

In the following theorem we will assume that is constructed by sorting independent samples of a uniformly distributed variable in . And, that the minmax interval around in the first iteration is “not too small”, i.e. that is not much smaller than one. This avoids the collapsing of to (in which case the ITP method behaves identical to binary searching), and also, as shown in the proof of Theorem 2

, in combination with the other conditions it guarantees that with high probability a steady state condition with super-linear convergence can be reached within a few iterations.

Theorem 2

If is sufficiently large and is not too small, the number of iterations for Algorithm to terminate satisfies

(13)

for some constant that depends on and but not on .

The structure of the proof is as follows: We begin by analysing Algorithm for where is the closest integer to that lies between and , i.e. without the projection onto . We will see that for sufficiently large we have that produces an expected query complexity of the order of . Then, we include the projection step and verify that, if is not too small, then with high probability the minmax range will at least double in each iteration and in a few iterations the full interval will be encompassed by . After that point, Algorithm will behave as if there were no projection step, and thus, the same expected query complexity of is attained.

Before we proceed with the proof, we point out that no attempt is made here to obtain the tightest bounds nor to optimize our choice of or any other meta-parameter. Instead, whenever possible we opted for the simplest and shortest path to obtain our results, and, overall aim for a proof that is accessible to a university level advanced algorithms course.

In order to calculate the expected query complexity of Algorithm implemented with we first calculate the expected number of iterations that Algorithm requires to reduce an interval of length to a new interval with length less than or equal to . For this we will use a few facts. First notice that the distance between and can be upper-bounded by:

(14)

We refer to the first term as the estimation error, the second term as the truncation error, and the third term “+1” is the round-off error.

We say that an iteration is successful when , and, it is unsuccessful when . Notice that if is built by sorting independent samples from a uniform distribution over , then the probability of an iteration with being successful is equal to the probability that is between and . Without loss of generality we may assume that , and thus the probability of a successful iteration is equal to the probability that . Now the index is equal to the number of entries of that satisfy , and since each entry is sampled from a uniform distribution over , then, in problem (1) the variable

follows a binomial distribution with expected value of

and with variance

. During a run of Algorithm 0, given all the data collected up to iteration j, by using scaling arguments we find that the conditional distribution of will also follow a translated binomial between and , with mean and with variance . Thus, from (14) and Chebyshev’s inequality we find that:

(15)

And thus, since , the probability of an unsuccessful iteration vanishes with larger values of . We will denote by and the probabilities of successful and unsuccessful iterations respectively.

Now, from (15) we have that for large values of the estimation error is smaller than the truncation error with high probability. The same is true of the round-off error. Thus we deduce that

(16)

with high probability for large values of .

Now let us analyse two different scenarios: (i) when is near extremity or ; and (ii) when it is somewhere in the middle. Or, formally: (i) when or ; (ii) every other case. It is easy to see that in case (i), with high probability, one successful iteration will suffice to reduce to less than or equal to . This is a direct consequence of equation (16) and (15). Similarly, notice that case (ii) after one iteration will produce or with high probability. Thus, after two successful iterations case (ii) will reduce to less than or equal to . Hence, with high probability, it suffices to obtain two successful iterations in order to reduce to less than or equal to , and, the expected number of iterations required to obtain two successes is given by

which, by using the relation simplifies to

Thus we find that approaches 2 as goes to infinity, and hence for greater than or equal to some constant (that depends on and alone) we have less than or equal to . This implies that for large we have:

(17)

where is the expected number of iterations given that there are elements in . Thus applying (17) recursively we find that

and repeating this process times we find

Thus, the value of such that is less than a will give us the expected query complexity of Algorithm implemented with . This, of course, reduces to

for some and that depend on and on but not on . This completes the deduction of the expected query complexity of Algorithm implemented with .

What is left now is to verify the effect of the projection step on the behaviour of Algorithm . We start by pointing out that for high values of , due to (15), Algorithm implemented with generates successful iterations with high probability. The same is true for the projection of onto , since, if lies between and then so will the projection of onto . Thus, with high probability we are left with the smallest amongst the intervals and after each iteration. This implies that and, with a little algebra, we can show that the fraction of the interval covered by , which we will denote by , increases from iteration to iteration , i.e. that

Furthermore, if is not equal to , then it is the projection onto the minmax disk. Thus, (ignoring rounding effects) we find that . In which case since we must have that:

from which we derive that the fraction of the interval covered by is given by

and thus since is greater than given that is not a power of two. Thus, with high probability the fraction of the interval covered by must at least double in each iteration if . Hence, if the fraction of the interval covered by is not too small, it will take only a few iterations until can assume any value within , and will thus coincide with from that iteration onward.

Theorem 2 shows that the ITP method is both minmax optimal and can attain as low as expected query complexity given that is not too small. This last condition, as mentioned earlier, can be dropped if minmax optimality is relaxed. In fact it suffices to allow for just one iteration more than and the “not too small” condition is satisfied. Also, it is worth mentioning that one may calculate the expected number of “gained” iterations per query and find that it is greater than or equal to one for sufficiently large . Thus, even though can collapse into binary searching after a few rounds of unsuccessful iterations, this will only happen with low probability since in an average run, the ITP method it will typically accumulate “spare iterations” and can afford a few misses quite early in the run. Also, from (15) we may deduce that the first iterations have the highest probability of being successful since these have the biggest intervals , and hence, the iterations in which Algorithm has less “spare iterations” are the ones less likely to blunder and produce unsuccessfull/wastefull iterations. By the time it has narrowed down the search to smaller intervals, several “spare” iterations will be available, and thus it will take many more unsuccessfully iterations for to degenerate to binary searching.

2.2 Robustness and Limits

It is well known that the expected query complexity of interpolation search is of the order of and binary search is of the order of under the uniform distribution assumption, i.e. interpolation search considerably outperforms binary search under the standard hypothesis. However, it is also well known that if the distribution is misspecified, then, the expected query complexity of interpolation search can reach up to queries while binary search remains upper bounded by , i.e. interpolation search is considerably outperformed by binary search under misspecified conditions. In this section we will verify whether the ITP method suffers from the same drawback or whether it is robust to such changes, i.e. can the ITP method be outperformed by binary searching with respect to average performance222In the continuous setting this question was answered in Corollary 2.2 of Oliveira and Takahashi (Submitted on April 2019 and Revised in January 2020). There, since the bisection method has a fixed expected query complexity of for any continuous distribution, the worst-case guarantees of the ITP method already ensure that the expected query complexity of the ITP method cannot be outperformed by the bisection method. However, unlike the continuous setting, the expected query complexity of binary searching over lists does depend on the underlying distribution.?

We will answer this question by analysing two large classes of distributions over instances of (1). The first class , which we will denote by , encompasses all distributions over instances of (1) that produce with no restriction on how and are generated. The second class, denoted , encompasses a large collection of distributions over instances of (1) that do produce . In particular, the second class includes any distribution that does not limit nor favours any particular between and , i.e. it assumes only that that can assume any value from to with a uniform probability.

If Problem (1) is generated by a distribution from class , then since the distribution does not produce , binary search must require at least iterations to terminate. To see this, first notice that , for any value of

(whether odd or even). By recursion, we find that

which in turn is greater than . Hence, in order for to be less than or equal to the number of iterations must satisfy: . Thus

Corollary 3

If the distribution over instances of (1) is such that then the expected query complexity of binary searching satisfies:

(18)

The second class of distributions does allow for Problem (1) to admit a solution with . The class assumes nothing on how or is constructed other than the fact that the solution can assume any value within the range from 1 to n with a uniform probability333This second constraint is added since otherwise it is easy to construct distributions that can arbitrarily lower the expected query complexity of virtually any method. Taking binary search as an example, if the distribution trivially produces then the expected query complexity can be as low as one iteration. Thus to exclude trivial cases and arbitrary biases we assume that is equally likely to assume any value between to .. In this second case it is useful to consider the graph constructed by placing the first index visited on the root, and, successively branching left with the indices probed in case of and branching right when . Figure 2 illustrates one such construction.

Figure 2: The binary search tree associated with Algorithm . Each node of the tree represents an index

of vector

visited by Algorithm , the height of the tree represents the worst-case complexity of the searching strategy and the average depth of the tree is represents the expected query complexity of the searching strategy.

The depth of the resulting tree measures the maximum number of iterations required for Algorithm to terminate, and, the average depth of the graph measures the average number of iterations. We will denote the average depth by and we decompose into two factors as for some . This way we find that

Corollary 4

If the distribution over instances of (1) is such that and is equally likely to assume any value between to then the expected query complexity of binary searching is equal to where , and satisfies

(19)

Corollary 4 is well known and it’s proof is thus omitted for simplicity 444For completeness sake we point to Prof. PhD Steven Pigeon’s proof an analysis of Corollary 4 in Average node depth in a Full Tree that can be found in https://hbfs.wordpress.com/2013/05/14/average-node-depth-in-a-full-tree/, published in 2013.. Now combining the above corollaries with the fact that the ITP method requires no more than iterations to terminate we find that for the classes of distributions in and described above:

Theorem 5

The expected query complexity of binary searching can outperform that of the ITP method by at most two iterations.

Hence, unlike interpolation search, even under very broad misspecified conditions the ITP method cannot be outperformed by binary searching by any significant margin. Thus, Theorems 2 and 5 combined show that by choosing the ITP method over binary searching, not only will Algorithm enjoy the benefits associated with interpolation search (the complexity over the uniform distribution assumption), but it will also not suffer the drawbacks associated with interpolation search (being outperformed by binary searching under misspecified conditions).

3 Experimental Results

In this section we empirically test the ITP strategy on three experiments and compare it with traditional binary searching and interpolation search. In the first experiment we test the minmax ITP method with varying values of and in order to find the values of and that minimize the expected number of queries under a uniform distribution assumption. The second and third experiments use the values of and found on the first experiment and compare the minmax ITP method with the relaxed version where and interpolation search over both artificial and real data.

Artificial Data 1

In our first experiment, we search for the values of and that minimize the expected number of iterations required by the minmax ITP method over lists of size . As seen in the proof of Theorem 2, the behaviour of the ITP method quickly mimics the behaviour of which depends solely on the values of and . We performed Monte Carlo simulations by generating the list by sorting independent samples from a uniform distribution over . The target value was also sampled from a uniform distribution over . Table 1 shows the empirical average obtained by varying between and , and, varying between and .

7.44 7.35 7.68 8.02 8.40 8.76 9.09 9.38
7.39 7.37 7.79 8.30 8.85 9.35 9.78 10.16
7.31 7.43 8.08 8.87 9.57 10.17 10.66 11.07
7.20 7.63 8.66 9.63 10.46 11.13 11.68 12.13
7.08 8.05 9.38 10.54 11.51 12.25 12.81 13.32
6.95 8.64 10.26 11.66 12.73 13.52 14.09 14.51
6.87 9.35 11.40 13.00 14.09 14.73 15.19 15.57
6.93 10.27 12.78 14.42 15.23 15.60 15.93 16.30
7.21 11.45 14.39 14.95 15.94 16.77 17.24 17.51
7.51 13.03 14.55 16.53 17.49 17.69 17.69 17.69
Table 1: Average number of iterations of Monte Carlo simulations of searches in lists of size and sampled from a uniform distribution between and . Each column shows the performance of the ITP method with a given value of and each line a fixed value of .

As can be seen in Table 1, the empirical average was minimized at and . We highlighted the cell located on the first column and on the seventh row to show the number of iterations attained with these values of and which are significantly lower than . It should also be noted that the average number of iterations remains below for any value of and as predicted by Theorem 1.

Artificial Data 2

In our second experiment we compare the empirical performance of two versions of the ITP method against interpolation search on lists of various sizes. The first version of the ITP method used is the minmax version analysed in Theorem 2 and the second version is the one that makes use of the relaxation . The average number of iterations required by each method was calculated by averaging the results of 500 Monte Carlo simulations on lists of sizes ranging from to generated by sorting independent samples from predetermined distributions. The maximum number of iterations required by interpolation search is also reported for comparison with the worst case performance of the ITP method. In Figure 3 we plot the number of iterations as a function of the size of the list for lists generated from the uniform distribution and Figure 4 show the results under different distributions, namely: when the elements of

are samples of (i) a Gaussian distribution, (ii) an exponential distribution, (iii) a triangular distribution and (iv) a step function distribution (two overlapped uniform distributions over different intervals). The Gaussian in (i) was generated in each run with a random mean

sampled from a uniform distribution over and a fixed variance with ; the exponential in (ii) was constructed with a parameter of ; the triangular distribution was obtained by taking the square root of a uniformly distributed variable; and the step function distribution in (iv) was obtained by sampling from a distribution that is uniform over the intervals and with interval concentrating half of the cumulative probability and the other half.

Figure 3: The average of 500 Monte Carlo simulations comparing two versions of the ITP method and interpolation search for increasing values of on data with uniform distribution. In the background, the bar plot in gray displays the average number of iterations required by the minmax version of the ITP method. The lower curve in black shows the average number of iterations required by interpolation search; the dark blue curve above it the average number of iterations required by the ITP method with ; and, the light blue curve shows the maximum number of iterations used by interpolation search over all 500 runs.

In Figure 3, the background bar plot shows the behaviour of the minmax version of the ITP method. As predicted by Theorems 1 and 2, for values of in which (6) is not too small, i.e. most of the range, the growth of is linear with respect to similar to interpolation search. The bar plot shows eighteen peaks which correspond to the values of that are equal to for some ; and thus, for those values of the number of iterations is linear with and not . The relaxation of the minmax ITP method with displayed in dark blue reduces the peaks and obtains a curve that grows linearly with in it’s entirety just as interpolation search displayed right bellow it in black. The average performance of the ITP method with when compared with interpolation search, attains an almost identical linear growth with respect to that exceeds the number of iterations required by interpolation search by approximately one iteration throughout the range investigated, i.e. the ITP method with has a nearly identical expected query complexity as interpolation search under the uniform distribution hypothesis. However, the worst case behaviour of interpolation search is upper-bounded by , i.e. both versions of the ITP method depicted have much better worst case guarantees than interpolation search. The light blue curve overarching the graph depicts the maximum number of iterations required by interpolation search in the 500 runs; and, as can be noticed it exceeded for values of less than or equal to which is approximately . Of course, with more runs, interpolation search will demand much more iterations in the worst case.

Figure 4: The average of 500 Monte Carlo simulations comparing two versions of the ITP method and interpolation search for increasing values of . The light blue dashed line provides for reference the value of . The lower curve in black shows the average number of iterations required by the ITP method with ; the dark blue curve the average number of iterations required by interpolation search; and, the green curve shows the maximum number of iterations used by interpolation search over all 500 runs.

When different distributions are considered then the robustness of the ITP method becomes an interesting feature. As can be seen in Figure 4, the average number of iterations of the ITP method with remained below under all distributions considered. Interpolation search performed much worse than for both the Gaussian distribution and the exponential distribution, and displayed an average performance that seems to be close to under the step function and the triangular distribution considered. The worst case behaviour of interpolation search showed to be much worse than under the four distinct distributions. As depicted in Figure 4, under these distributions and others still, interpolation search may have both an average and a worst case performance that require much more iterations than the ITP method by several orders of magnitude. Thus, these experiments show that the ITP method seems to be a much better alternative than both binary searching and interpolation searching when both worst case and average performances are taken into account.

Real Data

In our third experiment we collect a wide variety of real data from publicly available lists of varying sizes and different origins which are specified in the appendix section. To name a few, we have included a list of full names of all public employees of the Brazilian government, a dataset of genome sizes of fungal species, atomic weights, zip codes and others. For each list we calculate the empirical average of the number of iterations required by both the ITP method with and interpolation search. In each run we sample between and with a uniform probability and perform the search with both methods. Four of the twelve lists considered were composed of names rather then numbers, specifically the NASDAQ Acronyms, the English Dictionary, the Family Names and Full Names. These were converted into numerical lists by taking a base-27 read of each digit and sorting them accordingly. Other natural approaches that could be used are the ASCII standard conversion or even a Morse code mapping onto binary numbers. Clearly, the average performance of the ITP method is sensitive to this mapping and hence there is space for improvement. However, this goes beyond the scope of the paper and thus we opted to display only the results for the first approach considered, i.e. the base 27 conversion. Table 2 reports the empirical average of runs of the described procedures.

ITP Search Interp Search
mean max mean max
Thermodynamics Table 3.6 7 3.1 5 6 49
Atomic Weights 3.3 7 2.8 7 6 54
Fluid Dynamics Chart 5.8 11 5.8 11 10 600
Fibonacci Sequence 8.2 11 19.8 553 10 700
Genome Sizes 9.6 13 8.9 27 12 2352
NASDAQ Acronyms 10.6 15 28.8 1000 14 8203
Zip Codes 10.5 18 9.8 69 17 81831
Family Names 16.5 18 90.0 1000 17 88799
English Dictionary 19.0 20 247.4 1000 19 370103
Full Names 20.6 21 751.3 1000 20 660276
Prime Numbers 7.2 21 6.0 10 20 664579
Harmonic Series 22.3 25 79.7 189 24
Central Tendency Metrics:
Median
Median
Table 2: Average number of iterations required by the ITP method and Interpolation Search. The averages are taken over searches for a target sampled from a uniform distribution between and . We also report the empirical maximum number of iterations required by Interpolation Search over this sample. The simulation capped out the count when more than iterations were required, and thus when iterations are reached we indicate with a sub-index the number of runs where this cap occurred. Also, since the ITP method was implemented with , in the second column under the title “ITP Search” we provide the value of . On the bottom lines, the estimates of the mean and the median are displayed in units of .

Table 2 displays the average number of iterations required by the ITP method side by side with the number of iterations required by interpolation search. The ITP method seems to have a better performance when compared to interpolation search under both the average query complexity criteria and the worst case query complexity criteria. In all instances where interpolation search outperformed the ITP method on the average, it did so by less than iterations, and when the ITP method outperformed interpolation search it did so by up to iterations which is more than 36 times the number of of iterations required by the ITP method. On the average the ITP method required

less iterations when compared to binary searching whereas interpolation search required on average more than five times the number of iterations as binary searching across all twelve lists. We point out that even if outliers were excluded from the list (the two most difficult cases for interpolation search for example) interpolation search still attains an empirical average worse than binary search, i.e. interpolation search does not seem to perform well in real data. One possible explanation for this might be the fact that real world data is not generated from uniformly distributed variables, and hence, the robustness guarantees provided by the ITP method seem to be vital for outperforming binary search in real world applications. By analysing the median metric a similar conclusion is reached, i.e. interpolation search performs poorly and the ITP method outperforms binary search.

When considering the worst case performances, since the ITP method in display made use of the relaxation , then the ITP method never required more than one iteration above , but due to the expected query complexity, under favorable conditions it performed less than half the number of iterations of binary searching. On the other hand, interpolation search not only averaged higher iteration counts but it also maxed out the number of iterations with several unsuccessful searches, and hence, it seems to be the least interesting alternative amongst the three when both metrics are taken into consideration.

General Recommendations

Throughout our experiments (including an extensive number of experiments not reported here) the performance of the ITP method with the relaxation seems to give the best results overall. With the relaxation, the ITP method is less sensitive to the value of but also less sensitive to the choice of and . As a rule of thumb we recommend the ITP method with and and with the relaxation555By adopting a non integer value for , the maximum number of iterations of Algorithm 0 is of . Furthermore, the projection step of the ITP method projects to the interior of instead of the border, avoiding numerical errors associated to edge cases. Our experiments were performed with , however for practitioners we recommend a non integer such as instead. of , however, if prior knowledge on the distribution over instances of (1) is available, or, if there is availability of a training set, then the values of and can be tested and chosen accordingly.

In Experiment 2, both interpolation search and the ITP method were implemented under four misspecified conditions when the non-uniform distributions were used to generate . If prior knowledge of the underlying distribution is available, then the behaviour of Algorithm depicted in Figure 3 can be obtained for different distributions by implementing Algorithm on the transformation of by the cumulative distribution.

4 Discussion

In this paper we have identified a novel and yet simple searching method, which we refer to as the ITP method, that attains an expected query complexity of iterations and a worst case query complexity of ; i.e. it is optimal with respect to both average and worst case metrics. Furthermore, we also prove robustness guarantees that show that binary search cannot outperform the ITP method by more than a constant factor even if the distributional hypothesis is misspecified. Hence, the ITP method enjoys the benefits of interpolation search (the improved expected query complexity of ) without the drawbacks associated with it (a lower than binary search expected query complexity when distribution is misspecified). We perform extensive testing on artificial and real data and we find that the ITP method can considerably outperform both the classical binary search method as well as interpolation search. We reach time-savings that range roughly from 25% to 75%, depending on the experiment, when compared to binary searching; an overall much better performance than interpolation search when compared across experiments.

Binary searching is a fundamental tool in the field of computer science and has continually been the choice for applications, specifically due to its minmax optimality. Our results show that this preference for binary search, or alternatively for interpolation search, has often been an inefficient one. The improvements highlighted here have both practical and theoretical implications that directly translate to significant time savings, specifically when the cost of a query is much greater than the time to compute the procedure itself. In short, the ITP method is our recommended improvement to the traditional approach. However, the identified minmax class of methods, which is largely unexplored, is potentially a more significant contribution that may lead to further improvements and the identification of even more efficient methods.

Future work

The problem of searching sorted tables and/or other multidimensional variants are natural instances that may benefit if equivalent results as the ones developed here are found. Another relatively unexplored variation studied in Bentley and Yao (1976) is searching through infinite lists. Also, assuming multiple instances of (1) to be solved sequentially and generated under one common distribution, one may ask how to adapt and improve the solvers in between each resolution to obtain an adaptive/self-improving method. Finally, the cost of one query is typically assumed to be significantly greater than that of the computation of the searching procedure itself; several interesting questions arise when this assumption is modified.

Appendix A Online material

Table 3 contains the sources of the twelve lists used in the second experiment. The texts were converted into numerals as explained in the end of Section 3 and any additional symbols such as “*.!;” and others were ignored.

NASDAQ Acronyms
ftp://ftp.nasdaqtrader.com/symboldirectory
Prime Numbers
(self generated)
Atomic Weights
https://www.qmul.ac.uk/sbcs/iupac/AtWt/
Zip Codes
http://federalgovernmentzipcodes.us
Fluid Dynamics Chart
https://engineering.purdue.edu/~wassgren/notes/CompressibleFlowTables.xls
Genome Sizes
http://www.zbi.ee/fungal-genomesize/index.php?q
Fibonacci Sequence
(self generated)
Thermodynamics Table
https://www.ohio.edu/mechanical/thermo/property_tables/H2O/H2O_TempSat.xls
English Dictionary
https://github.com/dwyl/english-words
Family Names
https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
Harmonic Series
(self generated)
Full Names
http://www.portaltransparencia.gov.br/servidores
Table 3: The source of the data collected on the fifth of June of 2019.

Several of the files found in the above links contain multiple columns, specifically the fluid dynamics chart, the genome sizes, the atomic weights and the thermodynamics table. When this is the case we selected one column arbitrarily and performed all simulations on the selected column.

References

  • J. L. Bentley and R. Sedgewick (1997) Fast algorithms for sorting and searching strings. SODA ’97: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms (), pp. 360–369. Note: External Links: Document, Link Cited by: §1.
  • J. L. Bentley and A. C. Yao (1976) An almost optimal algorithm for unbounded searching. Information Processing Letters 5 (3), pp. 82–87. Note: External Links: Document, Link Cited by: §1, §4.
  • F. Cannizzo (2018) Fast and vectorizable alternative to binary search in o(1) applicable to a wide domain of sorted arrays of floating point numbers. Journal of Parallel and Distributed Computing 113 (5), pp. 37. Note: External Links: Document, Link Cited by: §1.
  • D.E. Knuth (1998) The art of computer programming - sorting and searching. , Vol. 3. Note: Cited by: §1, §1.
  • I.F.D. Oliveira and R.H.C. Takahashi (Submitted on April 2019 and Revised in January 2020) An enhancement of the bisection method average performance preserving minmax optimality. ACM Transactions on Mathematical Software (), pp. . Note: External Links: Document, Link Cited by: §1, §1, §1, footnote 1, footnote 2.
  • Y. Perl and A. Itai (1978) Interpolation search—a log log n search. Communications of the ACM (), pp. . Note: External Links: Document, Link Cited by: §1.
  • Y. Perl and E. M. Reingold (1977) Understanding the complexity of interpolation search. Information Processing Letters 6 (6), pp. 219–222. Note: External Links: Document, Link Cited by: §1, §1.
  • B. Schlegel, R. Gemulla, and W. Lehner (2009) K-ary search on modern processors. Proceedings of the Fifth International Workshop on Data Management on New Hardware (), pp. 52–60. Note: External Links: Document, Link Cited by: §1.
  • A. C. Yao and F.F. Yao (1976) The complexity of searching an ordered random table. Proceedings of the Seventeenth Annual Symposium on Foundations of Computer Science (), pp. 173–177. Note: External Links: Document, Link Cited by: §1.