This paper studies stable merge sort algorithms, especially natural merge sorts. We will propose new strategies for the order in which merges are performed, and prove upper and lower bounds on the cost of several merge strategies. The first merge sort algorithm was proposed by von Neumann [15, p.159]: it works by splitting the input list into sorted sublists, initially possibly lists of length one, and then iteratively merging pairs of sorted lists, until the entire input is sorted. A sorting algorithm is stable if it preserves the relative order of elements which are not distinguished by the sort order. There are several methods of splitting the input into sorted sublists before starting the merging; a merge sort is called natural it finds the sorted sublists by detecting consecutive runs of entries in the input which are already in sorted order. Natural merge sorts were first proposed by Knuth [15, p.160].
Like most sorting algorithms, the merge sort is comparison-based in that it works by comparing the relative order of pairs of entries in the input list. Information-theoretic considerations imply that any comparison-based sorting algorithm must make at least comparisons in the worst case. However, in many practical applications, the input is frequently already partially sorted. There are many adaptive sort algorithms which will detect this and run faster on inputs which are already partially sorted. Natural merge sorts are adaptive in this sense: they detect sorted sublists (called “runs”) in the input, and thereby reduce the cost of merging sublists. One very popular stable natural merge sort is the eponymous Timsort of Tim Peters . Timsort is extensively used, as it is included in Python, in the Java standard library, in GNU Octave, and in the Android operating system. Timsort has worst-case runtime , but is designed to run substantially faster on inputs which are partially pre-sorted by using intelligent strategies to determine the order in which merges are performed.
There is extensive literature on adaptive sorts: e.g., for theoretical foundations see [20, 9, 25, 22] and for more applied investigations see [6, 12, 5, 30, 26]. The present paper will consider only stable, natural merge sorts. As exemplified by the wide deployment of Timsort, these are certainly an important class of adaptive sorts. We will consider the Timsort algorithm [24, 7], and related sorts due to Shivers  and Auger-Nicaud-Pivoteau . We will also introduce new algorithms, the “2-merge sort” and the “-merge sort” for where is the golden ratio.
The main contribution of the present paper is the definition of new stable merge sort algorithms, called -merge and -merge sort. These have better worst-case run times than Timsort, are slightly easier to implement than Timsort, and perform better in our experiments.
We focus on natural merge sorts, since they are so widely used. However, our central contribution is analyzing merge strategies and our results are applicable to any stable sorting algorithm that generates and merges runs, including patience sorting , melsort [28, 19], and split sorting .
All the merge sorts we consider will use the following framework. (See Algorithm 1.) The input is a list of elements, which without loss of generality may be assumed to be integers. The first logical stage of the algorithm (following ) identifies maximal length subsequences of consecutive entries which are in sorted order, either ascending or descending. The descending subsequences are reversed, and this partitions the input into “runs” of entries sorted in non-decreasing order. The number of runs is ; the number of elements in a run is . Thus . It is easy to see that these runs may be formed in linear time and with linear number of comparisons.
Merge sort algorithms process the runs in left-to-right order starting with . This permits runs to be identified on-the-fly, only when needed. This means there is no need to allocate additional memory to store the runs. This also may help reduce cache misses. On the other hand, it means that the value is not known until the final run is formed; thus, the natural sort algorithms do not use except as a stopping condition.
The runs are called original runs. The second logical stage of the natural merge sort algorithms repeatedly merges runs in pairs to give longer and longer runs. (As already alluded to, the first and second logical stages are interleaved in practice.) Two runs and can be merged in linear time; indeed with only many comparisons and many movements of elements. The merge sort stops when all original runs have been identified and merged into a single run.
Our mathematical model for the run time of a merge sort algorithm is the sum, over all merges of pairs of runs and , of . We call the quantity the merge cost. In most situations, the run time of a natural sort algorithm can be linearly bounded in terms of its merge cost. Our main theorems are lower and upper bounds on the merge cost of several stable natural merge sorts. Note that if the runs are merged in a balanced fashion, using a binary tree of height , then the total merge cost . (We use to denote logarithms base 2.) Using a balanced binary tree of merges gives a good worst-case merge cost, but it does not take into account savings that are available when runs have different lengths.111 gives a different method of achieving merge cost . Like the binary tree method, their method is not adaptive. The goal is find adaptive stable natural merge sorts which can effectively take advantage of different run lengths to reduce the merge cost, but which are guaranteed to never be much worse than the binary tree. Therefore, our preferred upper bounds on merge costs are stated in the form for some constant , rather than in the form .
The merge cost ignores the cost of forming the original runs : this does not affect the asymptotic limit of the constant .
Algorithm 1 shows the framework for all the merge sort algorithms we discuss. This is similar to what Auger et al.  call the “generic” algorithm. The input is a sequence of integers which is partitioned into monotone runs of consecutive members. The decreasing runs are inverted, so is expressed as a list of increasing original runs called “original runs”. The algorithm maintains a stack of runs , which have been formed from . Each time through the loop, it either pushes the next original run, , onto the stack , or it chooses a pair of adjacent runs and on the stack and merges them. The resulting run replaces and and becomes the new , and the length of the stack decreases by one. The entries of the runs are stored in-place, overwriting the elements of the array which held the input . Therefore, the stack needs to hold only the positions (for ) in the input array where the runs start, and thereby implicitly the lengths of the runs. We have , pointing to the beginning of the input array, and for each , we have . The unprocessed part of in the input array starts at position . If , then it will be the starting position of the next original run pushed onto .
Algorithm 1 is called -aware since its choice of what to do is based on just the lengths of the runs in the top members of the stack , and since merges are only applied to runs in the top members of . ( used the terminology “degree” instead of “aware”.) In all our applications, and are small numbers, so it is appropriate to store the runs in a stack. Usually , and we write “-aware” instead of “-aware”. Table 1 shows the awareness values for the algorithms considered in this paper. To improve readability (and following ), we use the letters to denote the top four runs on the stack, respectively, (if they exist).222 used “” for “”.
|Timsort (original) ||3-aware|
|Timsort (corrected) ||(4,3)-aware|
|-stack sort ||2-aware|
|Shivers sort ||2-aware|
|-merge sort ()||3-aware|
In all the sorting algorithms we consider, the height of the stack will be small, namely . Since the stack needs only store the values , the memory requirements for the stack are minimal. Another advantage of Algorithm 1 is that runs may be identified on the fly and that merges occur only near the top of the stack: this may help reduce cache misses. (See  for other methods for reducing cache misses.)
is the run obtained by stably merging and . Since it takes only linear time to extract the runs from , the computation time of Algorithm 1 is dominated by the time needed for merging runs. We shall use as the mathematical model for the runtime of a (stable) merge. The usual algorithm for merging and uses an auxiliary buffer to hold the smaller of and , and then merging directly into the combined buffer. Timsort  uses a variety of techniques to speed up merges, in particular “galloping”; this also still takes time proportional to in general. It is possible to perform (stable) merges in-place with no additional memory [16, 21, 13, 29, 10]; these algorithms also require time .
More complicated data structures can perform merges in sublinear time in some situations; see for instance [4, 11]. These methods do not seem to be useful in practical applications, and of course still have worst-case run time .
If and have very unequal lengths, there are (stable) merge algorithms which use fewer than comparisons [14, 29, 10]. Namely, if , then it is possible to merge and in time but using only comparisons. Nonetheless, we feel that the cost is the best way to model the runtime of merge sort algorithms. Indeed, the main strategies for speeding up merge sort algorithms try to merge runs of approximate equal length as much as possible; thus and are very unequal only in special cases. Of course, all our upper bounds on merge cost are also upper bounds on number of comparisons.
The merge cost of a merge sort algorithm on an input is the sum of taken over all merge operations performed. For a run on the stack during the computation, the merge cost of is the sum of taken over all merges used to combine runs that form .
Our definition of merge cost is motivated primarily by analyzing the run time of sequential algorithms. Nonetheless, the constructions may be applicable to distributed sorting algorithms, such as in MapReduce . A distributed sorting algorithm typically performs sorts on subsets of the input on multiple processors: each of these can take advantage of a faster sequential merge sort algorithm. Applications to distributed sorting are beyond the scope of the present paper however.
In later sections, the notation is used to denote the merge cost of the -th entry on the stack at a given time. Here “” stands for “weight”. The notation denotes the length of the run . We use to denote the number of original runs which were merged to form .
All of the optimizations used by Timsort mentioned above can be used equally well with any of the merge sort algorithms discussed in the present paper. In addition, they can be used with other sorting algorithms that generate and merge runs. These other algorithms include patience sort , melsort  (which is an extension of patience sort), the hybrid quicksort-melsort algorithm of , and split sort . The merge cost as defined above applies equally well to all these algorithms. Thus, it gives a runtime measurement which applies to a broad range of sort algorithms that incorporate merges and which is largely independent of which optimizations are used.
Algorithm 1, like all the algorithms we discuss, only merges adjacent elements, and , on the stack. This is necessary for the sort to be stable: If and two non-adjacent runs and were merged, then we would not know how to order members occurring in both and . The patience sort, melsort, and split sort can all readily be modified to be stable, and our results on merge costs can be applied to them.
Our merge strategies do not apply to non-stable merging, but Barbay and Navarro  have given an optimal method—based on Huffmann codes—of merging for non-stable merge sorts in which merged runs do not need to be adjacent.
The known worst-case upper and lower bounds on stable natural merge sorts are listed in Table 2. The table expresses bounds in the strongest forms known. Since , it is generally preferable to have upper bounds in terms of , and lower bounds in terms of .
|Algorithm||Upper bound||Lower bound|
|Timsort33footnotemark: 3||[Theorem 3]|
|Shivers sort||[Theorem 10]|
|2-merge sort||[Theorem 15]||[Theorem 22]|
|-merge sort||[Theorem 14]||[Theorem 21]|
The main results of the paper are those listed in the final two lines of Table 2. Theorem 22 proves that the merge cost of -merge sort is at most , where and : these are very tight bounds, and the value for is optimal by Theorem 15. It is also substantially better than the worst-case merge cost for Timsort proved in Theorem 3. Similarly for , Theorem 22 proves an upper bound of . The values for are optimal by Theorem 14; however, our values for have unbounded limit and we conjecture this is not optimal.
We only analyze -merge sorts with . For , the -merge sorts improve on Timsort, by virtue of having better run time bounds and by being slightly easier to implement. In addition, they perform better in the experiments reported in Section 6. It is an open problem to extend our algorithms to the case of ; we expect this will require -aware algorithms with .
The outline of the paper is as follows. Section 2 describes Timsort, and proves the lower bound on its merge cost. Section 3 discusses the -stack sort algorithms, and gives lower bounds on their merge cost. Section 4 describes the Shivers sort, and gives a simplified proof of the upper bound of . Section 5 is the core of the paper and describes the new 2-merge sort and -merge sort. We first prove the lower bounds on their merge cost, and finally prove the corresponding upper bounds. Section 6 gives some experimental results on various kinds of randomly generated data. All these sections can be read independently of each other. The paper concludes with discussion of open problems.
2 Timsort lower bound
Algorithm 2 is the Timsort algorithm as defined by  improving on . Recall that are the top four elements on the stack . A command “Merge and ” creates a single run which replaces both and in the stack; at the same time, the current third member on the stack, , becomes the new second member on the stack and is now designated . Similarly, the current becomes the new , etc. Likewise, the command “Merge and ” merges the second and third elements at the top of ; those two elements are removed from and replaced by the result of the merge.
The proof in Auger et al.  did not compute the constant implicit in their proof of the upper bound of Theorem 2; but it is approximately equal to . The proofs in  also do not quantify the constants in the big-O notation, but they are comparable or slightly larger. We prove a corresponding lower bound.
The worst-case merge cost of the Timsort algorithm on inputs of length which decompose into original runs is . Hence it is also .
In other words, for any , there are inputs to Timsort with arbitrarily large values for (and ) so that Timsort has merge cost . We conjecture that Theorem 2 is nearly optimal:
The merge cost of Timsort is bounded by .
Proof of Theorem 3.
We must define inputs that cause Timsort to take time close to . As always, is the length of the input to be sorted. We define to be a sequence of run lengths so that equals where each and . Furthermore, we will have , so that . The notation is reminiscent of , but is a sequence of runs whereas is a sequence of run lengths. Since the merge cost of Timsort depends only on the lengths of the runs, it is more convenient to work directly with the sequence of run lengths.
The sequence , for , is defined as follows.444For purposes of this proof, we allow run lengths to equal 1. Strictly
speaking, this cannot occur since all original runs will have length at
least 2. This is unimportant for the proof however, as the
run lengths could be doubled and the asymptotic analysis needed
for the proof would be essentially unchanged.
could be doubled and the asymptotic analysis needed for the proof would be essentially unchanged.First, for , is the sequence , i.e., representing a single run of length . Let . For even , we have and define to be the concatenation of , and
. For odd, we have and define to be the concatenation of , and .
We claim that for , Timsort operates with run lengths as follows: The first phase processes the runs from and merges them into a single run of length which is the only element of the stack . The second phase processes the runs from and merges them also into a single run of length ; at this point the stack contains two runs, of lengths and . Since , no further merge occurs immediately. Instead, the final run is loaded onto the stack: it has length equal to either 1 or 2. Now and the test on line 9 of Algorithm 2 is triggered, so Timsort merges the top two elements of the stack, and then the test causes the merge of the final two elements of the stack.
Suppose that is the initial subsequence of a sequence of run lengths, and that Timsort is initially started with run lengths either (a) with the stack empty or (b) with the top element of a run of length and the second element of (if it exists) a run of length . Then Timsort will start by processing exactly the runs whose lengths are those of , merging them into a single run which becomes the new top element of . Timsort will do this without performing any merge of runs that were initially in and without (yet) processing any of the remaining runs in .
Claim 5 is proved by induction on . The base case, where , is trivial since with stable, Timsort immediately reads in the first run from . The case of uses the induction hypothesis twice, since starts off with followed by . The induction hypothesis applied to implies that the runs of are first processed and merged to become the top element of . The stack elements have lengths (if they exist), so the stack is now stable. Now the induction hypothesis for applies, so Timsort next loads and merges the runs of . Now the top stack elements have lengths and is again stable. Finally, the single run of length is loaded onto the stack. This triggers the test , so the top two elements are merged. Then the test is triggered, so the top two elements are again merged. Now the top elements of the stack (those which exist) are runs of length , and Claim 5 is proved.
Let be the merge cost of the Timsort algorithm on the sequence of run lengths. The two merges described at the end of the proof of Claim 5 have merge cost plus . Therefore, for , satisfies
Also, since no merges are needed. Equation (1) can be summarized as
The function is strictly increasing. So, by induction, is strictly increasing for . Hence , and thus for all .
For , define . Since is nondecreasing, so is . Then
For all , .
We prove the claim by induction, namely by induction on that it holds for all . The base case is when and is trivial since the lower bound is negative and . For the induction step, the claim is known to hold for . Then, since ,
proving the claim.
3 The -stack sort
Augur-Nicaud-Pivoteau  introduced the -stack sort as a -aware stable merge sort; it was inspired by Timsort and designed to be simpler to implement and to have a simpler analysis. (The algorithm (e2) of  is the same as -stack sort with .) Let be a constant. The -stack sort is shown in Algorithm 3. It makes less effort than Timsort to optimize the order of merges: up until the run decomposition is exhausted, its only merge rule is that and are merged whenever . An upper bound on its runtime is given by the next theorem.
Theorem 7 ().
Fix . The merge cost for the -stack sort is .
 did not explicitly mention the constant implicit in this upper bound, but their proof establishes a constant equal to approximately . For instance, for , the merge cost is bounded by . The constant is minimized at , where is it approximately 2.489.
Let . The worst-case merge cost of the -stack sort on inputs of length is , where equals .
The proof of Theorem 8 is postponed until Theorem 14 proves a stronger lower bound for -merge sorts; the same construction works to prove both theorems. The value is quite small, e.g., ; this is is discussed more in Section 5.
The lower bound of Theorem 8 is not very strong since the constant is close to 1. In fact, since a binary tree of merges gives a merge cost of , it is more relevant to give upper in terms of instead of . The next theorem shows that -stack sort can be very far from optimal in this respect.
Let . The worst-case merge cost of the -stack sort on inputs of length which decompose into original runs is .
In other words, for any , there are inputs with arbitrarily large values for and so that -stack sort has merge cost .
Let be the least integer such that . Let be the sequence of run lengths
describes runs whose lengths sum to , so . Since , the test on line 6 of Algorithm 3 is triggered only when the run of length is loaded onto the stack ; once this happens the runs are all merged in order from right-to-left. The total cost of the merges is which is certainly greater than . Indeed, that comes from the fact that the final run in is involved in merges. Since , the total merge cost is greater than , which is . ∎
4 The Shivers merge sort
The 2-aware Shivers sort , shown in Algorithm 4, is similar to the 2-stack merge sort, but with a modification that makes a surprising improvement in the bounds on its merge cost. Although never published, this algorithm was presented in 1999.
The only difference between the Shivers sort and the 2-stack sort is the test used to decide when to merge. Namely, line 6 tests instead of . Since is rounded down to the nearest power of two, this is somewhat like an -sort with varying dynamically in the range .
The Shivers sort has the same undesirable lower bound as 2-sort in terms of :
The worst-case merge cost of the Shivers sort on inputs of length which decompose into original runs is .
This is identical to the proof of Theorem 9. We now let be the sequence of run lengths
and argue as before. ∎
Theorem 11 ().
The merge cost of Shivers sort is .
We present a proof which is simpler than that of . The proof of Theorem 11 assumes that at a given point in time, the stack has elements , and uses to denote the merge cost of . We continue to use the convention that denote if they exist.
Define to equal . Obviously, . The test on line 6 works to maintain the invariant that each or equivalently . Thus, for , we always have and . This condition can be momentarily violated for , i.e. if and , but then the Shivers sort immediately merges and .
As a side remark, since each for , since , and since , the stack height is . (In fact, a better analysis shows it is .)
When , a. says . Since can be less than or greater than , this neither implies, nor is implied by, b.
The lemma is proved by induction on the number of updates to the stack during the loop. Initially is empty, and a. and b. hold trivially. There are two induction cases to consider. The first case is when an original run is pushed onto . Since this run, namely , has never been merged, its weight is . So b. certainly holds. For the same reason and using the induction hypothesis, a. holds. The second case is when , so , and and are merged; here will decrease by 1. The merge cost of the combination of and equals , so we must establish two things:
, where .
, if .
By induction hypotheses and . Thus the lefthand sides of a. and b. are . As already discussed, , therefore condition b. implies that b. holds. And since , , so condition a. implies that a. also holds. This completes the proof of Claim 12.
Claim 12 implies that the total merge cost incurred at the end of the main loop incurred is . Since and each , the total merge cost is .
We now upper bound the total merge cost incurred during the final loop on lines 10-12. When first reaching line 10, we have for all hence and for all . The final loop then performs merges from right to left. Each for participates in merge operations and participates in merges. The total merge cost of this is less than . Note that
where the last inequality follows by . Thus, the final loop incurs a merge cost , which is .
Therefore the total merge cost for the Shivers sort is bounded by . ∎
5 The 2-merge and -merge sorts
This section introduces our new merge sorting algorithms, called the “2-merge sort” and the “-merge sort”, where is a fixed parameter. These sorts are 3-aware, and this enables us to get algorithms with merge costs . The idea of the -merge sort is to combine the construction of the 2-stack sort, with the idea from Timsort of merging and instead of and whenever . But unlike the Timsort algorithm shown in Algorithm 2, we are able to use a 3-aware algorithm instead of a 4-aware algorithm. In addition, our merging rules are simpler, and our provable upper bounds are tighter. Indeed, our upper bounds for are of the form with . Although we conjecture Timsort has an upper bound of , we have not been able to prove it, and the multiplicative constant for any such bound must be at least 1.5 by Theorem 3.
Algorithms 5 and 6 show the -merge sort and -merge sort algorithms. Note that the 2-merge sort is almost, but not quite, the specialization of the -merge sort to the case . The difference is that line 6 of the 2-merge sort has a simpler while test than the corresponding line in the -merge sort. As will be shown by the proof of Theorem 22, the fact that Algorithm 5 uses this simpler while test makes no difference to which merge operations are performed; in other words, it would be redundant to test the condition .
The 2-merge sort can also be compared to the -stack sort shown in Algorithm 3. The main difference is that the merge of and on line 7 of the -stack sort has been replaced by the lines lines 7-11 of the -merge which conditionally merge with either or . For the 2-merge sort (and the -merge sort), the run is never merged with if it could instead be merged with a shorter . The other, perhaps less crucial, difference is that the weak inequality test on line 6 in the -stack sort has been replaced with a strict inequality test on line 6 in the -merge sort. We have made this change since it seems to make the -merge sort more efficient, for instance when all original runs have the same length.
We will concentrate mainly on the cases for where is the golden ratio. Values for do not seem to give useful merge sorts; our upper bound proof does not work for .
Let , the constant is defined by
For , . For , . For , is strictly increasing as a function of . Thus, when .
The next four subsections give nearly matching upper and lower bounds for the worst-case running time of the 2-merge sort and the -merge sort for .
5.1 Lower bound for 2-merge sort and -merge sort
Fix . The worst-case merge cost of the -merge sort algorithm is .
The corresponding theorem for is:
The worst-case merge cost of the 2-merge sort algorithm is , where .
The proof of Theorem 14 also establishes Theorem 8, as the same lower bound construction works for both -stack sort and -merge sort. The only difference is that part d. of Claim 17 is used instead of part c. In addition, the proof of Theorem 14 also establishes Theorem 15; indeed, exactly the same proof applies verbatim, just uniformly replacing “” with “2”.
Proof of Theorem 14.
Fix . For , we define a sequence of run lengths that will establish the lower bound. Define to equal . For , set to be the sequence , containing a single run of length . For , define and . Thus is the least integer greater than . Similarly define and . These four values can be equivalently uniquely characterized as satisfying
for some . The sequence of run lengths is inductively defined to be the concatenation of , and .
Part a. of the claim is immediate from the definitions. Part b. is immediate from the equalities (3) since and . Part c. is similarly immediate from (4) since also . Part d. follows from (3) and . Part e. follows by (4), , and .
Let be as defined above.
The sums of the run lengths in is .
If , then the final run length in is .
Suppose that is the initial subsequence of a sequence of run lengths and that the -merge sort is initially started with run lengths and (a) with the stack empty or (b) with the top element of a run of length . Then the -merge sort will start by processing exactly the runs whose lengths are those of , merging them into single run which becomes the new top element of . This will be done without merging any runs that were initially in and without (yet) processing any of the remaining runs in .
The property c. also holds for -stack sort.
Part a. is immediate from the definitions using induction on . Part b. is a consequence of Claim 16(d.) and the fact that the final entry of is a value for some . Part c. is proved by induction on , similarly to the proof of Claim 5. It is trivial for the base case . For , is the concatenation of . Applying the induction hypothesis to yields that these runs are initially merged into a single new run of length at the top of the stack. Then applying the induction hypothesis to shows that those runs are merged to become the top run on the stack. Since the last member of is a run of length , every intermediate member placed on the stack while merging the runs of has length . And, by Claim 16(c.), these cannot cause a merge with the run of length already in . Next, again by Claim 16(c.), the top two members of the stack are merged to form a run of length . Applying the induction hypothesis a third time, and arguing similarly with Claim 16(b.), gives that the runs of