Mergesort trilemma

In the Quicksort-dilemma there were trade-offs between efficiency and adaptivity. In merging there are trade-offs between efficiency, adaptivity and reducing buffer memory: one cannot have all three of them: full \(\mathcal{O}(N\log{N})\) efficiency and full \(\mathcal{O}(N)\) adaptivity and low-memory.

Forgotten art

Let’s rehearse how a generic efficient potentially parallel Mergesort is organized. Naive implementations allocate 100% buffer in the merge, copy the data to the buffer (100% moves), merge the two input streams back (100% comparisons and 100% moves) and deallocate the buffer. Obviously better is to allocate the buffer once before recursing and de-allocating once done, however what about the 200% moves? Timsort (T. Peters (2002)) allocates not more than 50% buffer, copies note more than 50% out and merges 100% back, that is 150% moves. Sedgewick (1990) teached a Nocopy-Mergesort that only merges (100% moves) without copying out by alternating between two memory regions of size 100% each.

Sedgewick teaches Nocopy-Mergesort with three loop-checks: two on the two input-streams and one on the output loop, obviously two loop-checks on the two input-streams ar enough. Sedgewick explains how to get away with one loop-check on the output streams by using sentinels in a mutual-recursive Bitonic-Mergesort, which sorts the left half ascending and the right half descending (from left), which is however not stable. Using the symmetric sorting definition we can fix this, by sorting the right half not descending from left but ascending from right which makes the algorithm nicely symmetric (Bimesort). However there are still disadvantages: this sentinel-approach must move data in each merge, hence cannot skip all merging for presorted data. Interestingly this sentinel-approach is not even necessary to get down to one loop-check: D. Knuth (1973) had teached 20 years earlier that one loop-check is sufficient, if only that input-sequence is checked for exhaustion from which the last element was merged. Combining this with Nocopy-Mergesort gives an efficient algorithm we name Knuthsort in honor of Donald Knuth. Note that Katajainen and Träff (1997) showed that the number of loop-checks can be further reduced to ½ by investigating which of the two input-sequences exhausts first and only checking that one. Be warned that Katajainen’s code reads beyond the last element, we fixed this and named the result Katasort. Note further, that Katajainen’s optimization costs an extra comparison, hence we consider it tuned and prefer Knuthsort as the prior-art reference for efficient prior-art binary Mergesort in contiguous space.

For random data Knuthsort (and Katasort) are much more efficient than Timsort. Conventional wisdom assumes, that Timsort’s inefficient merge is needed to reduce buffer to 50% and to achieve \(\mathcal{O}(N)\) adaptivity to presorted data: after the merge Timsort has the data in the same memory-region as before the merge, hence it is easy to skip copying-out and merging-back in case of presorted keys. However, conventional wisdom is wrong, we can have one of the two without compromising on efficiency, either buffer-reduction or full adaptivity. Let’s start with the latter.

Full adaptivity

The alternating merge of Nocopy-Mergesort finally completes in the data memory, not in the buffer. However, for odd recursion depths this starts merging from the buffer memory, hence Sedgewick copies the data to the buffer memory before recursion. Our Omitsort no longer uses a pre-determined alternating merge, instead we let the merge functions decide whether it merges (or whether it can omit the merge in case of presorted data): the merge function simply returns the location of the data and the sort function checks whether after its two merges the data are in the same memory region, if not, it copies one to the data region. For fully presorted data no moves are needed anymore, not even Sedgewicks initial moves. Omitsort uses \(\mathcal{O}(N)\) comparisons for diagnosing presortedness as non-overlap of the two input sequences.

For descending data, an ascending Omitsort cannot omit: each merge must do its work, hence the total cost are \(\mathcal{O}(N\log{N})\). Cheaper would be to do nothing and only reverse the total sequence after checking presortedness throughout the recursion. Our Octosort achieves this by no longer requiring a specific order during the merging: the resulting order is simply data-driven and returned by the recursive sort function. Only if the left and right recursive sorts return different directions, the desired (ascending) order is enforced. For perfectly ascending and descending data the are no moves during the recursion, for descending data only a final reversal is needed, hence Octosort is like Timsort \(\mathcal{O}(N)\) bi-adaptive for pre-sorted data, but unlike Timsort much more efficient for non-sorted data. Octosort can also be fully parallelized, while Timsort has inherently serial tasks.

Distance minimization

“For decades, the machine balance of compute speed to memory bandwidth has increased 15%–30% per year […] Projections for future machines only exacerbate the current data movement crisis” (Dongarra 2022).

We all have learned that sorting cost is \(\mathcal{O}(N\log{N})\) … for constant access costs in the RAM-model. We all know that the RAM-model is wrong and access costs are not constant: todays memory hierachies are deep, L1, L2, L3, RAM on local socket, RAM on foreign socket, and todays virtualization and cloud techniques might even give us memory on different machines, in different LAN and across WAN networks.

The traditional answers to the data movement crisis are k-ary algorithms (which reduce the number of memory passes), block-access (which is only suitable for one of the cache-layers) and cache-oblivious algorithms (which are complicated). greeNsort^® takes a different perspective, the perspective of an Ordinal Machine Model: there is a cost for moving data over distance, but the exact cost of a move (or access) over distance \(d\) is unknown. It could be anything between \(\mathcal{O}(1)\), \(\mathcal{O}(\log{d})\) and \(\mathcal{O}(d)\). An example for the latter - linear move costs - would be sorting N cars in a row in a parking slot next to a road. Let’s assume another free N positions of buffer parking space. A buffer next to the N cars is the best we can hope for, hence merging N cars to the buffer space costs moving each car on average N positions on each recursion level, which gives us total move cost of \(\mathcal{O}(N^2\log_2{N})\) for all cars on all recursion levels. That’s expensive.

Note that even K-ary Mergesort for our cars still would be quadratic: \(\mathcal{O}(N^2\log_k{N})\). Contrast this to Quicksort, where the move distance is halved on each partitioning in the recursion, therefore the total move costs is \(\mathcal{O}(N^2)\). This is a little appreciated reason why Quicksort performs robustly on many different machines. In other words, Quicksort zooms into local memory, while Mergesort keeps merging over global distances.

Gapped-merging

A simple trick brings locality to Mergesort: for \(N\) elements of data take \(2N\) elements of memory and alternate merging between odd and even positions, this reduces the move distances as the algorithm divides deeper. Unfortunately gapped-merging comes with a couple of disadvantages: reading \(N\) elements actually scans \(2N\), hence gapped GKnuthsort is on current CPUs slower than Knuthsort. Also gapped-merging requires all elements to have the same size.

Buffer-merging

Locality requires starting the merging with local buffer gaps between the data. If we want a contiguous result of the sort with only two regions, data and buffer, this requires that we merge not only the data but also the buffer. Rehearse this: buffer becomes a first-class object of merging. Understand that Divide&Conquer with buffer and good locality requires Divide&Conquer of buffer. Once this is understood, it looks really weird, that in Mergesort for decades we only merged data but not buffer. Now let’s look at some ways to merge buffer. Let uppercase letters represent data and lowercase represent buffer.

If the result of buffer merging is one contiguous data region and one contiguous buffer region, we have to make a fundamental asymmetric choice: data left and buffer right (Ab) or data right and buffer left (aB). If we assume Ab before splitting and after merging, standard self-recursion implies to replicate that pattern across all recursion levels, i.e. we merge from AbCd to ACbd. In order to merge A and C, a naive approach would first transport all input streams to the right: bdAC and then merge back to ACbd. 100% extra moves are expensive (TKnuthsort). A more efficient approach would only relocate the streams in the left half to the gaps in the right half dbCA before merging to the left ACbd. Note that the order of the streams has changed, hence care is needed to retain stability when merging (Crocosort with Knuth’s merge). Unfortunately, with 50% extra moves, this is still as inefficient as Timsort.

Symmetric merging

It is possible to get rid of any extra moves if we leverage symmetric mutual recursion: if in the left branch we sort the data to the left (buffer to the right) and in the right branch we sort the data to the right (buffer left). What we obtain is the data in the outer regions and the buffer in the inner regions: the two inner buffer regions are contiguous without any extra moves. That’s nice, but even nicer is that not more than 50% buffer is needed: merging is done from inner to outer input-streams such that the result is aligned at the outer border. This semi-inplace merging has the following adaptivity properties: for perfectly presorted data the number of comparisons and moves is automatically reduced to 50%, with a non-overlap test the number of comparisons can be reduced to 0%, but at lat least 50% of the data must be moved. Symmetric merging offers two variants: a symmetric variant that sorts one branch left-ascending and one branch right-ascending (Geckosort) and a asymmetric variant which uses the same order in both branches (Frogsort). Geckosort is 25% adaptive to ascending and 25% adaptive to descending keys. Frogsort is 50% adaptive to presorting in the implemented order. Note that FROG is an acronym of Flip Recursive Organized Gaps.⁷

Frogsort0

The simple (and first) variant of Frogsort was Frogsort0 (2010): it operates on triplets of memory elements, one buffer element symmetrically in the middle between two data elements, i.e. AbC. Frogsort0 splits and merges the triplets. Rule: if there is an odd number of triplets, the surplus triplet goes to the outer side. The setup of the gapped elements can be done before Split&Merge.

Table 1: Sketch of Frogsort0
position	1	2	3	4	5	6	7	8	9
setup	A	b	C	D	e	F	G	h	I
merge	A	C	b	e	D	F
merge	A	C	D	F	b	e	h	G	I
merge	A	C	D	F	G	I	b	e	h

Odd number of elements can be handled by using one extreme dummy value (see the implementation of Frogsort0) or by a specific third recursion function that handles the outermost incomplete triplet (see the implementation of PFrogsort0).

Frogsort1

An alternative is splitting single elements in Frogsort1. Rule 1: if there is an odd number of elements, the surplus element goes to the outer side. Rule 2: the size of the buffer is \(N/2\) (integer division). The setup is done in an extra recursion before the Split&Merge Recursion.

Table 2: Sketch of Frogsort0
position	1	2	3	4
top-2	A	b	C
top-1	A	C	b	D
top	A	C	D	b

Frogsort2

Textbook knowledge tells us that Divide&Conquer should be balanced in order to minimize operations. Symmetric merging allows to reduce the buffer to much less than 50%: let the inner branch be \(p\%\), then symmetric merging needs only \(p\%\) buffer. Yes, this increases the maximum recursion depth and the total number of operations, but it reduces the total %RAM. Hence for the sustainable Footprint measure, we expect an U-shaped function of \(p\). greeNsort^® implemented this as Frogsort2. For sorting doubles, surprisingly, not only exists an optimal \(p\) below 10% regarding Footprint, there is also an optimal \(p\) below 20% regarding speed. This is surprising given usual space-speed trade-off expectations, that less RAM implies longer runTime. One reason for this is: imbalanced merging makes for better branch-prediction. With Frogsort2 we have a nice algorithm, that can be tuned to specific hardware features using its parameter \(p\). Note that Frogsort1 is a special case of Frogsort2 at \(p=0.5\).

Frogsort3 and 6

So far we have split the buffer between the two branches. An alternative is to share the buffer, i.e. first we send all the buffer down the left branch, and then we send the buffer down the right branch. This allows to use even less buffer. Frogsort3 does this, until there is enough buffer at a branch to switch to Frogsort1. Frogsort6 does this, until there is enough buffer at a branch to switch to Frogsort2. Frogsort1 is a special case of Frogsort3 with parameter \(p=0.5\). Frogsort6 has two parameters \(p3\) and \(p2\), and Frogsorts 1,2,3 are special cases of Frogsort6.

Squidsort

Frogsort saves 50% compares and moves for presorted data, but not for reverse-sorted data. Geckosort saves 25% compares and moves for presorted data and for reverse-sorted data. Tuning Frogsort with a non-overlap comparison for presorted data reduces all other compares to 0%. Combining Frogsort with the data-driven lazily enforced order of Octosort gives Squidsort, which needs 0% comparisons and 50% moves for presorted and reverse-sorted data (see Squidsort1 and Squidsort2). Squidsort beats Timsort and related algorithms such as Peeksort and Powersort by Munro and Wild (2018), unless data is extremely presorted.

Gaps are crucial for stable sorting. As late as 2006, Bender, Farach-Colton, and Mosteiro (2006) published Librarysort, a Gapped Insertion Sort. Like people leave gaps between books in bookshelves, gaps reduce insertion costs from \(\mathcal{O}(N^2)\) to amortized \(\mathcal{O}(N\log{N})\) in the RAM-model. While the usage of gaps in Librarysort is rather probabilistic, Frogsort uses gaps in a deterministic, ‘Organized’ way.↩︎