Further algorithms

Low-memory merging

In-place merging algorithms have enjoyed huge academic attention. Several in-place Mergesorts are known for decades but were too slow in practice. In 2008 (or earlier) Sedgewick created the term “holy sorting grail” for a practically usable stable \(\mathcal{O}([N\log{N})\) in-place-Mergesort. In 2013 Astrelin (2013) published Grailsort an implementation of Huang and Langston (1992) that was not dramatically slower than other Mergesorts, hence deserved its name. Since 2014 Sedgewick redefined the “holy sorting grail” and required a \(\mathcal{O}(N)\) best case. Anyhow, in-place merging is overrated, Astrelin also published Sqrtsort which is faster than Grailsort and needs only a practically negligible \(\mathcal{O}(\sqrt{N})\) buffer. The greeNsort^® algorithm Walksort also requires \(\mathcal{O}(\sqrt{N})\) buffer, is faster, and its equally fast variant Jumpsort uses relocation-moves for distance minimization (like Crocosort).

Regarding speed and energy for sorting random data, these \(\mathcal{O}(\sqrt{N})\) buffer algorithms are inferior to Frogsort and Squidsort. Regarding Footprint the ranking for sorting random doubles is Frogsort2 < Squidsort2 < Frogsort3 < Walksort < Jumpsort < Frogsort1 < Squidsort1 < Sqrtsort < Grailsort < Knuthsort. For pre-sorted data the ranking of Walksort and Jumpsort improves: they are quite adaptive.

Partition&Pool

Some of the greeNsort^® learnings can be transferred from Split&Merge to Partition&Pool algorithms. It is possible to turn Zacksort into the stable Kiwisort algorithm using 100% distant buffer. By partitioning not only data but also 100% local buffer we get the distance reducing Swansort. For a very special use case it is even possible to reduce to 50% buffer, see Storksort.

Size-varying

A popular method for sorting elements of varying size - such as strings, bigints or arbitrary objects with a key field - is indirect sorting of pointers to elements using Quicksort, or better one of Zacksort, Zucksort or Ducksort. However, this is inefficient because it incurs heavy random access (see UZacksort, and even more inefficient if the sort is stabilized by breaking ties by comparing pointer address values, see WQuicksort2). An alternative is directly sorting size varying elements using a buffer. Splitting can be done either by helper-pointers to elements or directly splitting at null-terminators (for the latter see VKnuthsort and VFrogsort1). These direct sorts are more efficient than the indirect Quicksorts. Only if there are many string duplicates, unstable UZacksort benefits from its \(\mathcal{O}(N\log{D})\) early termination. Hence it seems promising to implement direct stable Partition&Pool algorithms for size-varying data (VKiwisort and VSwansort, not yet implemented).

Parallel

Doing a task in parallel can save energy, Intel calls this “race to idle”. All Divide&Conquer algorithms are easily implemented with parallel branches (see PQuicksort2, PQuicksort2B, PDucksort, PDucksortB). However, the partitioning of the Quicksort family is difficult to implement in parallel, and this involves trade-offs. By contrast, parallelizing binary merges is relatively straightforward. greeNsort^® uses a method that allows to parallelize over an arbitrary number of processes, not only power of two processes, see PKnuthsort, PFrogsort0, PFrogsort1, PFrogsort2, PFrogsort3, PVKnuthsort, PVFrogsort1. As expected, parallel execution reduces not only runTime but also Energy (although to a lesser extent).

Note that the parallel speed-up of the Frogsorts scales almost as linear with the number of cores like the speed-up of Knuthsort, hence the Footprints of PFrogsorts are clearly lower than that of PKnuthsort.

Note that the benefit of Frogsort2 over Frogsort1 somewhat diminishes when more parallel cores are used.

Note that the linear setup-phase of Frogsort0 somewhat better parallelizes than the recursive setup of Frogsorts1,2,3,6. Hence it is promising to implement variants of Frogsorts2,3,6 which leverage a linear setup with chunks of a predefined mixture of data and buffer elements.