A Portfolio of efficient algorithms for different sorting disciplines is the basis for sustainable sorting.

Generic sorting

The most generic (and expensive) sorting algorithms have the following features

generic sorting sorts by comparison
generic sorting is stable
generic sorting can sort elements of varying size
generic sorting has worst case complexity \(\mathcal{O}(N\log{N})\)
generic sorting can be implemented with concurrent divide and concurrent conquer

The greeNsort^® Portfolio provides algorithms for different sorting disciplines, ranked here from most generic to most specialized.

stable size-varying

A popular prior-art method is an indirect Quicksort which swaps pointers to size-varying elements.¹ This is relatively memory-parsimonious, particularly if elements are big, however, it comes with high random-access CPU costs. The greeNsort^® research shows that a direct Mergesort (100% buffer) which moves the size-varying elements can be more sustainable because this avoids random access.

50% buffer

The size-varying greeNsort^® algorithms VFrogsort1 and XFrogsort1 needs only 50% buffer and still fulfill all above criteria for a generic sorting algorithm.

stable equally-sized

Most prior-art algorithms restrict themselves to equally-sized elements, which is faster.

100% buffer

The standard methods for stable sorting which allow for concurrent execution use 100% buffer², for example Mergesort is the preferred method for stable sorting in C++ if enough memory is available. Mergesort can easily be tuned for adaptivity: an initial comparison checks whether the two sequences overlap, if not the cost for comparing can be skipped which reduces the merging to just copying. The greeNsort^® algorithm Omitsort also skips the cost of movements, Octosort further improves adaptivity to also reverse-sorted data. Both greeNsort^® algorithms beat Mergesort with their \(\mathcal{O}(N)\) best case and beat Timsort with less copying.

50% buffer

Timsort as used in Python and Java is a highly tuned adaptive Natural-Mergesort which towards the best case of presorted data costs only \(\mathcal{O}(N)\) and which only needs 50% buffer. However, beyond implementation complexity, the disadvantage of Timsort is much higher cost for really unsorted data and strong serial logic which hampers concurrency. The greeNsort^® algorithms Frogsort1 and Geckosort1 achieve the generality of Mergesort but save 50% buffer. Frogsort1 is automatically (without tuning) adaptive to presorted data and Geckosort1 is automatically adaptive to presorted and reverse-sorted data. Squidsor1t is a variant of Frogsort1 tuned with techniques from Octosort to be bi-directionally most adaptive. Like Mergesort and unlike Timsort, the greeNsort^® algorithms are efficient no-copy algorithms (not more than one move per merge).

low buffer

The greeNsort^® research is not aware of prior-art algorithms that can match the speed of a good Mergesort with less than 50% buffer. The greeNsort^® algorithms Frogsort2 and Squidsort2 can be parametrized to low (14% buffer) and still sort faster and with lower footprint than Mergesort and Timsort.

sqrt buffer

Academic research was obsessed for decades with finding a true stable inplace (i.e. 0-buffer) Mergesort that can practically compete with standard Mergesort. In 2008, when Robert Sedgewick termed this goal as the holy sorting grail, inplace-mergesort (IMergesort) was more than factor 5 slower than Mergesort. In 2014³ Andrey Astrelin achieved to implement Grailsort, based on an algorithm published only theoretically by Huang&Langston (1989-1992), such that it was not more than 1.5 times slower than Mergesort. Andrey Astrelin also showed that using a true inplace sort was somewhat academic: he provided variants Sqrtsort) that use negligible \(\mathcal{O}(\sqrt{N})\) buffer and are only 1.3 times slower than standard Mergesort. The greeNsort^® algorithms Walksort and Jumpsort also work with so little \(\mathcal{O}(\sqrt{N})\) buffer and achieve approximately the speed of standard Mergesort, further tuning for adaptivity is possible.

unstable equally-sized

It is well known, that sacrificing stability enables effective inplace-partitioning of equally-sized elements. Further sacrificing worst-case \(\mathcal{O}(N\log{N})\) for only average-case \(\mathcal{O}(N\log{N})\) (by using random pivots instead of deterministic median-pivots) enables inplace-partitioning with Quicksort at about the speed of Mergesort in other words: when the sacrifices are acceptable, Quicksort tends to be more sustainable than Mergesort according to the greeNsort^® footprint definition⁴.

unstable sqrt buffer

Recently Super Scalar Sample Sort (SSSS or short S4) has been developed with the goal of replacing standard sorting libraries. While it is faster than current standard sorting libraries, it is not stable⁵, it is more complex, it requires more buffer⁶ and its serial⁷ implementation ( IS4o) is slower than the best algorithms in the following section.

unstable no buffer

The greeNsort^® algorithms Zacksort and Zucksort provide simple and elegant solutions to the question open since Hoare’s 1961 inception of Quicksort: how to obtain early-termination on ties without the cost of extra operations slowing down the algorithm on non-tied data.

A bit of history: The original Quicksort as published by Hoare (1961) seemed simple and almost perfect, but was vulnerable for quadratic runtime under certain data input. Sedgewick (1977) compared different alternatives and recommended a (modified) Quicksort2 of Singleton (1969) which had average-case \(\mathcal{O}(N\log{N})\) for all data inputs, but no longer early-terminated to \(\mathcal{O}(N \log{D})\) in the tied case (where \(D\) is the number of distinct values). 3-way Quicksort by Wegner (1985) improved and popularized by Bentley&McIlroy (1993) re-introduced early terminating on ties by isolating a third partition with pivot-ties, but due to extra-operations was slower for untied data ( Quicksort3). Yaroslavskiy’s (2009) faster dual-pivot Quicksort ( Dupisort) used such extra-operations to create a real third partition⁸, but the complex⁹ algorithm is strongly serial and probably impossible to implement branchless. Edelkamp&Weiß (2016) published the much faster simple branchless Block-Quicksort (Quicksort2B)¹⁰, but only with a rudimentary and expensive early-termination mechanism. Orson Peters (2015) combined a branchless algorithm with a proper tuning for ties and a proper tuning for presorted data, the tuning overhead of extra operations in his Pattern-Defeating Quicksort (Pdqsort) is very little, so it is close to optimal, see the C++ implementation on github. However, once combined with an expensive comparison function such as localized string comparison strcoll, it becomes slower than Quicksort2. The greeNsort^® algorithms Zacksort and Zucksort achieve early termination on ties at only one extra comparison per partitioning task, like Hoare’s original they use random pivots but guarantee convergence for any input pattern at expected cost of \(\mathcal{O}(N \log{D})\), Ducksort additionally is \(\mathcal{O}(N)\) adaptive to presorted data. Quicksort serves as a blueprint for many other algorithms, for example ‘selection’ Quickselect and ‘partial sorting’ Quickpart; similarly Zacksort and Zucksort serve as blueprints for Zackselect, Zuckselect, Zackpart and Zuckpart.

Specific sorting

Sorting algorithms exploiting specific situations play an important role in reducing the footprint of sorting. greeNsort^® does focus generic algorithms, however, it is important to know the following classes of specific algorithms, which do sort not by comparisons.

analytic algorithms

Analytic sorting algorithms sort elements by incrementally analyzing the keys for recursive partitioning, in other words, they require the specific situation where the sorting can be determined by analyzing the keys alone, without mutually comparing them. Malte Skarupke (2015) has stressed the importance of offering a convenient API for as many binary data types as possible, his C++ code can be found here.

char-by-char

Incremental analysis of string keys is known to everyone who has seen a phone book. Strings are recursively partitioned character-by-character, sorting ends as soon as the number of remaining elements is one (or small enough to delegate the remaining work to another algorithm). This method can handle strings of varying lengths, but unlike generic sorting it cannot handle locale specific collation orders. The straightforward way of doing this is partitioning into \(d_i\) buckets where \(d_i\) is the number of distinct characters at the i-th character of the string. This works well in small alphabets, if the number of distinct characters can become prohibitively high, an alternative related to 3-way Quicksort with a fan-out between 2 and 3 is Multi-key Quicksort.

byte-by-byte

Incremental byte-by-byte analysis of binary types such as integers works the same way and is called MSB-Radixsort. While MSB-Radixsort supports early termination, LSD-Radixsort dos not: it always processes all bytes. With a fixed number of k bytes in a specific binary type the cost of LSD-radix sort is \(\mathcal{O}(N \cdot k)\) which makes some people believe that Radixsort has cost linear with N. They claim that the \(\mathcal{O}(N\log{N})\) rule applies only to generic comparison sorts, but that is a misinterpretation which ignores that a fixed number of bytes limits the size D of the domain - for example how many different integers the binary type can represent. For more information see Wikipedia.

synthetic algorithms

Synthetic sorting algorithms sort only keys that stem from a limited domain. This requires, that elements do contain nothing but keys, that there is a limited number of distinct keys, and that those keys can be projected onto consecutive positions. Typically a linear projection is used, i.e. synthetic sorting handles interval-scale data, but unlike generic sorting algorithms not ordinal-scale data (which requires mutual comparisons).

counting

Counting sort starts with a vector of counters initialized to zero, then it takes one pass over the data to project each key to a position and increase a counter at this position. Once all elements have been counted, one pass over the counters synthesizes the keys in the order of the counter positions. This obviously has only linear \(\mathcal{O}(N)\) cost. For more information see Wikipedia.

marking

If it is known that each key can only occur once, instead of a vector of counters a vector of bits is sufficient. After initializing all bits to 0, Bit sort takes one pass over the data to project each key to a position and set that bit to 1. Once all keys have been set, one pass over the bits synthesizes the unique keys in the order of the bit positions. This obviously has only linear \(\mathcal{O}(N)\) cost and is very memory parsimonious. Sorting by marking bits has been described in the chapter Cracking the Oyster of the famous book Programming Pearls by Jon Bentley (1999).

an example with specialized algorithms

The following examples demonstrate the power and limitations of specialized algorithms. Say we want to sort 1 million integers. We use R and our package bit.

# package 'bit' version 4.0.0 is not yet on CRAN
# hence installation from github
# devtools::install_github("truecluster/bit")
require(bit)
require(microbenchmark)

If the integers are sampled from a small domain (2 million values) with few or no duplicates, specialized radixsort is faster than quicksort and even more specialized bitsort is even faster, particularly if we tell bitsort that there are no duplicates in this example:

if (!exists("sx")) {
  x <- sample(2e6, 1e6)
  sx <- rbind(
    quicksort = summary(microbenchmark(sort(x, method = "quick"), times = 10), unit = "ms")
  , radixsort = summary(microbenchmark(sort(x, method = "radix"), times = 10), unit = "ms")
  , multip_bitsort = summary(microbenchmark(bit_sort(x, has.dup = TRUE ), times = 10), unit = "ms")
  , unique_bitsort = summary(microbenchmark(bit_sort(x, has.dup = FALSE), times = 10), unit = "ms")
  )
}
round(sx[,-1], 1)

##                 min   lq mean median   uq  max neval
## quicksort      65.1 66.3 67.3   67.0 68.0 70.3    10
## radixsort      50.9 51.1 52.8   52.0 54.5 56.6    10
## multip_bitsort 28.5 30.1 33.1   33.1 35.9 38.6    10
## unique_bitsort 15.6 16.0 16.8   17.0 17.1 19.0    10

However, if we sample from a huge domain (2 billion values), the more specialized bitsort algorithms take significantly longer:

if (!exists("sy")) {
  y <- sample(1e9, 1e6)
  sy <- rbind(
    quicksort = summary(microbenchmark(sort(y, method = "quick"), times = 10), unit = "ms")
  , radixsort = summary(microbenchmark(sort(y, method = "radix"), times = 10), unit = "ms")
  , multip_bitsort = summary(microbenchmark(bit_sort(y, has.dup = TRUE ), times = 10), unit = "ms")
  , unique_bitsort = summary(microbenchmark(bit_sort(y, has.dup = FALSE), times = 10), unit = "ms")
  )
}
round(sy[,-1], 1)

##                   min     lq   mean median     uq    max neval
## quicksort        65.9   66.5   67.8   67.4   68.2   72.5    10
## radixsort        36.7   38.1   39.2   38.8   39.6   45.5    10
## multip_bitsort  139.5  140.0  143.0  141.1  146.8  149.9    10
## unique_bitsort 1459.3 1465.5 1491.8 1483.9 1527.2 1535.1    10

although Quicksort has only average case complexity \(\mathcal{O}(N\log{N})\) and although concurrent conquer is difficult↩︎
A standard optimization for reducing the buffer need of such algorithms based on merging is allocating 50% buffer, use this buffer first for sorting the left half, then sorting the right half and finally merging both halves, a limitation of this approach is that both halves cannot be sorted concurrently↩︎
In the same year Robert Sedgewick changed his definition of the holy sorting grail from a \(\mathcal{O}(N\log{N})\) to a \(\mathcal{O}(N)\) best case, which denies Andrey Astrelin the recognition for having achieved the holy sorting grail. Here are Segewick lectures of the years 2008, 2009, 2010, 2011, 2012, 2013, 2014. Sadly, Andrey Astrelin passed away on January 13th 2017 at the age of 48.↩︎
This implies that if in a given system sufficient memory is available that is currently not needed for other tasks, then investing negligible buffer into Walksort or little buffer into Frogsort2 can eliminate the remaining risk of quadratic runtime.↩︎
The original publications do not specify stability, but the implementors clarify: “So far there is no stable implementation planned. A stable implementation might be difficult due to the algorithm’s design.”↩︎
The inplace version at least \(\mathcal{O}(\sqrt{N})\))↩︎
Sorting doubles, the parallel implementation IPS4o is faster, we have not compared to a parallel implementation of Pattern-Defeating Quicksort↩︎
This improves the average \(\mathcal{O}(N\log_2{N})\) algorithm to \(\mathcal{O}(N\log_3{N})\) regarding moves↩︎
greeNsort^® research discovered unneeded moves in certain situations↩︎
Fun fact: the optimization published by Edelkamp&Weiß (2016) was already suggested in a little known paper of Hoare (1962) in which he gives context and explanations for his 1961 algorithm.↩︎

The greeNsort® portfolio of algorithms

Generic sorting

stable size-varying

50% buffer

stable equally-sized

100% buffer

50% buffer

low buffer

sqrt buffer

unstable equally-sized

unstable sqrt buffer

unstable no buffer

Specific sorting

analytic algorithms

char-by-char

byte-by-byte

synthetic algorithms

counting

marking

an example with specialized algorithms

The greeNsort^® portfolio of algorithms