**A Portfolio of efficient algorithms for different sorting disciplines is the basis for sustainable sorting.**

The most generic (and expensive) sorting algorithms have the following features

- generic sorting sorts by comparison
- generic sorting is stable
- generic sorting can sort elements of varying size
- generic sorting has worst case complexity \(\mathcal{O}(N\log{N})\)
- generic sorting can be implemented with concurrent divide and concurrent conquer

**The greeNsort^{®} Portfolio provides algorithms for different sorting disciplines, ranked here from most generic to most specialized.**

A popular prior-art method is an indirect *Quicksort* which swaps pointers to size-varying elements.^{1} This is relatively memory-parsimonious, particularly if elements are big, however, it comes with *high random-access CPU costs*. The *greeNsort ^{®}* research shows that a direct

**The size-varying greeNsort^{®} algorithms VFrogsort1 and XFrogsort1 needs only 50% buffer and still fulfill all above criteria for a generic sorting algorithm.**

Most prior-art algorithms restrict themselves to equally-sized elements, which is faster.

The standard methods for stable sorting which allow for concurrent execution use 100% buffer^{2}, for example *Mergesort* is the preferred method for stable sorting in *C++* if enough memory is available. *Mergesort* can easily be tuned for adaptivity: an initial comparison checks whether the two sequences overlap, if not the cost for comparing can be skipped which reduces the merging to just copying. **The greeNsort^{®} algorithm Omitsort also skips the cost of movements, Octosort further improves adaptivity to also reverse-sorted data. Both greeNsort^{®} algorithms beat Mergesort with their \(\mathcal{O}(N)\) best case and beat Timsort with less copying.**

*Timsort* as used in Python and Java is a highly tuned adaptive *Natural-Mergesort* which towards the best case of presorted data costs only \(\mathcal{O}(N)\) and which only needs 50% buffer. However, beyond implementation complexity, the disadvantage of *Timsort* is much higher cost for really unsorted data and strong serial logic which hampers concurrency. **The greeNsort^{®} algorithms Frogsort1 and Geckosort1 achieve the generality of Mergesort but save 50% buffer. Frogsort1 is automatically (without tuning) adaptive to presorted data and Geckosort1 is automatically adaptive to presorted and reverse-sorted data. Squidsor1t is a variant of Frogsort1 tuned with techniques from Octosort to be bi-directionally most adaptive. Like Mergesort and unlike Timsort, the greeNsort^{®} algorithms are efficient no-copy algorithms (not more than one move per merge).**

The *greeNsort ^{®}* research is not aware of prior-art algorithms that can match the speed of a good

Academic research was obsessed for decades with finding a true stable inplace (i.e. 0-buffer) Mergesort that can practically compete with standard *Mergesort*. In 2008, when Robert Sedgewick termed this goal as the *holy sorting grail*, inplace-mergesort (*IMergesort*) was more than factor 5 slower than *Mergesort*. In 2014^{3} Andrey Astrelin achieved to implement *Grailsort*, based on an algorithm published only theoretically by Huang&Langston (1989-1992), such that it was not more than 1.5 times slower than *Mergesort*. Andrey Astrelin also showed that using a true inplace sort was somewhat academic: he provided variants *Sqrtsort*) that use negligible \(\mathcal{O}(\sqrt{N})\) buffer and are only 1.3 times slower than standard *Mergesort*. **The greeNsort^{®} algorithms Walksort and Jumpsort also work with so little \(\mathcal{O}(\sqrt{N})\) buffer and achieve approximately the speed of standard Mergesort, further tuning for adaptivity is possible**.

It is well known, that sacrificing stability enables effective inplace-partitioning of equally-sized elements. Further sacrificing worst-case \(\mathcal{O}(N\log{N})\) for only average-case \(\mathcal{O}(N\log{N})\) (by using random pivots instead of deterministic median-pivots) enables inplace-partitioning with *Quicksort* at about the speed of *Mergesort* in other words: when the sacrifices are acceptable, *Quicksort* tends to be more sustainable than *Mergesort* according to the *greeNsort ^{®}* footprint definition

Recently *Super Scalar Sample Sort* (SSSS or short S4) has been developed with the goal of replacing standard sorting libraries. While it is faster than current standard sorting libraries, it is not stable^{5}, it is more complex, it requires more buffer^{6} and its serial^{7} implementation ( *IS4o*) is slower than the best algorithms in the following section.

**The greeNsort^{®} algorithms Zacksort and Zucksort provide simple and elegant solutions to the question open since Hoare’s 1961 inception of Quicksort: how to obtain early-termination on ties without the cost of extra operations slowing down the algorithm on non-tied data.**

**A bit of history**: The original *Quicksort* as published by Hoare (1961) seemed simple and almost perfect, but was vulnerable for quadratic runtime under certain data input. Sedgewick (1977) compared different alternatives and recommended a (modified) *Quicksort2* of Singleton (1969) which had average-case \(\mathcal{O}(N\log{N})\) for all data inputs, but no longer early-terminated to \(\mathcal{O}(N \log{D})\) in the tied case (where \(D\) is the number of *distinct* values). Threeway Quicksort by Wegner (1985) improved and popularized by Bentley&McIlroy (1993) re-introduced early terminating on ties by isolating a third partition with pivot-ties, but due to extra-operations was slower for untied data ( *Quicksort3*). Yaroslavskiy’s (2009) faster dual-pivot Quicksort ( *Dupisort*) used such extra-operations to create a real third partition^{8}, but the complex^{9} algorithm is strongly serial and probably impossible to implement branchless. Edelkamp&Weiß (2016) published the much faster simple branchless Block-Quicksort (*Quicksort2B*)^{10}, but only with a rudimentary and expensive early-termination mechanism. Orson Peters (2015) combined a branchless algorithm with a proper tuning for ties and a proper tuning for presorted data, the tuning overhead of extra operations in his *Pattern-Defeating Quicksort* (*Pdqsort*) is very little, so it is close to optimal, see the C++ implementation on github. However, once combined with an expensive comparison function such as localized string comparison `strcoll`

, it becomes slower than *Quicksort2*. **The greeNsort^{®} algorithms Zacksort and Zucksort achieve early termination on ties at only one extra comparison per partitioning task, like Hoare’s original they use random pivots but guarantee convergence for any input pattern at expected cost of \(\mathcal{O}(N \log{D})\), Ducksort additionally is \(\mathcal{O}(N)\) adaptive to presorted data**.

Sorting algorithms exploiting specific situations play an important role in reducing the footprint of sorting. *greeNsort ^{®}* does focus generic algorithms, however, it is important to know the following classes of specific algorithms, which do sort not by comparisons.

Analytic sorting algorithms sort elements by incrementally analyzing the keys for recursive partitioning, in other words, they require the specific situation where the sorting can be determined by analyzing the keys alone, without mutually comparing them. Malte Skarupke (2015) has stressed the importance of offering a convenient API for as many binary data types as possible, his C++ code can be found here.

Incremental analysis of string keys is known to everyone who has seen a phone book. Strings are recursively partitioned character-by-character, sorting ends as soon as the number of remaining elements is one (or small enough to delegate the remaining work to another algorithm). This method can handle strings of varying lengths, but unlike generic sorting it cannot handle locale specific collation orders. The straightforward way of doing this is partitioning into \(d_i\) buckets where \(d_i\) is the number of distinct characters at the i-th character of the string. This works well in small alphabets, if the number of distinct characters can become prohibitively high, an alternative related to *threeway Quicksort* with a fan-out between 2 and 3 is Multi-key Quicksort.

Incremental byte-by-byte analysis of binary types such as integers works the same way and is called *MSB-Radixsort*. While *MSB-Radixsort* supports early termination, *LSD-Radixsort* dos not: it always processes all bytes. With a fixed number of *k* bytes in a specific binary type the cost of *LSD-radix sort* is \(\mathcal{O}(N \cdot k)\) which makes some people believe that Radixsort has cost linear with *N*. They claim that the \(\mathcal{O}(N\log{N})\) rule applies only to generic comparison sorts, but that is a misinterpretation which ignores that a fixed number of bytes limits the size *D* of the domain - for example how many different integers the binary type can represent. For more information see Wikipedia.

Synthetic sorting algorithms sort only keys that stem from a limited domain. This requires, that elements do contain nothing but keys, that there is a limited number of distinct keys, and that those keys can be projected onto consecutive positions. Typically a linear projection is used, i.e. synthetic sorting handles interval-scale data, but unlike generic sorting algorithms not ordinal-scale data (which requires mutual comparisons).

*Counting sort* starts with a vector of counters initialized to zero, then it takes one pass over the data to project each key to a position and increase a counter at this position. Once all elements have been counted, one pass over the counters *synthesizes* the keys in the order of the counter positions. This obviously has only linear \(\mathcal{O}(N)\) cost. For more information see Wikipedia.

If it is known that each key can only occur once, instead of a vector of counters a vector of bits is sufficient. After initializing all bits to 0, *Bit sort* takes one pass over the data to project each key to a position and set that bit to 1. Once all keys have been set, one pass over the bits *synthesizes* the unique keys in the order of the bit positions. This obviously has only linear \(\mathcal{O}(N)\) cost and is very memory parsimonious. Sorting by marking bits has been described in the chapter *Cracking the Oyster* of the famous book *Programming Pearls* by Jon Bentley (1999).

The following examples demonstrate the power and limitations of specialized algorithms. Say we want to sort 1 million integers. We use R and our package bit.

```
# package 'bit' version 4.0.0 is not yet on CRAN
# hence installation from github
# devtools::install_github("truecluster/bit")
require(bit)
require(microbenchmark)
```

If the integers are sampled from a small domain (2 million values) with few or no duplicates, specialized radixsort is faster than quicksort and even more specialized bitsort is even faster, particularly if we tell bitsort that there are no duplicates in this example:

```
if (!exists("sx")) {
x <- sample(2e6, 1e6)
sx <- rbind(
quicksort = summary(microbenchmark(sort(x, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(x, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(x, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(x, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sx[,-1], 1)
```

```
## min lq mean median uq max neval
## quicksort 65.1 66.3 67.3 67.0 68.0 70.3 10
## radixsort 50.9 51.1 52.8 52.0 54.5 56.6 10
## multip_bitsort 28.5 30.1 33.1 33.1 35.9 38.6 10
## unique_bitsort 15.6 16.0 16.8 17.0 17.1 19.0 10
```

However, if we sample from a huge domain (2 billion values), the more specialized bitsort algorithms take significantly longer:

```
if (!exists("sy")) {
y <- sample(1e9, 1e6)
sy <- rbind(
quicksort = summary(microbenchmark(sort(y, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(y, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(y, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(y, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sy[,-1], 1)
```

```
## min lq mean median uq max neval
## quicksort 65.9 66.5 67.8 67.4 68.2 72.5 10
## radixsort 36.7 38.1 39.2 38.8 39.6 45.5 10
## multip_bitsort 139.5 140.0 143.0 141.1 146.8 149.9 10
## unique_bitsort 1459.3 1465.5 1491.8 1483.9 1527.2 1535.1 10
```

although

*Quicksort*has only average case complexity \(\mathcal{O}(N\log{N})\) and although concurrent conquer is difficult↩︎A standard optimization for reducing the buffer need of such algorithms based on merging is allocating 50% buffer, use this buffer first for sorting the left half, then sorting the right half and finally merging both halves, a limitation of this approach is that both halves cannot be sorted concurrently↩︎

In the same year Robert Sedgewick changed his definition of the

*holy sorting grail*from a \(\mathcal{O}(N\log{N})\) to a \(\mathcal{O}(N)\) best case, which denies Andrey Astrelin the recognition for having achieved the*holy sorting grail*. Here are Segewick lectures of the years 2008, 2009, 2010, 2011, 2012, 2013, 2014. Sadly,*Andrey Astrelin passed away*on January 13th 2017 at the age of 48.↩︎This implies that if in a given system sufficient memory is available that is currently not needed for other tasks, then investing negligible buffer into

*Walksort*or little buffer into*Frogsort2*can eliminate the remaining risk of quadratic runtime.↩︎The original publications do not specify stability, but the implementors clarify:

*“So far there is no stable implementation planned. A stable implementation might be difficult due to the algorithm’s design.”*↩︎The inplace version at least \(\mathcal{O}(\sqrt{N})\))↩︎

Sorting doubles, the parallel implementation IPS4o is faster, we have not compared to a parallel implementation of

*Pattern-Defeating Quicksort*↩︎This improves the average \(\mathcal{O}(N\log_2{N})\) algorithm to \(\mathcal{O}(N\log_3{N})\) regarding moves↩︎

*greeNsort*research discovered unneeded moves in certain situations↩︎^{®}Fun fact: the optimization published by Edelkamp&Weiß (2016) was already suggested in a little known paper of Hoare (1962) in which he gives context and explanations for his 1961 algorithm.↩︎

Copyright © 2020 Dr. Jens Oehlschlägel - All rights reserved - Terms - Privacy - Impressum