There is a vast literature on sorting, so this page cannot hope to be comprehensive. Reasons to include something here:
-
This is, or could be, something to credit for ideas used in djbsort.
-
The results are, or have been claimed to be, competitive with djbsort in speed. Speed results are not excluded on the basis of security or verification. Results are also not excluded on the basis of array-size limitations (only multiples of 64, only powers of 2, etc.) or array-alignment requirements.
Available software
2017 Gueron–Krasnov "avx_qsort"
2017 Bernstein–Chuengsatiansup–Lange–van Vredendaal "NTRU Prime: reducing attack surface at low cost" (paper): constant time; direct predecessor of djbsort
2017 Bramas "A novel hybrid quicksort algorithm vectorized using AVX-512 on Intel Skylake" (paper): only for Skylake using AVX-512
2018 Hou–Wang–Feng "A framework for the automatic vectorization of parallel sort on x86-based processors"
2020 Blacher–Giesen–Kühne "Fast and robust vectorized in-place sorting of primitive types" (paper)
2022 Blacher–Giesen–Sanders–Wassenberg "Vectorized and performance-portable Quicksort" (paper)
Harder-to-verify performance claims
1987 Rönsch–Strauss "Timing results of some internal sorting algorithms on vector computers"
2007 Furtak–Amaral–Niewiadomski "Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms"
2008 Chhugani–Macy–Baransi–Nguyen–Hagog–Kumar–Lee–Chen–Dubey "Efficient implementation of sorting on multi-core SIMD CPU architecture"
2013 Xiaochen–Rocki–Suda "Register level sort algorithm on multi-core SIMD processors"
2015 Hayes–Palomar–Unsal–Cristal–Valero "VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors"
2015 Hou–Wang–Feng "ASPaS: a framework for automatic SIMDization of parallel sorting on x86-based many-core processors": reports, e.g., about 220 million cycles (55 cycles/byte) on 1.05GHz Xeon Phi 5110P (Knights Corner) for n=1000000
2016 Xu–Xu–Yin–Zheng "Optimized merge sort on modern commodity multi-core CPUs", Telkomnika 14 (2016), 209–318: reports, e.g., about 110 million cycles (27 cycles/byte) on 2.4GHz Core i7 4700MQ (Haswell) for n=1048576
2016 Codish–Cruz-Filipe–Nebel–Schneider-Kamp "Optimizing sorting algorithms by using sorting networks", Formal Aspects of Computing 29 (2017), 559–579: reports, e.g., about 150 cycles (2.3 cycles/byte) on unspecified "Core i7" for n=16; about 120000 cycles (3 cycles/byte) for n=10000
2016 Gueron–Krasnov
"Fast quicksort implementation using AVX instructions",
The Computer Journal 59 (2016), 83–90:
reports, e.g.,
about 160000 cycles (40 cycles/byte) for n=1000 int32
on Haswell (Figure 4),
and
about 130000 cycles (32 cycles/byte) for n=1000 int32
on Haswell
using radix sort in Intel's Integrated Performance Primitives library,
"the fastest in-memory sort we are aware of"
Sorting networks
1968 Batcher "Sorting networks and their applications"
1983 Kumar–Hirschberg "An efficient implementation of Batcher's odd-even merge algorithm and its application in parallel sorting schemes"
1986 Marberg–Gafni "Sorting in constant number of row and column phases on a mesh"
1987 Chung–Ravikumar "Bounds on the size of test sets for sorting and related networks", Discrete Mathematics (1990)
1994 Lee–Batcher "A multiway merging network"
2006 Chaudhry–Cormen "Slabpose columnsort: a new oblivious algorithm for out-of-core sorting on distributed-memory clusters"
2007 Even–Levi–Litman "Optimal conclusive sets for comparator networks", Theoretical Computer Science (2009)
2015 Ebbesen "On the practicality of data-oblivious sorting"
Symbolic execution
2007 Nethercote–Seward "Valgrind: a framework for heavyweight dynamic binary instrumentation", ACM SIGPLAN Conference on Programming Language Design and Implementation
2016 Shoshitaishvili–Wang–Salls–Stephens–Polino–Dutcher–Grosen–Feng–Hauser–Kruegel–Vigna "SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis", IEEE Symposium on Security and Privacy
Version: This is version 2024.01.17 of the "References" web page.