There is a vast literature on sorting, so this page cannot hope to be comprehensive. Reasons to include something here:

The benchmark used here is sorting an aligned array of n=1024 random int32 items in L1 cache on a 3.5GHz Intel Xeon E3-1275 v3 (Haswell) core. This is a typical size of interest in post-quantum cryptography. For this size, djbsort runs in under 8000 cycles.

Available software

2001 Herf: about 33000 cycles for n=1024 int32 on Haswell (original code is for float32 but easily adapted to int32)

2017 Gueron–Krasnov "avx_qsort": about 50000 cycles for n=1024 int32 on Haswell

2017 Bernstein–Chuengsatiansup–Lange–van Vredendaal "NTRU Prime: reducing attack surface at low cost" (paper): about 25000 cycles for n=1024 int32 on Haswell; constant time; direct predecessor of djbsort

2017 Bramas "A novel hybrid quicksort algorithm vectorized using AVX-512 on Intel Skylake" (paper): only for Skylake using AVX-512

2018 Hou–Wang–Feng "A framework for the automatic vectorization of parallel sort on x86-based processors": about 37000 cycles for n=1024 int32 on Haswell

Harder-to-verify performance claims

1987 Rönsch–Strauss "Timing results of some internal sorting algorithms on vector computers"

2007 Furtak–Amaral–Niewiadomski "Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms"

2008 Chhugani–Macy–Baransi–Nguyen–Hagog–Kumar–Lee–Chen–Dubey "Efficient implementation of sorting on multi-core SIMD CPU architecture"

2013 Xiaochen–Rocki–Suda "Register level sort algorithm on multi-core SIMD processors"

2015 Hayes–Palomar–Unsal–Cristal–Valero "VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors"

2015 Hou–Wang–Feng "ASPaS: a framework for automatic SIMDization of parallel sorting on x86-based many-core processors": reports, e.g., about 220 million cycles (55 cycles/byte) on 1.05GHz Xeon Phi 5110P (Knights Corner) for n=1000000

2016 Xu–Xu–Yin–Zheng "Optimized merge sort on modern commodity multi-core CPUs", Telkomnika 14 (2016), 209–318: reports, e.g., about 110 million cycles (27 cycles/byte) on 2.4GHz Core i7 4700MQ (Haswell) for n=1048576

2016 Codish–Cruz-Filipe–Nebel–Schneider-Kamp "Optimizing sorting algorithms by using sorting networks", Formal Aspects of Computing 29 (2017), 559–579: reports, e.g., about 150 cycles (2.3 cycles/byte) on unspecified "Core i7" for n=16; about 120000 cycles (3 cycles/byte) for n=10000

2016 Gueron–Krasnov "Fast quicksort implementation using AVX instructions", The Computer Journal 59 (2016), 83–90: reports, e.g., about 160000 cycles (40 cycles/byte) for n=1000 int32 on Haswell (Figure 4), and about 130000 cycles (32 cycles/byte) for n=1000 int32 on Haswell using radix sort in Intel's Integrated Performance Primitives library, "the fastest in-memory sort we are aware of"

Sorting networks

1968 Batcher "Sorting networks and their applications"

1983 Kumar–Hirschberg "An efficient implementation of Batcher's odd-even merge algorithm and its application in parallel sorting schemes"

1986 Marberg–Gafni "Sorting in constant number of row and column phases on a mesh"

1987 Chung–Ravikumar "Bounds on the size of test sets for sorting and related networks", Discrete Mathematics (1990)

1994 Lee–Batcher "A multiway merging network"

2006 Chaudhry–Cormen "Slabpose columnsort: a new oblivious algorithm for out-of-core sorting on distributed-memory clusters"

2007 Even–Levi–Litman "Optimal conclusive sets for comparator networks", Theoretical Computer Science (2009)

2015 Ebbesen "On the practicality of data-oblivious sorting"

Symbolic execution

2007 Nethercote–Seward "Valgrind: a framework for heavyweight dynamic binary instrumentation", ACM SIGPLAN Conference on Programming Language Design and Implementation

2016 Shoshitaishvili–Wang–Salls–Stephens–Polino–Dutcher–Grosen–Feng–Hauser–Kruegel–Vigna "SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis", IEEE Symposium on Security and Privacy

Version: This is version 2018.07.30 of the "References" web page.