There is a vast literature on sorting, so this page cannot hope to be comprehensive. Reasons to include something here:

Available software

2001 Herf

2016 Santurkar

2017 Gueron–Krasnov "avx_qsort"

2017 Bernstein–Chuengsatiansup–Lange–van Vredendaal "NTRU Prime: reducing attack surface at low cost" (paper): constant time; direct predecessor of djbsort

2017 Bramas "A novel hybrid quicksort algorithm vectorized using AVX-512 on Intel Skylake" (paper): only for Skylake using AVX-512

2018 Hou–Wang–Feng "A framework for the automatic vectorization of parallel sort on x86-based processors"

2020 Blacher–Giesen–Kühne "Fast and robust vectorized in-place sorting of primitive types" (paper)

2020 Schechter (blog post)

2022 Blacher–Giesen–Sanders–Wassenberg "Vectorized and performance-portable Quicksort" (paper)

2023 Intel

Harder-to-verify performance claims

1987 Rönsch–Strauss "Timing results of some internal sorting algorithms on vector computers"

2007 Furtak–Amaral–Niewiadomski "Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms"

2008 Chhugani–Macy–Baransi–Nguyen–Hagog–Kumar–Lee–Chen–Dubey "Efficient implementation of sorting on multi-core SIMD CPU architecture"

2013 Xiaochen–Rocki–Suda "Register level sort algorithm on multi-core SIMD processors"

2015 Hayes–Palomar–Unsal–Cristal–Valero "VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors"

2015 Hou–Wang–Feng "ASPaS: a framework for automatic SIMDization of parallel sorting on x86-based many-core processors": reports, e.g., about 220 million cycles (55 cycles/byte) on 1.05GHz Xeon Phi 5110P (Knights Corner) for n=1000000

2016 Xu–Xu–Yin–Zheng "Optimized merge sort on modern commodity multi-core CPUs", Telkomnika 14 (2016), 209–318: reports, e.g., about 110 million cycles (27 cycles/byte) on 2.4GHz Core i7 4700MQ (Haswell) for n=1048576

2016 Codish–Cruz-Filipe–Nebel–Schneider-Kamp "Optimizing sorting algorithms by using sorting networks", Formal Aspects of Computing 29 (2017), 559–579: reports, e.g., about 150 cycles (2.3 cycles/byte) on unspecified "Core i7" for n=16; about 120000 cycles (3 cycles/byte) for n=10000

2016 Gueron–Krasnov "Fast quicksort implementation using AVX instructions", The Computer Journal 59 (2016), 83–90: reports, e.g., about 160000 cycles (40 cycles/byte) for n=1000 int32 on Haswell (Figure 4), and about 130000 cycles (32 cycles/byte) for n=1000 int32 on Haswell using radix sort in Intel's Integrated Performance Primitives library, "the fastest in-memory sort we are aware of"

Sorting networks

1968 Batcher "Sorting networks and their applications"

1983 Kumar–Hirschberg "An efficient implementation of Batcher's odd-even merge algorithm and its application in parallel sorting schemes"

1986 Marberg–Gafni "Sorting in constant number of row and column phases on a mesh"

1987 Chung–Ravikumar "Bounds on the size of test sets for sorting and related networks", Discrete Mathematics (1990)

1994 Lee–Batcher "A multiway merging network"

2006 Chaudhry–Cormen "Slabpose columnsort: a new oblivious algorithm for out-of-core sorting on distributed-memory clusters"

2007 Even–Levi–Litman "Optimal conclusive sets for comparator networks", Theoretical Computer Science (2009)

2015 Ebbesen "On the practicality of data-oblivious sorting"

Symbolic execution

2007 Nethercote–Seward "Valgrind: a framework for heavyweight dynamic binary instrumentation", ACM SIGPLAN Conference on Programming Language Design and Implementation

2016 Shoshitaishvili–Wang–Salls–Stephens–Polino–Dutcher–Grosen–Feng–Hauser–Kruegel–Vigna "SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis", IEEE Symposium on Security and Privacy

Version: This is version 2024.01.17 of the "References" web page.