djbsort: Comparison

This page compares djbsort to some other numeric-array sorting libraries. The libraries selected here for comparison are the same as djbsort's main competitors on the speed page: "the vxsort library used in Microsoft's .NET starting in 2020; the vqsort/highway library introduced by Google; the far library, which is the (sometimes faster) parent of vqsort; and the x86simdsort library introduced by Intel".

One can do better on some of these comparison axes when speed is not an issue: for example, one can simultaneously (1) use less code and (2) sort a wider range of data types.

Interface

Data types:

djbsort: int32, uint32, float32, int64, uint64, float64
far: int32
vqsort: int16, uint16, float16, int32, uint32, float32, int64, uint64, float64, uint128
vxsort: int16, uint16, int32, uint32, float32, int64, uint64, float64
x86simdsort: int16, uint16, float16, int32, uint32, float32, int64, uint64, float64

Array lengths allowed by API on 64-bit CPUs:

djbsort: up to 2⁶³ (void djbsort_int32(int32_t *,long long))
far: up to 2³¹ (void quicksort(int *arr, int n))
vqsort: up to 2⁶⁴ (void VQSort(int32_t* HWY_RESTRICT keys, size_t n, SortAscending))
vxsort: up to 2⁶⁴ (void sort(T* left, T* right, T left_hint = std::numeric_limits<T>::min(), T right_hint = std::numeric_limits<T>::max()))
x86simdsort: up to 2⁶⁴ (void qsort(T *arr, size_t arrsize, bool hasnan = false, bool descending = false))

API is CPU-independent, with implementation automatically selected based on run-time CPU:

djbsort: yes (chooses between AVX2, SSE4.2, ARM NEON, and portable fallback)
far: no (provides AVX2 code; caller handles fallback)
vqsort: yes (chooses between AVX-512, AVX2, SSE4.2, ARM NEON, more options, and portable fallback)
vxsort: no (provides AVX-512 code and AVX2 code; caller handles selection and fallback)
x86simdsort: yes (chooses between AVX-512, AVX2, and portable fallback)

Correctness and security

Memory safety and protection against other bugs:

djbsort: conventional tests; tests under valgrind; verification
far: conventional tests
vqsort: conventional tests
vxsort: conventional tests
x86simdsort: conventional tests

Protection against timing leaks:

djbsort: yes (constant cycle counts when array size is constant)
far: no
vqsort: no
vxsort: no
x86simdsort: no

Protection against denial of service:

djbsort: clear (changing array contents does not increase cycle counts)
far: unclear (chooses pivot from 72 pseudorandom elements but deterministically; recognizes constant subarrays; falls back to radix exchange)
vqsort: unclear (chooses pivot from 576 bytes using per-thread call to OS RNG; falls back to heapsort)
vxsort: unclear (chooses deterministic median-of-three pivot; falls back to heapsort)
x86simdsort: unclear (similar to the far library; falls back to std::sort)

More research is required to quantify the cost of the far, vqsort, vxsort, and x86simdsort libraries when array contents are chosen adversarially. Note that a library that allows a larger relative slowdown may still provide better protection against denial of service if it is faster to begin with.

Cycle counts for random data

CPU cycles/byte to sort int32[16] on AMD Zen 3 (the three numbers are stabilized quartiles of many measurements excluding 31 cycles of cycle-counting overhead):

djbsort: 0.33, 0.34, 0.35
far: 1.14, 1.14, 1.14
vqsort: 0.92, 0.92, 0.92
vxsort: 1.85, 1.93, 2.09
x86simdsort: 0.91, 1.00, 1.01

CPU cycles/byte to sort int32[1024] on AMD Zen 3 (the three numbers are stabilized quartiles of many measurements):

djbsort: 1.14, 1.15, 1.16
far: 1.85, 1.90, 1.95
vqsort: 2.34, 2.46, 2.56
vxsort: 2.61, 2.63, 2.64
x86simdsort: 2.61, 2.78, 2.96

CPU cycles/byte to sort int32[65536] on AMD Zen 3 (the three numbers are stabilized quartiles of many measurements):

djbsort: 3.27, 3.28, 3.31
far: 3.28, 3.30, 3.31
vqsort: 3.51, 3.55, 3.60
vxsort: 3.67, 3.71, 3.76
x86simdsort: 4.15, 4.18, 4.21

See the separate speed page for a more comprehensive comparison of cycle counts.

Code size

Compiled AVX2 code size for a simple int32.cc that defines void sortint32(int32_t *y,int n) by calling the library (text+data+bss for int32.o, plus text+data+bss for the relevant *.o files in the library):

djbsort: 56+0+0 for int32.o plus 25050+0+0 for int32_avx2_C0-sort.o plus 1772+0+0 for portable fallback and dispatch (dispatch_auto-sort_int32.o, int32_portable4_C3-sort.o, compilers*.o)
far: 24387+0+0 for int32.o plus 0+0+0 (far is header-only)
vqsort: 277+8+0 for int32.o plus 266420+1224+24 for vqsort_i32a.cc.o plus some small support routines for dispatch etc.
vxsort: 23483+0+0 for int32.o plus 4096+0+0 for avx2_masks.cpp.o
x86simdsort: 60+0+0 for int32.o plus 3951712+8+0 for x86simdsort-avx2.cpp.o (all data types in the library are in a single .o file) plus some small support routines

An application with many calls to a big-header library such as far or vxsort will incur many copies of the first number, but can avoid this by instead calling an intermediate function such as sortint32, so the overall code size includes only one copy of that number.

Compiled AVX2 code size for a simple all32.cc that defines sortint32, sortuint32, and sortfloat32 (text+data+bss for all32.o, plus text+data+bss for the relevant *.o files in the library):

djbsort: 128+0+0 for all32.o plus 25050+0+0 for int32_avx2_C0-sort.o plus 445+0+0 for uint32_avx2useint32_C0-sort.o plus 493+0+0 for float32_avx2useint32_C0-sort.o plus 3960+0+0 for portable fallback and dispatch (dispatch_auto-sort_int32.o, dispatch_auto-sort_uint32.o, dispatch_auto-sort_float32.o, int32_portable4_C3-sort.o, uint32_useint32_C3-sort.o, float32_useint32_C3-sort.o, compilers*.o)
far: not applicable
vqsort: 677+8+0 for all32.o plus 266420+1224+24 for vqsort_i32a.cc.o plus 279990+1224+24 for vqsort_u32a.cc.o plus 244433+1224+24 for vqsort_f32a.cc.o plus some small support routines
vxsort: 53557+0+0 for all32.o plus 4096+0+0 for avx2_masks.cpp.o
x86simdsort: 132+0+0 for all32.o plus 3951712+8+0 for x86simdsort-avx2.cpp.o plus some small support routines

The uint64 and float64 functions in djbsort are similarly small wrappers around int64, which is separate code from int32.

Version: This is version 2026.02.09 of the "Comparison" web page.