djbsort

Version 20180729: browse djbsort-20180729.tar.gz

Algorithm

Rewrite of the core int32/avx2 implementation for (1) higher speed and (2) reduced memory consumption. Stack allocation is now at most a few kilobytes, even for gigantic arrays.

Internally, the sorting algorithm is now mostly bitonic to simplify indexing, although odd-even speedups are still applied when convenient. Lanes are complemented to take the down-up decision out of the inner loops.

As in previous djbsort versions, data is sorted first in vector lanes and then transposed for final merges, reducing the overall number of vector permutations. Unlike previous versions, transposition is done in-place. The transposition in this version is bit-reversal on the outer 6 bits (bottom 3 bits and the top 3 bits), but leaves intermediate bits alone. Non-power-of-2 array sizes are handled by an extra, more traditional, merge step.

Sizes 2, 3, 4, 5, 6, 7, 8, 16, 32 are now special-cased. Non-power-of-2 sizes below 256 are padded to the next power of 2.

Portable implementations: The out-of-place int32/portable1 and int32/portable2 implementations are now gone; the in-place int32/portable3 and int32/portable4 implementations remain.

C API

float32_sort is now supported. The arithmetic in the reduction from float32 to int32 is int32 31-bit right shift, uint32 1-bit right shift, xor; this is slightly more efficient than the reduction from float32 to uint32 from 2001 Herf.

Compiling

Tests now have more variation (without much slowdown): the uint32 test cases now deviate from int32 in more than the sign; float32 uses floating-point numbers that aren't integers; int32 does more loops for small cases, and some larger cases.

Internals

API for 2-input sorting is now MINMAX macro operating on two inputs in place.

Better inline assembly from Jason Donenfeld for 2-input sorting: more flexibility in compiler's register allocation.

The package version number is now automatically copied to version.c as the implementation version number for implementations that don't provide version.c.

Verification

minmax now supports more peephole optimizations for complemented bitonic sorting and for padding: xor(s,xor(s,t))t; xor(-1,s)invert(s); Reverse(Reverse(s))s; signedmin(invert(s),invert(t))invert(signedmax(s,t)); signedmax(invert(s),invert(t))invert(signedmin(s,t)); invert(s)[high:low]invert(s[high:low]); s[bits-1:0]s; s[high:low][high2:low2]s[high2+low:low2+low]; Concat(...)[high:low]...[high-pos:low-pos] when possible; Reverse(s)[high:low]Reverse(s[...]) when possible; eliminate signedmin/signedmax when one input is the minimum or maximum constant.

verifymany now includes the implementation version number on verified lines.

Version 20180717: browse djbsort-20180717.tar.gz

C API

uint32_sort is now supported, joining int32_sort. (Internally, uint32_sort simply flips top bits and calls int32_sort. Inlining the int32_sort code and integrating the flips into the initial and final passes would be noticeably faster. Adapting the int32 code to handle uint32 directly, without flips, would be noticeably faster on platforms that have all relevant uint32 instructions. However, the separate flip has the virtue of minimizing the code size for the library.)

The .h files should work from C++ now. (Not tested yet.)

Compiling and benchmarking

./do now finishes by running int32-speed.

int32-speed now prints cycle-count overhead; cycle counts for some intermediate sizes; and cycles per byte.

The default compiler list is now revamped, and shorter.

Internals

There is now a unified internal API for 2-input sorting. This API has the following interchangeable implementations: int32_minmax.c (portable); int32_minmax_x86.c (using cmovg in assembly); presumably more later. These implementations are now shared by the higher-level sorting code.

Verification

verifymany now prints a "verified" line for each successful verification.

Verification is now flexible enough to handle the portable implementations, at least compiled on amd64 with typical compiler options.

Internally, Z3 is no longer asked to simplify symbolic expressions. All necessary simplifications are handled by peephole optimizations (in djbsort's minmax, or in patches from djbsort to angr's claripy). Added peephole optimizations in minmax: If(c,constbit1,constbit0)c; xor(c,constbit1)invert(c); equal(c,bit0)invert(c); invert(invert(c))c.

More operations supported in input DSL for minmax and tryinput: xor; or; and; add; sub; mul; equal; signedlt; signedrshift; any number of inputs to concatenation. Reduced redundancy in minmax input grammar. Some work on cleaning up DSL syntax.

Version 20180710: browse djbsort-20180710.tar.gz

Original release.


Version: This is version 2018.07.29 of the "Changes" web page.