Version 20180729: browse
Rewrite of the core
for (1) higher speed and (2) reduced memory consumption.
Stack allocation is now at most a few kilobytes,
even for gigantic arrays.
Internally, the sorting algorithm is now mostly bitonic to simplify indexing, although odd-even speedups are still applied when convenient. Lanes are complemented to take the down-up decision out of the inner loops.
As in previous djbsort versions, data is sorted first in vector lanes and then transposed for final merges, reducing the overall number of vector permutations. Unlike previous versions, transposition is done in-place. The transposition in this version is bit-reversal on the outer 6 bits (bottom 3 bits and the top 3 bits), but leaves intermediate bits alone. Non-power-of-2 array sizes are handled by an extra, more traditional, merge step.
Sizes 2, 3, 4, 5, 6, 7, 8, 16, 32 are now special-cased. Non-power-of-2 sizes below 256 are padded to the next power of 2.
are now gone;
float32_sort is now supported.
The arithmetic in the reduction from
int32 31-bit right shift,
uint32 1-bit right shift, xor;
this is slightly more efficient
than the reduction from
from 2001 Herf.
Tests now have more variation (without much slowdown):
uint32 test cases
now deviate from
int32 in more than the sign;
float32 uses floating-point numbers that aren't integers;
int32 does more loops for small cases,
and some larger cases.
API for 2-input sorting is now
operating on two inputs in place.
Better inline assembly from Jason Donenfeld for 2-input sorting: more flexibility in compiler's register allocation.
The package version number is now automatically copied to
as the implementation version number
for implementations that don't provide
minmax now supports more peephole optimizations
for complemented bitonic sorting and for padding:
...[high-pos:low-pos] when possible;
Reverse(s[...]) when possible;
when one input is the minimum or maximum constant.
verifymany now includes the implementation version number
Version 20180717: browse
uint32_sort is now supported, joining
uint32_sort simply flips top bits and calls
int32_sort code and integrating the flips into the initial and final passes
would be noticeably faster.
int32 code to handle
uint32 directly, without flips,
would be noticeably faster on platforms that have all relevant
However, the separate flip has the virtue of minimizing the code size for the library.)
.h files should work from C++ now. (Not tested yet.)
Compiling and benchmarking
./do now finishes
int32-speed now prints
cycle counts for some intermediate sizes;
and cycles per byte.
The default compiler list is now revamped, and shorter.
There is now a unified internal API for 2-input sorting.
This API has the following interchangeable implementations:
cmovg in assembly);
presumably more later.
These implementations are now shared by the higher-level sorting code.
verifymany now prints a "
verified" line for each successful verification.
Verification is now flexible enough
to handle the
at least compiled on amd64
with typical compiler options.
Z3 is no longer asked
to simplify symbolic expressions.
All necessary simplifications
are handled by peephole optimizations
or in patches from djbsort to angr's claripy).
Added peephole optimizations in
More operations supported in input DSL
any number of inputs to concatenation.
Reduced redundancy in
minmax input grammar.
Some work on cleaning up DSL syntax.
Version 20180710: browse
Version: This is version 2018.07.29 of the "Changes" web page.