“Vectorized and Performance-Portable Quicksort”, Mark Blacher, Joachim Giesen, Peter Sanders, Jan Wassenberg2022-05-12 ()⁠:

[blog] Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically single-threaded, implemented for a particular instruction set, and restricted to a small set of key types.

We lift these 3 restrictions: our proposed vqsort algorithm integrates into the state-of-the-art parallel sorter ips4o, with a geometric mean speedup of 1.59. The same implementation works on 7 instruction sets (including SVE and RISC-V V) across 4 platforms. It also supports floating-point and 16–128 bit integer keys.

To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20× as fast as the sorting algorithms implemented in standard libraries.

This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input.