Lomuto's Comeback

ipsofacto · on May 17, 2020

There's an even better branch-free (super scalar) sorting algorithm: "In-place Parallel Super Scalar Samplesort (IPS4o)" which we started using:

https://github.com/SaschaWitt/ips4o

https://arxiv.org/abs/1705.02257

As an example, to sort 10 million random longs on my computer it takes std::sort 766 ms (roughly in line with Andrei's numbers) and ips4o::sort takes 274 ms.

[edit:formatting]

phkahler · on May 17, 2020

2x-3x improvement doesn't seem like much for something that runs in parallel.

ipsofacto · on May 17, 2020

The 274 ms is for the sequential version of the algorithm (on my i7-9750 laptop).

On my office Xeon E5-2690 machine when using multiple threads the runtime decreases like this

840 ms for std::sort

372 ms for IPS4o sequentially

201 ms for 2 threads

104 ms for 4 threads

53 ms for 8 threads

33 ms for 16 threads

phkahler · on May 18, 2020

So starting off better on a single, and then scaling well with threads. Nice!

bluescarni · on May 18, 2020

That's really interesting!

How does it behave with almost-sorted data?

ipsofacto · on May 18, 2020

There's some logic to detect if the data is already sorted or reverse sorted, but I think that's only triggered at the very beginning and not at every recursion level. If the data is almost sorted it seems to become a little bit faster. Take a look at the appendix of the paper where there are timings for various distributions.

There's also some logic to handle the case where there are a lot of equal elements which also results in faster performance than the completely random case.

goldenkey · on May 17, 2020

You might want to mention if are affiliated with the algorithm you are promoting. ipsofacto ~ ips4o

ComputerGuru · on May 17, 2020

Or maybe the throwaway was named after the comment it was intended to be used to post.

ipsofacto · on May 17, 2020

No affiliation, just picked the name because it fit... The company I work at was exploring various sort algorithms to use and this turned out to be the fastest.

Another benefit of this algorithm is that it is parallizes extremely well.

orlp · on May 17, 2020

Hoare partitioning can also be implemented in a branchless manner, my algorithm pdqsort does this: https://github.com/orlp/pdqsort

andralex · on May 18, 2020

Thanks! I tried that too, but it's slower than Lomuto. Forgot to mention in the article.

orlp · on May 18, 2020

> I tried that too, but it's slower than Lomuto.

That is very strange. I can not reproduce your results in C++, using your code.

On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2 MinGW64) my algorithm performs best and your branchless version ends up slower than your branchy one:

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && a 10000000
    min_milliseconds=828.1250
    median_milliseconds=890.6250

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && a 10000000
    min_milliseconds=671.8750
    median_milliseconds=671.8750

    $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && a 10000000
    min_milliseconds=343.7500
    median_milliseconds=343.7500

On an Intel university machine (Xeon E5-2667 v2 @ 3.3GHz, 64 bit Ubuntu 16.04 LTS, GCC 5.4.0) I get:

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && ./a.out 10000000
    min_milliseconds=514.8051
    median_milliseconds=536.1984

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && ./a.out 10000000
    min_milliseconds=688.2471
    median_milliseconds=698.9603

    $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && ./a.out 10000000
    min_milliseconds=340.1935
    median_milliseconds=344.7432

Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is generating inferior code for AMD/Windows, as at least on Intel/Linux your Lomuto branchless code can beat the branchy code. But pdqsort (which uses branchless Hoare partitioning) consistently beats both.

I didn't change the benchmark, I only added pdqsort to yours, my full changes are the simple inclusion of 1 header and 3 lines of code: https://github.com/orlp/lomuto/commit/29381176f49e3588f6882c...

In order for any interested readers to reproduce, simply clone https://github.com/orlp/lomuto and run the above commands.

dataflow · on May 17, 2020

Notice that this is on random longs. Of course branch misprediction and memory bandwidth is going to crush you for a branchy sort... you'll be wrong like half the time, and the comparisons and swapping are trivial! Real world data isn't random, so I'd expect branch predictors to do much better with recognizing patterns. And your sorting isn't going to be as simple as sorting integers according to their values all that often. It's a neat exercise, but absent more realistic benchmarking to the contrary, I wouldn't assume I'll get a faster program by just substituting even a typical quicksort with this...

sesuximo · on May 17, 2020

Sorting is the case study but not the insight. I think the cool/surprising part of the article is that doing more work can be faster... Even more memory work (which per conventional wisdom is not trivial!)

phkahler · on May 17, 2020

>> Sorting is the case study but not the insight. I think the cool/surprising part of the article is that doing more work can be faster

Yeah, I figured heap sort should be the standard because it has a worst case O(n*log(n)) complexity. But it does tend to do more swaps (a smarter implementation can avoid many writes), and the memory access is very cache-unfriendly. In practice it's just not used.

dataflow · on May 17, 2020

My comment wasn't inherently about sorting either. Branch-free in general is faster in the worst cases, i.e. if the branch is difficult to predict. Except most real-world branches are predictable... that's why we have branch predictors. So in general you should expect branch-free to be slower, unless you have reason to assume your branches are actually close to random.

derefr · on May 17, 2020

> unless you have reason to assume your branches are actually close to random.

I mean, to the degree that you need to sort the data at all, it contains informational entropy.

Many "presents as sorted" data structures trade space for time by keeping track of the internal sortedness of chunks of the data, between sortation passes. Heck, any insert-optimized data structure (B+ trees; LevelDB's sorted-string-tables; N-ary tries generally) will get most of its performance gains by being able to guarantee, at node split time, that the children of any particular node are sorted relative to one-another.

So, in many cases, by the very fact that you don't already have metadata attached to your data asserting that it's sorted, it's highly likely to have no branch-predictability.

(Couple that to the fact that load-balancing in OLTP distributed systems favors writes to evenly-keyspace-distributed partitioning keys, ala Dynamo, and you'll frequently find that your inputs to an index-like data structure can be guaranteed random.)

dataflow · on May 18, 2020

I'm struggling to follow your comment. I said real-world data have patterns and aren't completely random, so you can in general expect branch predictors to help (unless, obviously, your data is actually known to be random like in the example). You rebutted with "because load balancers evenly spread their key space, you will frequently find that your inputs can be guaranteed random", as if that somehow contradicts the point I was making...? It's also clearly not true that it's "highly likely" your data has "no branch-predictability" just because "you don't already have metadata attached to it asserting it's sorted"... right? That's not only both obviously not the case on its own, but also, why would you even sort data that's already guaranteed to be sorted? I can't make sense of this, so I feel like we must be speaking past each other somehow? But I'm confused how.

derefr · on May 18, 2020

The insight I might have not made clear, is that if you have “this node and its children are already sorted” metadata, then it’s cheaper to check whether a given (fine-grained) input chunk is already sorted—and if so, to directly construct a node from it—than it is to feed it blindly to the sorting algorithm.

If such a pre-check is in play, then the sorting algorithm itself will basically never have the sort of “runs” of sorted values that make some algorithms cheaper. Those “runs” were already plucked out and turned into nodes!

andralex · on May 18, 2020

This. Thanks!

cozzyd · on May 17, 2020

Branches also can't be vectorized, so going branch free can lead to a substantial speedup if you can use vector instructions (you might be doing twice the work, but your vector instruction may let you do it 4-16 times faster).

ncmncm · on May 17, 2020

Indeed, AVX512 instructions can sort 16 int32 or well-behaved float values optimally, with no branches. It is a good final pass to any other algorithm.

sesuximo · on May 18, 2020

You mean sorting within a vector register? Can you provide a link or something? I can't find info on this

ncmncm · on May 19, 2020

The keyword to look for is "sorting network". See, for example, http://cs.brandeis.edu/~hugues/sorting_networks.html.

There is a 28-step 16-input network there. It admits a direct translation to AVX-512 instructions.

What is usually considered minimal differs from what is best for AVX512. AVX512 is happy to do 16 comparisons at each stage; its efficiency is limited only by the number of stages. Thus, a network that did more comparisons in fewer stages would be faster.

zbjornson · on May 18, 2020

For one: https://arxiv.org/abs/1704.08579

sesuximo · on May 18, 2020

Thanks. It's kind of sad that there's no single instruction for horizontally sorting in vectors of integers. Horizontal merge/reduce would also be cool

dataflow · on May 18, 2020

True! Though I have yet to see vectorized sorting in production...

andralex · on May 18, 2020

(Author here.) That is incorrect. The difficult case is and the real benchmark is with unpredictable data. The low-entropy cases will be about as fast with both partitioning schemes.

This is only a smart part of the benchmarks I've run because I wanted to drive one point home within a limited space. I've run many tests on various data types and shapes, and Lomuto does better than Hoare on most. (E.g. its improvement on double is even larger than on integrals.)

dataflow · on May 18, 2020

I'm confused, what exactly is incorrect? I said if we don't have more realistic benchmarks then I wouldn't just assume it's faster. Now you're saying you did in fact do more realistic benchmarks, and you found lower entropy ones to exhibit the same performance (which is not faster). They all seem consistent with each other?

I also said that sorting is often more complicated than just comparing integers by their values, which you didn't really address except to mention doubles, which missed the point I was making (I was trying to say sorting often needs e.g. satellite information).

In any case though, if you've done other benchmarks, it'd be nice if you could post them on GitHub or something. Maybe I'm missing something.

repsilat · on May 17, 2020

Maybe?

Lots of people replaced their sorting algorithms with Tim-Sort because runs of alrealdy-sorted data are empirically common (at least in some workloads). I've seen it myself, in use-cases where sort keys change slightly from iteration to iteration in some larger algorithm, and sort items need to be dynamically kept in order. It no doubt also happens when grabbing data from multiple places, each of which is theoretically ordered arbitrarily but is in practice ordered predictably.

So yeah, the author picked the use-cases where this algorithm is going to show in its best light. But that's not bad either, right? Data is often not well-ordered -- if you're counting occurrences of words, and want to get an ordered list (and never mind that you wouldn't use a comparison-based sort for that) you wouldn't expect much order. Likewise sorting any keys/values from a decent hash table.

One worry that I would have applies specifically to C++ and D: sometimes values are big, and copies are expensive. But I guess users can always choose to sort pointers like other languages do, if the standard library uses a copy-heavy, branch-light algorithm.

TeMPOraL · on May 17, 2020

I think one example of extremely unsorted data is random data in Monte Carlo simulations, in particular if you need to compute statistical summaries that need the data to be sorted - e.g. percentiles. So you may end up repeatedly sorting million-long random arrays many time a second.

Then again, maybe there are algorithms letting you compute percentiles without sorting?

leni536 · on May 17, 2020

If you need k percentile values then I think you could do it in O(n log(k)) with introselect. You most definitely don't need O(n log(n)).

loeg · on May 17, 2020

> Then again, maybe there are algorithms letting you compute percentiles without sorting?

I think you may be able to do so probabilistically, which is often good enough? Additionally, one observation about percentiles is that you don't need to sort all of the data; you can just hold on to the top N results in something like a Heap.

kccqzy · on May 17, 2020

Extracting percentile information doesn't require sorting. Even if you're too lazy to implement anything custom, a few invocations of std::nth_element will likely outperform sorting: https://en.cppreference.com/w/cpp/algorithm/nth_element

cozzyd · on May 17, 2020

Yeah, you can histogram the values if you just need an approximate answer (or if you have few enough discrete values, you can just have a bin per value). Once you have an approximate answer, if you need a more precise answer, you can go through the data again and only consider values within the right bin (which hopefully will be a lot fewer, if you've chosen your bins wisely) and then use something like std::nth_element.

kragen · on May 18, 2020

I used Lomuto's algorithm in my presentation of Quicksort in http://canonical.org/~kragen/sw/dev3/paperalgo#addtoc_21 because my emphasis there is on using the simplest possible algorithms. It didn't occur to me that it might actually be faster, or be capable of being made faster, and indeed in that page I said you could improve efficiency with Hoare partitioning.

Alexandrescu's branch-free code in this article is a thing of beauty, in the same sort of rugged way that Stepanov's code is.

zengid · on May 17, 2020

I love this sort of "Historical Fiction" where interactions and conversations between The Great People of Science are imagined and portrayed. A great example of this is Louisa Gilder's wonderful book _The_Age_Of_Entanglement:_When_Quantum_Physics_was_Reborn

https://www.amazon.com/Age-Entanglement-Quantum-Physics-Rebo...

gumby · on May 17, 2020

Thanks for writing his because I hated it — I read that par thinking it would shed some light on the topic of the paper at large, but it was just noise to me.

Which is not to imply that I am right and you wrong or vice versa! Simply to observe that people are different.

As I liked the article but was annoyed by that beginning, it’s great for me that the first comment I saw was yours. So thanks for making it.

turndown · on May 17, 2020

If you enjoy books like that, I recommend The Cambridge Quintet[0]; it's a wonderful book.

0. https://www.amazon.com/Cambridge-Quintet-Scientific-Speculat...

mwcampbell · on May 17, 2020

Oh, I assumed that dialogue between Lomuto and Hoare was real -- at least the gist of it, if not the exact words.

zengid · on May 17, 2020

The interaction was probably real, since there is a date and location, but the dialog (and certainly the narration) seems to be taking some artistic liberties perhaps? This is also what Louisa did in her book on the history of quantum mechanics; real interactions with imagined dialogs. My apologies if I gave the wrong impression.

mhh__ · on May 17, 2020

Anthony Zee's Quantum Field Theory textbook takes a similar tone.

kleiba · on May 17, 2020

To anyone interested in a deeper treatment of quicksort and variants thereof, I can recommend Sebastian Wild's PhD thesis on that topic: https://kluedo.ub.uni-kl.de/frontdoor/deliver/index/docId/44...

bachmeier · on May 17, 2020

Off topic: Is it common practice for a dissertation at a German university to be written in English?

kkaefer · on May 17, 2020

Depends on the field, but it is very common in computer science and generally in natural sciences. Studying computer science, I don’t think I ever even wrote as much as a paper in German; everything’s English.

DaiPlusPlus · on May 17, 2020

Yes - to ensure the widest possible audience. English is the lingua-franca of academia and has been for almost a century now.

kleiba · on May 17, 2020

At least in MINT fields, probably less so in the humanities. Correct me if I'm wrong.

atq2119 · on May 17, 2020

You're right. German economics is famously insular, for example. Part of that is cultural, part of it is that they don't write as much in English.

ggrrhh_ta · on May 18, 2020

It is also common in many fields in Switzerland, Italy, Spain, ... Newton's Principia was written in Latin, for example and not in 18th century English.

cozzyd · on May 17, 2020

In physics it certainly is. As far as I know, much all scientific conferences and journals are in English these days. Even journals like JETP (https://en.wikipedia.org/wiki/Journal_of_Experimental_and_Th...) became joint English and Russian in the 50's!

kleiba · on May 17, 2020

Hardly anyone is going to read, let alone cite your work if you were to publish in German. I'm not even aware of any German-speaking conferences.

codeflo · on May 17, 2020

I’ve read that some vector instruction sets use masking instructions to avoid branching. You execute both arms of the conditional and discard the unwanted computation, which can be cheaper than an X% chance of a branch misprediction.

Should CPUs include more masking instructions for regular, non-vectorized code as well?

tom_mellior · on May 17, 2020

32-bit ARM assembly language provided conditional execution for most instructions. This was dropped in the 64-bit instruction set. (https://en.wikipedia.org/wiki/Predication_(computer_architec..., https://en.wikipedia.org/wiki/ARM_architecture#64-bit) I imagine the architects had a clear picture of the advantages and disadvantages, and made a very well-informed decision. I guess that part of the reason might have been that it's not clear when compilers should emit predicated code instead of branches.

Also, you can get something very similar even for scalar code as long as your machine has a conditional move operation, which they all have:

    /* true path */
    a = ...
    b = ...
    true_result = ...
    /* false path */
    x = ...
    y = ...
    false_result = ...
    /* final result */
    result = (condition ? true_result : false_result)

This works if the two "branches" have no side effects (memory writes, function calls) and cannot raise exceptions. Again, it's very difficult to estimate for compilers (and humans) if it will pay off to run both computations rather than branch.

The trade-offs for vectorization are different from scalar code since vectorizing a loop by a factor N gives a huge win, and even if you waste some time doing some redundant computations, you still have a reasonable chance of being faster than scalar.

ncmncm · on May 17, 2020

You can get close to the performance of a cmov instruction by generating a pair of all-1s and all-0s values: a=c, b=c-1, and a result z=a&x|b&y.

Gcc will not under any circumstances produce two cmov instructions in a basic block, so that is your only alternative without dropping to asm. Clang is happy to produce two adjacent cmov instructions. Usually your ALUs are not otherwise so engaged as to make the number of operations involved costly.

tom_mellior · on May 19, 2020

I don't get this. If c can be 0 or 1, then a can be 0 or 1, so it can be all-0s but not all-1s. b can be -1 or 0, so that part is fine. Can you show actual C or C++ code for a function like this:

    int select(int x, int y, bool condition) {
        ...
    }

ncmncm · on May 20, 2020

Sorry, brain freeze. See

https://news.ycombinator.com/item?id=23237663

The bitwise thing is correctly -c&x|(c-1)&y.

tom_mellior · on May 20, 2020

Oh, yes, I should have realized that -c is enough to transform from 0/1 to 0/-1. Thanks!

ncmncm · on May 19, 2020

I should add that Clang is happy to turn c&x|(c-1)&y into cmov.

banachtarski · on May 17, 2020

I’m unconvinced because the instruction set sizes are already becoming fairly bloated (increasing microcode translation cost and i-cache pressure) but also because CPUs already do speculative execution which has a nearly equivalent effect to the idiom you’re referring to.

*edit: wording

MauranKilom · on May 17, 2020

To be more precise, some CPUs do the "both sides of the conditional" execution (EDIT This link is incorrect! Refer to "Eager execution" in the last link below. /EDIT https://en.wikipedia.org/wiki/Eager_evaluation) - years ago I heard that mobile processors do so for power reasons. No idea if that's still the case.

The problem in question is due to predictive execution in which the CPU has to flush the pipeline on misprediction. This is dominant in desktop PC CPUs. I have no knowledge on what is prevalent in e.g. server or mobile CPUs.

Both are (according to https://en.wikipedia.org/wiki/Speculative_execution) forms of speculative execution.

banachtarski · on May 17, 2020

I'm speaking broadly about desktop/console CPUs which I have low-level familiarity with. I don't know of any CPUs in this class that don't perform some form of speculative execution.

I'm not sure if eager evaluation is related to the topic at hand and was speaking specifically about speculative execution. Eager evaluation is just what pretty much every language except Haskell (or Haskell-like) does.

MauranKilom · on May 17, 2020

Sorry, that first link was nonsense. I just copied the wiki link under "Eager execution" in the Speculative Execution page, but that is of course far from the same as "eager evaluation" (which is what the link goes to). I apologized for the confusion.

The point is that there are at least two ways CPUs can speculatively execute a conditional: Either predict which branch is taken and flush the pipeline if the prediction was wrong, or execute both arms of the conditional and discard the one that was not actually taken. The top-level SIMD comment is about the latter, but most optimization around speculative execution is for the former.

Yes, CPUs do speculative execution. The "flush pipeline on mispredict" kind ("predictive execution"). That is not the same as the "execute both and discard the untaken one" kind ("eager execution"). Your first reply suggests you consider them equivalent when they are not. I just wanted to clear that up.

banachtarski · on May 17, 2020

Ah OK I understand now thanks for clearing that up. I meant that the end effect was similar but not so much that the mechanism was identical.

ncmncm · on May 19, 2020

If you define

  inline bool swap_if(
    bool c, long& a, long& b)
  {
    long ta = a, tb = b;
    a = c ? tb : ta;
    b = c ? ta : tb;
    return c;
  }

and then use it in a partition like

   right += swap_if(
     *left < pivot,
     *left, *right);

compiled with Clang (which generates `cmov` instructions), the Hoare partition is still faster, and more than twice as fast as `std::sort`.

Using `swap_if` in Lumuto is also faster, but not as much faster. I interpret the difference (vs. array ops) as resulting from reduced L1 bus traffic.

A fully general swap_if,

  template <typename T>
  bool swap_if(
    bool c, T& a, T& b)
  {
    T v[2] = { a, b };
    b = v[c], a = v[1-c];
    return c;
  }

could and IMHO should be peephole-optimized to use a pair of `cmov` instructions, but is not in Clang, Gcc, Icc, or MSVC. But even without such an optimization, it makes Quicksort much faster.

(Gcc, incidentally, is very, very sensitive to details of the second swap_if. Change the order of assigning a and b, or use `!c` in place of `1-c`, and it gets much slower, on Intel, for very non-(to me-)obvious reasons. Gcc also will never, ever produce two `cmov` instructions in a basic block. I have filed a bug.)

If `swap_if` were in the Standard Library, it would probably be implemented optimally on all compilers, and almost half of the Standard algorithms could use it to get, often, ~2x performance.

cozzyd · on May 17, 2020

Writing branch-free code (which I find myself sometimes doing to take advantage of simd vectorization) is always really painful. I wonder if someone has made an attempt at better tooling for this.

repsilat · on May 17, 2020

One thing people have made is array languages. The "primitive" operations apply to arrays of data, so branches around individual elements can't be expressed.

I think it's also generally less expressive, though. Writing Lomuto's partition as efficiently would likely be impossible, but writing it branch-free might not be -- you would compose branch-free primitives like

    left = filter(array, array < array[0])
    right = filter(array, array > array[0])

where `filter` is branch-free and takes an array of Boolean values as its second argument.

Shrugs, aside from some standard set of functions/primitives and idioms I'm not sure it would help much, though perhaps a mindset-shift is a more likely solution than something more technical.

VHRanger · on May 17, 2020

This sort of vectorised pseudo branching is often used in SIMD algorithms

cozzyd · on May 17, 2020

Yes... and the transformations are often non-trivial (especially when considering the possibility of out-of-bounds elements, which can't just safely be multiplied by zero--they might be a NaN, or on the edge of a page). It would be nice if there was a tool (or compiler option!) that could convert my branchy code into an "identical" (assuming floating-point associativity, etc.) branchless version.

snovv_crash · on May 17, 2020

The Eigen Array [1] object is fairly close to this. Combining the 'select' with comparisons, chained operators etc. Eigen vectorized stuff pretty effectively too.

1. https://eigen.tuxfamily.org/dox/group__TutorialArrayClass.ht...

cozzyd · on May 17, 2020

Interesting... I've used Eigen for matrix operations before, I did not know they had such a smart general-purpose array implementation!

klyrs · on May 17, 2020

I tend to use the same method as the author has presented.

1. Write your branchy code

2. Massage the branches until their code is identical (testing frequently)

3. Trim a useless branch, and eat a piece of chocolate

tomcam · on May 18, 2020

I don’t mind boasting that I’m a world class talent at part 2 of step 3

gautamcgoel · on May 17, 2020

Reminds me of this great article on writing a fast mergesort:

http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/

slashdev · on May 17, 2020

If you have a branch free algorithm, you can vectorize it. If he discussed implementing this with SIMD instructions (whether by human hand or compiler auto vectorization) I missed that part. But that seems an interesting angle to me.

kccqzy · on May 17, 2020

Not necessarily. A branch-free algorithm can totally have a data dependency from the previous iteration of the loop to the next, making it impossible to vectorize.

slashdev · on May 17, 2020

That's true, although there are sometimes workarounds for that by modifying the algorithm to iterate over a small groups where there is still a data-dependency between iterations, but not within each group, allowing one to use SIMD. e.g. instead of delta compression over an array of integers where you subtract each integer from the prior one, you can subtract each group from the prior one with only a small loss of compression ratio.

The question is, is there such a dependency in this algorithm? I didn't check.

klyrs · on May 17, 2020

Yes, sorting typically (and this one specifically) has long dependency chains. The natural reflex is to split the chains across multiple workers... and that's how you get merge-sort. But even the last pass has a linear dependence chain. Not long ago, somebody posted a clever algorithm based on 4-way swaps, which is a clever way of addressing the issue.

Edit: I can't seem to locate that 4-way swapping algorithm, I swear it was posted to HN in the last year... somebody halp!

kccqzy · on May 17, 2020

You might be thinking of this: https://news.ycombinator.com/item?id=22322967

klyrs · on May 17, 2020

Bingo, thanks!

Arcuru · on May 17, 2020

Reminds me of this paper on binary search, which also shows a speed up using branch-free comparisons.

"Array Layouts for Comparison-Based Searching" https://arxiv.org/pdf/1509.05053.pdf

eqvinox · on May 17, 2020

So what's the impact of the data dependency chain on "first -= smaller" to the next iteration?

(I guess at the very least it prevents vectorization, but other than that ... shouldn't be too much, but probably still the new limiting factor?)

a1369209993 · on May 18, 2020

https://dlang.org/blog/2020/05/14/lomutos-comeback/ gives a 404 error. And archive.org ( http://web.archive.org/cdx/search/cdx?url=https://dlang.org/... ) is also getting nothing but 404s, so it's not a problem on my end. I'm not really sure what's wrong with this site, since other commenters seem not to have noticed a problem.

zombinedev · on May 18, 2020

It's ok now, as far as I can see.

a1369209993 · on May 18, 2020

Actually it's it still giving a 404, but (after remembering and looking up the --content-on-error option to wget) it appears TFA is served just fine as if it were a 404 error page.

ncmncm · on May 17, 2020

It is true that more swaps may be cheaper than fewer swaps.

The operations to count, nowadays, are not swaps or comparisons, which are very cheap, but instead pipeline stalls, which are very, very expensive.

It is easy to better-than-double the speed of vanilla quicksort with a three-line change in the partition step.

ncmncm · on May 20, 2020

Details

https://news.ycombinator.com/item?id=23237663

Ericson2314 · on May 17, 2020

I'm not sure I've ever sorted an array or list. Rather I put things into sets or maps or built in index---I always wanted to access things later, or have an on-line data structure (which a sorted flat thing is not).

This whole field is a waste of time, and anachronism from when the master copy was paper/pre-computer and computers were just doing the analytics.

ComputerGuru · on May 17, 2020

Your dismissal of “sorted flat thing” as being necessarily not online is incorrect or unnecessarily strict to your detriment.

I wrote this comparing the high-level performance of a “set” (AVL or red-black tree) against just that:

Faster lookups, insertions, and in-order traversal than a red-black or AVL tree [0]

[0]: https://neosmart.net/blog/2019/sorted-list-vs-binary-search-...

(Spoiler: it depends on the nature of your data, but sorted vector can be dramatically faster.)

Ericson2314 · on May 18, 2020

To be clear I am not against "small sorted flat thing" is an optimization for smaller data structures.

My point is I care about the map or set interface. I don't use temporary sets or quicksort to sort....because I don't sort---I don't want a list/array of things in order, basically ever.

ComputerGuru · on May 19, 2020

Yes, and what I showed you is a set with a set interface, internally implemented as an always-sorted list.

Ericson2314 · on May 19, 2020

OK sorry, Yes creating a map/set using the small-size optimization is internally doing a sort.

ncmncm · on May 17, 2020

Putting things in a map or set is almost invariably slower, often much slower, than a smart sort. If performance doesn't matter, then go ahead. Most often sort performance doesn't matter, or anyway most of the time is spent elsewhere. People do use Python or Bash in production, without shame.

But some of us work on problems where performance of the sort does matter.

Ericson2314 · on May 18, 2020

Sure, it is slower, but my point is not that I sort by doing `Map.toList . map.fromList` but that I don't immediately turn the map back to a list. The sorting literature is optimizing a thing I don't need to to do.

ComputerGuru · on May 18, 2020

Just because you’re not using it directly doesn’t mean that you’re not benefiting from those optimization in those layers you are depending on.

Ericson2314 · on May 18, 2020

I'm not using it directly or indirectly. The whole point of this sorting literature is a space-time tradeoff that's completely different from when you want an ordered map/index as the end result.

Ono-Sendai · on May 18, 2020

Not really true, putting things in a hash map is pretty fast, and generally faster than a comparison based sort.

cozzyd · on May 18, 2020

A hash map is unsorted so it's not a useful way to sort data

Ericson2314 · on May 18, 2020

Very true. I like rich keys with non-trivial comparison cost. I even better like nested maps which no only are often useful in their own right, but also server to cache the cost of comparison wrt the original map---a domain-specific trie.

Map are just far more the bread and butter of programming than sorting, and not just if you are using the one of the languages satirized by https://elbenshira.com/blog/the-universal-data-structure/