Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lomuto's Comeback (dlang.org)
288 points by VHRanger on May 17, 2020 | hide | past | favorite | 98 comments


There's an even better branch-free (super scalar) sorting algorithm: "In-place Parallel Super Scalar Samplesort (IPS4o)" which we started using:

https://github.com/SaschaWitt/ips4o

https://arxiv.org/abs/1705.02257

As an example, to sort 10 million random longs on my computer it takes std::sort 766 ms (roughly in line with Andrei's numbers) and ips4o::sort takes 274 ms.

[edit:formatting]


2x-3x improvement doesn't seem like much for something that runs in parallel.


The 274 ms is for the sequential version of the algorithm (on my i7-9750 laptop).

On my office Xeon E5-2690 machine when using multiple threads the runtime decreases like this

840 ms for std::sort

372 ms for IPS4o sequentially

201 ms for 2 threads

104 ms for 4 threads

53 ms for 8 threads

33 ms for 16 threads


So starting off better on a single, and then scaling well with threads. Nice!


That's really interesting!

How does it behave with almost-sorted data?


There's some logic to detect if the data is already sorted or reverse sorted, but I think that's only triggered at the very beginning and not at every recursion level. If the data is almost sorted it seems to become a little bit faster. Take a look at the appendix of the paper where there are timings for various distributions.

There's also some logic to handle the case where there are a lot of equal elements which also results in faster performance than the completely random case.


You might want to mention if are affiliated with the algorithm you are promoting. ipsofacto ~ ips4o


Or maybe the throwaway was named after the comment it was intended to be used to post.


No affiliation, just picked the name because it fit... The company I work at was exploring various sort algorithms to use and this turned out to be the fastest.

Another benefit of this algorithm is that it is parallizes extremely well.


Hoare partitioning can also be implemented in a branchless manner, my algorithm pdqsort does this: https://github.com/orlp/pdqsort


Thanks! I tried that too, but it's slower than Lomuto. Forgot to mention in the article.


> I tried that too, but it's slower than Lomuto.

That is very strange. I can not reproduce your results in C++, using your code.

On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2 MinGW64) my algorithm performs best and your branchless version ends up slower than your branchy one:

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && a 10000000
    min_milliseconds=828.1250
    median_milliseconds=890.6250

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && a 10000000
    min_milliseconds=671.8750
    median_milliseconds=671.8750

    $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && a 10000000
    min_milliseconds=343.7500
    median_milliseconds=343.7500
On an Intel university machine (Xeon E5-2667 v2 @ 3.3GHz, 64 bit Ubuntu 16.04 LTS, GCC 5.4.0) I get:

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && ./a.out 10000000
    min_milliseconds=514.8051
    median_milliseconds=536.1984

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && ./a.out 10000000
    min_milliseconds=688.2471
    median_milliseconds=698.9603

    $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && ./a.out 10000000
    min_milliseconds=340.1935
    median_milliseconds=344.7432
Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is generating inferior code for AMD/Windows, as at least on Intel/Linux your Lomuto branchless code can beat the branchy code. But pdqsort (which uses branchless Hoare partitioning) consistently beats both.

I didn't change the benchmark, I only added pdqsort to yours, my full changes are the simple inclusion of 1 header and 3 lines of code: https://github.com/orlp/lomuto/commit/29381176f49e3588f6882c...

In order for any interested readers to reproduce, simply clone https://github.com/orlp/lomuto and run the above commands.


Notice that this is on random longs. Of course branch misprediction and memory bandwidth is going to crush you for a branchy sort... you'll be wrong like half the time, and the comparisons and swapping are trivial! Real world data isn't random, so I'd expect branch predictors to do much better with recognizing patterns. And your sorting isn't going to be as simple as sorting integers according to their values all that often. It's a neat exercise, but absent more realistic benchmarking to the contrary, I wouldn't assume I'll get a faster program by just substituting even a typical quicksort with this...


Sorting is the case study but not the insight. I think the cool/surprising part of the article is that doing more work can be faster... Even more memory work (which per conventional wisdom is not trivial!)


>> Sorting is the case study but not the insight. I think the cool/surprising part of the article is that doing more work can be faster

Yeah, I figured heap sort should be the standard because it has a worst case O(n*log(n)) complexity. But it does tend to do more swaps (a smarter implementation can avoid many writes), and the memory access is very cache-unfriendly. In practice it's just not used.


My comment wasn't inherently about sorting either. Branch-free in general is faster in the worst cases, i.e. if the branch is difficult to predict. Except most real-world branches are predictable... that's why we have branch predictors. So in general you should expect branch-free to be slower, unless you have reason to assume your branches are actually close to random.


> unless you have reason to assume your branches are actually close to random.

I mean, to the degree that you need to sort the data at all, it contains informational entropy.

Many "presents as sorted" data structures trade space for time by keeping track of the internal sortedness of chunks of the data, between sortation passes. Heck, any insert-optimized data structure (B+ trees; LevelDB's sorted-string-tables; N-ary tries generally) will get most of its performance gains by being able to guarantee, at node split time, that the children of any particular node are sorted relative to one-another.

So, in many cases, by the very fact that you don't already have metadata attached to your data asserting that it's sorted, it's highly likely to have no branch-predictability.

(Couple that to the fact that load-balancing in OLTP distributed systems favors writes to evenly-keyspace-distributed partitioning keys, ala Dynamo, and you'll frequently find that your inputs to an index-like data structure can be guaranteed random.)


I'm struggling to follow your comment. I said real-world data have patterns and aren't completely random, so you can in general expect branch predictors to help (unless, obviously, your data is actually known to be random like in the example). You rebutted with "because load balancers evenly spread their key space, you will frequently find that your inputs can be guaranteed random", as if that somehow contradicts the point I was making...? It's also clearly not true that it's "highly likely" your data has "no branch-predictability" just because "you don't already have metadata attached to it asserting it's sorted"... right? That's not only both obviously not the case on its own, but also, why would you even sort data that's already guaranteed to be sorted? I can't make sense of this, so I feel like we must be speaking past each other somehow? But I'm confused how.


The insight I might have not made clear, is that if you have “this node and its children are already sorted” metadata, then it’s cheaper to check whether a given (fine-grained) input chunk is already sorted—and if so, to directly construct a node from it—than it is to feed it blindly to the sorting algorithm.

If such a pre-check is in play, then the sorting algorithm itself will basically never have the sort of “runs” of sorted values that make some algorithms cheaper. Those “runs” were already plucked out and turned into nodes!


This. Thanks!


Branches also can't be vectorized, so going branch free can lead to a substantial speedup if you can use vector instructions (you might be doing twice the work, but your vector instruction may let you do it 4-16 times faster).


Indeed, AVX512 instructions can sort 16 int32 or well-behaved float values optimally, with no branches. It is a good final pass to any other algorithm.


You mean sorting within a vector register? Can you provide a link or something? I can't find info on this


The keyword to look for is "sorting network". See, for example, http://cs.brandeis.edu/~hugues/sorting_networks.html.

There is a 28-step 16-input network there. It admits a direct translation to AVX-512 instructions.

What is usually considered minimal differs from what is best for AVX512. AVX512 is happy to do 16 comparisons at each stage; its efficiency is limited only by the number of stages. Thus, a network that did more comparisons in fewer stages would be faster.



Thanks. It's kind of sad that there's no single instruction for horizontally sorting in vectors of integers. Horizontal merge/reduce would also be cool


True! Though I have yet to see vectorized sorting in production...


(Author here.) That is incorrect. The difficult case is and the real benchmark is with unpredictable data. The low-entropy cases will be about as fast with both partitioning schemes.

This is only a smart part of the benchmarks I've run because I wanted to drive one point home within a limited space. I've run many tests on various data types and shapes, and Lomuto does better than Hoare on most. (E.g. its improvement on double is even larger than on integrals.)


I'm confused, what exactly is incorrect? I said if we don't have more realistic benchmarks then I wouldn't just assume it's faster. Now you're saying you did in fact do more realistic benchmarks, and you found lower entropy ones to exhibit the same performance (which is not faster). They all seem consistent with each other?

I also said that sorting is often more complicated than just comparing integers by their values, which you didn't really address except to mention doubles, which missed the point I was making (I was trying to say sorting often needs e.g. satellite information).

In any case though, if you've done other benchmarks, it'd be nice if you could post them on GitHub or something. Maybe I'm missing something.


Maybe?

Lots of people replaced their sorting algorithms with Tim-Sort because runs of alrealdy-sorted data are empirically common (at least in some workloads). I've seen it myself, in use-cases where sort keys change slightly from iteration to iteration in some larger algorithm, and sort items need to be dynamically kept in order. It no doubt also happens when grabbing data from multiple places, each of which is theoretically ordered arbitrarily but is in practice ordered predictably.

So yeah, the author picked the use-cases where this algorithm is going to show in its best light. But that's not bad either, right? Data is often not well-ordered -- if you're counting occurrences of words, and want to get an ordered list (and never mind that you wouldn't use a comparison-based sort for that) you wouldn't expect much order. Likewise sorting any keys/values from a decent hash table.

One worry that I would have applies specifically to C++ and D: sometimes values are big, and copies are expensive. But I guess users can always choose to sort pointers like other languages do, if the standard library uses a copy-heavy, branch-light algorithm.


I think one example of extremely unsorted data is random data in Monte Carlo simulations, in particular if you need to compute statistical summaries that need the data to be sorted - e.g. percentiles. So you may end up repeatedly sorting million-long random arrays many time a second.

Then again, maybe there are algorithms letting you compute percentiles without sorting?


If you need k percentile values then I think you could do it in O(n log(k)) with introselect. You most definitely don't need O(n log(n)).


> Then again, maybe there are algorithms letting you compute percentiles without sorting?

I think you may be able to do so probabilistically, which is often good enough? Additionally, one observation about percentiles is that you don't need to sort all of the data; you can just hold on to the top N results in something like a Heap.


Extracting percentile information doesn't require sorting. Even if you're too lazy to implement anything custom, a few invocations of std::nth_element will likely outperform sorting: https://en.cppreference.com/w/cpp/algorithm/nth_element


Yeah, you can histogram the values if you just need an approximate answer (or if you have few enough discrete values, you can just have a bin per value). Once you have an approximate answer, if you need a more precise answer, you can go through the data again and only consider values within the right bin (which hopefully will be a lot fewer, if you've chosen your bins wisely) and then use something like std::nth_element.


I used Lomuto's algorithm in my presentation of Quicksort in http://canonical.org/~kragen/sw/dev3/paperalgo#addtoc_21 because my emphasis there is on using the simplest possible algorithms. It didn't occur to me that it might actually be faster, or be capable of being made faster, and indeed in that page I said you could improve efficiency with Hoare partitioning.

Alexandrescu's branch-free code in this article is a thing of beauty, in the same sort of rugged way that Stepanov's code is.


I love this sort of "Historical Fiction" where interactions and conversations between The Great People of Science are imagined and portrayed. A great example of this is Louisa Gilder's wonderful book _The_Age_Of_Entanglement:_When_Quantum_Physics_was_Reborn

https://www.amazon.com/Age-Entanglement-Quantum-Physics-Rebo...


Thanks for writing his because I hated it — I read that par thinking it would shed some light on the topic of the paper at large, but it was just noise to me.

Which is not to imply that I am right and you wrong or vice versa! Simply to observe that people are different.

As I liked the article but was annoyed by that beginning, it’s great for me that the first comment I saw was yours. So thanks for making it.


If you enjoy books like that, I recommend The Cambridge Quintet[0]; it's a wonderful book.

0. https://www.amazon.com/Cambridge-Quintet-Scientific-Speculat...


Oh, I assumed that dialogue between Lomuto and Hoare was real -- at least the gist of it, if not the exact words.


The interaction was probably real, since there is a date and location, but the dialog (and certainly the narration) seems to be taking some artistic liberties perhaps? This is also what Louisa did in her book on the history of quantum mechanics; real interactions with imagined dialogs. My apologies if I gave the wrong impression.


Anthony Zee's Quantum Field Theory textbook takes a similar tone.


To anyone interested in a deeper treatment of quicksort and variants thereof, I can recommend Sebastian Wild's PhD thesis on that topic: https://kluedo.ub.uni-kl.de/frontdoor/deliver/index/docId/44...


Off topic: Is it common practice for a dissertation at a German university to be written in English?


Depends on the field, but it is very common in computer science and generally in natural sciences. Studying computer science, I don’t think I ever even wrote as much as a paper in German; everything’s English.


Yes - to ensure the widest possible audience. English is the lingua-franca of academia and has been for almost a century now.


At least in MINT fields, probably less so in the humanities. Correct me if I'm wrong.


You're right. German economics is famously insular, for example. Part of that is cultural, part of it is that they don't write as much in English.


It is also common in many fields in Switzerland, Italy, Spain, ... Newton's Principia was written in Latin, for example and not in 18th century English.


In physics it certainly is. As far as I know, much all scientific conferences and journals are in English these days. Even journals like JETP (https://en.wikipedia.org/wiki/Journal_of_Experimental_and_Th...) became joint English and Russian in the 50's!


Hardly anyone is going to read, let alone cite your work if you were to publish in German. I'm not even aware of any German-speaking conferences.


I’ve read that some vector instruction sets use masking instructions to avoid branching. You execute both arms of the conditional and discard the unwanted computation, which can be cheaper than an X% chance of a branch misprediction.

Should CPUs include more masking instructions for regular, non-vectorized code as well?


32-bit ARM assembly language provided conditional execution for most instructions. This was dropped in the 64-bit instruction set. (https://en.wikipedia.org/wiki/Predication_(computer_architec..., https://en.wikipedia.org/wiki/ARM_architecture#64-bit) I imagine the architects had a clear picture of the advantages and disadvantages, and made a very well-informed decision. I guess that part of the reason might have been that it's not clear when compilers should emit predicated code instead of branches.

Also, you can get something very similar even for scalar code as long as your machine has a conditional move operation, which they all have:

    /* true path */
    a = ...
    b = ...
    true_result = ...
    /* false path */
    x = ...
    y = ...
    false_result = ...
    /* final result */
    result = (condition ? true_result : false_result)
This works if the two "branches" have no side effects (memory writes, function calls) and cannot raise exceptions. Again, it's very difficult to estimate for compilers (and humans) if it will pay off to run both computations rather than branch.

The trade-offs for vectorization are different from scalar code since vectorizing a loop by a factor N gives a huge win, and even if you waste some time doing some redundant computations, you still have a reasonable chance of being faster than scalar.


You can get close to the performance of a cmov instruction by generating a pair of all-1s and all-0s values: a=c, b=c-1, and a result z=a&x|b&y.

Gcc will not under any circumstances produce two cmov instructions in a basic block, so that is your only alternative without dropping to asm. Clang is happy to produce two adjacent cmov instructions. Usually your ALUs are not otherwise so engaged as to make the number of operations involved costly.


I don't get this. If c can be 0 or 1, then a can be 0 or 1, so it can be all-0s but not all-1s. b can be -1 or 0, so that part is fine. Can you show actual C or C++ code for a function like this:

    int select(int x, int y, bool condition) {
        ...
    }


Sorry, brain freeze. See

https://news.ycombinator.com/item?id=23237663

The bitwise thing is correctly -c&x|(c-1)&y.


Oh, yes, I should have realized that -c is enough to transform from 0/1 to 0/-1. Thanks!


I should add that Clang is happy to turn c&x|(c-1)&y into cmov.


I’m unconvinced because the instruction set sizes are already becoming fairly bloated (increasing microcode translation cost and i-cache pressure) but also because CPUs already do speculative execution which has a nearly equivalent effect to the idiom you’re referring to.

*edit: wording


To be more precise, some CPUs do the "both sides of the conditional" execution (EDIT This link is incorrect! Refer to "Eager execution" in the last link below. /EDIT https://en.wikipedia.org/wiki/Eager_evaluation) - years ago I heard that mobile processors do so for power reasons. No idea if that's still the case.

The problem in question is due to predictive execution in which the CPU has to flush the pipeline on misprediction. This is dominant in desktop PC CPUs. I have no knowledge on what is prevalent in e.g. server or mobile CPUs.

Both are (according to https://en.wikipedia.org/wiki/Speculative_execution) forms of speculative execution.


I'm speaking broadly about desktop/console CPUs which I have low-level familiarity with. I don't know of any CPUs in this class that don't perform some form of speculative execution.

I'm not sure if eager evaluation is related to the topic at hand and was speaking specifically about speculative execution. Eager evaluation is just what pretty much every language except Haskell (or Haskell-like) does.


Sorry, that first link was nonsense. I just copied the wiki link under "Eager execution" in the Speculative Execution page, but that is of course far from the same as "eager evaluation" (which is what the link goes to). I apologized for the confusion.

The point is that there are at least two ways CPUs can speculatively execute a conditional: Either predict which branch is taken and flush the pipeline if the prediction was wrong, or execute both arms of the conditional and discard the one that was not actually taken. The top-level SIMD comment is about the latter, but most optimization around speculative execution is for the former.

Yes, CPUs do speculative execution. The "flush pipeline on mispredict" kind ("predictive execution"). That is not the same as the "execute both and discard the untaken one" kind ("eager execution"). Your first reply suggests you consider them equivalent when they are not. I just wanted to clear that up.


Ah OK I understand now thanks for clearing that up. I meant that the end effect was similar but not so much that the mechanism was identical.


If you define

  inline bool swap_if(
    bool c, long& a, long& b)
  {
    long ta = a, tb = b;
    a = c ? tb : ta;
    b = c ? ta : tb;
    return c;
  }
and then use it in a partition like

   right += swap_if(
     *left < pivot,
     *left, *right);
compiled with Clang (which generates `cmov` instructions), the Hoare partition is still faster, and more than twice as fast as `std::sort`.

Using `swap_if` in Lumuto is also faster, but not as much faster. I interpret the difference (vs. array ops) as resulting from reduced L1 bus traffic.

A fully general swap_if,

  template <typename T>
  bool swap_if(
    bool c, T& a, T& b)
  {
    T v[2] = { a, b };
    b = v[c], a = v[1-c];
    return c;
  }
could and IMHO should be peephole-optimized to use a pair of `cmov` instructions, but is not in Clang, Gcc, Icc, or MSVC. But even without such an optimization, it makes Quicksort much faster.

(Gcc, incidentally, is very, very sensitive to details of the second swap_if. Change the order of assigning a and b, or use `!c` in place of `1-c`, and it gets much slower, on Intel, for very non-(to me-)obvious reasons. Gcc also will never, ever produce two `cmov` instructions in a basic block. I have filed a bug.)

If `swap_if` were in the Standard Library, it would probably be implemented optimally on all compilers, and almost half of the Standard algorithms could use it to get, often, ~2x performance.


Writing branch-free code (which I find myself sometimes doing to take advantage of simd vectorization) is always really painful. I wonder if someone has made an attempt at better tooling for this.


One thing people have made is array languages. The "primitive" operations apply to arrays of data, so branches around individual elements can't be expressed.

I think it's also generally less expressive, though. Writing Lomuto's partition as efficiently would likely be impossible, but writing it branch-free might not be -- you would compose branch-free primitives like

    left = filter(array, array < array[0])
    right = filter(array, array > array[0])
where `filter` is branch-free and takes an array of Boolean values as its second argument.

Shrugs, aside from some standard set of functions/primitives and idioms I'm not sure it would help much, though perhaps a mindset-shift is a more likely solution than something more technical.


This sort of vectorised pseudo branching is often used in SIMD algorithms


Yes... and the transformations are often non-trivial (especially when considering the possibility of out-of-bounds elements, which can't just safely be multiplied by zero--they might be a NaN, or on the edge of a page). It would be nice if there was a tool (or compiler option!) that could convert my branchy code into an "identical" (assuming floating-point associativity, etc.) branchless version.


The Eigen Array [1] object is fairly close to this. Combining the 'select' with comparisons, chained operators etc. Eigen vectorized stuff pretty effectively too.

1. https://eigen.tuxfamily.org/dox/group__TutorialArrayClass.ht...


Interesting... I've used Eigen for matrix operations before, I did not know they had such a smart general-purpose array implementation!


I tend to use the same method as the author has presented.

1. Write your branchy code

2. Massage the branches until their code is identical (testing frequently)

3. Trim a useless branch, and eat a piece of chocolate


I don’t mind boasting that I’m a world class talent at part 2 of step 3


Reminds me of this great article on writing a fast mergesort:

http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/


If you have a branch free algorithm, you can vectorize it. If he discussed implementing this with SIMD instructions (whether by human hand or compiler auto vectorization) I missed that part. But that seems an interesting angle to me.


Not necessarily. A branch-free algorithm can totally have a data dependency from the previous iteration of the loop to the next, making it impossible to vectorize.


That's true, although there are sometimes workarounds for that by modifying the algorithm to iterate over a small groups where there is still a data-dependency between iterations, but not within each group, allowing one to use SIMD. e.g. instead of delta compression over an array of integers where you subtract each integer from the prior one, you can subtract each group from the prior one with only a small loss of compression ratio.

The question is, is there such a dependency in this algorithm? I didn't check.


Yes, sorting typically (and this one specifically) has long dependency chains. The natural reflex is to split the chains across multiple workers... and that's how you get merge-sort. But even the last pass has a linear dependence chain. Not long ago, somebody posted a clever algorithm based on 4-way swaps, which is a clever way of addressing the issue.

Edit: I can't seem to locate that 4-way swapping algorithm, I swear it was posted to HN in the last year... somebody halp!


You might be thinking of this: https://news.ycombinator.com/item?id=22322967


Bingo, thanks!


Reminds me of this paper on binary search, which also shows a speed up using branch-free comparisons.

"Array Layouts for Comparison-Based Searching" https://arxiv.org/pdf/1509.05053.pdf


So what's the impact of the data dependency chain on "first -= smaller" to the next iteration?

(I guess at the very least it prevents vectorization, but other than that ... shouldn't be too much, but probably still the new limiting factor?)


https://dlang.org/blog/2020/05/14/lomutos-comeback/ gives a 404 error. And archive.org ( http://web.archive.org/cdx/search/cdx?url=https://dlang.org/... ) is also getting nothing but 404s, so it's not a problem on my end. I'm not really sure what's wrong with this site, since other commenters seem not to have noticed a problem.


It's ok now, as far as I can see.


Actually it's it still giving a 404, but (after remembering and looking up the --content-on-error option to wget) it appears TFA is served just fine as if it were a 404 error page.


It is true that more swaps may be cheaper than fewer swaps.

The operations to count, nowadays, are not swaps or comparisons, which are very cheap, but instead pipeline stalls, which are very, very expensive.

It is easy to better-than-double the speed of vanilla quicksort with a three-line change in the partition step.



I'm not sure I've ever sorted an array or list. Rather I put things into sets or maps or built in index---I always wanted to access things later, or have an on-line data structure (which a sorted flat thing is not).

This whole field is a waste of time, and anachronism from when the master copy was paper/pre-computer and computers were just doing the analytics.


Your dismissal of “sorted flat thing” as being necessarily not online is incorrect or unnecessarily strict to your detriment.

I wrote this comparing the high-level performance of a “set” (AVL or red-black tree) against just that:

Faster lookups, insertions, and in-order traversal than a red-black or AVL tree [0]

[0]: https://neosmart.net/blog/2019/sorted-list-vs-binary-search-...

(Spoiler: it depends on the nature of your data, but sorted vector can be dramatically faster.)


To be clear I am not against "small sorted flat thing" is an optimization for smaller data structures.

My point is I care about the map or set interface. I don't use temporary sets or quicksort to sort....because I don't sort---I don't want a list/array of things in order, basically ever.


Yes, and what I showed you is a set with a set interface, internally implemented as an always-sorted list.


OK sorry, Yes creating a map/set using the small-size optimization is internally doing a sort.


Putting things in a map or set is almost invariably slower, often much slower, than a smart sort. If performance doesn't matter, then go ahead. Most often sort performance doesn't matter, or anyway most of the time is spent elsewhere. People do use Python or Bash in production, without shame.

But some of us work on problems where performance of the sort does matter.


Sure, it is slower, but my point is not that I sort by doing `Map.toList . map.fromList` but that I don't immediately turn the map back to a list. The sorting literature is optimizing a thing I don't need to to do.


Just because you’re not using it directly doesn’t mean that you’re not benefiting from those optimization in those layers you are depending on.


I'm not using it directly or indirectly. The whole point of this sorting literature is a space-time tradeoff that's completely different from when you want an ordered map/index as the end result.


Not really true, putting things in a hash map is pretty fast, and generally faster than a comparison based sort.


A hash map is unsorted so it's not a useful way to sort data


Very true. I like rich keys with non-trivial comparison cost. I even better like nested maps which no only are often useful in their own right, but also server to cache the cost of comparison wrt the original map---a domain-specific trie.

Map are just far more the bread and butter of programming than sorting, and not just if you are using the one of the languages satirized by https://elbenshira.com/blog/the-universal-data-structure/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: