> What do you mean by "the worst case for LSM trees"? For write amplification. >...

kragen · on Feb 10, 2022

Aha, thanks. I suppose that in theory you could avoid actual merging if you had disjoint key ranges, in which case random writes could conceivably have less write amplification than mostly sequential writes, but does anyone do that in practice? Does it make a difference in practice?

What do you mean by "You just hamstrung your B-tree for conscience to implement a single model"? I'm not sure what you mean by either "for conscience" or "to implement a single model".

It's probably not an extremely fair comparison, but I'm not sure which databases it's slanted in favor of, and it wouldn't be very surprising if I'd overlooked some easy speedup. Like, maybe by deferring index creation until after ETL you could get a significant boost in Postgres speed, or by changing the Postgres tuning numbers. I do think it was running long enough to incur a significant number of compactions and vacuums, though. Feel free to critique in more detail. The Postgres and SQLite numbers come from http://canonical.org/~kragen/sw/dev3/exampledb.py, but for LevelDB I inferred that the Python bindings were the bottleneck, so the 300'000 figure is from http://canonical.org/~kragen/sw/dev3/leveldb-62m.cc.

jeffffff · on Feb 10, 2022

i don't think that's an unfair comparison, that is in line with similar tests i have performed for random write workloads on lsm trees and b trees that don't fit in ram, including cost of compaction. https://youtu.be/jwq_0mPNnN8?t=3730 links to a section of a talk i did which explains my benchmarks. it also includes some pretty neat visualizations of the write patterns of the two data structures that may help people understand why there is such a big difference in write throughput. my benchmarks were done on spinning disk, on ssds i'd expect lsm trees to only have around 10x higher write throughput for the same benchmark.