For me, it was "Erasure Coding in Windows Azure Storage" from Microsoft Research...

H8crilA · on Dec 8, 2020

The primary reason why you should be using 3x or higher replication is the read throughput (which makes it only really relevant for magnetic storage). If the data is replicated 1.6x then there's only 1.6 magnetic disk heads per each file byte. If you replicate it 6x then there's 6 magnetic disk heads for each byte. At ~15x it becomes cheaper to store in SSD with ~1.5x reed-solomon/erasure code overhead since SSD has ~10x the per-byte cost of HDD.

(there are also effects on the tail latency of both read and write, because in a replicated encoding you are less likely to be affected by a single slow drive).

(also, for insane performance which is sometimes needed you can mlock() things into RAM; the per-byte cost of RAM is ~100x the cost of HDD and ~10x the cost of SSD).

luizfelberti · on Dec 8, 2020

Everything you just said is on point, but I think that's an orthogonal thing to what the paper is going for. Hot data should absolutely have a fully-materialized copy at the node where operations are made, and an arbitrary number of readable copies can be materialized for added performance in systems that don't rely on strong consistency as much.

However for cold-data, there really hasn't been (or at least I am unaware of) any system that can achieve the combined durability of 1.5x Reed-Solomon codes + 3x replication, with such a small penalty to storage costs.

Like you said though, it's definitely not the thing you'd be doing for things that prioritize performance as aggressively as the use-cases you've suggested.

H8crilA · on Dec 8, 2020

~1.5x reed solomon is the default these days, again, unless you need read throughput performance. It is awesome :)

Also, these days the storage of the data doesn't have to be at the same machine that processes the data. A lot of datacenter setups have basically zero transfer cost (or, alternatively, all the within-DC transfer cost is in the CAPEX required to build the DC in the first place), ultra low latency, and essentially unlimited bandwidth for any within-datacenter communication. This doesn't hold for dc1->dc2 communication, in particular it is very very far from the truth in long distance lines.

One way to think about the above is that datacenters have become the new supercomputers of the IBM era - it's free and really fast to exchange data within a single DC.

Also2, this is completely independent of consistency guarantees. At best it relates to durability guarantees, but that I want from all storage solutions. And yes, properly done reed solomon has the same durability guarantees as plain old replicated setup.

Also to the above also2, single-DC solutions are never really durable as the DC can simply burn down or meet some other tragic end, you need geographic replication if your data cannot be accidentally lost without serious consequences (a lot of data actually can be lost, in particular if it is some kind of intermediate data that can be regenerated from the "source" with some engineering effort). This is not just a theoretical concern, I've seen "acts of God" destroy single-DC setups data, ay least partially. It is pretty rare, though.

luizfelberti · on Dec 9, 2020

I'm confused, as you don't seem to be replying to any point I've made...

> ~1.5x reed solomon is the default these days, again, unless you need read throughput performance

I'm not surprised that Reed-Solomon is the "default these days" given that it exists since the 1960's, and that the most widely available and deployed open-source distributed filesystem is HDFS (which uses Reed-Solomon). However I don't see how that is to be taken as a blind endorsement for it, especially given that the paper in reference explicitly compares itself to Reed-Solomon based systems, including concerns regarding reconstruction costs, performance, and reliability.

> Also, these days the storage of the data doesn't have to be at the same machine that processes the data

Even though what you said here is correct, I don't see how that's relevant to the referenced paper, nor do I think I implied that I hold a contrary belief in any way from what I said.

> Also2, this is completely independent of consistency guarantees

My comment about consistency referred only to the fact that you cannot "simply" spin up more replicas to increase read throughput, because consistent reads often have to aqcuire a lock on systems that enforce stronger consistency, so your comments regarding throughput are not universally true, given that there are many systems where reads cannot be made faster this way, as they are bottle-necked by design.

> Properly done Reed-Solomon has the same durability guarantees as plain old replicated setup

This is not true unless the fragments themselves are being replicated across failure domains, which you seem to address with your next comment with "you need geographic replication if your data cannot be accidentally lost without serious consequences". All of this, however, is directly addressed in the paper as well:

> The advantage of erasure coding over simple replication is that it can achieve much higher reliability with the same storage, or it requires much lower storage for the same reliability. The existing systems, however, do not explore alternative erasure coding designs other than Reed-Solomon codes. In this work, we show that, under the same reliability requirement, LRC allows a much more efficient cost and performance tradeoff than Reed-Solomon.

laluser · on Dec 8, 2020

It's not even the reduction in storage costs in this paper that is groundbreaking. They talk about a way to not only reduce storage costs, but optimize for repairs. Repairs are costly at scale and reducing resources where possible: network, cpu, disk reads, etc is ideal.

toolslive · on Dec 9, 2020

Indeed: erasure coding is easy (they have been doing it since the 60s). Your real problem is the repair problem.

cmukka · on Dec 9, 2020

On this same note I would also suggest some papers which show you can do so much better than simple erasure coding -

[1] Clay Codes - https://www.usenix.org/conference/fast18/presentation/vajha . This paper was also implemented on Ceph and the results are shown in the paper.

and, [2] HeART: improving storage efficiency by exploiting disk-reliability heterogeneity - https://www.usenix.org/conference/fast19/presentation/kadeko... . This paper talks about how just one erasure code is not enough and employing code conversions over the disk-reliability we can get up to 30% savings!

evmar · on Dec 8, 2020

The Google File System (GFS) paper from 2003 mentions erasure codes. Which isn't to say they did it then, but rather that the technique of using erasure coding was known back then. (And surely before GFS too, I just picked it as an example of a large data storage system that used replication and a direct predecessor to the systems you mentioned.)

https://static.googleusercontent.com/media/research.google.c...

luizfelberti · on Dec 8, 2020

CDs (remember those? lol) also implemented Reed-Solomon erasure codes for the stored data, erasure codes in storage systems aren't new at all, and that's not what this paper is about.

I actually found out about this paper because it was referenced in a slide presentation from Google about Colossus (which is the successor to GFS). GFS indeed uses erasure coding with a 1.5x factor, but erasure coding alone does not guarantee durability, and thus needs to be combined with replication to satisfy that requirement, and erasure coding is not the same thing as replication.

The innovation here is explicitly the combination of a new erasure coding algorithm (LRC) AND replication, with a combined storage amplification that is much lower than the previous SOTA.

The paper explicitly compares the new algorithm (LRC) with GFS and other alternatives, and explains why it's better, so this is really not something that is comparable to the 2003 GFS paper in any way (or to any other prior art really), as this is not just a trivial application of erasure coding in a storage system.

There's also this paper [0] from 2001 which digs a bit deeper into the Erasure Codes vs Replication idea that I can recommend if you're interested

[0] http://fireless.cs.cornell.edu/publications/erasure_iptps.pd...

lsuresh · on Dec 8, 2020

The paper is from 2012, not 2016 (see https://dl.acm.org/doi/10.5555/2342821.2342823)

jeffbee · on Dec 8, 2020

I think for the major players you mentioned the 2016 paper was retrospective. Everyone was already doing it. Even mid-tier players like Dropbox Magic Pocket were using erasure coding by 2016, and their scheme was mostly written by ex-Google engineers influenced by Colossus.

luizfelberti · on Dec 8, 2020

Oh I am absolutely aware that erasure codes are an old thing, Reed-Solomon codes exist since the 1960's, but this is not simply a trivial application of erasure coding to a storage system: erasure codes alone don't provide the same durability guarantees that replication does. [0]

This is a combination of erasure coding AND replication, whose combined storage amplification is dramatically lower than previous SOTA.

I gave a longer explanation in a sibing comment to yours [1]

[0] http://fireless.cs.cornell.edu/publications/erasure_iptps.pd...

[1] https://news.ycombinator.com/item?id=25351678

jeffbee · on Dec 8, 2020

Thanks for the clarification. I still think these techniques were somewhat widespread already ... see for example this figure from US Patent 9292389 describing nested, aka layered coding that to my thinking is isomorphic with the "LRC" or "pyramid code" described by the Microsoft researchers.

By the way, not at all trying to say this paper isn't interesting. I keep it in my filing cabinet to show my colleagues when I need to describe this technique, since Google hasn't ever bothered to describe Colossus in a way I can reference.

https://imgur.com/a/gi2Xdl0

eeZah7Ux · on Dec 8, 2020

Erasure coding was already used in storage tapes in 1985.

> some clever information theory tricks is mind bending to me

It's a pretty trivial first-degree linear function, y = ax + b

eeZah7Ux · on Dec 9, 2020

And why all the downvotes?