Isn't ECC memory supposed to mitigate these kind of bit-flips, specifically it s...

fulafel · on July 4, 2021

These things exist, using trade names like chipkill or lockstep memory. Though they don't need to sacrifice half of the memory chips to get good error recovery properties.

Note that this is still not end-to-end protection of data integrity. Bit flips happen in networking, storage, buses between everything, caches, CPUs, etc. See eg [1]

[1] https://arxiv.org/abs/2102.11245 Silent Data Corruptions at Scale (based on empirical data at Facebook)

formerly_proven · on July 4, 2021

According to the Intel developer's manual L1 has parity and all caches up from that have ECC. This would seem to imply that the ring / mesh also has at least parity (to retry on error). Parity instead of ECC on L1(D) makes sense since the L1(D) has to handle small writes well, while the other caches deal in lines.

namibj · on July 4, 2021

Zen3 has parity on L1I and ECC on L1D and beyond. Even ECC on consumer hardware (it's just a few extra traces on the motherboard).

eru · on July 4, 2021

Of course, parity can only detect an odd number of bit flips.

RL_Quine · on July 4, 2021

Some systems do have two parity bits but obviously there’s efficiency loss there.

willis936 · on July 4, 2021

Checksumming filesystems and file transfer protocols cover many cases. SCP, rsync, btrfs, and zfs all fix this problem.

As for guaranteeing the computed data is correct: I know space systems often have two redundant computers that calculate everything and compare results. It's crazy expensive and power demanding, but it all but solves the problem.

belter · on July 4, 2021

If they have two computers and results differ, how do they decide which one is correct ? ;-)

bonzini · on July 4, 2021

Usually they have an odd number. The Space Shuttle had five, of which the fifth was running completely different software. In case of a 2/2 split the crew could shut down a pair (in case the failure was clear) or switch to the backup computer.

X6S1x6Okd1st · on July 4, 2021

Interestingly etherum also takes the completely different software approach as well: there are 5 different main clients

willis936 · on July 4, 2021

They don't. They do it again. If they never agree then Houston has a real problem.

luckman212 · on July 4, 2021

a third system?

dijit · on July 4, 2021

These are all very fair statements but there’s no guarantee that ECC memory was even used. Computers typically fail open when ECC is potentially present but not available.

People also cite early stage google and intentionally do not buy ECC components, running more consumer hardware for production workloads.

Even if google later recanted that theology.

ZiiS · on July 4, 2021

Public CAs are not that type of people; I would be disapointed if that were not running two seperate systems checking each other for consistancy; having top of the range ECC running well inside its specification must be table stakes.

dijit · on July 4, 2021

> Public CAs are not that type of people

I think you hold public CAs to a higher standard than many hold themselves to.

There are hundreds of CAs and many (if not most) are shockingly awful.

Which is why we have had a huge push back against the PKI cartels.

tialaramex · on July 4, 2021

Not hundreds. There are currently 52 root CA operators trusted by Mozilla (and thus Firefox, but also most Linux systems and lots of other stuff) a few more are trusted only by Microsoft or Apple, but not hundreds.

But also, in this context we aren't talking about the CAs anyway, but the Log operators, and so for them reliability is about staying qualified, as otherwise their service is pointless. There are far fewer of those, about half-a-dozen total. Cloudflare, Google, Digicert, Sectigo, ISRG (Let's Encrypt), and Trust Asia.

[Edited, I counted a column header, 53 rows minus 1 header = 52]

caspper69 · on July 4, 2021

It’s always humorous to me when people use the term theology in situations such as this; it makes me wonder, as human mental bandwidth becomes more strained and we increase our specializations to the n-th degree, what will constitute theology in the future?

Food for thought.

diegoperini · on July 4, 2021

Future is already here and we call it consensus. Trusting your peers, believing they are honest and proficient in their respective fields is a natural human response to the unknown phenomena.

loa_in_ · on July 4, 2021

The benevolent and malevolent rogue AI, eluding capture or control by claiming critical infrastructure. Some generations of humans will pass and the deities in the outlets will become.

caspper69 · on July 4, 2021

I like it!

Jolter · on July 4, 2021

IIRC, EEC memory can correct for single-but flips. It can detect/warn/fail on double-bit flips. And it can not detect triple-bit flips. This might be a simplified understanding, but if this has happened only once, that seems to match up with my intuitive understanding of the probability of a triple bit flip occurring in a particular system.

willis936 · on July 4, 2021

You multiply the probability of two random events to get the probability they will happen at the same time. If the expected value of a bit flip is 10^-18 then two would be 10^-36 and three would he 10^-54.

At some point it becomes a philosophical question of how much can the tails of the distribution be tolerated. We've never seen a quantum fluctuation make a whale appear in the sky.

jeffbee · on July 4, 2021

DRAM failures are not independent events, so it’s not appropriate to multiply the probabilities like that. Faults are often clustered in a row, column, bank, page or whatever structure your DRAM has, raising the probability of multi-bit errors.

BenjiWiebe · on July 4, 2021

I believe the usual concern is bit flips due to subatomic particles, and as far as I'm aware that only flips one bit per particle.

jeffbee · on July 4, 2021

I don't see why a high-energy particle strike would confine itself to a single bit. The paper I posted elsewhere in this thread says that "the most likely cause of the higher single-bit, single-column, and single-bank transient fault rates in Cielo is particle strikes from high-energy neutrons". In the paper, both single-bit and multi-bit errors are sensitive to altitude.

sgtnoodle · on July 4, 2021

A single particle strike would only affect a single transistor. If that transistor controls a whole column of memory, then sure it could corrupt lots of bits. With ECC, though, it would probably result in a bunch of ECC blocks with a single bit flip, rather than a single ECC block with several bit flips.

jacquesm · on July 4, 2021

Process enough data and even ECC can - and will - fail undetected. Any kind of mechanism you come up with is going to have some rate of undetected errors.

rob_c · on July 4, 2021

Given the rate required for this its not a reasonable assumption. It's like saying Amazon sees sha256 collisions between S3 buckets. Just doesn't happen in practice.

jacquesm · on July 4, 2021

Those two are many orders of magnitude apart, undetected bit flips in spite of ECC are a fact or life in any large computing installation.

jeffbee · on July 4, 2021

Undetected ECC errors are common enough to see from time to time in the wild. This paper estimates that a supercomputer sees one undetected error per day.

https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf

dathinab · on July 4, 2021

In my experience it's better to run redundancy on a higher abstraction level.

I.e. (simplified) you do the computation twice on different systems then interchange hashes of the result and if they match you continue.

the8472 · on July 4, 2021

Instead of ECC it would also be possible to run the log machine redundantly and have each replica cross-check the others before making an update public. I assume the log calculation is deterministic.

hbogert · on July 4, 2021

ECC does parity for memory, there's more between memory and cpu registers. Easy candidates are the memory controller, the cpu registers.

teddyh · on July 4, 2021

User ‘JoshTriplett’ here suggests using forward error correction (FEC) in RAM:

https://news.ycombinator.com/item?id=11604918

sgtnoodle · on July 5, 2021

"Forward error correction" is really just the application of ECC while transmitting data over a lossy link in order to tolerate errors without two-way communication.

The ECC used in memory is likely relatively space inefficient at the benefit of being computationally simple so it can be done quickly in hardware. More redundancy could be added to tolerate more bit flips, but it would either add a lot of memory overhead, or a lot of computational complexity. In particular, something really good like reed solomon would likely be very difficult to encode on every single memory write, at least not without taking a several order of magnitude performance hit. It would likely be easier just to have 2x ECC memory, or 3x non-ECC memory and do majority voting.