These things exist, using trade names like chipkill or lockstep memory. Though they don't need to sacrifice half of the memory chips to get good error recovery properties.
Note that this is still not end-to-end protection of data integrity. Bit flips happen in networking, storage, buses between everything, caches, CPUs, etc. See eg [1]
According to the Intel developer's manual L1 has parity and all caches up from that have ECC. This would seem to imply that the ring / mesh also has at least parity (to retry on error). Parity instead of ECC on L1(D) makes sense since the L1(D) has to handle small writes well, while the other caches deal in lines.
Checksumming filesystems and file transfer protocols cover many cases. SCP, rsync, btrfs, and zfs all fix this problem.
As for guaranteeing the computed data is correct: I know space systems often have two redundant computers that calculate everything and compare results. It's crazy expensive and power demanding, but it all but solves the problem.
Usually they have an odd number. The Space Shuttle had five, of which the fifth was running completely different software. In case of a 2/2 split the crew could shut down a pair (in case the failure was clear) or switch to the backup computer.
These are all very fair statements but there’s no guarantee that ECC memory was even used. Computers typically fail open when ECC is potentially present but not available.
People also cite early stage google and intentionally do not buy ECC components, running more consumer hardware for production workloads.
Public CAs are not that type of people; I would be disapointed if that were not running two seperate systems checking each other for consistancy; having top of the range ECC running well inside its specification must be table stakes.
Not hundreds. There are currently 52 root CA operators trusted by Mozilla (and thus Firefox, but also most Linux systems and lots of other stuff) a few more are trusted only by Microsoft or Apple, but not hundreds.
But also, in this context we aren't talking about the CAs anyway, but the Log operators, and so for them reliability is about staying qualified, as otherwise their service is pointless. There are far fewer of those, about half-a-dozen total. Cloudflare, Google, Digicert, Sectigo, ISRG (Let's Encrypt), and Trust Asia.
[Edited, I counted a column header, 53 rows minus 1 header = 52]
It’s always humorous to me when people use the term theology in situations such as this; it makes me wonder, as human mental bandwidth becomes more strained and we increase our specializations to the n-th degree, what will constitute theology in the future?
Future is already here and we call it consensus. Trusting your peers, believing they are honest and proficient in their respective fields is a natural human response to the unknown phenomena.
The benevolent and malevolent rogue AI, eluding capture or control by claiming critical infrastructure. Some generations of humans will pass and the deities in the outlets will become.
IIRC, EEC memory can correct for single-but flips. It can detect/warn/fail on double-bit flips. And it can not detect triple-bit flips. This might be a simplified understanding, but if this has happened only once, that seems to match up with my intuitive understanding of the probability of a triple bit flip occurring in a particular system.
You multiply the probability of two random events to get the probability they will happen at the same time. If the expected value of a bit flip is 10^-18 then two would be 10^-36 and three would he 10^-54.
At some point it becomes a philosophical question of how much can the tails of the distribution be tolerated. We've never seen a quantum fluctuation make a whale appear in the sky.
DRAM failures are not independent events, so it’s not appropriate to multiply the probabilities like that. Faults are often clustered in a row, column, bank, page or whatever structure your DRAM has, raising the probability of multi-bit errors.
I don't see why a high-energy particle strike would confine itself to a single bit. The paper I posted elsewhere in this thread says that "the most likely cause of the higher single-bit, single-column, and single-bank transient fault rates in Cielo is particle strikes from high-energy neutrons". In the paper, both single-bit and multi-bit errors are sensitive to altitude.
A single particle strike would only affect a single transistor. If that transistor controls a whole column of memory, then sure it could corrupt lots of bits. With ECC, though, it would probably result in a bunch of ECC blocks with a single bit flip, rather than a single ECC block with several bit flips.
Process enough data and even ECC can - and will - fail undetected. Any kind of mechanism you come up with is going to have some rate of undetected errors.
Given the rate required for this its not a reasonable assumption. It's like saying Amazon sees sha256 collisions between S3 buckets. Just doesn't happen in practice.
Undetected ECC errors are common enough to see from time to time in the wild. This paper estimates that a supercomputer sees one undetected error per day.
Instead of ECC it would also be possible to run the log machine redundantly and have each replica cross-check the others before making an update public. I assume the log calculation is deterministic.
"Forward error correction" is really just the application of ECC while transmitting data over a lossy link in order to tolerate errors without two-way communication.
The ECC used in memory is likely relatively space inefficient at the benefit of being computationally simple so it can be done quickly in hardware. More redundancy could be added to tolerate more bit flips, but it would either add a lot of memory overhead, or a lot of computational complexity. In particular, something really good like reed solomon would likely be very difficult to encode on every single memory write, at least not without taking a several order of magnitude performance hit. It would likely be easier just to have 2x ECC memory, or 3x non-ECC memory and do majority voting.
As this is a single bit-flip, why wasn't it corrected? Did ECC memory fail? Or was this bit-flip induced in the CPU pipeline, registers, or cache?
Do we need "RAID for ECC memory", where we halve user-accessible RAM and store each memory segment twice and check for parity?