Cosmic-ray bit flipping is real and it has real security concerns. This also makes Intel's efforts at market segmentation by not having ECC support in any consumer CPUs [1] even more unforgivable and dangerous.
AMD is also doing market segmentation on their APU series of Ryzen. PRO vs. non PRO.
It should also be mentioned that Ryzen is a consumer CPU and you're stuck (mostly) with consumer motherboards, none of which tell you the level of ECC support they provide. Some motherboards do nothing with ECC! Yes, they "work" with it. But that means nothing. Motherboards need to say they correct single bit errors and detect double bit errors.[1] None of the Ryzen motherboards say this. Not a single one that I could find.
Maybe Asrock Rack, but that's a workstation/server motherboard. Which is also going for $400-600. You think that $50 Gigabyte motherboard is doing the right thing regarding ECC? That's a ton of faith right there.
Consumer Ryzen CPUs may support ECC, but that's meaningless without motherboards testing it and documenting their support of it. So no, Ryzen really does not support ECC if you ask me.
DDR5 has a modicum of ECC so things might slowly improve. Maybe DDR6 will be full ECC and we will no longer have this market segmentation in the 2030s. Wow, that’s a long time though.
PS: why didn’t apple do the right thing with the M1? My guess is the availability of the memory which again points to changing it at the memory spec level.
Also, ECC ram is technically supported on AMD’s recent consumer platform, although it’s not advertised as so since they don’t do validation testing for it.
I've read that reporting of ECC events is not supported on consumer Ryzen. It's not a complete solution and since unregistered ECC is being used, how can you even be sure the memory controller is doing any error correction at all?
Someone would need to induce memory errors and publish their results. I'd love to read it.
This is 4 years old now, but does produce some interesting results.
The author of that article doesn't have hands-on experience with ECC DRAM, and mistakenly concludes that ECC on Ryzen is unreliable because of a misunderstanding of how Linux behaves when it encounters an uncorrected error. However, the author at least includes screenshots which show ECC functionality on Ryzen working properly.
> ...since unregistered ECC is being used, how can you even be sure the memory controller is doing any error correction at all?
ECC is performed by the memory controller, and requires an extra memory device per rank and 8 extra data bits, which unbuffered ECC DIMMs provide.
Registered memory has nothing to do with ECC (although in practice, registered DIMMs almost always have ECC support). It's simply a mechanism to reduce electrical load on the memory controller to allow for the usage of higher-capacity DIMMs than what unbuffered DIMMs would allow.
With respect to Ryzen, Zen's memory controller architecture is unified, and owners of Ryzen CPUs use the same memory controller found in similar-generation Threadripper and EPYC processors (just fewer of them). Although full ECC support is not required on the AM4 platform specifically (it's an optional feature that can be implemented by the motherboard maker), it's functional and supported if present. Indeed, there are several Ryzen motherboards aimed at professional audiences where ECC is an explicitly advertised feature of the board.
ECC reporting is part of the memory controller (which is unified across all Zen architecture parts), and is fully supported and functional. You can see the reporting working as expected within the Hardware Canucks article linked in the grand parent.
The article you linked mentions that ECC reporting is not working with the on-board IPMI controller (which presumably means that ECC events aren't being logged in the SEL). While that might be a limitation of this board (and other IPMI-equipped AM4 boards), reporting from within the operating system will still work.
I don't understand the issue with market segmentation here. I can absolutely see the reason why all of my servers should have ECC, but I don't see why my gaming PC should (or even my work development machine). What's the worst case impact of the (extremely rare) bit-flip on one of those machines?
Example: bitsquatting on domains [2].
[1]: https://arstechnica.com/gadgets/2021/01/linus-torvalds-blame...
[2]: https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...