I have an NVME device that very rarely literally (but figuratively) falls off the PCI port and disappears [0]. It is one of several Physical Volumes (PV) in a Logical Volume Management (LVM) Volume Group (VG) that backs several RAID-1 mirror Logical Volumes (LV).
When it drops off file-systems writes to the LVs are blocked and reads can also fail but the system survives sufficiently to do a controlled power off/on that recovers it.
In some cases the LVs pair up a spinning disk with the NVME but due to how I've configured the LV the spinner is read-mostly and the NVME is write-mostly (RAID member syncing is delayed and in background). There isn't too much noticeable latency except for things like `git log -Sneedle` - and worth it for the resilience.
[0] first time it happened it was spiders that had taken up residence around the M2 header and CPU (nice and warm!) and causing dust trails allowing current leakage between contacts (yes, I did do microscopic examination because I could not identify any other cause) that a simple blast with the air-compressor resolved. Later incidents turn out to be physical stress due to extreme thermal expansion and contraction as best as I can tell - ambient air temperature can fluctuate from 14C to 40C and back over 18 hours. Re-seating the M2 adapter fixes it for a a few months before it starts again! All NVME SMART self-tests pass; the failure is of the link not the storage - effectively being removed from the PCIe port. Firmware was at one stage suspected, although it had been fine for a couple of years on the same version, but updates haven't changed it in any way. ASPM is disabled.
When it drops off file-systems writes to the LVs are blocked and reads can also fail but the system survives sufficiently to do a controlled power off/on that recovers it.
In some cases the LVs pair up a spinning disk with the NVME but due to how I've configured the LV the spinner is read-mostly and the NVME is write-mostly (RAID member syncing is delayed and in background). There isn't too much noticeable latency except for things like `git log -Sneedle` - and worth it for the resilience.
[0] first time it happened it was spiders that had taken up residence around the M2 header and CPU (nice and warm!) and causing dust trails allowing current leakage between contacts (yes, I did do microscopic examination because I could not identify any other cause) that a simple blast with the air-compressor resolved. Later incidents turn out to be physical stress due to extreme thermal expansion and contraction as best as I can tell - ambient air temperature can fluctuate from 14C to 40C and back over 18 hours. Re-seating the M2 adapter fixes it for a a few months before it starts again! All NVME SMART self-tests pass; the failure is of the link not the storage - effectively being removed from the PCIe port. Firmware was at one stage suspected, although it had been fine for a couple of years on the same version, but updates haven't changed it in any way. ASPM is disabled.