Sure, but to write 1GB you stream 1GB from ram -> CPU in either case. With softw...

zamadatix · on Aug 15, 2024

Your assumption that, from a memory perspective, the stream goes from 1 GB RAM Read -> Write to Disk to 1 GB RAM read -> Calculation -> Write to disk does not hold. There are intermediate forms of data that end up writing back to RAM then to disk. This is what the article is talking about here:

> upwards of 90% reduction in system DRAM utilization

sliken · on Aug 15, 2024

My understanding is that it's something like:

      stripe = read_from_ram(*ptr) # usually between 128k and 256k
      blobs[]=do_raid_calc(stripe) # blobs usually 25% to 33% larger than stripe
      for i in drives
          write(drive=i,blobs[i])

The above should be relatively cache friendly, my Zen 4 desktop (1 gen old) has 128MB of L3 cache, enough for 1000 ish stripes.

> upwards of 90% reduction in system DRAM utilization

That seems unbelievable, most ram isn't spend for anything I/O related let alone RAID releated. Now if it's 90% reduction in system DRAM utilization by RAID, sure. But that seems like a very small fraction of all ram.

Even if 10,000 stripes are in flight simultaneously to 100s of drives that's only 2.5GB or 1% of a servers ram (256GB or more seems common). Especially since 2/3rd of that would be in ram even with hardware RAID. Not like the buffer/page cache which might reach 50% of ram has the extra RAID in data in it.

zamadatix · on Aug 15, 2024

> 128MB of L3 cache

Sure, if you use X3D chip with the current largest amount of L3 cache accessible to a single core of any option currently available you can dedicate all of it to 128 MB of the write buffer to your disk instead of letting it be offloaded. Valid option, just as cool. I have a non X3D 7950X so jealous though ;).

You've also got the case of needing to transmit up the read of the disk for modifications to sectors not cached by the system so the CPU can perform the parity calc of the whole sector and issue the appropriate writes. Particularly bad for non-sequential IO writes.

> if it's 90% reduction in system DRAM utilization by RAID

Yes, this - not the other. It's achieved by not writing things back to RAM again before they hit the flash pool.

sliken · on Aug 15, 2024

> Sure, if you use X3D chip

Ah, sorry, lscpu shows: L3: 64 MiB (2 instances)

I originally thought that meant 64MB x 2, but it means 64MB total (32MB x 2). Still 64MB is 500 times larger than 128KB stripe and I/O normally happens on a wide variety of cores, and should only be required for stripe that are in flight. Server (normally with 5x or more cores than my 12 core desktop) and way more bandwidth (24 channels instead of my 2) will have much more cache and much more bandwidth.

> Yes, this - not the other. It's achieved by not writing things back to RAM again before they hit (comparatively slow to RAM) flash pool.

Why should the stripes be written to ram? The write should enter kernel space (write is a system call), then the software RAID driver does the calculation and then the write to the devices memory space. The PCIe connected NVMe controller is not cache coherent and can't safely read main memory, which might be cached.

I took a closer look at the original post, they seem to be considering the tiny write, which requires a read/modify/write. Said operation is pretty inefficient, and linux tries to avoid this with caching, but certainly is needed sometimes. I've not seen any analysis on what fraction of I/O to production RAID system is R/M/W instead of a normal read or write.

Even in the R/M/W case, a stripe is read by the software-RAID driver, the write is masked onto the strip, and a new checksum is calculated. Then the stripe is sent back to the I/O space for each involved NVMe controller. So a 4KB write (common minimum size) requires reading 128-256KB, doing the checksum, and writing it back to the device.

It does tip the scales more towards hardware RAID, but that's always been true for hardware RAID, which very often ends up slower than software RAID for previously discusses reasons.

zamadatix · on Aug 15, 2024

Say it were a 6 disk pool and you add an object to a database (with the goal of doing many of these as fast as possible with fsync to the disks):

- Receive the new data

- Read the multiple disks to get the current stripe(s) associated with it.

- Calculate the new parity

- Issue the multiple writes

- Wait for completion, clear that from RAM

Looking at a single write it doesn't seem so bad. You take something like ~128k in from the disks per stripe (which will arrive it ever so slightly different times and be held as that thread stalls before the calc), issue a bunch of writes, wait for that to clear while the result remains in memory (cache or RAM), then you're good to clear it out and that thread/coroutine task can process the next one. "Just" 3 GB/s is ~23,000/s of that - doing those multiple reads into RAM, parity writes into RAM (well, unless you can stick it all in massive L3 by keeping queue depths low), and caching until spat out on to the drive. On a normal non-parity setup you just have your data to be written sit and go to disk, no intermediate reads/writes.

This may not make sense on a home box but consider the approach more an alternative to solutions like https://www.graidtech.com/product/sr-1000/ which are single cards that can get a million RAIDed IOPS written at near 100 GB/s in a single PCIe slot alone with no additional load to the CPU. Just writing 100 GB/s takes a CPU core and most of the RAM bandwidth from a raw data creation/parsing perspective before talking about writing it to disk at all, it's a different problem than e.g. what the bandwidth looks like on a home NAS pool. This type of approach tries to do something similar without the extra device in-between the cards and the server.

Sometimes you also want to take the above approach and scale it out over many 100G/400G ethernet ports so your flash storage pools are reachable over network separate from compute nodes. Here the goal is to make that storage solution as dense, fast, and efficient as possible where you might want to load as much possible storage as you can on a single node until it saturates the bandwidth to the CPU. If you can do that without doubling back data to the CPU you can scale it that much better.

creshal · on Aug 15, 2024

I guess the interesting usecase would be to combine this with other hardware accelerates and do DMA between devices, e.g., stream network data directly to a RAID without ever touching the main CPU, after some initial setup work.