On Btrfs, you can mark a folder/file/subvolume to have nocow, which has the effe...

mustache_kimono · on Oct 11, 2022

And that may work for btrfs, but again at some cost:

"When you enable nocow on your files, Btrfs cannot compute checksums, meaning the integrity against bitrot and other corruptions cannot be guaranteed (i.e. in nocow mode, Btrfs drops to similar data consistency guarantees as other popular filesystems, like ext4, XFS, ...). In RAID modes, Btrfs cannot determine which mirror has the good copy if there is corruption on one of them."[0]

[0]: https://wiki.tnonline.net/w/Blog/SQLite_Performance_on_Btrfs...

lazide · on Oct 11, 2022

Yup. It’s a pretty fundamental thing. COW and data checksums (and usually automatic/inline compression) co-exist that way because it’s otherwise too expensive performance wise, and potentially dangerous corruption wise.

For instance, if you modify a single byte in a large file, you need to update the data on disk as well as the checksum in the block header, and other related data. Chances are, these are in different sectors, and also require re-reading in all the other data in the block to compute the checksum. Anywhere in that process is a chance for corruption of the original data and the update.

If the byte changes the final compressed size, it may not fit in the current block at all, causing an expensive (or impossible) re-allocation.

You could end up with the original data and update both invalid.

Writing out a new COW block is done all at once, and if it fails, the write failed atomically, with the original data still intact.

tjoff · on Oct 11, 2022

> Chances are, these are in different sectors, and also require re-reading in all the other data in the block to compute the checksum. Anywhere in that process is a chance for corruption of the original data and the update.

Not much different than any interrupted write though. And a COW needs to reread just as much.

> If the byte changes the final compressed size, it may not fit in the current block at all, causing an expensive (or impossible) re-allocation.

Something that you must always pay in a COW filesystem anyway? Is handled by other non-COW filesystems anyway.

Just because a filesystem isn't COW doesn't mean every change needs to be in place either. Of course, a filesystem that is primarily COW might not want to maintain compression for non-COW edge-cases and that is quite reasonable.

lazide · on Oct 12, 2022

Literally none of what you are saying is true.

A COW write only needs to checksum the newly written bytes. A non-cow filesystem needs to checksum all data contained in the block (unchanged prior with now new values).

Additionally, a non-COW filesystem needs to update all metadata checksums/values for existing blocks. It's a much more pathological case (interrupted write wise) than a COW filesystem, because if it writes data, but hasn't written the checksum yet - the block is now corrupt. If it writes the checksum, but not the data, the block is now corrupt. And there is no way to know which one is correct post-facto, without storing the old data and the new data somewhere. Which has the exact same overhead or worse as COW. And the data in a FS block is usually many large multiples of the sector size, which makes writes pretty hard to do in any sort of atomic way. Journaling helps, but not with performance here! Since you'd need to store the prior values + the new values, or you're still guaranteed to lose data.

Compression wise, this isn't (as much) of an issue for COW filesystems, because it only needs to compress the newly written data, which can be allocated without concern to the previous allocation size, which is still there, allocated. It can mean less efficient compression if these are small, fragmented writes of course, which is why most of them have some sort of batching mechanism in place. Alternatively, it can copy a chunk of the block, though that can cause write amplification, and is usually minimized.

But you don't run across potential pathological fragmentation issues, like where you compress prior blocks which now take significantly less space, or new blocks take more space and require reshuffling everywhere.

tjoff · on Oct 12, 2022

>A COW write only needs to checksum the newly written bytes. A non-cow filesystem needs to checksum all data contained in the block (unchanged prior with now new values).

No, a COW will re-read the entire block. Make the change and update the checksum and then write it back (to a new location, obviously). Way more than the newly written bytes - but way less than the entire file of course. Just as a non-COW fs will.

>If it writes the checksum, but not the data, the block is now corrupt. And there is no way to know which one is correct post-facto, without storing the old data and the new data somewhere.

Which is exactly what the journal is for - and you already have a journal. Or you don't update it in-place. Just because it isn't COW doesn't mean you always have to update in place.

>Compression wise, this isn't (as much) of an issue for COW filesystems, because it only needs to compress the newly written data, which can be allocated without concern to the previous allocation size, which is still there, allocated. It can mean less efficient compression if these are small, fragmented writes of course, which is why most of them have some sort of batching mechanism in place. Alternatively, it can copy a chunk of the block, though that can cause write amplification, and is usually minimized.

You can do exactly the same for a non-COW filesystem.

>But you don't run across potential pathological fragmentation issues, like where you compress prior blocks which now take significantly less space, or new blocks take more space and require reshuffling everywhere.

COW has the same issue. COW always need to "reshuffle"(?) data somewhere.

lazide · on Oct 12, 2022

I think you don't understand COW or non-COW filesystems?

They don't work the way you are asserting.

tjoff · on Oct 12, 2022

I could say the same. Be more specific please. If everything I've said is wrong it should be easy to point out something demonstrably false.

lazide · on Oct 12, 2022

For one, you get no performance benefit over non-cow unless you update in place. It’s what every ‘fast and easy’ filesystem has to do - fat (including exfat), ext3, ext4, etc.

The failure modes are well documented - and got worse in many cases trying to work around performance issues due to journaling, but the journaling doesn’t resolve the issue fully because they can’t store all the data they need without making the performance issues worse. See https://en.m.wikipedia.org/wiki/Ext4 and ‘Delayed allocation and data loss’ for one example.

This isn’t a solved (or likely solvable in a reasonable way) problem with non-COW filesystems, which is one of the reasons why all newer filesystems are COW. The other being latency hits from tracking down COW delta blocks aren’t a big issue now due to SSDs and having enough RAM to have decent caches and pre-allocation buffers.

Also, COW doesn’t need to allocate (or re-read/re-checksum) the entire prior block when someone changes something, unlike modify in place. Due to alignment issues, doing SOME usually makes sense, but it’s highly configurable.

It only needs to add new metadata with updated mapping information for the updated range in the file, and then checksum/write out the newly updated data (only), plus or minus alignment issues or whatever. It acts like a patch. That’s literally the whole point of COW filesystems.

Update in place has an already allocated block it has to deal with in real time, either now consuming less space in it’s already allocated area (leaving tiny fragmentation) or by having to allocate a new block and toss the old one, which will have worse real time performance than a COW system, as it’s doing the new block allocation (which is more space than a COW write, unless the COW write is for the entire blocks contents!), plus going back and removing the old block.

ZFS record size for instance is just the maximum size of one of the patch ‘blocks’. The actual records are only the size of the actual write data + Filesystem overhead.

ZFS only then goes back and removes old records when they aren’t referenced by anyone, which is typically async/background, and doesn’t need to happen as part of the write itself.

This allows freeing up entire regions of pool space easier, and fragmentation becomes much less of an issue.

tjoff · on Oct 12, 2022

>For one, you get no performance benefit over non-cow unless you update in place. It’s what every ‘fast and easy’ filesystem has to do - fat (including exfat), ext3, ext4, etc.

That is just a matter of priorities then. And just because you might opt to not update in place in some situations doesn't mean that you can never do it.

I'm not sure what you mean by "Delayed allocation and data loss", I don't find it relevant to this discussion at all since that isn't about filesystem-corruption but application data corruption. And COW also suffers from this - unless you have NILFS/automatic continuous snapshots. Now with COW you probably have a much greater chance of recovering the data with forensic tools (also discussed in this thread regarding ZFS) but with huge downsides and hardly an relevant argument for COW in the vast majority of usecases anyway.

ZFS minimum block size corresponds to disk sector size so for most practical purposes it is the same as your typical non-COW filesystem there. Writing 1 byte requires you to read 4 kb, update it in memory, recalculate checksum, and then writing it down again.

How you remove old records shouldn't depend on COW should it?

My only statement was that checksums isn't in any way dependent on COW.

The discussion about compression is invalid as it is a common feature of non-COW filesystems anyway.

Haven't seen a proper argument for the corruption claims. And that you get corrupted data if you interrupt a write is not a huge deal. Mind you corrupted write. Not corrupted filesystem. The data was toast anyway. A typical COW would at best save you one "block" of data which is hardly worth celebrating anyway. Your application will not care if you wrote 557 out of 1000 blocks or 556 out of 1000 blocks your document is trashed anyway. You need to restore from backup (or from a previous snapshot, which of course is typical killer feature of COW)).

There are also several ways to solve the corruption issue. ReFS for instance has data checksums and metadata checksums but only do copy-on-write for the metadata. (edit: was wrong about this, it uses COW for data too if data checksumming is enabled)

dm-integrity can be used at a layer below the filesystem and solves it with the journal https://www.kernel.org/doc/html/latest/admin-guide/device-ma...

Yes, COW is popular and for good reasons. As is checksumming. It isn't surprising that modern filesystems employ both. Especially since the costs of both have been becoming less and less relevant at the same time.

Arnavion · on Oct 11, 2022

While filesystem-integrated RAID makes sense since the filesystem can do filesystem-specific RAID placements (eg zfs), for now the safest RAID experience seems to be filesystem on mdadm on dm-integrity on disk partition, so that the RAID and RAID errors are invisible to the filesystem.

mustache_kimono · on Oct 11, 2022

> the safest RAID experience seems to be filesystem on mdadm on dm-integrity on disk partition, so that the RAID and RAID errors are invisible to the filesystem.

I suppose I don't understand this. Why would this be the case?

Arnavion · on Oct 11, 2022

dm-integrity solves the problem of identifying which replica is good and which is bad. mdadm solves the problem of reading from the replica identified as good and fixing / reporting the replica identified as bad. The filesystem doesn't notice or care.

mustache_kimono · on Oct 11, 2022

Ahh, so you intend, "If you can't use ZFS/btrfs, use dm-integrity"?

Arnavion · on Oct 11, 2022

No. I don't use ZFS since it's not licensed correctly, so I have no opinion on it. And BTRFS raid is not safe enough for use. So I'm saying "Use filesystem on mdadm on dm-integrity".

nix23 · on Oct 12, 2022

>I don't use ZFS since it's not licensed correctly,

Oh look a hobby-lawyer!! Please Linus, license your code "correctly" it's called ISC not GPL.

curt15 · on Oct 11, 2022

What makes dm-integrity "safer" than zfs or btrfs raid?