For appends, the normal way is to apply the append operations in an arbitrary or...

dekhn · on Sept 6, 2023

It's also sensible to have high quality record markers that make identifying and skipping broken records easy. For example, recordio, a record-based container format used at Google in the early mapreduce days (and probably still used today) was explicitly designed to skip corruption efficiently (you don't wnt a huge index build to fail halfway through because a single document got corrupted).

Even so, I avoid parallel appenders to a single file, it can be hard to reason about and debug in ways that having each process appending to its own file doesn't.

boulos · on Sept 6, 2023

Objects have three fields for this: Version, Generation, and Metageneration. There's also a checksum. You can be sure that you were the writer / winner by checking these.

dpkirchner · on Sept 6, 2023

You can also send a x-goog-if-generation-match[0] header that instructs GCS to reject writes that would replace the wrong generation (sort of like a version) of a file. Some utilities use this for locking.

0: https://cloud.google.com/storage/docs/xml-api/reference-head...

KptMarchewa · on Sept 6, 2023

That makes sense - if you keep data in something like ndjson and don't require any order.

If you need order then probably writing to separate files and having compaction jobs is still better.