Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For appends, the normal way is to apply the append operations in an arbitrary order if there are multiple concurrent writers. That way you can have 10 jobs all appending data to the same 'file', and you know every record will end up in that file when you later scan through it.

Obviously, you need to make sure no write operation breaks a record midway while doing that. (unlike the posix write() API which can be interrupted midway).



It's also sensible to have high quality record markers that make identifying and skipping broken records easy. For example, recordio, a record-based container format used at Google in the early mapreduce days (and probably still used today) was explicitly designed to skip corruption efficiently (you don't wnt a huge index build to fail halfway through because a single document got corrupted).

Even so, I avoid parallel appenders to a single file, it can be hard to reason about and debug in ways that having each process appending to its own file doesn't.


Objects have three fields for this: Version, Generation, and Metageneration. There's also a checksum. You can be sure that you were the writer / winner by checking these.


You can also send a x-goog-if-generation-match[0] header that instructs GCS to reject writes that would replace the wrong generation (sort of like a version) of a file. Some utilities use this for locking.

0: https://cloud.google.com/storage/docs/xml-api/reference-head...


That makes sense - if you keep data in something like ndjson and don't require any order.

If you need order then probably writing to separate files and having compaction jobs is still better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: