I was part of an experimental neuroimaging group that tested Pachyderm OSS years ago and at the time we were really impressed with the versioning capabilities it provided. For us at the time it made it easy for each researcher to grab and change data as needed for their own development without requiring support from eng.
How well does that work when you datasets are a sizeable percentage of available storage capacity, though? Is there some sort of deduplication at work?
Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.
Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful
Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.
FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.