Frustratingly little information. For example, I'm exceedingly curious how they ...

p4ul · on June 13, 2024

I usually assume these companies are using some of the popular schedulers (e.g., Slurm, MOAB, SGE) that have existed in the HPC community for many years.

I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!

[1] https://slurm.schedmd.com/documentation.html

claytonjy · on June 13, 2024

Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²

at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.

¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/

²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...

p4ul · on June 13, 2024

Very cool! Thank for this, claytonjy!!