Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I usually assume these companies are using some of the popular schedulers (e.g., Slurm, MOAB, SGE) that have existed in the HPC community for many years.

I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!

[1] https://slurm.schedmd.com/documentation.html



Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²

at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.

¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/

²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...


Very cool! Thank for this, claytonjy!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: