> Right now Nvidia is just crushing it and there's zero chance anyone is going t...

foobiekr · on Aug 23, 2023

It's not going to work because they are too late.

HPC cluster builds are complex enough due to the presence of multiple networks (2x moderate-scale infiniband chassis and 2x ethernet chassis as a minimum) without introducing unknown vendors. At that point, if you're doing IB, why not just go Mellanox since you will almost certainly buy 200GE Connectix NICs and not the 100G NICs from these guys.

UEC will - like most standards - _eventually_ work. In the mean time, Nvidia pods are the obvious choice for anyone who really cares about performance, and other vendors (Cisco, Arista) if they don't.

michh · on Aug 24, 2023

Idk, I personally did it for a HPC cluster two years ago. 2x100GbE + OmniPath was a sensible way to reduce cost, especially as the cluster was very light on GPU power and mostly focussed on CPU-bound jobs. Last I heard everyone there is still very happy with what we built.

foobiekr · on Aug 24, 2023

Was that before Intel spun it out? I can see people being willing to do that build if Intel was seemingly on board. Today things are pretty different.

Modern HPC mostly means GPU compute and tons of data shuffling, but point taken. CPU bound jobs aren't going to stress the i/o, so you probably could have done 100GE for less. I'm curious what you did for storage but I'm guessing with CPU-bound that is again much less of an issue.