In the absence of hardware unified memory, CUDA will automatically copy data bet...

fenced_load · 2025-07-15T00:08:30 1752538110

There is also NVLink c2c support between Nvidia's CPUs and GPUs that doesn't require any copy, CPUs and GPUs directly access each other's memory over a coherent bus. IIRC, they have 4 CPU + 4 GPU servers already available.

benreesman · 2025-07-15T00:35:06 1752539706

Yeah NCCL is a whole world and it's not even the only thing involved, but IIRC that's the difference between 8xH100 PCI and 8xH100 SXM2.

saagarjha · 2025-07-15T02:02:57 1752544977

This seems like it would be slow…

freeone3000 · 2025-07-15T02:28:22 1752546502

Matches my experience. It’s memory stalls all over the place, aggravated (on 12.3 at least) there wasn’t even a prefetcher.

nickysielicki · 2025-07-15T01:21:02 1752542462

See also: https://www.kernel.org/doc/html/v5.0/vm/hmm.html