the VRAM chip is cheap, but how to inter-connect them at high speed isn't

michaelt · on Feb 12, 2025

If you want to double the memory and double the total memory bandwidth, sure. That'd need twice as many data lines, or the same lines at twice the speed.

But if you just want to double the memory without increasing the total memory bandwidth, isn't it a good deal simpler? What's 1 more bit on the address bus for a 256 bit bus?

kbolino · on Feb 12, 2025

The GPU already has DMA to system RAM. If you're going to make the VRAM as slow as system RAM, then a UMA makes more sense than throwing more memory chips on the GPU.

1W6MIC49CYX9GAP · on Feb 12, 2025

Why would you slow down VRAM?

kbolino · on Feb 12, 2025

Good point. I misunderstood the situation. I figured doubling the VRAM size at the same bus width would halve the bandwidth.

Instead, it appears entirely possible to double VRAM size (starting from current amounts) while keeping the bus width and bandwidth the same (cf. 4060 Ti 8GB vs. 4060 Ti 16GB). And, since that bandwidth is already much higher than system RAM (e.g. 128-bit GDDR6 at 288 GB/s vs DDR5 at 32-64 GB/s), it seems very useful to do so, though I'd imagine games wouldn't benefit as much as compute would.

jorvi · on Feb 13, 2025

Actually, it's compute workloads that love bandwidth, they just have hard thresholds on how much memory they need.

You can see this with overclocking VRAM. Greatly benefits mining, slightly or even negatively benefits gaming workloads.

This extends to system RAM too, most applications will see more benefit from better access times rather than higher MT/s.

immibis · on Feb 12, 2025

But having the VRAM allows you to run the model on the GPU at all, doesn't it? A card with 48GB can run twice as much model than a card with 24GB, even though it takes twice as long. Nobody is expecting to run twice as much model in the same time just by increasing the VRAM.

Without the extra VRAM, it takes hundreds of times divided by batch size longer due to swapping, or tens of times longer consistently if you run the rest of the model on the CPU.