Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Loosely related question:

What prevents manufacturers from taking some existing mid/toprange consumer GPU design, and just slapping like 256GB VRAM onto it? (enabling consumers to run big-LLM inference locally).

Would that be useless for some reason? What am I missing?



The amount of memory you can put on a GPU is mainly constrained by the GPU's memory bus width (which is both expensive and power hungry to expand) and the available GDDR chips (generally require 32bits of the bus per chip). We've been using 16Gbit (2GB) chips for awhile, and they're just starting to roll out 24Gbit (3GB) GDDR7 modules, but they're expensive and in limited demand. You also have to account for VRAM being somewhat power hungry (~1.5-2.5w per module under load).

Once you've filled all the slots your only real option is to do a clamshell setup that will double the VRAM capacity by putting chips on the back of the PCB in the same spot as the ones on the front (for timing reasons the traces all have to be the same length). Clamshell designs then need to figure out how to cool those chips on the back (~1.5-2.5w per module depending on speed and if it's GDDR6/6X/7, meaning you could have up to 40w on the back).

Some basic math puts us at 16 modules for a 512 bit bus (only the 5090, have to go back a decade+ to get the last 512bit bus GPU), 12 with 384bit (4090, 7900xtx), or 8 with 256bit (5080, 4080, 7800xt).

A clamshell 5090 with 2GB modules has a max limit of 64GB, or 96GB with (currently expensive and limited) 3GB modules (you'll be able to buy this at some point as the RTX 6000 Blackwell at stupid prices).

HBM can get you higher amounts, but it's extremely expensive to buy (you're competing against H100s, MI300Xs, etc), supply limited (AI hardware companies are buying all of it and want even more), requires a different memory controller (meaning you'll still have to partially redesign the GPU), and requires expensive packaging to assemble it.


What of previous generations of HBM? Older consumer AMD GPUs (Vega) and Titan V had HBM2. According to https://en.wikipedia.org/wiki/Radeon_RX_Vega_series#Radeon_V... you could get 16GB with 1TB/s for $700 at release. It is no longer use in data centers. I'd gladly pay $2800 for 48GB with 4TB/s.


Previous generation of HBM is not any cheaper than the current ones, and it is no longer in production, the lines having shifted to the new stuff.


HBM2 is still in volume production. New products coming out with it on the ASIC side. Gaudi 3 uses HBM2e.


Interesting. So a 32-chip GDDR6 clamshell design could pack 64GB VRAM with about 2TB/s on a 1024bit bus, consuming around 100W for the memory subsystem? With current chip prices [1], this would cost just about 200$ (!) for the memory chips, apparently. So theoretically, it should be possible to build fairly powerful AI accelerators in the 300W and < 1000$ range. If one wanted to, that is :)

1. https://dramexchange.com/


I wonder if a multiplexer would be feasible?

Hardware-wise instead of putting the chips on the PCB surface one would mount an 16-gonal arrangement of perpendicular daughterboards, each containing 2-16 GDDR chips where there would be normally one, with external liquid cooling, power delivery and PCIe control connection.

Then each of the daughterboards would feature a multiplexer with a dual-ported SRAM containing a table where for each memory page it would store the chip number to map it to and it would use it to route requests from the GPU, using the second port to change the mapping from the extra PCIe interface.

API-wise, for each resource you would have N overlays and would have a new operation allowing to switch the resource overlay (which would require a custom driver that properly invalidates caches).

This would depend on the GPU supporting the much higher latency of this setup and providing good enough support for cache flushing and invalidation, as well as deterministic mapping from physical addresses to chip addresses, and the ability to manufacture all this in a reasonably affordable fashion.


Not at GDDR speeds.

GPUs use special DRAM that has much higher bandwidth than the DRAM that's used with CPUs. The main reason they can achieve this higher bandwidth at low cost is that the connection between the GPU and the DRAM chip is point-to-point, very short, and very clean. Today, even clamshell memory configuration is not supported by plugging two memory chips into the same bus, it's supported by having the interface in the GDDR chips internally split into two halves, and each chip can either serve requests using both halves at the same time, or using only one half over twice the time.

You are definitely not passing that link through some kind of daughterboard connector, or a flex cable.


>A clamshell 5090 with 2GB modules has a max limit of 64GB

How does "clamshelling" get around the 32-bits per module requirement? Do the two 2GB modules act as one 4GB module when clamshelled?


So I guess we just wait for HBM to get cheaper and better, which should not take too long, given how much money is being pumped into it ?


You'd need memory chips with double the memory capacity to slap the extra vram in, at least without altering the memory bus width. And indeed, some third party modded entries like that seem to have shown up: https://www.tomshardware.com/pc-components/gpus/nvidia-gamin...

As far as official products, I think the real reason another commentator mentioned is that they don't want to cannibalize their more powerful card sales. I know I'd be interested in a lower powered card with a lot of vram just to get my foot in the door, that is why I bought a RTX 3060 12GB which is unimpressive for gaming but actually had the second most vram available in that generation. Nvidia seem to have noticed this mistake and later released a crappier 8GB version to replace it.

I think if the market reacted to create a product like this to compete with nvidia they'd pretty quickly release something to fit the need, but as it is they don't have too.


The 3060 with 12GB was an outlier for it's time of release because the crypto (currency) hype was raging at that moment and scalpers, miners and everyone in between were buying graphic cards left and right! Hard times were these! D:


There are companies in China doing that, recycling older NVidia GPUs.[1]

[1] https://www.reddit.com/r/hardware/comments/182nmmy/special_c...


You can actually getting GPUs from the Chinese markets (e.g., AliExpress) that have had their VRAM upgraded. Someone out there is doing aftermarket VRAM upgrades on cards to make them more usable for GPGPU tasks.

Which also answers your question: The manufacturers aren't doing it because they're assholes.


These are a bit mythical, finding one for sale is no small feat.

I guess adding memory to some cards is a matter of completely reworking the PCB, not just swapping DRAM chips. From what I can find it has been done, both chip swaps and PCB reworks, it's just not easy to buy.

Software support is of course another consideration.


Bandwidth. GDDR / HBM, both used by GPU depending on usage are high bandwidth low capacity, comparatively speaking. Modern GPU tries to put more VRAM with more memory channel up to 512bit but requires more die space and hence are expensive.

We will need a new memory design for both GDDR and HBM. And I wont be surprised they are working on it already. But hardware takes time so it will be few more years down the road.


Because then they couldn't sell you the $10k enterprise GPU


True, but it is mostly profit - GDDR6 sells for $2.30 a gigabyte [1]

[1]https://www.dramexchange.com


That's for 8Gbit chips, which are more or less unusable in modern products. 16Gbit chips are at ~$8, or $4 per GB.


10? Try 30+ ...


> enabling consumers to run big-LLM inference locally

A non-technical reason is that the market of people wanting to run their personal LLMs at home is very small.


Not sure where I read this and am paraphrasing a lot, but: there's a point where `RAM bandwidth < processor speed` becomes `true`, and processor becomes architecturally data starved.

As in, a 32bit CPU that runs at 1 giga instruction/second, with a 16 Gbps memory bus, could get up to 0.5 instruction per clock, and that's not very useful. For this reason there can't be an absolute potato with gigantic RAM.

How gigantic is not useful, idk.


Seems some years away to get that into consumer price range.


NVidia sells memory and GPUs as bundles to board partners.

If you harm their profit good luck continuing to have access to GPU chips. It’s a cartel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: