Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Memory bandwidth puts an upper limit on LLM tokens per second.

At 200GB/s, that upper limit is not very high at all. So it doesn't really matter if the compute is there or not.



The M1 Max's GPU can only make use of about 90GB/s out of the 400GB/s they advertise/support. If the AMD chip can make better use of its 200GB/s then, as you say, it will manage to have better LLM tokens per second. You can't just look at what has the wider/faster memory bus.

https://www.anandtech.com/show/17024/apple-m1-max-performanc...


This mainly shows that you need to watch out when it comes to unified architectures. The sticker bandwidth might not be what you can get for GPU-only workloads. Fair point. Duly noted.

But my overarching point still stands: LLM inference needs memory bandwidth, and 200GB/s is not very much (especially for the higher ram variants).

If the M1 Max is actually 90GBs that just means it's a poor choice for LLM inference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: