I don't understand why even put that HBM on top of the core.
From what I understand, in a typical gpu core you put logic and connectors on one side and innert silicon on the other. So unless you drill through silicon you don't get shorter routing.
Why not put GPU one one side and HBM on the other side of the PCB? This would fix the cooling problem?
Routing signals through PCB vias requires greater voltage and has lower available bandwidth than silicon-to-silicon bridges. AMD's first generation of cache dies bonded to the top of the CPU, but their second generation bonded to the CPU's bottom which improved cooling for the fast logic on top. Similarly, HBM under logic would be ideal.
Overly dismissive of the idea that you'd cut clock frequency to boost LLM inference performance by 46%. Yes, it's workload specific, but the industry is spending tens of billions on running that workload. It's actually quite smart to focus on it. People will certainly take that trade off if offered.
Still it's a good article and nice to see the old anandtech crew together. The random grammatical errors are still there but these days they are a reassuring sign the article was written by hand.
Interesting read. The paper calls this a “roadmap” and says 3d HBM is still figuring out what it can be, and what it will look like - seems right.
Hyperscalers are dealing with a pretty complex Pareto envelope that includes power (total), power (density), volume of space available, token throughput and token latency.
My guess is that there’s going to be some heterogenous compute deployed possibly forever, but likely for at least the next six to ten years, and exotic fragile underclocked highly dense compute as imagined in the paper is likely to be part of that. But probably not all of it.
Either way as a society we’ll get the benefits of at least a trillion dollars of R&D and production on silicon, which is great.
I don't understand why even put that HBM on top of the core.
From what I understand, in a typical gpu core you put logic and connectors on one side and innert silicon on the other. So unless you drill through silicon you don't get shorter routing.
Why not put GPU one one side and HBM on the other side of the PCB? This would fix the cooling problem?
Routing signals through PCB vias requires greater voltage and has lower available bandwidth than silicon-to-silicon bridges. AMD's first generation of cache dies bonded to the top of the CPU, but their second generation bonded to the CPU's bottom which improved cooling for the fast logic on top. Similarly, HBM under logic would be ideal.
Overly dismissive of the idea that you'd cut clock frequency to boost LLM inference performance by 46%. Yes, it's workload specific, but the industry is spending tens of billions on running that workload. It's actually quite smart to focus on it. People will certainly take that trade off if offered.
Still it's a good article and nice to see the old anandtech crew together. The random grammatical errors are still there but these days they are a reassuring sign the article was written by hand.
Interesting read. The paper calls this a “roadmap” and says 3d HBM is still figuring out what it can be, and what it will look like - seems right.
Hyperscalers are dealing with a pretty complex Pareto envelope that includes power (total), power (density), volume of space available, token throughput and token latency.
My guess is that there’s going to be some heterogenous compute deployed possibly forever, but likely for at least the next six to ten years, and exotic fragile underclocked highly dense compute as imagined in the paper is likely to be part of that. But probably not all of it.
Either way as a society we’ll get the benefits of at least a trillion dollars of R&D and production on silicon, which is great.