• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

NVIDIA’s Next-Gen Blackwell GB100 GPUs Utilize Chiplet Design

Bernoulli

M2 slut
NVIDIA’s upcoming GPU architecture, codenamed Blackwell, is poised to be the successor to Ada Lovelace. In contrast to the Hopper/Ada architecture, Blackwell is set to extend its reach across both datacenter and consumer GPUs. NVIDIA is gearing up to introduce several GPU processors, with no major alterations to core counts, but there are hints of a significant restructuring of the GPU architecture.

According to the latest series of tweets from Kopite, Blackwell is not expected to feature a substantial increase in core counts. While it remains unclear whether this pertains to both data-center and gaming series, the core count for Blackwell is anticipated to remain relatively unchanged, while the underlying GPU clusters will undergo significant structural modifications. Kopite has not disclosed further details at this point, but it is said that GB100 GPU might feature twice as many cores as GB102, both are data-center GPUS.


Additionally, there has been mention of GB100, the data-center GPU for Blackwell, adopting a Multi Chip Module (MCM) design. This suggests that NVIDIA will employ advanced packaging techniques, dividing GPU components into separate dies. The specific number and configuration of these dies are yet to be determined, but this approach will grant NVIDIA greater flexibility in customizing chips for consumers, mirroring AMD’s intentions with the Instinct MI300 series


 
Last edited:
I’m not really sure what to make of this. If core counts aren’t changing much, does this indicate that whilst performance gains will be better, performance won’t be significantly better than 4000 series?

I imagine if this is the case, they are likely moving away from traditional core performance and moving further in with A.I to improve performance through technology deployed through the chipset.

It’s all very confusing, just sell me a decent 5000 series card that will give me better performance than a 4090 but not cost as much. Thanks.
 

Loxus

Member
This reminds me of this from Nvidia back in 2017.

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability
Mqg2kd2.png


Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau.

However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth.

Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.
 
Last edited:

Kenpachii

Member
Interesting i thought that chiplets wasn't really beneficial much for nvidia to go with it. But i guess they still do so.

Wonder what gains we will be seeing and what new tech there 5000 cards bring along the way.
 
Don't you mean RDNA 4? Blackwell is the next release from Nvidia is it not? Likely due end of next year.

Not entirely sure, I know a few of the PC leakers mentioned that RDNA 5 is being designed to compete against the 50 Series, not RDNA 4 which apparently isn't going to compete in the high end.
 

Zathalus

Member
If I remember correctly, high end RDNA4 was cancelled because of pricing. Not because they couldn't get it working.
XCXOyZ6.png


I mean, it's not like AMD didn't already created the most complex chip on the planet.
ZpzKVPN.png

9XVfHdc.png

KaUZYVE.jpg
I know they have it. It was a joke aimed at how mediocre uplift RDNA 3 is over RDNA 2.
 

Black_Stride

do not tempt fate do not contrain Wonder Woman's thighs do not do not
I’m not really sure what to make of this. If core counts aren’t changing much, does this indicate that whilst performance gains will be better, performance won’t be significantly better than 4000 series?

I imagine if this is the case, they are likely moving away from traditional core performance and moving further in with A.I to improve performance through technology deployed through the chipset.

It’s all very confusing, just sell me a decent 5000 series card that will give me better performance than a 4090 but not cost as much. Thanks.

Its a new architecture, so simply looking at CUDA counts is pointless, thats always been the case.

Case in point.
RTX 3070 - 5888 CUDA cores.
RTX 4070 - 5888 CUDA cores.

TKkSIKt.png


The 4070 performs like a 3080 while having much fewer CUDA cores.
 

Black_Stride

do not tempt fate do not contrain Wonder Woman's thighs do not do not
it's because RDNA 3 was their first chiplet GPU

the jump should be bigger when they master it

AMDs MCM design simply separates GCDs and MCDs, the GCDs are effectively the same compute dies theyve been making for eons.

Theyve already "mastered" the GCD (that does most of the work) in MCM design, reducing latency and improving the MCDs is pretty much the only place to gain anything.


998-navi-31-xtx.jpg


The big block in the center is the GCD.....basically the same as a monolithic die but surrounded by the MCDs which are on a different node.
This isnt a perf thing, its a cost cutting measure, shrink the GCD while keeping the MCDs on another node to keep costs down.


They arent going to be making major jumps in power and likely the reason they arent bothering with high end GPUs anymore, cuz theres no point aiming that high and being in a no mans land.

Better to fight the xx70s and Arc x70s.
 

winjer

Gold Member
I know they have it. It was a joke aimed at how mediocre uplift RDNA 3 is over RDNA 2.

RDNA3 is lacking, but not because of the chiplet design.
If anything, chiplets have been AMD's biggest suceess in the last decade.

But what most people still don't understand about chiplets or MCM, is that it's not meant to improve performance. It's meant to improve yields and cost. Especially cost.
 

Bernoulli

M2 slut
AMDs MCM design simply separates GCDs and MCDs, the GCDs are effectively the same compute dies theyve been making for eons.

Theyve already "mastered" the GCD (that does most of the work) in MCM design, reducing latency and improving the MCDs is pretty much the only place to gain anything.


998-navi-31-xtx.jpg


The big block in the center is the GCD.....basically the same as a monolithic die but surrounded by the MCDs which are on a different node.
This isnt a perf thing, its a cost cutting measure, shrink the GCD while keeping the MCDs on another node to keep costs down.


They arent going to be making major jumps in power and likely the reason they arent bothering with high end GPUs anymore, cuz theres no point aiming that high and being in a no mans land.

Better to fight the xx70s and Arc x70s.
but they say the latency is the biggest problem and improving that would already already give them a jump
 

Black_Stride

do not tempt fate do not contrain Wonder Woman's thighs do not do not
4070 clocks a lot higher though which offsets the shader deficit.

4070:

clock-vs-voltage.png


3080:

clock-vs-voltage.png


3080 - 68*128*2*1.93 = ~33.6 TFLOPS
4070 - 46*128*2*2.762 = ~32.5 TFLOPS

Perf/TFLOPS is effectively the same Ampere -> Ada.
Adas effeciency and clock speed advantage IS its architectural advantage.
Which is why I said simply looking at the CUDA count doesnt really mean much.

but they say the latency is the biggest problem and improving that would already already give them a jump

Without major improvements with the GCD your arent going to be getting a "big" jump in performance.
A big jump is of course relative, im assuming they are going back to the RDNA1 model and will only have a 8700XT/8800XT as their range topping GPU which will likely compete with the RTX 5070.
Now that might still be considered a big jump if the 8700XT/8800XT beat and/or match the 7900XTX.
Who knows, only time will tell.
 

shamoomoo

Member
Its a new architecture, so simply looking at CUDA counts is pointless, thats always been the case.

Case in point.
RTX 3070 - 5888 CUDA cores.
RTX 4070 - 5888 CUDA cores.

TKkSIKt.png


The 4070 performs like a 3080 while having much fewer CUDA cores.
I'm not sure if that's a good enough example since Ampere was on a inefficient node and with Ada Lovelace the frequency was dramatically increased. If it's possible to get like-for-like clock on both GPUs to see if the improvement to architecture actually changed anything.


The RX 6700xt performed the same as the 5700xt at the same clock speed.
 
Last edited:

Buggy Loop

Member
As expected since the very beginning of Blackwell name appearing.

Even Ada had an MCM alternative ready to go according to kopite, but Nvidia was too impressed with the TSMC output so they scrapped the idea. As everyone saw, 4090 was a beast, probably the last monolithic king.

I’m not really sure what to make of this. If core counts aren’t changing much, does this indicate that whilst performance gains will be better, performance won’t be significantly better than 4000 series?

I imagine if this is the case, they are likely moving away from traditional core performance and moving further in with A.I to improve performance through technology deployed through the chipset.

It’s all very confusing, just sell me a decent 5000 series card that will give me better performance than a 4090 but not cost as much. Thanks.

There's so many things that can change in an SM that core count is a meaningless metric. Their "not substantial" can also be compared with previous gens.

From Pascal GP104 (2560) → Turing TU104 (3072) → Ampere GA104 (3072 → ~2x3072) → Ada AD104 (3840 → ~2x3840)

Nvidia "doubled" the int32 / FP cores for Ampere and Ada, but that's a bit of a stretch, not all of them can be used. Point is that from a top view block diagram perspective, until Nvidia announced it, you would barely see a "substantial" core count increase from pascal to ada either.

My guesses are that fundamentally, there's a lot of work to be done around datapaths and memory paths especially if they go chiplet. This will the foundation of their MCM and while almost every companies have done MCM at this point, for the likes of data servers, where the data iterations are known quantities, as in non real-time productivity tasks, those solutions scale badly for gaming. Even with Apple's 2.5TB/s interposer, GPU MCMs do not scale as well as CPUs, it's like 200% CPU while +50% GPU. So everything will depend on the chiplet datapaths before you even decide to slap 2 GCDs together, but there's a need for a major overhaul. Just doing like servers & Apple did, is not a good plan for gaming.

Probably also a big focus on their ReStir / NRC path tracing, i would imagine that the ray tracing cores gen 5 + memory caching will be majorly overhauled for path tracing. And also of course, even more ML leverage as NRC is AI based and as we see now they are about to release ray reconstruction, Nvidia's future is ML and rightfully so. NRC is also highly dependent on low level caches and memory traffic, a revamp of the WARP so that the NRC remains in registers and doesn't bounce around memory lanes is in order if they're going this path. All that is detailed in their research papers as mitigations to yield additional performances.

Scaling to MCM only for +50% performances in gaming doesn't cut it, you stay monolithic if this is the result. So the question is will Nvidia be the first to unlock gaming MCM where it scales similarly to CPUs?

Might even need an OS revamp to achieve that. AMD’s engineer kind of covered that with the press that it’s more tricky to have multi GCD on GPUs than CPU CCDs. As of now the OS have native multi CPU support and it’s well understood how the system handles multi tasks over multiples of them. There’s no such thing for GPUs, it has to be handled on driver side, which is a big yikes.. but time will tell.
 
Last edited:

Toots

Gold Member
Next stop on the Toots Hot Take Tour 2023
Hot take :

I was wondering if the name "Blackwell" was virtue signaling like Lovelace was ?
C40BIMW.jpg

I guess it is :messenger_grinning_sweat:
Dude seems like a pretty cool guy tho and i don't really care if you celebrate black or white nerds as long as you are celebrating real nerds.
Anyway here's my hot take :
I wonder when all-white nvidia marketing execs will find enough balls between all of them to do the right thing and call the next gen of gpu "Elijah Muhammad" and be done with their pandering...
 

Black_Stride

do not tempt fate do not contrain Wonder Woman's thighs do not do not
Next stop on the Toots Hot Take Tour 2023
Hot take :

I was wondering if the name "Blackwell" was virtue signaling like Lovelace was ?
C40BIMW.jpg

I guess it is :messenger_grinning_sweat:
Dude seems like a pretty cool guy tho and i don't really care if you celebrate black or white nerds as long as you are celebrating real nerds.
Anyway here's my hot take :
I wonder when all-white nvidia marketing execs will find enough balls between all of them to do the right thing and call the next gen of gpu "Elijah Muhammad" and be done with their pandering...

f96.png





GeForce 6 Curie was codenamed after Marie Curie and Hopper was named after Grace Hopper I take it they were virtue signaling then as well.
The Anti-Woke crowd are absolutely insane mane.....its reached the point where I dont know if this is parody or these people are actually this deluded and so desperate for something to be outraged at.
 

Buggy Loop

Member
Next stop on the Toots Hot Take Tour 2023
Hot take :

I was wondering if the name "Blackwell" was virtue signaling like Lovelace was ?
C40BIMW.jpg

I guess it is :messenger_grinning_sweat:
Dude seems like a pretty cool guy tho and i don't really care if you celebrate black or white nerds as long as you are celebrating real nerds.
Anyway here's my hot take :
I wonder when all-white nvidia marketing execs will find enough balls between all of them to do the right thing and call the next gen of gpu "Elijah Muhammad" and be done with their pandering...

Sylvester Stallone Facepalm GIF


is everything about wokism now

Fucking hell
 

hlm666

Member
This is data centre right? Can't see nvidia wasting the limited advanced packaging they will need to use on these for game gpus when they can sell every AI gpu they can make at crazy prices and it's the packaging they need for these being the limiting factor currently, because of hbm apparently. The game gpus will still be monolithic most probably because of this because mcm will require them to also use this packaging.
 

Black_Stride

do not tempt fate do not contrain Wonder Woman's thighs do not do not
We are about to get charged out the ass.
Nope fuck that, imma be skipping Blackwell for sure.

wXkVmsM.png



This is data centre right? Can't see nvidia wasting the limited advanced packaging they will need to use on these for game gpus when they can sell every AI gpu they can make at crazy prices and it's the packaging they need for these being the limiting factor currently, because of hbm apparently. The game gpus will still be monolithic most probably because of this because mcm will require them to also use this packaging.
Nvidia is consolidating their architectures.
HPC will be GB10x consumer will be GB20x


It doesnt make sense to have monolithic be consumer and MCM be HPC when in terms of CUDA counts they arent far apart.
If anything the opposite would be more likely.
The margins on gaming GPUs is lower so going MCM to get as much savings makes sense, the HPC market you can charge them whatever the fuck you want.
 

KungFucius

King Snowflake
I’m not really sure what to make of this. If core counts aren’t changing much, does this indicate that whilst performance gains will be better, performance won’t be significantly better than 4000 series?

I imagine if this is the case, they are likely moving away from traditional core performance and moving further in with A.I to improve performance through technology deployed through the chipset.

It’s all very confusing, just sell me a decent 5000 series card that will give me better performance than a 4090 but not cost as much. Thanks.
Oh C'mon. We all know whatever they come out with will push you towards the 5090 that costs 1800.

We are about to get charged out the ass.
Nope fuck that, imma be skipping Blackwell for sure.

wXkVmsM.png




Nvidia is consolidating their architectures.
HPC will be GB10x consumer will be GB20x


It doesnt make sense to have monolithic be consumer and MCM be HPC when in terms of CUDA counts they arent far apart.
If anything the opposite would be more likely.
The margins on gaming GPUs is lower so going MCM to get as much savings makes sense, the HPC market you can charge them whatever the fuck you want.
Are you sure? I said I am fine with 3090 for 4-6 years. 2 years later I was hammering F5 to get a 4090.

I would love to see them go to 2025 without launching the next gen, because that will make me look less like a weak bitch when I cave and buy one at launch. 32GB alone will have them bumping the MSRP to 2k. They will price these things like they are losing HPC revenue just by selling them.
 

dave_d

Member
f96.png





GeForce 6 Curie was codenamed after Marie Curie and Hopper was named after Grace Hopper I take it they were virtue signaling then as well.
The Anti-Woke crowd are absolutely insane mane.....its reached the point where I dont know if this is parody or these people are actually this deluded and so desperate for something to be outraged at.
My main annoyance with Hopper is idiots who try to play up her importance by playing up the fact she coined the term bug and then miss her work on Cobol.(One of the first higher level languages. I am so glad I don't have to program in machine code.) To give an analogy that would be like somebody talking about Thomas Jefferson as "the guy who invented french fries" and basically missed the other stuff he kind of did.
 

Celcius

°Temp. member
This will be when I finally upgrade from my rtx 3090.
Honestly, I wouldn't mind if they took a generation and kept the performance the same as last gen but focused on getting the heat and power draws cut in half. Stuff is getting out of hand.
 

Reallink

Member
Adas effeciency and clock speed advantage IS its architectural advantage.
Which is why I said simply looking at the CUDA count doesnt really mean much.
Important distinction to make is the same core count on the same node is going to perform very similar. Actual architectural gains are very small at this point, especially 1 gen apart. For people completely unversed in this stuff, 90% of performance gains come from node shrinks allowing more cores on the same size die, higher clocks due to improved power efficiencies, or both. The 4070s 55% clock advantage is the achievement of TSMC (and the failure of Samsung), not Nvidia
 
Last edited:
Top Bottom