• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

AMD RDNA 3 GPUs To Be More Power Efficient Than NVIDIA Ada Lovelace GPUs, Navi 31 & Navi 33 Tape Out Later This Year

Armorian

Banned
rumors that both are going to require over 400W of power for the top end? Crazy. Should be a huge performance jump based just on that.

I think top GPUs from and and Nvidia will happily consume ~500W based that current top is around 300 and we are talking about at least 2x performance with minimal power efficiency increase.
 

ToTTenTranz

Banned
i don't understand all these recent leaks from Nvidia and AMD, we are more than a year away from anything real 🤷‍♀️

Navi 31 is Q2 2022, less than a year away.


Expect AMD to include Tensor DSP with future AMD GPUs and APUs.
The CVML block is present in Van Gogh, Dragon Crest and Rembrandt (Cezanne's Zen4+DDR5 successor). There are no Tensor / MAC units planned for consumer dGPUs. AMD sees no advantage in spending die area on those, especially since RDNA3 is bringing a sizeable upgrade in compute power and their shader processors already do quad-rate INT8 and 8-rate INT4.
 
Last edited:
1. Raytracing denoise pass is done via compute, read my post.
No, it's performed using the Tensor cores, not the shaders.

2. PC DirectSTorage decompression is done via GpGPU compute. Try again.

Try what? What does decompression have to do with GPU performance? CPU decompression is still way faster than GPU decompression. So who cares.

3. My post answered the Ampere RT vs RDNA 2 RT question. Follow the thread.

Fine. Not sure where the RT question came into play, because you and I were talking about Ampere's FP compute being largely useless in gaming workloads.

4. Ampere Compute TFLOPS is real via a certain path.

Of course it's real. If you can fill all the ALUs you can get lot of compute performance. That 'certain path' just isn't games. Most people here don't care about compute workloads. And even if you did, you'd probably prefer getting RTX A6000 instead of a 3090 because it would have pro drivers that fully extract Amperes compute capabilities.

5. PC DirectStorage decompression function is done via GpGPU path. DirectML is another GpGPU path.

Cool. So what?
Nvidia will use their tensor cores for ML, because they're faster at matrix and bfloat operations than cuda cores or regular vector ALUs.


6. Wrong. PC and XSX RDNA 2 was modified to follow DirectX12U and Vulkan counterpart. I answered the question

Modified how? Do you have the white papers for RDNA2 and Series X's ISA to be able to confirm that DX12U features enabled on XSX doesn't exist in PC RDNA2? I want to see the receipts. You can't make a spurious claim here and have no evidence.


7. Unlike the RDNA 2 competition, RTX 2080 Ti has a separate TIOPS resource that hides extra performance from the typical TFLOPS debate.

This sentence is unintelligible. What are you talking about? TIOPS 'hide' extra performance? What is being 'hidden' and how? This sentence makes absolutely no sense whatsoever.

6883a602-963b-411b-9c65-1f5147bf5431.PNG



RTX 2080 Ti FE real-life average clock speed is higher than paper spec i.e. 1824 Mhz (~15.9 TFLOPS, not including partly used TIOPS)

It's still 40% slower than Navi 21 so who cares how many TeraFLOPS it has. Not sure what relevance Turing and it's compute performance has on a conversation about Ampere and Navi 21.
 

twilo99

Member
I think top GPUs from and and Nvidia will happily consume ~500W based that current top is around 300 and we are talking about at least 2x performance with minimal power efficiency increase.

Right..

Hopefully we get better efficiency with the 5xxx series and they keep the wattage the same because everyone will have to buying new PSUs and the state of California law makers will loose their minds completely.
 

Buggy Loop

Member
No, it's performed using the Tensor cores, not the shaders.

No, it’s not done on tensor cores for game rendering, only for non real-time rendering via OptiX.

They are working on it but either the results are unsatisfactory, not as fast or they hit a wall with the training for real-time applications like games.
 

rnlval

Member
1. No, it's performed using the Tensor cores, not the shaders.



2. Try what? What does decompression have to do with GPU performance? CPU decompression is still way faster than GPU decompression. So who cares.



3. Fine. Not sure where the RT question came into play, because you and I were talking about Ampere's FP compute being largely useless in gaming workloads.



4. Of course it's real. If you can fill all the ALUs you can get lot of compute performance. That 'certain path' just isn't games. Most people here don't care about compute workloads. And even if you did, you'd probably prefer getting RTX A6000 instead of a 3090 because it would have pro drivers that fully extract Amperes compute capabilities.



5. Cool. So what?
Nvidia will use their tensor cores for ML, because they're faster at matrix and bfloat operations than cuda cores or regular vector ALUs.




6. Modified how? Do you have the white papers for RDNA2 and Series X's ISA to be able to confirm that DX12U features enabled on XSX doesn't exist in PC RDNA2? I want to see the receipts. You can't make a spurious claim here and have no evidence.




7. This sentence is unintelligible. What are you talking about? TIOPS 'hide' extra performance? What is being 'hidden' and how? This sentence makes absolutely no sense whatsoever.



It's still 40% slower than Navi 21 so who cares how many TeraFLOPS it has. Not sure what relevance Turing and it's compute performance has on a conversation about Ampere and Navi 21.
1 and 3. From https://images.nvidia.com/aem-dam/e...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf
Quoting from NVIDIA's white paper

Compared to Turing, the GA10x SM’s combined L1 data cache and shared memory capacity is 33% larger. For graphics workloads, the cache partition capacity is doubled compared to Turing, from 32KB to 64KB.

Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput.
----

You are wrong. Try Again. :messenger_tears_of_joy:


You didn't read NVIDIA's white paper. Don't mix up Ray tracing denoising shaders with DLSS 2.x.

2. DirectStorage GPGPU decompression targets future gaming workload. Prove PC's NAVI 21 has XSX's DirectStorage hardware decompression. Hint budgeting for the future gaming workload.

4. Ampere's excess TFLOPS/TIOPS is useful for "Fine Wine". LOL

5. It depends on the DirectML's individual function and input data formats e.g. Tensor has "rapid pack math" 4-bit/8-bit/16-bit inputs with 32-bit output. You are wrong with points 1 and 3. XSX and NAVI 21 DirectML are shared with normal shader workload. Not every DirectML workload will be processed via the Tensor hardware.

6. RDNA 1 doesn't have DX12U. RDNA 2 has DX12U, hence RDNA 2 PC/XSX was modified for DX12U with NVIDIA's NGGP methods.

7. Your TFLOPS argument didn't factor Turing's separate TIOPS hardware while RDNA 1/2's INT and FP are shared workload. NAVI 21 has raster ops superiority over Turing TU102, but 3DMarks mesh shader benchmark has RTX 2080 Ti very close to RX 6800 XT/ 6900 XT.

While NAVI 21 has raster ops superiority over TU102, it has near TU102 like mesh shader and RT performance, hence RX 6900 XT upgrade path wasn't sufficient enough from RTX 2080 Ti AIB OC (TU102) when RTX 3080 Ti's price is similar to RX 6900 XT.

RX 6800 XT/6900 XT upgrade path are suitable from RX 5700/RX 5700 XT and RTX 2070/2070 Super/RTX 2080 level.

I have purchased MSI RTX 3080 Ti Gaming X Trio AIB OC.
 
Last edited:

rnlval

Member
Navi 31 is Q2 2022, less than a year away.



The CVML block is present in Van Gogh, Dragon Crest and Rembrandt (Cezanne's Zen4+DDR5 successor). There are no Tensor / MAC units planned for consumer dGPUs. AMD sees no advantage in spending die area on those, especially since RDNA3 is bringing a sizeable upgrade in compute power and their shader processors already do quad-rate INT8 and 8-rate INT4.
Tensor workload has 32-bit output with quad-rate INT8 and 8-rate INT4 which is different from the normal Rapid Pack Math baseline. NAVI 21 and XSX were confirmed to have the 32-bit output RPM feature.
 
Top Bottom