• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

DF: Tales of Arise PS5 vs Series X/S Tech Breakdown.

Mister Wolf

Member
My favorite game in a couple years

Easily my favorite combat I've played in an Action-RPG. It shits all over Kingdom Hearts franchise in that department. All the characters play different. I love the skill trees and the steps needed to take to unlock parts of it. Best looking cel shading I've seen in a game. Beautiful artstyle. I love all the side quest optional boss battles and the meaningful rewards you get from them and exploration. Game even has a training arena that gives great rewards. Hell I even like the fishing. The game makes me want to play it even when I'm tired from work. Everything about this game is great.
 
Last edited:

MonarchJT

Banned
Isn't Doom 1800p with raytracing

It would be if it didn't run into other bottlenecks
and what should be the bottleneck the stronger GPU, faster CPU or the faster bandwidth or is the slower SSD ? can we admit that 99% is there the likelihood that it is simply a Port made with little attention to the characteristics of the machine? (I'm talking about ghost runner)
 
Last edited:

Cornbread78

Member
The game looks and plays great on my PS5. Thinking about it, I'll probably play through it again on my SeX when it hits GP eventually
 

DaGwaphics

Member


- Two modes, a 1620p mode which targets 60 fps on PS5 and Series X and a 4K mode with better shadows with unlocked frame rate. Both are a visual match. On Series S is 1080 at 60 fps or 1440p at unlocked fps.

- Performance: on PS5 performance is almost a locked 60 fps. 4K mode hovers around 40s and 50s most of the time, but it can dip as low as 30 in heavy combat scenes. On Series X the performance mode is locked 60 fps, and the graphics mode is 60 fps "a surprising amount of the time" at the beginning of the game (struggles a bit more later on). On Series S the 1080 mode is locked 60 fps and the 1440p is usually around 50 to 60 fps, with dips into the low 40s.


Good to see that the unlocked mode hovers in VRR range. I'm seriously going to have to upgrade my monitor.
 

Zathalus

Member
Ok, show it to us with a like for like scene like I did. It takes time to do that kind of comparisons. Problem with combat scenes is those are more dynamic than exploration notably when big amount of alphas on screen. What we already know is that both can drop in the 30s during combat. But it's harder to make an framerate average-ish like I did.
You don't need a perfect like for like to get a comparison between two systems. Just a large enough sample size. I think it's pretty clear from both DF and ElAnalistaDeBits that the XSX is performing notably better in combat scenes. Only one system drops to 30FPS in all the combat scenes. The lowest XSX drops to is 38FPS, and drops that low much less often.
 

rnlval

Member
It also has higher rasterization throughput:

8QXpMZC.jpg
Pixel fillrate is limited by memory bandwidth. PC NAVI 21 has a large 128 MB L3 cache + delta color compression. Remember, XBO's 32 MB eSRAM can fit 1600x900 framebuffers without delta color compression. 128 MB is 4X over 32 MB.

Pixel fillrate can be supplemented by compute shader with texture management units (TMU) read/write functions. The entire marketing push from AMD's Async Compute to supplement ROPS with TMUs.

Game console GPUs don't have PC RDNA 2's 96 MB to 128 MB L3 cache.
 
Last edited:

rnlval

Member
Pixel fillrate isn't the only thing that gets bigger. Higher clock speed also means higher cache bandwidths.

Games can be limited by L1, L2 throughput before they can be CU/SM bound. Here are two games I profiled on my PC. DOOM Eternal is predominantly CU/SM bound then memory bandwidth and then L1 and L2, which means the more SM/CU and memory bandwidth you throw at it the better it will perform. Death Stranding, on the other hand, is almost always L1 throughput bound across multiple frames even at higher res like 4K and less so by the SMs/CUs or other parts of the GPU. In scenarios like Death Stranding where it thrashes cache bandwidths, PS5, due to its higher clock speed therefore higher cache bandwidths/FLOP, will benefit more and have similar or better perf than XSX.

GPU Trace is NVIDIA's trace tool that doesn't show RDNA's L0 cache and local data storage (LDS).

XSX GPU has 5MB L2 cache. Higher CU count yields higher wavefront queue, register, local data storage (LDS) and L0 cache on-chip storage.

PS5 GPU has 4MB L2 cache.
---------------

On the subject of GPU Trace, NVIDIA Turing/Ampere SM has L1 cache storage, hence L1 cache can scale with SM count.
 
Last edited:

Md Ray

Member
Pixel fillrate is limited by memory bandwidth. PC NAVI 21 has a large 128 MB L3 cache + delta color compression. Remember, XBO's 32 MB eSRAM can fit 1600x900 framebuffers without delta color compression. 128 MB is 4X over 32 MB.

Pixel fillrate can be supplemented by compute shader with texture management units (TMU) read/write functions. The entire marketing push from AMD's Async Compute to supplement ROPS with TMUs.

Game console GPUs don't have PC RDNA 2's 96 MB to 128 MB L3 cache.
Hey, The other day I was trying out this new outfit. I didn't like it at first but then it kinda grew on me. Something happened and I didn't buy it. And now i wish i had bought the outfit, it was awesome. you should try it when you go this place called the Moon Fare in New York 44th Street, 5th avenue.
 

Loxus

Member
GPU Trace is NVIDIA's trace tool that doesn't show RDNA's L0 cache and local data storage (LDS).

XSX GPU has 5MB L2 cache. Higher CU count yields higher wavefront queue, register, local data storage (LDS) and L0 cache on-chip storage.

PS5 GPU has 4MB L2 cache.
---------------

On the subject of GPU Trace, NVIDIA Turing/Ampere SM has L1 cache storage, hence L1 cache can scale with SM count.
What does any of this has to do with cache throughput and the PS5 having 4MB L2 Cache?

This is the reason XBSX has 5MB of L2 Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets
In RDNA, the caches are connected to each other in the following way:
Cache RDNA


The L2 cache is connected to the outside to 16 channels of 32 Bytes / cycle each, if we look at the Navi 10 diagram then you will see how this GPU has about 16 L2 Cache partitions and a 256-bit GDDR6 bus to which they are connected.

Keep in mind that GDDR6 uses 2 channels per chip that operate in parallel, each of 16 bits.

Canales GDDR6


In other words, the number of L2 cache partitions in RDNA architectures is equal to the number of 16-bit GDDR6 channels that are connected to the graphics processor. In RDNA and RDNA 2 each partition is 256 KB, this is the reason why the Xbox Series X that has a 320-bit bus and therefore 20 GDDR6 channels has about 5 MB of L2 Cache.


Xbox Series X Arquitectura


The 6900XT has 4MB L2 Cache with a 256bit bus.

Cerny whole talk is about balance and efficiency. How to squeeze a bit more performance out of the GPU.
"If you just calculate teraflops you get the same number, but actually the performance is noticeably different because teraflops is defined as the computational capability of the vector ALU.

That's just one part of the GPU, there are a lot of other units and those other units all run faster when the GPU frequency is higher. At 33% higher frequency, rasterization goes 33% faster, processing the command buffer goes that much faster, the L2 and other caches have that much higher bandwidth and so on.

About the only downside is that system memory is 33% further away in terms of cycles. But the large number of benefits more than counterbalanced that.

As a friend of mine says a rising tide lifts all boats.

Also it's easier to fully use 36CUs in parallel than it is to fully use 48CUs. When triangles are small, it's much harder to fill although CUs with useful work."



I fully understand the advantage the XBSX has in terms of resolution, RT, etc. Only con I see with XBSX is the split memory setup and 10GB limit.
But you have to admit PS5 hardware is pretty good too.
 

rnlval

Member
What does any of this has to do with cache throughput and the PS5 having 4MB L2 Cache?

This is the reason XBSX has 5MB of L2 Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets
In RDNA, the caches are connected to each other in the following way:
Cache RDNA


The L2 cache is connected to the outside to 16 channels of 32 Bytes / cycle each, if we look at the Navi 10 diagram then you will see how this GPU has about 16 L2 Cache partitions and a 256-bit GDDR6 bus to which they are connected.

Keep in mind that GDDR6 uses 2 channels per chip that operate in parallel, each of 16 bits.

Canales GDDR6


In other words, the number of L2 cache partitions in RDNA architectures is equal to the number of 16-bit GDDR6 channels that are connected to the graphics processor. In RDNA and RDNA 2 each partition is 256 KB, this is the reason why the Xbox Series X that has a 320-bit bus and therefore 20 GDDR6 channels has about 5 MB of L2 Cache.


Xbox Series X Arquitectura


The 6900XT has 4MB L2 Cache with a 256bit bus.

Cerny whole talk is about balance and efficiency. How to squeeze a bit more performance out of the GPU.
"If you just calculate teraflops you get the same number, but actually the performance is noticeably different because teraflops is defined as the computational capability of the vector ALU.

That's just one part of the GPU, there are a lot of other units and those other units all run faster when the GPU frequency is higher. At 33% higher frequency, rasterization goes 33% faster, processing the command buffer goes that much faster, the L2 and other caches have that much higher bandwidth and so on.

About the only downside is that system memory is 33% further away in terms of cycles. But the large number of benefits more than counterbalanced that.

As a friend of mine says a rising tide lifts all boats.

Also it's easier to fully use 36CUs in parallel than it is to fully use 48CUs. When triangles are small, it's much harder to fill although CUs with useful work."



I fully understand the advantage the XBSX has in terms of resolution, RT, etc. Only con I see with XBSX is the split memory setup and 10GB limit.
But you have to admit PS5 hardware is pretty good too.
NVIDIA's SM L0 cache/L1 cache effectively similar to AMD CU's Local Data Storage/L0 cache/instruction cache/K cache. My post addressed the GPU Trace tool from NVIDIA.

Both NVIDIA SM L0 cache/L1 cache and AMD CU's Local Data Storage/L0 cache/instruction cache/K cache scales with SM/CU count.

NVIDIA's Ampere SM design.

5vfdIvD.png

VS

bN3TVPz.png


NVIDIA Ampere SM has 256 KB register storage with 128 vector stream processors.

AMD RDNA WGP (two joined CU) has 128 KB register storage with 128 vector stream processors.

Programmers need to conserve register allocation with AMD RDNA hardware before NVIDIA Ampere runs out of register storage.

Both AMD and NVIDIA placed RT cores next to the texture units while NVIDIA RT cores also accelerate RT transverse.

No brainer on why Ampere SM beats RDNA v2 WGP on complex compute and raytracing.

---------------
Higher on-chip cache storage reduces external memory hit rates and higher clock speed doesn't mitigate external memory hit rates.
 
Last edited:

rnlval

Member
Hey, The other day I was trying out this new outfit. I didn't like it at first but then it kinda grew on me. Something happened and I didn't buy it. And now i wish i had bought the outfit, it was awesome. you should try it when you go this place called the Moon Fare in New York 44th Street, 5th avenue.
You failed GDC 2014 lecture.
 

Md Ray

Member
You failed GDC 2014 lecture.
??
You tend to respond with a post that has nothing to do with what the other person is talking about. I wasn't even talking about pixel fillrate specifically in the post you responded to, higher pixel fillrate might not have a huge impact these days, I'm aware of that. What about higher rasterization rate? You fail to understand or purposefully ignore that PS5 has higher L1 and L2 cache bandwidth throughput/FLOP than XSX.
 

rnlval

Member
??
You tend to respond with a post that has nothing to do with what the other person is talking about. I wasn't even talking about pixel fillrate specifically in the post you responded to, higher pixel fillrate might not have a huge impact these days, I'm aware of that. What about higher rasterization rate? You fail to understand or purposefully ignore that PS5 has higher L1 and L2 cache bandwidth throughput/FLOP than XSX.
Wrong, XSX GPU has a higher L2 cache bandwidth since XSX's 5 MB L2 cache relates to 20 16-bit memory controller units.

PS5 GPU's 4 MB L2 cache relates to 16 16-bit memory controller units.
 
Last edited:

rnlval

Member
What does any of this has to do with cache throughput and the PS5 having 4MB L2 Cache?

This is the reason XBSX has 5MB of L2 Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets
In RDNA, the caches are connected to each other in the following way:
This is the reason why XSX GPU has 5MB of L2 Cache.

HhHtlXZ.jpg


There are 20 I/O ports from the GPU's 5MB L2 cache at GPU clock speed.

L1 cache can saturate L2 cache which in turn can saturate external memory.

----
PS5 GPU has 16 I/O ports from GPU's 4MB L2 cache at GPU clock speed.

Each I/O port is 16 bits wide, hence
PS5 GPU has 16 x 16 bits = 256 bits coupled with GDDR6-14000.

XSX GPU has 20 x 16 bits = 320 bits coupled with GDDR6-14000. 4K frame buffers with DCC need something like 128 MB of very fast memory storage a.k.a. NAVI 21. XSX's 10 GB 320-bit fast memory is more than enough for frame buffers.
 
Last edited:

Loxus

Member
This is the reason why XSX GPU has 5MB of L2 Cache.

HhHtlXZ.jpg


There are 20 I/O ports from the GPU's 5MB L2 cache at GPU clock speed.

L1 cache can saturate L2 cache which in turn can saturate external memory.

----
PS5 GPU has 16 I/O ports from GPU's 4MB L2 cache at GPU clock speed.

Each I/O port is 16 bits wide, hence
PS5 GPU has 16 x 16 bits = 256 bits coupled with GDDR6-14000.

XSX GPU has 20 x 16 bits = 320 bits coupled with GDDR6-14000. 4K frame buffers with DCC need something like 128 MB of very fast memory storage a.k.a. NAVI 21. XSX's 10 GB 320-bit fast memory is more than enough for frame buffers.
Bruh, I literally posted the same thing.
Don't you read?
This is the reason XBSX has 5MB of L2 Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets
In RDNA, the caches are connected to each other in the following way:
Cache RDNA
Cache RDNA


The L2 cache is connected to the outside to 16 channels of 32 Bytes / cycle each, if we look at the Navi 10 diagram then you will see how this GPU has about 16 L2 Cache partitions and a 256-bit GDDR6 bus to which they are connected.

Keep in mind that GDDR6 uses 2 channels per chip that operate in parallel, each of 16 bits.

Canales GDDR6
Canales GDDR6


In other words, the number of L2 cache partitions in RDNA architectures is equal to the number of 16-bit GDDR6 channels that are connected to the graphics processor. In RDNA and RDNA 2 each partition is 256 KB, this is the reason why the Xbox Series X that has a 320-bit bus and therefore 20 GDDR6 channels has about 5 MB of L2 Cache.


Xbox Series X Arquitectura
The 6900 XT has 4MB L2 Cache with a 256bit bus.
Do you see the 256bit bus being a problem for the 6900 XT?
Or I should say, do you see it being a problem on PS5? Especially seeing the PS5 does 8k very comfortably.
 

Lysandros

Member
Wrong. PS5's total cache BW/FLOP amounts to 1.67 TB/s and XSX's is 1.58 TB/s.
Is this total GPU cache bandwidth to compute ratio? How is this calculated, or is it from a source? By the way isn't PS5 L2 GPU cache bandwidth 22% higher due to higher frequency? The amount itself shouldn't have an impact.

Edit: I would expect around 40% more total cache bandwidth per FLOP in favor of PS5 because of its higher frequency the begin with+XSX having 18% more teraflops. Unless my logic is wrong.

Edit 2: I found this from RDNA white paper showing RX 5700 XT total cache bandwidth per FLOP figure as a referance. Is it only 0.2 TB/s higher than XSX figure despite being 9.75 TF max and running at higher frequency?
vthdGbW.jpg
 
Last edited:

Concern

Member
I don't even play these games. Just wondering why tf was this brought back up to fight over specs some more? 🤣🤣🤣

Do some of you feel like you accomplished something when you go to sleep at night after a gaf battle? Lol
 

Loxus

Member
I don't even play these games. Just wondering why tf was this brought back up to fight over specs some more? 🤣🤣🤣

Do some of you feel like you accomplished something when you go to sleep at night after a gaf battle? Lol
Funny thing is, we moved on from talking about this game and is now talking about cache bandwidth.
 

Roxkis_ii

Member
I play tales of arise on a Ps4 pro, and for some reason, knowing it has a slightly higher frame rate on another console hasn't spoiled the fun of the game for me. It's still a really enjoyable game! I hope other players are able to enjoy it too.
 
Last edited:

Md Ray

Member
Is this total GPU cache bandwidth to compute ratio? How is this calculated, or is it from a source? By the way isn't PS5 L2 GPU cache bandwidth 22% higher due to higher frequency? The amount itself shouldn't have an impact.

Edit: I would expect around 40% more total cache bandwidth per FLOP in favor of PS5 because of its higher frequency the begin with+XSX having 18% more teraflops. Unless my logic is wrong.

Edit 2: I found this from RDNA white paper showing RX 5700 XT total cache bandwidth per FLOP figure as a referance. Is it only 0.2 TB/s higher than XSX figure despite being 9.75 TF max and running at higher frequency?
vthdGbW.jpg
Ok, so on 5700 XT, you have 16 L2$ tiles. Each tile can move data with 64 Bytes per clock.
So 16 x 64B x 1.905 MHz = 1.95 TB/s for L2$

L0$ (L1$) read speed is 128 Bytes per clock per CU.
So 16 x 128B x 1.905 MHz = 3.90 TB/s for L1$
40 (CUs) x 128B x 1.905 MHz = 9.75 TB/s for L0$

Add all three cache numbers and you get 15.6 TB/s total cache BW. Divide this number by 9.75 TF and you get exactly 1.6 TB/s BW per FLOP as listed in the whitepaper for 5700 XT. Now you can do the same calc for PS5 and XSX. Just keep in mind that PS5 has the same 16 L2$ tiles as 5700 XT, but XSX has 20 L2$ tiles. So 20 x 64B x 1.825 MHz = 2.34 TB/s for XSX's L2$.
Radeon 5700 XT
PS5
XSX
Frequency​
GHz​
1.905​
2.230​
1.825​
FP32 Performance​
TFLOP/s​
9.75​
10.28​
12.15​
L0 Bandwidth​
TB/s​
9.76​
10.28​
12.15​
L1 Bandwidth​
TB/s​
3.90​
4.57​
3.74​
L2 Bandwidth​
TB/s​
1.95​
2.28​
2.34​
Total Cache Bandwidth​
TB/s​
15.61​
17.13​
18.23​
Total Cache Bandwidth/FLOP​
1.6​
1.67​
1.50​
 
Last edited:

Lysandros

Member
Ok, so on 5700 XT, you have 16 L2$ tiles. Each tile can move data with 64 Bytes per clock.
So 16 x 64B x 1.905 MHz = 1.95 TB/s for L2$

L0$ (L1$) read speed is 128 Bytes per clock per CU.
So 16 x 128B x 1.905 MHz = 3.90 TB/s for L1$
40 (CUs) x 128B x 1.905 MHz = 9.75 TB/s for L0$

Add all three cache numbers and you get 15.6 TB/s total cache BW. Divide this number by 9.75 TF and you get exactly 1.6 TB/s BW per FLOP as listed in the whitepaper for 5700 XT. Now you can do the same calc for PS5 and XSX. Just keep in mind that PS5 has the same 16 L2$ tiles as 5700 XT, but XSX has 20 L2$ tiles. So 20 x 64B x 1.825 MHz = 2.34 TB/s for XSX's L2$.
Radeon 5700 XT​
PS5​
XSX​
Frequency​
GHz​
1.905​
2.230​
1.825​
FP32 Performance​
TFLOP/s​
9.75​
10.28​
12.15​
L0 Bandwidth​
TB/s​
9.76​
10.28​
12.15​
L1 Bandwidth​
TB/s​
3.90​
4.57​

4.67​
L2 Bandwidth​
TB/s​
1.95​
2.28​
2.34​
Total Cache Bandwidth​
TB/s​
15.61​
17.13​
19.16​
Total Cache Bandwidth/FLOP​
1.6​
1.67​
1.58​
Thanks for the reply. But both console's GPU L1 cache amount per SA is the same being 128 KB, so 4x128=512KB in total and those run a 22% higher frequency on PS5, how can L1 cache bandwidth be higher on XSX in that case as shown on your figure? Are you saying that XSX 'L1' cache is also 20 way/tile instead of 16? Wouldn't this lead to cache being larger than 128KB per SA? Sorry for the repeating questions but the prospect of XSX having higher bandwidth for all its caches compared PS5 seems brand new and quite baffling to me, i believed the opposite to be true since the reveals, maybe i missed a few relevant posts earlier.
 
Last edited:

Riky

$MSFT
Ok, so on 5700 XT, you have 16 L2$ tiles. Each tile can move data with 64 Bytes per clock.
So 16 x 64B x 1.905 MHz = 1.95 TB/s for L2$

L0$ (L1$) read speed is 128 Bytes per clock per CU.
So 16 x 128B x 1.905 MHz = 3.90 TB/s for L1$
40 (CUs) x 128B x 1.905 MHz = 9.75 TB/s for L0$

Add all three cache numbers and you get 15.6 TB/s total cache BW. Divide this number by 9.75 TF and you get exactly 1.6 TB/s BW per FLOP as listed in the whitepaper for 5700 XT. Now you can do the same calc for PS5 and XSX. Just keep in mind that PS5 has the same 16 L2$ tiles as 5700 XT, but XSX has 20 L2$ tiles. So 20 x 64B x 1.825 MHz = 2.34 TB/s for XSX's L2$.
Radeon 5700 XT​
PS5​
XSX​
Frequency​
GHz​
1.905​
2.230​
1.825​
FP32 Performance​
TFLOP/s​
9.75​
10.28​
12.15​
L0 Bandwidth​
TB/s​
9.76​
10.28​
12.15​
L1 Bandwidth​
TB/s​
3.90​
4.57​

4.67​
L2 Bandwidth​
TB/s​
1.95​
2.28​
2.34​
Total Cache Bandwidth​
TB/s​
15.61​
17.13​
19.16​
Total Cache Bandwidth/FLOP​
1.6​
1.67​
1.58​

So total cache bandwidth is higher on Series X. Does anyone really think this would make up for the disparity in Compute Units and Memory Bandwidth?
I don't know why the likes of Digital Foundry don't get some developer thoughts on all this and RDNA2 features even if the developers don't want to be named, would make for a much better insight.
 

Md Ray

Member
Thanks for the reply. But both console's GPU L1 cache amount per SA is the same being 128 KB, so 4x128=512kB in total and those run a 22% higher frequency on PS5, how can L1 cache bandwidth be higher on XSX in that case as shown on your figure?
Ah, right, my mistake. I applied the same L2$ math towards L1$ as well. That's wrong. Both XSX and PS5 have the same amount of L1$ cache so it is indeed 18% slower on XSX. I've corrected them now. Thanks man.
A five percent difference.

Made an error there. Eleven percent, actually.
That's...
 
Last edited:

Lysandros

Member
Ah, right, my mistake. I applied the same L2$ math towards L1$ as well. That's wrong. Both XSX and PS5 have the same amount of L1$ cache so it is indeed 18% slower on XSX. I've corrected them now. Thanks man.

Made an error there. Eleven percent, actually.
That's...
👍 Independently from bandwidth, PS5 would still benefit from lower latency due to caches being 22% 'faster' right?
 
So total cache bandwidth is higher on Series X. Does anyone really think this would make up for the disparity in Compute Units and Memory Bandwidth?
It doesn't only make up for it, it allows the PS5 to run Touryst at almost twice the resolution, and it allows Ghostrunner to run something like 30% better. At least that's what I've been told 🤭
 
Wow big difference, wondering what's going on. Definitely get it right now on Xbox side of you can, unless they can patch things to fix up PlayStation side.
 

Loxus

Member
It doesn't only make up for it, it allows the PS5 to run Touryst at almost twice the resolution, and it allows Ghostrunner to run something like 30% better. At least that's what I've been told 🤭
I think you should avoid posting when threads start to get technical.

Not only cache performance increases when the GPU is clocked higher. There are many other units in a GPU that all run faster when the GPU frequency is higher. Rasterization, command buffer, culling, pixel fill rate, etc.

All these units working together is what is giving us these results we see today.
 

rnlval

Member
1. Bruh, I literally posted the same thing.
Don't you read?


2. The 6900 XT has 4MB L2 Cache with a 256bit bus.
Do you see the 256bit bus being a problem for the 6900 XT?
Or I should say, do you see it being a problem on PS5? Especially seeing the PS5 does 8k very comfortably.
1. The difference, I posted Microsoft source documents that showed 20 I/O ports from the 5MB L2 cache.

2. 6900 XT (NAVI 21) has ~1 TB/s 128 MB L3 cache coupled with delta color compression feature. Recall, XBO's 32 MB eSRAM (memory write bandwidth similar to 256 bit GDDR5-5500) can support 1600x900 framebuffers without delta color compression feature.

FYI, 6900 XT (NAVI 21)'s 128 MB L3 cache's ~1 TB/s bandwidth is similar to R9-290X/R9-390X's 1 MB L2 cache bandwidth at 1Ghz.

NAVI 21's 128 MB L3 cache coupled with delta color compression feature can easily handle 4K frame buffers. NAVI 21's 128 MB design was deliberate to be 4X scale from XBO's 32 MB eSRAM.

The main difference between NAVI 21's 128 MB L3 cache vs XBO's 32 MB eSRAM is NAVI 21's hardware cache behavior that works transparently.
 
Last edited:

rnlval

Member
Absolutely, yea. Higher clocks also mean cache latency should be lower, so that's another advantage.
Within the chip, higher clock speed yields lower access time (measured in nano-seconds) latency while increasing clock cycle latency with external memory.

A larger cache reduces external memory hit rates.

PS5 GPU has a higher clock speed with a narrower on-chip I/O.
XSX GPU has a medium clock speed with a wider on-chip I/O and higher cache storage.

Within the chip, clock cycle latency would be similar regardless of core clock speed.
 

rnlval

Member
Ok, so on 5700 XT, you have 16 L2$ tiles. Each tile can move data with 64 Bytes per clock.
So 16 x 64B x 1.905 MHz = 1.95 TB/s for L2$

L0$ (L1$) read speed is 128 Bytes per clock per CU.
So 16 x 128B x 1.905 MHz = 3.90 TB/s for L1$
40 (CUs) x 128B x 1.905 MHz = 9.75 TB/s for L0$

Add all three cache numbers and you get 15.6 TB/s total cache BW. Divide this number by 9.75 TF and you get exactly 1.6 TB/s BW per FLOP as listed in the whitepaper for 5700 XT. Now you can do the same calc for PS5 and XSX. Just keep in mind that PS5 has the same 16 L2$ tiles as 5700 XT, but XSX has 20 L2$ tiles. So 20 x 64B x 1.825 MHz = 2.34 TB/s for XSX's L2$.
Radeon 5700 XT​
PS5​
XSX​
Frequency​
GHz​
1.905​
2.230​
1.825​
FP32 Performance​
TFLOP/s​
9.75​
10.28​
12.15​
L0 Bandwidth​
TB/s​
9.76​
10.28​
12.15​
L1 Bandwidth​
TB/s​
3.90​
4.57​
4.67
3.74
L2 Bandwidth​
TB/s​
1.95​
2.28​
2.34​
Total Cache Bandwidth​
TB/s​
15.61​
17.13​
19.16
18.23
Total Cache Bandwidth/FLOP​
1.6​
1.67​
1.58
1.50
You have forgotten the Local Data Store (LDS). Hint: CELL SPU style local memory.

FYI, GPU's primary data storage is with register storage, hence the reason why they are larger than L0/L1 caches.

128 KB register file per WGP

36 CU (18 WGP) has 2,304 MB register file at clock speed.

52 CU (26 WGP) has 3,328 MB register file at clock speed.

The register file is the fastest known storage tech.
 
Last edited:
Top Bottom