• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Xbox Velocity Architecture - 100 GB is instantly accessible by the developer through a custom hardware decompression block

oldergamer

Member
don't tell me someone that had no idea what they were talking about in this thread suggested that "Perhaps Unreal 5 won't work with the hardware optimizations on XSX" and now a few are entertaining the idea?

Fact is it's not true and there is nothing to suggest that nanite won't benefit from the same hardware as other 3D engines.
 
Last edited:

J_Gamer.exe

Member
They're probably referring to the raw speed, so 2.4 GB/s x 2.5 = 6 GB/s which is the same figure they stated way earlier for higher compressed data transfer rate.



I think the area where XvA beats PS5's SSD I/O is in latency. There's a lot of things pointing to it, and I've been talking about it the past few days plus getting insight into it from people on other forums like B3D (I'm just a lurker tho x3).

In terms of raw bandwidth Sony's solution is still faster, I don't think that can be denied. But if MS has been prioritizing extremely low latency the entire time (as it seems they have), that actually brings massive advantages in frame-to-frame asset streaming, prefetching etc. Basically taking what the DiRT 5 developer was speaking of and validating it.



Here's the problem. You're still looking at it apples-to-apples. Their approaches actually ARE quite different and while Sony's prioritized raw bandwidth, MS has prioritized extremely low latency. So a lot of the paper specs, in practice, can end up being cancelled out.

Which is why comparing things through the paper specs alone was never a good idea. But hey, people were doing the same thing with the GPUs and the TFs all up until March so it is what it is :LOL:

Just where do you get this from? Whats pointing to it?

Did you watch the road to PS5 and see about that speed 100 times faster SSD transferring to 100 times faster usable speed.

PS5 has looked to eliminate all bottlenecks and latency is a bottleneck.

It has everything right there in the io complex, things as close as possible for as little as possible latency.

I'd love to hear how the xbox despite processing its io on the CPU will have less latency.... seems to go against physics....
 

stitch1

Member
I have no idea how any of this really works. Having said that, Microsoft has done some pretty impressive things in the past. Like bringing 360 games to XBO while upscaling and upgrading textures to them.

If they can do all that, then I trust them when they say they have come up with new ways of compressing and transferring data.
 
Last edited:

GODbody

Member
Just where do you get this from? Whats pointing to it?

Did you watch the road to PS5 and see about that speed 100 times faster SSD transferring to 100 times faster usable speed.

PS5 has looked to eliminate all bottlenecks and latency is a bottleneck.

It has everything right there in the io complex, things as close as possible for as little as possible latency.

I'd love to hear how the xbox despite processing its io on the CPU will have less latency.... seems to go against physics....

That 100x is in reference to the I/O performace of the base PS4.

SSD bandwidth and latency are not the same thing. Bandwidth is how much data can be sent (measured in GB/s) but latency is how long it takes for the SSD to respond to requests and begin the data transfer(measured in milliseconds or nanoseconds depending on how optimized this is on next gen consoles). Sony has made no mention of latency and has stated that they designed their system to elimante bottlenecks for the SSD and focused instead on increasing bandwidth.

We have not been given the data on the latency of either the PS5s or the Series Xs SSD.

what we could see is (I'm speculating) that I/O complex actually incurs some latency as introducing more hardware to systems tends to increase latency as more parts need to communicate and data needs to make more "stops" before it's final destination iirc theres an on chip ram, 2 I/o co processors, a memory controller and a coherency processor to clean things up due to have 2 processors writing to addresses all apart of the I/O complex of the PS5 and the data will need to be decompressed as well.

Both companies have stated that their I/O would have required 2 zen 2 cores to process. The differences lie in their approach to tackle this issue. While the PS5 when the hardware route and created a space on their APU dedicated to I/O, the Series X has gone with a software solution to reduce this down to 1/10th of a core.

On the Series X it's likely that data will be requested by the CPU and loaded straight into the decompression block and either sent to memory or the GPU. Due to only having 1 processor for their I/O they don't need a coherency engine, on chip ram, or the extra controller. On the PS5 it seems that data will be requested by the controller of the I/O complex and sent to the on chip ram while it waits for 1 of the 2 I/O Co processors to assign what address to occupy in memory and is then decompressed sent to that address in system memory where it is then used by the GPU or CPU.

We'll have to wait and see what the final official results are but it's likely Sony has sacrificed some latency for bandwidth and throughput.
 
Last edited:
Yeah the consoles are much more distinguished from each other in the coming generation. It's really hard not to just compare paper specs because that's what the current generation was built upon due to the Xbox One and PS4 being much less customized. In the coming gen with the addition of the SSDs asset streaming has become paramount and while Sony has been boasting about their SSD bandwidth it seems that Microsoft has not only bridged the gap through software but placed further distance between the Series X and the PS5.

The biggest game changer of the new consoles seems to me to be sampler feedback streaming as it will be delivering with a 2-3x multiplier on I/O Bandwidth and Memory. I've seen it reiterated many times by Microsoft and they seem pretty confident in that statement. That means the Series X can effectively transfer and store game assests that would have taken 20-30GBs of space in 10GB of RAM with a transfer rate of 4.8-7.2 GB/s raw and 9.6 - 14.4 GB/s compressed. (Please note that it's not going to literally be these speeds and sizes just how much equivalent data a system without Sampler feedback streaming would have to utilize).

Combining that with the ultra low latency and the speculated memory paging SSG style storage controller and the ceiling for what to expect from the Series X becomes much higher.

Exactly. At first even I was thinking the two approaches to I/O would be apples-to-apples but seeing the divergence highlights where their priorities are at. And, they're both equally capable for what they seek to achieve in cutting out bottlenecks through the data pipeline.

The frustrating part is that people are still generally comparing the two as if they're trying to do the exact same thing, and just going with the paper specs. It's even led to some people thinking MS just slapped a SSD in the system and called it a day, which is a pretty ludicrous idea to have.

The way I see the solutions so far, Sony's still favors raw bandwidth peaks obviously, their solution to the I/O bottleneck problem seems focused on being indiscriminate of data types and just seeing how quickly any type of data can get moved from storage to RAM and replaced more or less as quickly as possible with today's technological limits in the NAND market. I still think their absolute peak (22 GB/s) is for very specific types of video files for example (which you can compress at very high ratio levels with no discernible quality loss), but it's still a really commendable peak. Their solution's also a lot more centrally focused to a single platform but that means it'll be a lot less scalable going into the future.

MS's solution doesn't have Sony's bandwidth figures but they've seemingly done a lot of very specific research on actual texture usage and have built XvA around that, including SFS. Their solution seems focused more on cutting down latency so they'll probably have better latency than Sony's approach and that helps with latency-critical tasks. Rather than trying to move through as much data in/out of the system as quickly as possible, MS wants something that is focused on determining what actual specific data is needed, and having the low latency to pull in that specific data in as small a prefetch window as possible. They've even gone as far as to implement custom hardware in the GPU for this stuff. And their approach is focused much more on scalability for a range of SSD configurations that meet some minimum specification, given that they'll be implementing XvA (if not all of it then at least most parts of it) in the PC, server, mobile etc. spaces.

So both approaches are resolving virtually all of the current I/O bottlenecks, but with different main focuses and different approaches that respect/play to the strengths of the lineage of their respective companies. You can compare them on the paper specs of course if you'd like, but that's missing the big picture and ignoring the reality there are always multiple methods to solving the same problem.

Came across this from someone on Era (don't kill me xD) who linked what they feel might be what type of implementation MS has with XvA. Storage Performance Development Kit

This is just something that poster figured could be an implementation MS are taking, but t'd make sense they are using at least part of it as some inspiration. It's actually really interesting to consider both MS and Sony are taking inspiration from segments of the server/data center/business markets for addressing I/O throughput; you can see inspirations of Data Processing Units (DPUs) in bot Sony's and MS's I/O blocks (though Sony's seems to be contained wholesale to an actual single silicon block which would more match visually with a DPU; granted I'm sure there are many DPU setups that aren't necessarily like single chips in their implementation either).

Just where do you get this from? Whats pointing to it?

Did you watch the road to PS5 and see about that speed 100 times faster SSD transferring to 100 times faster usable speed.

PS5 has looked to eliminate all bottlenecks and latency is a bottleneck.

It has everything right there in the io complex, things as close as possible for as little as possible latency.

I'd love to hear how the xbox despite processing its io on the CPU will have less latency.... seems to go against physics....

I never said Sony hasn't addressed latency. But it's very feasible MS have focused very specifically on latency and therefore could have the edge there. Latency and bandwidth are not one in the same.

One thing that might hurt Sony on latency is that they're using slower NAND modules in the first place, while MS are using faster ones. This is in measure of MT/s. Usually the faster NAND modules also have better latency read access on first 4KB (or such) of data. Other factors influence latency too of course.

So PS5 can certainly improve dramatically on I/O bottlenecks in terms of bandwidth and latency, but still have higher latency compared to MS's solution. Also FWIW Series X has custom hardware for a lot of the I/O stack; it's 1/10th of a core (the OS core very likely) that is handling management of the I/O stack.

If you'd like a bit more insight from my POV just read the other part of this post right above yours.

I think there's a huge misconception how th XB VA works, especially that "multiplier" part, so to make things simple, let's use an image:

rubiks-speed-cube-3x3-solved-1.jpg


On this picture we can see only three sides of the Rubic's cube, BUT - the system still loads and uses textures for all 6 of them, wasting memory space and bandwidth. And now what MS is doing with VA, specifically the SFS component, is making the system loading and using indeed only the textures for those three visible sides, and by doing so, they effectively cut the data size by half, which also means half the required bandwidth. Now if we want to talk about rendering a whole scene, for example instead of 6GB the same scene will now use just 2-3GB instead, so instead of 2.4GB/s they can achieve the exact same on-screen result by utilizing only 0.8GB/s of the SSD. Which then creates an opportunity/headroom to add 3-4GB worth of additional objects/textures to fit within that previous 6GB size, utilizing the 2.4GB/s bandwidth, which otherwise would need 18GB RAM and 7.2GB's bandwidth. long story short, they can achieve the same stuff, but with only half-third of the resources.

We can probably even say that the solution of XvA might be a storage, I/O equivalent of Tile-Based Deferred Rendering. Here's a wiki portion on it:

Tiled rendering is the process of subdividing a computer graphics image by a regular grid in optical space and rendering each section of the grid, or tile, separately. The advantage to this design is that the amount of memory and bandwidth is reduced compared to immediate mode rendering systems that draw the entire frame at once. This has made tile rendering systems particularly common for low-power handheld device use. Tiled rendering is sometimes known as a "sort middle" architecture, because it performs the sorting of the geometry in the middle of the graphics pipeline instead of near the end.[1]

I'm not saying XvA is literally implementing a setup of TBDR in the SSD I/O, just that it has some common inspirations of it. The main part being spending processing power only on what is actually seen, cutting out the necessity to render and handling geometry earlier in the pipeline.

So what does that sound a lot like? Well, it sounds a lot like XvA and thins like SFS. Working with only very specific textures, or just specific portions of textures. Focusing on streaming in the immediate (or near-immediate) texture data or portion of the texture data, etc. Knowing what to pull and when, just in time, saving on required bandwidth throughput along the way and being beneficial for even lower-end SSD drives (in terms of pure specs like GB/s we can say MS's SSD is lower-end than Sony's), so on and so forth.

A really interesting approach all said. Looking forward to see it in practice.
 
Last edited:

Ascend

Member
It's not unknown. We have 2.4 GB/s raw, and 4.8 GB/s compressed. On top of that, you have this multiplier. So in practice, you would be getting 12GB/s equivalent throughput of doing things raw. I made a post about this quite a while back.

Edit: After reading through the link on MS website, it seems that it is above the raw throughput, not the compressed throughput. So the 12GB/s is incorrect, and it is indeedn 2.4 GB/s * 2.5, which gives you 6GB/s. Still fine. I guess I over-speculated about a few things back then ^_^

Edit2: Hm... I'm doubtful again.

This innovation results in approximately 2.5x the effective I/O throughput and memory usage above and beyond the raw hardware capabilities on average.

Must we interpret that as 2.5x above the 2.5GB/s, or, 2.5x above the 4.8GB/s, because, the compression is also hardware, so that means it should be 2.5x above 4.8GB/s. The word 'raw' can be interpreted to mean all hardware, or specifically the raw throughput of the I/O.
I feel stupid for doubting myself in this post. Of course they stack. I had to refresh all the details in my head on how all this works. I have been out of the loop for a few weeks... So the 12GB/s effective is most likely the more accurate number.

Now let me clarify. The XSX is never going to send more than 2.4GB/s of data through its I/O. But as I explained in the other thread using simple numbers...
Compression allows you to have two textures for the size of one so to speak (for simplicity's sake), or 20 for the size of 10. That's where the 4.8GB/s of the XSX comes from after compression. On top of that, the XSX discards the loading of textures that will not be used with SFS. It means that out of those 20 that you were going to use, you're using and thus loading only 8, which leaves room for another 12. So you're actually getting 50 textures for the price of 10, so to speak

I will now take the freedom to quote a post I made a LONG while back... Trust me, it's worth reading.
Regarding sampler feedback streaming... I'm not sure people get what it actually does... So I'm going to try and explain things step by step...

First, the transfer value given for the I/O slash SSD is basically a bandwidth value. The 2.4 GB/s raw value means that at most, 2.4 GB of data can be transferred per second.
The compressed value does not magically increase the 2.4 GB/s. What it does is, compress the files to make them smaller. The max amount transferred is still going to be 2.4GB in a second. But when you decompress it again on the 'other side', the equivalent size of the data would have been 4.8GB if you could have transferred it as raw data. So effectively, it's 4.8GB/s, but in practice, 2.4GB/s is being transferred.

Then we get to SFS. First, take a look at what MS themselves say on it;

Sampler Feedback Streaming (SFS) – A component of the Xbox Velocity Architecture, SFS is a feature of the Xbox Series X hardware that allows games to load into memory, with fine granularity, only the portions of textures that the GPU needs for a scene, as it needs it. This enables far better memory utilization for textures, which is important given that every 4K texture consumes 8MB of memory. Because it avoids the wastage of loading into memory the portions of textures that are never needed, it is an effective 2x or 3x (or higher) multiplier on both amount of physical memory and SSD performance.

That last sentence is important. It is an effective 2x or 3x (or higher) multiplier on both amount of physical memory and SSD performance. Now what does that mean? If you want to stream part of textures, you will inevitably need to have tiling. What is tiling? You basically divide the whole texture in equally sized tiles. Instead of having to load the entire texture, which is large, you load only the tiles that you need from that texture. You then don't have to spend time discarding so many parts of the texture that you don't need after you spent resources loading it. It basically increases transfer efficiency. Tiled resources is a hardware feature that is present since the first GCN, but there are different tiers to it, the latest one being Tier 4, which no current market GPU supports. It is possible that the XSX is the first one to have this, but don't quote me on that. It might simply be Tier 3 still.

In any case. When tiling, the size of the tiles will determine how efficient you can be. The smaller the tiles, the more accurate you can be for loading, and the less bandwidth you will need. Theoretically, you can be bit-precise so to speak, but that's unrealistic and requires an unrealistic amount of processing power. There is an optimum there, but we don't have enough information to determine where that point is in the XSX. Microsoft is claiming that with SFS the effective mulitplier can be more than 3x. This means that, after compression (everything on the SSD will inevitably be compressed), you can achieve a higher than 3x 4.8GB/s in effective streaming. To put it another way, effectively, the XSX is capable of transferring 14.4 GB/s of data from the SSD. This does not mean that 14.4GB/s is actually being transferred. Just like with compression, the amount of transferred data is still max 2.4GB/s. What it does mean is that if you compare the current data transfer with compressed tiles to loading the full raw uncompressed texture, you would need more than 14.4GB/s bandwidth to transfer the data to ultimately achieve the same result. This also helps RAM use obviously, because you're loading everything from the SSD into RAM, and you would be occupying RAM space that you wouldn't have. Basically, it decreases the load on everything, including the already mentioned RAM, and the I/O, CPU and GPU.


Lastly, we have confirmation from James Stanard that they stack, so it works as I described above a long while back... The main change is that rather than 2x -3x, they are saying 2.5x.

 
I think a lot of people are conflating texture streaming with SFS. The thing is that the technique of selectively loading from disk only the textures (or even portions of textures in the case of something like mega textures) that you need already exists and is fairly widespread in various forms.

What Sampler Feedback, the most significant part of SFS, does is provide a very precise accounting of which tile of a mip level (a subset of a subset of a texture) is needed exactly when it is needed. Previously you could do some work to get that information, which takes time and resources, or you just used a wider brush and load much more texture data than you actually needed.

Guru3D described Sampler Feedback like this:

Sampler feedback solves this by allowing a shader to efficiently query what part of a texture would have been needed to satisfy a sampling request, without actually carrying out the sample operation. This information can then be fed back into the game’s asset streaming system, allowing it to make more intelligent, precise decisions about what data to stream in next. In conjunction with the D3D12 tiled resources feature, this allows games to render larger, more detailed textures while using less video memory.

So Sampler Feedback lets you be very granular and precise in loading textures. You can think of this as a much more risky and aggressive approach because you’re waiting until the instant you need a tile before you load it.

Because this will inevitably lead to pop-in as you likely can’t get this from disk in a single frame all the time, SFS also includes a custom filter to blend between tiles and obscure pop-in.

That’s what we know about it in a nutshell. There might be more going on to enable this aggressive approach to texture streaming but they haven’t revealed that yet unless I’ve missed it.
 
O OptimistPrime Think I see where you're coming from. It is risky and aggressive, this is for sure. It will involve making sure however small they want to shrink the prefetch window, it still needs to be enough in terms of frames to consider the limitations of the underlying technology, such as the NAND.

I'm starting to think some aspect of ML might see leverage for enforcing it at the hardware level. Preferably not general ML but consolidation of equivalent into some part of the hardware dedicated to that part. We already know MS have talked about using ML to upscale textures, perhaps they're also employing some form of this for the mipmaps?

That way the devs may not even need so much to stream in the higher-level mip off the drive itself, but just scale up the lower-quality mip already in the chain. This being a part of the implementation but maybe not the all of it?
 
I love it how most people are reffereing to SFS as if its boosting XSX overall IO performance by 2.5x.
It is only doing that for textures.
Now does anyone here have a reliable source of how much bandwidth a game has overall and is using for textures in any given timeframe?
Are games using 10% Bandwidth for textures? 50%? 90%?
Because SFS is only decreasing THAT part compared to last gen xbox at best. ( MS wasn't really clear what they compared SFS to )
 
Last edited:
D

Deleted member 775630

Unconfirmed Member
Because that is precisely the claim made by Microsoft. Yes, SFS is about textures, but they specifically refer to the “2.5X” improvement with respect to overall bandwidth and memory.
Indeed, because if it was only textures it would've been x3 since they said in the old system the GPU only uses 1/3 of the streamed textures.
 

Allandor

Member
I love it how most people are reffereing to SFS as if its boosting XSX overall IO performance by 2.5x.
It is only doing that for textures.
Now does anyone here have a reliable source of how much bandwidth a game has overall and is using for textures in any given timeframe?
Are games using 10% Bandwidth for textures? 50%? 90%?
Because SFS is only decreasing THAT part compared to last gen xbox at best. ( MS wasn't really clear what they compared SFS to )
Most of the things loaded from storage are textures. Geometry, "code" and sound are really, really small compared to textures.
You just forget this "multiplier" comes from comparing old tech vs new tech. With the old tech you had to use loading everything in packets. With those you loaded many things multiple times into memory and wasted bandwidth and resources. But it was the only way to get everything into memory because the bandwidth and latency were not that great at all.
With the much faster SSD storage, you can load the data much more specific which multiplies the useable bandwidth and memory in comparison with the old paradigm.
 

J_Gamer.exe

Member
I never said Sony hasn't addressed latency. But it's very feasible MS have focused very specifically on latency and therefore could have the edge there. Latency and bandwidth are not one in the same.

One thing that might hurt Sony on latency is that they're using slower NAND modules in the first place, while MS are using faster ones. This is in measure of MT/s. Usually the faster NAND modules also have better latency read access on first 4KB (or such) of data. Other factors influence latency too of course.

So PS5 can certainly improve dramatically on I/O bottlenecks in terms of bandwidth and latency, but still have higher latency compared to MS's solution. Also FWIW Series X has custom hardware for a lot of the I/O stack; it's 1/10th of a core (the OS core very likely) that is handling management of the I/O stack.

If you'd like a bit more insight from my POV just read the other part of this post right above yours.

I know latency and bandwith aren’t the same, however latency will slow down the effective bandwith so for Sony to claim their 100 times ssd can be 100 times faster io in the acutual game means they have eliminated the bottlenecks and reduced latency as much as possible.

See this quote from Sweeney…

'I think, first of all, Sony has a massive, massive increase in graphics performance compared to previous generations. But you know, I guess we get that every generation?” Sweeney joked. “But Sony’s made another breakthrough that in many ways is more fundamental, which is a multi-order magnitude increase in storage bandwidth and reduction in latency.'

We have also had leaks or info from insiders like Matt on era saying it does everything faster and is on another level so I very much doubt Sony haven't got latency down to ridiculously low levels too.

The whole console for sony has been seemingly designed around speed.

Also its about total latency in the whole pipeline, the reason for PS5’s io unit is for max bandwith and minimum latency aka bottlenecks. Doing io on the CPU core is far more likely to introduce latency and be far more inefficient than Sony’s method. Even things like SFS having custom hardware in the GPU, that’s great but how much latency does that introduce.

The xbox doesn't seem to have put things as close as possible like the PD5 seems to have.
 

Bernkastel

Ask me about my fanboy energy!
Xbox Series X also enables you to spend less time waiting and more time playing, as it virtually eliminates load times with the 40x boost in I/O throughput from last generation. With our custom next-generation SSD and Xbox Velocity Architecture, nearly every aspect of playing games is improved. Game worlds are larger, more dynamic and load in a flash, and fast travel is just that – fast. The Xbox Velocity Architecture also powers new platform capabilities like Quick Resume, which enables you to seamlessly switch between multiple titles and resume instantly from where you last left off without waiting through long loading screens. Right now, Xbox Series X is in the hands of our 15 Xbox Game Studios teams and thousands of third-party developers, empowering them to create a new generation of blockbuster games for you to enjoy.
 
That was some detailed chain of posts xD, but it basically reaffirms that MS focused on highly optimized and efficient (if proprietary) approaches to handling the I/O stack that are scalable and deployable with a range of devices. The way they describe SFS in particular there is extremely neat and not a description I've seen that detailed elsewhere FWIW.

Also re-reading some of those parts again, it just re-affirms the apples-to-oranges approach MS and Sony have taken to resolving I/O. Sony, IMHO, seem to have focused again on maximizing bandwidth and what way they felt to best push with a configured hardware approach a clean run of traditional methods here. MS have prioritized latency (to do what Gavin's stating you need very low latency) and gone with a more mixed hardware/software approach that's doing something quite different from what's been attempted before in the scene, so it's quite innovative in that aspect (not saying Sony's approach isn't innovative either, though, especially if you consider how their approach mimics DPUs (Data Processing Units)).

They mentioned meshes very briefly; I don't know a ton about meshes whatsoever but it's neat to see SFS seems to be designed with taking consideration of those in mind too. IIRC Series X supports a group of 256 meshes, while Nvidia GPUs support groups of only 128. I'm curious what the doubling of mesh support in groups enables.
 
Last edited:
What Gavin Stevens saying is very misleading to about the PS5, in that it will use a 8192 size texture for an object that's only 1cm big on your screen, when any reasonable developer would use some kind of lower resolution mip-map.

He's a developer himself, so he should know better.

He basically cherry picking a particular scenario and applying to the PS5, and saying it's deficient.

The PS5 SSD is in no way preventing the use of mip-maps, and having them stream in if needed.

Epic's Unreal Engine 5.0 has a feature to not use LOD's (level of detail based on distance), but then again, they are doing a ton of culling before things are rendered out.

I don't think they're being misleading, but it does remain to be seen how efficient Sony's work on the Geometry Engine (which is where the culling occurs) is.

Keep in mind that is customization, and Geometry Engine was already present on RDNA1 GPUs. Unless everything in it has been completely superceded by something else I'd expect it's carried over to RDNA2 so the Series platforms will have their own Geometry Engine.

Whether it's been customized or not, and if so by how much/in what ways, is unknown at this time however.

Also yes you're right to the extend it'd make no sense to use a large detailed texture as mipmap for something that's so far away from the player, but the emphasis that dev is placing on is the seamless blending and transition of mip levels and the ability of SFS to only load those very specific portions into memory rather than the larger texture and generating the lower-quality mipmaps from that. Or maybe a better way to take it, having the ability to sample the native texture and generating the appropriate mipmap from it when needed, then placing that mipmap into memory. This is probably being done through the GPU, and would support some other ideas around that (such as the mention of the ARM cores in the APU design from an Indian AMD engineer team member probably being for extending executeIndirect capabilities in the GPU for tasks such as this).

There's no telling if there's an equivalent on PS5, and they're not speculating if there is (though from their words it'd seem like it isn't doing such a thing). They're just speaking about SFS and what they know of it.
 
Last edited:

Kerlurk

Banned
 
Last edited:
When you are traditionally filtering a texture, you grab the full texture, create a sampling kernel function of some size and algorithm, and run the kernel over the image to create filtered image.
A high quality / filtered texture spread over a flat plane would look very weird at distance, so you may move to anisotropic filtering, and include sampling the various mipmaps to improve rendering at a distance.

With PRT, your textures are split across pages, and of course, mips. Just sampling them will mean you get weird seams when you join two pages with two parts of the same texture together, so you want those other pages too when you sample, and also their mips!

This is a lot of data to track in software, and is a pain in the ass speed wise. Hardware which stores these locations, and improves data locality as well as doing the filtering makes this a better experience for the dev.

The issue with this feature is it exists in hardware for the last 11 years.
Also re-reading some of those parts again, it just re-affirms the apples-to-oranges approach MS and Sony have taken to resolving I/O. Sony, IMHO, seem to have focused again on maximizing bandwidth and what way they felt to best push with a configured hardware approach a clean run of traditional methods here. MS have prioritized latency (to do what Gavin's stating you need very low latency)

Sony has a very low latency pipeline.

The feature described above is what is called a feedback buffer. It has existed for a long time and ALL platforms can implement it (as it is usually implemented at engine level).

SFS seems to implement something at hardware level to manage this. It has no bearing on latency really, as feedback buffers track the current state, not the future state (that's the call to get the "missing" mips).
 
Sony has a very low latency pipeline.

The feature described above is what is called a feedback buffer. It has existed for a long time and ALL platforms can implement it (as it is usually implemented at engine level).

SFS seems to implement something at hardware level to manage this. It has no bearing on latency really, as feedback buffers track the current state, not the future state (that's the call to get the "missing" mips).

I know that, and haven't said Sony doesn't have a focus on low latency. However one can still have a low latency target and end up with higher latency than the other, for whatever reason. We'll have to see how it bears out. One factor of latency I don't think is being considered in Sony's case, I think the following post actually brings this up quite well and it's at least worth considering:

That 100x is in reference to the I/O performace of the base PS4.

SSD bandwidth and latency are not the same thing. Bandwidth is how much data can be sent (measured in GB/s) but latency is how long it takes for the SSD to respond to requests and begin the data transfer(measured in milliseconds or nanoseconds depending on how optimized this is on next gen consoles). Sony has made no mention of latency and has stated that they designed their system to elimante bottlenecks for the SSD and focused instead on increasing bandwidth.

We have not been given the data on the latency of either the PS5s or the Series Xs SSD.

what we could see is (I'm speculating) that I/O complex actually incurs some latency as introducing more hardware to systems tends to increase latency as more parts need to communicate and data needs to make more "stops" before it's final destination iirc theres an on chip ram, 2 I/o co processors, a memory controller and a coherency processor to clean things up due to have 2 processors writing to addresses all apart of the I/O complex of the PS5 and the data will need to be decompressed as well.

Both companies have stated that their I/O would have required 2 zen 2 cores to process. The differences lie in their approach to tackle this issue. While the PS5 when the hardware route and created a space on their APU dedicated to I/O, the Series X has gone with a software solution to reduce this down to 1/10th of a core.

On the Series X it's likely that data will be requested by the CPU and loaded straight into the decompression block and either sent to memory or the GPU. Due to only having 1 processor for their I/O they don't need a coherency engine, on chip ram, or the extra controller. On the PS5 it seems that data will be requested by the controller of the I/O complex and sent to the on chip ram while it waits for 1 of the 2 I/O Co processors to assign what address to occupy in memory and is then decompressed sent to that address in system memory where it is then used by the GPU or CPU.

We'll have to wait and see what the final official results are but it's likely Sony has sacrificed some latency for bandwidth and throughput.

Bolded the part of emphasis. It will be a question of what effect that has on their setup; FWIW we've already seen to what extent a co-processor heavy setup can have on latency in a much older system: SEGA Saturn. I'm just using that as a visual example of comparison, not some suggestion Sony's approach has a similar potential issue of latency (even if Sony's approach adds latency due to their approach it won't be of much significance).

I also agree SFS as a technique can be done in other engines, but for them to do it on something that doesn't have the hardware customization to explicitly support it, would require more resources. How much more I don't know, guess it'd come down to the implementation in the engine.

Yeah I knew bringing up a comparison of inspiration to speculative execution branch prediction was not the best thing, considering SFS is doing what it does for the current state as you've said. It was the closest, quickest thing that came to my mind though. AFAIK the dev still has to have an idea of what textures will be up next and specify that as a condition for SFS, and it's high-risk so if something misses there's the lower-level mip to fall back on that can be blended into the high-level one once it's available. So there is more direct work for the developer in that case but I'd imagine mastery of this will yield fantastic results.
 
Last edited:

GODbody

Member
What Gavin Stevens saying is very misleading about the PS5, in that it will use a 8192 size texture for an object that's only 1 cm big on your screen, when any reasonable developer would use some kind of lower resolution mip-map.

He's a developer himself, so he should know better.

He basically cherry picking a particular scenario and applying to the PS5, and saying it's deficient.

The PS5 SSD is in no way preventing the use of mip-maps, and having them stream in if needed.

Also what developer on the PS5 is prevented from implementing some similar technique, to see what part of a texture is in view, and streaming only that part of a texture in. This technology can be applied on a engine by engine bases, with all kinds of variations on how it should be done.

Just because both systems have fast SSD's does not mean LOD's (including mip-maps) are going away. Developers will continue to optimize their game engines using LOD's.

He's stating even if you have a 256x256 object using 1 cm of screen space the full 8192 mipmap will still be in memory taking up space.

Example: You have 2 objects on screen displaying with different textures. One of those objects is close to the camera and is displaying in full texture quality at mip0 or the 8192x8192 image of the mipmap. The second object is much further away and does not need to use the highest quality (mip0) of the texture so that texture is only using mip6 or the 128x128 image of the mipmap.

What he is stating is that the way that PS5 would handle this situation due to not having sampler feedback streaming would be to load the full mipmap for each of the textures which would take up quite a bit of memory (about 1152mb total (576mb for the 8192x8192 and 576mb for the 128x128) due to having to load the full mipmap for each.

The Series X due to having sampler feedback streaming would only load the portion of the mipmap that is on display into memory (about 288mb total (about 288mb total (288mb for the 8192x8192 and 267kb for the 128x128).

This lines up with the "games typically access about 1/3rd of texture data" that Microsoft has stated and where you see the major savings of the sampler feedback streaming part of the velocity architecture

I've edited my post, and indicated, there is nothing stopping developers from adding similar tech to their game engines. This is software, and so others can create this. It could be superior, more flexible then what MS is offering and of course worse/less flexible.

This feature of velocity architecture he's talking about is streaming partial textures in that are view, instead of the whole texture. This has nothing to do with the vertex side of things.

I'm not indicating that MS was being misleading. It's Gavin Stevens who's being misleading, by indicating that developers are forced to design their PS5 game engine in an inefficient way.

It's good tech that MS has developed here, and I'm no way saying it's bad. It's just not as exclusive as some people might believe.
Microsoft has specific new hardware as apart of their GPU to enable this. Not quite as simple as writing code to enable sampler feedback streaming.
 
Last edited:

Kerlurk

Banned
 
Last edited:

Tripolygon

Banned
He's stating even if you have a 256x256 object using 1 cm of screen space the full 8192 mipmap will still be in memory taking up space.

Example: You have 2 objects on screen displaying with different textures. One of those objects is close to the camera and is displaying in full texture quality at mip0 or the 8192x8192 image of the mipmap. The second object is much further away and does not need to use the highest quality (mip0) of the texture so that texture is only using mip6 or the 128x128 image of the mipmap.

What he is stating is that the way that PS5 would handle this situation due to not having sampler feedback streaming would be to load the full mipmap for each of the textures which would take up quite a bit of memory (about 1152mb total (576mb for the 8192x8192 and 576mb for the 128x128) due to having to load the full mipmap for each.

The Series X due to having sampler feedback streaming would only load the portion of the mipmap that is on display into memory (about 288mb total (about 288mb total (288mb for the 8192x8192 and 267kb for the 128x128).

This lines up with the "games typically access about 1/3rd of texture data" that Microsoft has stated and where you see the major savings of the sampler feedback streaming part of the velocity architecture


Microsoft has specific new hardware as apart of their GPU to enable this. Not quite as simple as writing code to enable sampler feedback streaming.
Let me introduce you to PRT.
Nm9iSp6.jpg
 
K Kerlurk The part you're not understanding is the additional resource cost of a fully software-based implementation. Not only that, but similar techniques can have themselves very different approaches of implementing themselves even in software, such as what algorithms they use, what way the code is set up and how efficient is that code, what features are alongside that and how they're implemented, etc.

The ability to recreate it in software is always good and shows flexibility, but there's always some advantages to having hardware support for that type of feature, as well. Which was enough a priority for MS to have done so, at the very least from what we can see. And we've yet to see what exact hardware components comprise of their hardware implementation of SFS, though there's at least a patent around which gives a hint.

Tripolygon Tripolygon PRT and SFS still have divergences in how they are implemented and the feedback loop they create. Also the hardware implementation of PRT support in PS4 varies greatly from the hardware implementation of SFS on the Series systems and those will differ from PRT 2.0 hardware support on PS5.

One of the standouts of SFS from PRT is that SFS offers a fallback mode for lower-level mips to blend seamlessly into the higher-level one once it is available. Also the accessing of the texture between the two approaches seems as though they differ.

So nothing there in what you quoted actually disproves what GODbody was speaking on behalf of.
 
Last edited:
The PS5 has TWO programmable I/O processors backed up by on-chip SRAM (memory), along with a DMA engine.

We have no ideal what the overhead is for this feature on the XSX, but the PS5 has some impressive hardware for implementing different streaming tech that is separate from needing any help from the CPU (or minimal help).

Yes, two I/O processors (Cache Coherency Engines)...that may or may not add a notable bit of latency, as was pointed out earlier near the top of this page.

Regardless of that, PS5 does have a very impressive setup, but it's not without limitations. Just as one example, if the I/O block is DMA'ing to the memory bus for read/write operations, CPU/GPU/Tempest etc. still have to wait on bus access until those processors in the I/O block are done with their stuff.

Everything being contained on an APU helps a lot, but let's not pretend there aren't potential limitations in Sony's setup as well. They exist.
 

Kerlurk

Banned
 
Last edited:
Every system has limitations, but we don't know if those limitations prevents this hardware function that MS has from being done in software on the PS5.

Anyway, the Unreal Engine 5.0 demo showed, there are many ways to do things, and still have amazing graphics.

This is my last post in this thread here. I just see this as MS marketing talk, and the end of the day, you will need a magnifying glass to see the difference between both systems. It's going to have no difference in terms of the sales of each console.

Okay before you leave ask this: why are you bringing up sales near the end? Did you think that was the motivating factor in the discussion? Because it wasn't.

No one is here trying to push this as suddenly meaning night-and-day differences between the two systems in the I/O department. It's meant to show that MS's solution does "punch above its weight", even if Sony's solution is still the better one in terms of outright numbers/bandwidth. The discussion's also been meant to show that they approaches are not really apples-to-apples, and have taken divergent approaches from the very beginning, prioritizing chief factors at the core of their solutions.

Due to that, people who have been, for example, using Sony's setup to try and denigrate or criticize MS's I/O approach have been in the wrong, because they've only been comparing paper specs and not having a full picture of the actual setup and technology of both solutions, nor the factors driving those implementations (such as desire for platform-agnostic scalability for MS that stacks on top of compliant hardware).

Hopefully you understand.
 
Last edited:

Tripolygon

Banned
K Kerlurk The part you're not understanding is the additional resource cost of a fully software-based implementation. Not only that, but similar techniques can have themselves very different approaches of implementing themselves even in software, such as what algorithms they use, what way the code is set up and how efficient is that code, what features are alongside that and how they're implemented, etc.
True, similar techniques can have varying implementation approaches in software and hardware and there can be additional cost to fully software base techniques but not necessarily enough to have a negative impact. For example lets look at compression used in current gen. Both PS4 and Xbox One come with zlib decompression hardware but kraken can decompress faster using software implementation on jaguar than the built in zlib decompression hardware at little to no cost to the overall game performance. Hence a lot of game chose to use kraken or roll their own software based decompression implementation.

The ability to recreate it in software is always good and shows flexibility, but there's always some advantages to having hardware support for that type of feature, as well. Which was enough a priority for MS to have done so, at the very least from what we can see. And we've yet to see what exact hardware components comprise of their hardware implementation of SFS, though there's at least a patent around which gives a hint.
Now apply the same reasoning to PS5, Sony went out of their way to implement a lot of hardware into their systems to remove a lot of bottleneck and latency yet many people would readily say SFS, direct storage, BCPack various software would close the gap between a quantifiable hardware difference in raw throughput. After all SFS is further optimization of PRT.

We know the hardware component comprised for their hardware implementation. It is in the patent and one of the principal software engineers has said it himself.

From the patent
Software-only residency map solutions typically perform two fetches of two different buffers in the shader, namely the residency map and the actual texture map. The primary PRT texture sample is dependent on the results of a residency map sample. These solutions are effective, but require considerable implementation changes to shader and application code, especially to perform filtering the residency map in order to mask unsightly transitions between levels of detail, and may have undesirable performance characteristics. The improvements herein can streamline the concept of a residency map and move the residency map into a hardware implementation
From Microsoft dev


PRT and SFS still have divergences in how they are implemented and the feedback loop they create. Also the hardware implementation of PRT support in PS4 varies greatly from the hardware implementation of SFS on the Series systems and those will differ from PRT 2.0 hardware support on PS5.

PRT is SFS but have a divergence in what is bolded. This feedback loop is enable by sampler feedback which is natively supported by Turing GPU and RDNA 2. Hence PS5 also. PRT 2.0 allows tile residency control.


One of the standouts of SFS from PRT is that SFS offers a fallback mode for lower-level mips to blend seamlessly into the higher-level one once it is available. Also the accessing of the texture between the two approaches seems as though they differ.
That is not unique to SFS. All PRT/SFS offer fallback, that is what various forms of texture filtering can be used for. To blend between levels. Xbox Series X has a hardware implementation of said texture filter.

So nothing there in what you quoted actually disproves what GODbody was speaking on behalf of.

Actually what i quoted actually proves that nobody is loading entire full high resolution texture pages and mipmaps into memory for every object in view. It disproves that BS that in the tweet chain and subsequent reply in this thread.


What is new is using SF to trigger the page load.
 
Last edited:

GODbody

Member
No developer is forced to do it that way. They could very easy setup a system that streams in different textures OR partial textures for objects depending on distance from viewer, and/or size/orientation on screen.

The whole point of that incredibly fast SSD on the PS5 is to create game engines that take advantage of it. That high rate and low latency system that they have, allows for lots of flexibility.

Which the Unreal Engine 5 demo proved.

Yeah, MS put it in hardware, but there is no stopping developers from doing this in software. It's like saying you could only have lighting in games, if you have ray-tracing hardware, and yet the Unreal Engine 5.0 demo has amazing bounced lighting with no ray tracing involved.
Let me introduce you to PRT.
Nm9iSp6.jpg
I was just giving an example based on what the dev tweeted.
Virtual Texturing is a great, but it is not ideal for every kind of texture (such as transparent objects) which is why games that include virtual textures do not utilize it for the entirety of their textures. Games also often look like they're using the same texture all over the place with virtual textures. It also does not play well with filtering techniques and can cause some CPU/GPU stalls and incur some latency while waiting

SFS on the other hand appears to be a method that works with every texture and is a much improved hardware based version of this.

From the patent filing

A first enhancement includes a hardware residency map feature comprising a low-resolution residency map that is paired with a much larger PRT, and both are provided to hardware at the same time. The residency map stores the mipmap level of detail resident for each rectangular region of the texture. PRT textures are currently difficult to sample given sparse residency. Software-only residency map solutions typically perform two fetches of two different buffers in the shader, namely the residency map and the actual texture map. The primary PRT texture sample is dependent on the results of a residency map sample. These solutions are effective, but require considerable implementation changes to shader and application code, especially to perform filtering the residency map in order to mask unsightly transitions between levels of detail, and may have undesirable performance characteristics. The improvements herein can streamline the concept of a residency map and move the residency map into a hardware implementation.

A second enhancement includes an enhanced type of texture sample operation called a “residency sample.” The residency sample operates similarly to a traditional texture sampling, except the part of the texture sample that requests texture data from cache/memory and filters the texture data to provide an output value is removed from the residency sample operation. The purpose of the residency sample is to generate memory addresses that reach the page table hardware in the graphics processor but do not continue on to become full memory requests. Instead, the residency of the PRT at those addresses is checked and missing pages are non-redundantly logged and requested to be filled by the OS or a delegate.
 
Last edited:

Tripolygon

Banned
I was just giving an example based on what the dev tweeted.
Virtual Texturing is a great, but it is not ideal for every kind of texture (such as transparent objects) which is why games that include virtual textures do not utilize it for the entirety of their textures. Games also often look like they're using the same texture all over the place with virtual textures. It also does not play well with filtering techniques and can cause some CPU/GPU stalls and incur some latency while waiting

SFS on the other hand appears to be a method that works with every texture and is a much improved hardware based version of this.

From the patent filing
What makes SFS more suitable to handle transparency while previous PRT cannot? Though you can use both virtualized and nonvirtualized textures in the same game. It is not a either or thing.
 
Last edited:
True, similar techniques can have varying implementation approaches in software and hardware and there can be additional cost to fully software base techniques but not necessarily enough to have a negative impact. For example lets look at compression used in current gen. Both PS4 and Xbox One come with zlib decompression hardware but kraken can decompress faster using software implementation on jaguar than the built in zlib decompression hardware at little to no cost to the overall game performance. Hence a lot of game chose to use kraken or roll their own software based decompression implementation.


Now apply the same reasoning to PS5, Sony went out of their way to implement a lot of hardware into their systems to remove a lot of bottleneck and latency yet many people would readily say SFS, direct storage, BCPack various software would close the gap between a quantifiable hardware difference in raw throughput. After all SFS is further optimization of PRT.

We know the hardware component comprised for their hardware implementation. It is in the patent and one of the principal software engineers has said it himself.

From the patent

From Microsoft dev




PRT is SFS but have a divergence in what is bolded. This feedback loop is enable by sampler feedback which is natively supported by Turing GPU and RDNA 2. Hence PS5 also. PRT 2.0 allows tile residency control.



That is not unique to SFS. All PRT/SFS offer fallback, that is what various forms of texture filtering can be used for. To blend between levels. Xbox Series X has a hardware implementation of said texture filter.



Actually what i quoted actually proves that nobody is loading entire full high resolution texture pages and mipmaps into memory for every object in view. It disproves that BS that in the tweet chain and subsequent reply in this thread.


What is new is using SF to trigger the page load.


Appreciate the response; I'd like to clear up the thing about PRT 2.0. I just made that up as a name of the implementation Sony could call it. It's more or less assumed PS5 will carry on with PRT in hardware but whether they call it 2.0 or not is down to semantics. I just threw 2.0 in there to show a continuation of its support.

You're right that software implementations don't necessarily imply a large hit in performance, especially well-optimized ones. But workloads also grow over time and sometimes general increases in non-specialized hardware can't keep up with the workload demands that scale with the software implementations to keep the performance hit of a software-based implementation condensed. That said we can't really predict how much that impact would grow going forward, ahead of time.

About SFS and its presence on Nvidia cards; that doesn't necessarily mean it's a standard RDNA2 feature. Nvidia GPUs (at least a couple) also had executeIndirect hardware features for support of that too, but aside from those GPUs and XBO, no other system had that. AFAIK, executeIndirect is a MS development, though the technique could probably be done through other means (like many other things; there's always multiple solutions for similar problems after all).

At least from what I can tell, SF/SFS are a similar thing; MS likely gave Nvidia license to support it in their hardware since they often collaborate together on deciding on DX features, that way Nvidia (and AMD) can incorporate these into their timeline of future developments. It's very similar to Nvidia's support of DirectStorage, which is already on their GPUs in form of GPUDirectStorage, even though it's a MS development for DX12/DX12U. They just license out the rights for Nvidia to build hardware/software implementations of it into their GPU technology.

RDNA2 itself in terms of more generic fashion should be able to support things analogous to SFS, the question is if that comes down to generic RDNA2 hardware on AMD's hand (that Sony would then have to rely on to simulate a similar approach), or if, more likely, Sony's team with their AMD engineers came up with dedicated hardware support of such basically being that "PRT 2.0" I mentioned earlier. The one thing that makes me pessimistic on that particular note (at least in terms to what extent they'd have done this latter approach) is lack of focus on it in Road to PS5; Cerny seemed willing to touch on specific customizations/optimizations with Geometry Engine and Primitive Shaders there even if GPU wasn't their big focus, and a hypothetical PRT 2.0 would be GPU-based so...why not mention it then?

At least in MS's case the excuse is they generally always reserve architecture breakdowns for Hot Chips, which will be in August. But anyways, yeah.
 

Tripolygon

Banned
Appreciate the response; I'd like to clear up the thing about PRT 2.0. I just made that up as a name of the implementation Sony could call it. It's more or less assumed PS5 will carry on with PRT in hardware but whether they call it 2.0 or not is down to semantics. I just threw 2.0 in there to show a continuation of its support.
Agreed with your post but also have to clarify that i took your PRT 2.0 to mean the next version of PRT. It will more than likely be termed PRT+, that is what Microsoft calls it in their documentation
 

GODbody

Member
What makes SFS more suitable to handle transparency while previous PRT cannot? Though you can use both virtualized and nonvirtualized textures in the same game. It is not a either or thing.

likely through the use of improved texture filters and an improved texture sampling process. They've also made some improvements to the shading process to help with filtering techniques.





The king has spoken.

So this differs from PRT where PRT would load an entire mip level from a mipmap, SFS would take it a step further and only load to memory which parts of that mip level which are actually visible.
 

martino

Member


i mean outside top 20 posters of speculation thread who says SFS is software when all documentation interview and tweets says otherwise ?
but so far nothing says this is MS limited feature.
so ps5 could load 27gb/s of textures
(a lie for both with random/mix reading in streaming scenario imo)
 
Last edited:

Bernkastel

Ask me about my fanboy energy!
Thanks, no more speculations. According to Stabard this clever streaming strategy would be really demanding without HW build for it. On top of that he also confirmed XSX can load 12 GB/s textures data thanks to SFS savings. Amazing stuff👍
We had already gone through most of these at some point in this threads and still at "software cant beat hardware", "SFS is just fancy marketing for PRT", "XVA is a gimmick" posts.
 

sinnergy

Member
Thanks, no more speculations. According to Stanard this clever streaming strategy would be really demanding without HW build for it. On top of that he also confirmed XSX can load 12 GB/s textures data, because thanks to SFS savings I/O only need to move 4.8GB/s. Amazing stuff👍
Screenshot-20200718-220929-Samsung-Internet-2.jpg
So it’s 12 GB/s, things just got interesting. It does not run that but it compares to 12 GB/s, as you can’t go faster than 4.8 GB/s.
 
Last edited:
Top Bottom