• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Apple's M1 Max GPU is as powerful as an Nvidia RTX 2080 desktop GPU and the Sony PS5 gaming console

Panajev2001a

GAF's Pleasant Genius
DirectX12_1 has ROV (Rasterizer Order Views). XBox 360's has ROV-like features since the emulator (Xenia) mapped Xbox 360's certain ROPS feature to DirectX12_1's ROV.
ROV allow you to modify the pipeline’s behaviour to handle OIT in software (shader code: https://docs.microsoft.com/en-us/windows/win32/direct3d11/rasterizer-order-views) which is different from pixel sorting DC did for opaque+punch-through geometry and transparent one ( https://docs.microsoft.com/en-us/previous-versions/ms834190(v=msdn.10) ). This was something they stopped doing on PC designs and future designs, but according to people that worked on its design it was not a big cost for the HW side at all.

B7hc40U.jpg

… and it was quite a lot of years ago too.
Reciprocal treatment is an easy concept to understand, hence I returned the same serve back to you.
Not really, you just choose to pick fights for the fun of it / to stand taller, maybe got the feeling of some console warring or someone saying something you disagree with and you start hounding people’s posts with slides and links barely addressing what they are saying until they stop posting.

You think TBDR’s are obsolete and Apple’s choice to stick with them is odd if not wrong, I do think they still for now have advantages (especially to reach similar levels of performance as consoles at a much lower power consumption, die size they are not doing too bad trade-offs wise as well considering it is a lot of GPU in there but tons of other HW and embedded caches making up the space). You can do a depth pre-pass and use early-z and Hierarchical Z buffers to reject geometry later to reduce overdraw but that will still burn extra power compared to processing geometry and rendering sorted triangles with invisible ones culled by the HW and the scene already binned in independent tiles (on chip tile bandwidth is easy to keep very high and Apple has been building extensions to do more steps without rendering temporary buffers out to main memory too, and the ability to do the MSAA resolve step before writing out the tile is another bandwidth win… you are free to look at a PC CPU+GPU SoC combo with that memory bandwidth and up to 64 GB of RAM on the package and feel smug, others appreciate the engineering challenge behind something like the M1 Max).
 
Last edited:

UltimaKilo

Gold Member
Still more interested in what Apple does with it’s next-Gen chips. Brute force is nice, but doesn’t translate to real world performance.
 
So this chip the M1 Max is supposedly 435mmsq die size but fabbed on TSMC's 5nm node.

Considering it's only able to match the PS5 GPUs theoretical peak but with a much bigger die size, on a much smaller process node, it makes it seem a little less impressive.

I guess that 512bit memory interface is gonna cost you in die area, plus the M1 packs in dedicated ML silicon, but still... the size of the die is still surprising.
Apple is prioritising power efficiency over cost. As you increase the freq on a chip, power consumption doesn't scale up linearly. By running with a lower clock speed, you can save a lot of power. Hence they're going for a wider approach to save on power without compromising on performance, at a higher cost due to using more silicon. That's why you get a huge die, but not higher performance than those smaller dies. Compare the power consumption of those SoCs and you'll see the difference there. Cost/Performance/Power Consumption - you can prioritise two, not all three.
 
Last edited:
It´s Apple we are talking about.....overcharging for their products and imprisoning their customers into their ecosystem is their business model.
Unless you are 100%linux there really is no ground for you to stand on when you say that. Apple is and was foremost a computer hardware company that makes their own operating system (like IBM used to do, until they bought DOS from Microsoft).

They all want you to be locked in in more or less egregious ways. Try to use Windows 10/11 with only a local account... You'll be constantly harrassed to get with the times, even google doesn't bother me when I don't use a Google account on Android).
 

rnlval

Member
ROV allow you to modify the pipeline’s behaviour to handle OIT in software (shader code: https://docs.microsoft.com/en-us/windows/win32/direct3d11/rasterizer-order-views) which is different from pixel sorting DC did for opaque+punch-through geometry and transparent one ( https://docs.microsoft.com/en-us/previous-versions/ms834190(v=msdn.10) ). This was something they stopped doing on PC designs and future designs, but according to people that worked on its design it was not a big cost for the HW side at all.

… and it was quite a lot of years ago too.

Not really, you just choose to pick fights for the fun of it / to stand taller, maybe got the feeling of some console warring or someone saying something you disagree with and you start hounding people’s posts with slides and links barely addressing what they are saying until they stop posting.

You think TBDR’s are obsolete and Apple’s choice to stick with them is odd if not wrong, I do think they still for now have advantages (especially to reach similar levels of performance as consoles at a much lower power consumption, die size they are not doing too bad trade-offs wise as well considering it is a lot of GPU in there but tons of other HW and embedded caches making up the space). You can do a depth pre-pass and use early-z and Hierarchical Z buffers to reject geometry later to reduce overdraw but that will still burn extra power compared to processing geometry and rendering sorted triangles with invisible ones culled by the HW and the scene already binned in independent tiles (on chip tile bandwidth is easy to keep very high and Apple has been building extensions to do more steps without rendering temporary buffers out to main memory too, and the ability to do the MSAA resolve step before writing out the tile is another bandwidth win… you are free to look at a PC CPU+GPU SoC combo with that memory bandwidth and up to 64 GB of RAM on the package and feel smug, others appreciate the engineering challenge behind something like the M1 Max).
DirectX12 Hardware Feature Level 12_1 ROV is a hardware feature (1) since DirectX12 Hardware Feature Level 12_0 is missing the ROV feature.

1. From https://docs.microsoft.com/en-us/windows/win32/direct3d12/hardware-feature-levels

DirectX12 Hardware Feature Level 12_1 also has Conservative Rasterization.

From https://docs.microsoft.com/en-us/windows/win32/direct3d12/conservative-rasterization

Conservative Rasterization is useful in a number of situations, including for certainty in collision detection, occlusion culling, and tiled rendering.
From https://www.anandtech.com/show/10536/nvidia-maxwell-tile-rasterization-analysis

NVIDIA's Maxwell and Pascal also has tile-based rendering via its L2 cache. Ampere RTX is built on Samsung's older 8 nm process node.

Z9IZ7qS.png


NVIDIA's 2nd gen Maxwell and Pascal supports DirectX12 Hardware Feature Level 12_1's Conservative Rasterization.

It took AMD RDNA 2 to rival NVIDIA's raster efficiency.

-----------------

Apple's Mac Mini M1 consumes 39 watts and that's using TSMC's 5 nm process node and the power consumption is far from the handset version (~4 to 5 watts).

On Samsung's 4 nm process node, RDNA 2 with 6 CUs also reaches the handset form factor and still includes raytracing hardware.

AMD 4700S (recycled PS5 APU) has 448 GB/s memory bandwidth, it's being offered for the PC market and it demonstrates unmodified Windows 10 can run on it. Note that PS5's Zen 2 CPUs have reduced AVX ports and Sony's choice to reduce AVX resources.

Game consoles like Xbox Series X and Series S don't have the laptop's aggressive clock speed throttling and under-volting to conserve energy.

Note that RTX 3080 Mobile includes RT cores.
 
Last edited:

ethomaz

Banned
From Ethomaz. https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621 has 1710 Mhz boost clock speed.


xyRrXyY.png



From https://www.techpowerup.com/review/nvidia-geforce-rtx-3080-founders-edition/32.html

RTX 3080 FE has a 1931 Mhz average clock speed, hence the potential TFLOPS lands on 33 TFLOPS.
You mean nVidia.
The full specs come from nVidia themselves.

 
Last edited:

Spukc

always chasing the next thrill
What they mean is that it costs a lot.

In laptops Apple are pretty competitive if you compare with similar quality and specs, they just don't go in the very low end.
Lowend meaning 1000 i would disagree.
Then again you want low low end get the cheapest ipad. Enough for most people
 

Panajev2001a

GAF's Pleasant Genius
DirectX12 Hardware Feature Level 12_1 ROV is a hardware feature (1) since DirectX12 Hardware Feature Level 12_0 is missing the ROV feature.

1. From https://docs.microsoft.com/en-us/windows/win32/direct3d12/hardware-feature-levels
ROV is a hardware feature but it does not magically code your OIT algorithm for you. You are still coding it yourself (aka it runs in software). Again: https://docs.microsoft.com/en-us/windows/win32/direct3d11/rasterizer-order-views

“This enables Order Independent Transparency (OIT) algorithms to work”

ROVs are an HLSL-only construct that applies different behavior semantics to UAVs”

Rasterizer ordered views (ROVs) enable the underlying OIT algorithms to use features of the hardware to try to resolve the transparency order correctly. Transparency is handled by the pixel shader.”
I expect more hair splitting, slides, and papers quoted to make 5 different points and try to win arguments if others throw the towel. Not sure why we keep replying to each other. You are uninteresting in anything but trying to win “something” and that gets old after a while.
It took AMD RDNA 2 to rival NVIDIA's raster efficiency.

Relevant if someone called nVIDIA’s efficiency into question as if it were overstated (they are certainly pulling a lot of magic tricks to get immediate mode renderers in TBDR efficiency regions, it is impressive), so you are fighting windmills here. With the downside of having to use more memory to store the entire scene and processing geometry before being able to bin triangles and start the HSR process this is still different than putting a cache to gather and reuse submitted geometry and caching the render output afterwards (you can call into question the cost to achieve highest efficiency, that is fair).

Apple's Mac Mini M1 consumes 39 watts and that's using TSMC's 5 nm process node and the power consumption is far from the handset version (~4 to 5 watts).
And in the 57 Billion transistors M1 Max at roughly Double that target power consumption they have not invested in RT HW. Almost as if they designed the GPU for other workloads than RT based games. On the other side they have a considerably fast design at a very low power consumption target and with tons of low latency on chip memory.
Nobody is offending PCMR with green tint as if nVIDIA had to feel inferior as a company, but your seem to be throwing factoids at the wall as if Apple somehow tried but fell short or well not sure snag the point was.

On Samsung's 4 nm process node, RDNA 2 with 6 CUs also reaches the handset form factor and still includes raytracing hardware.
I am not sure how much HQ RT visuals you will get from that 6 CU design in a handset form factor on a battery. Still, not sure ehag

AMD 4700S (recycled PS5 APU) has 448 GB/s memory bandwidth,
For al intents and purposes no, it does not: https://www.tomshardware.com/uk/reviews/amd-4700s-desktop-kit-review-ps5-cpu/4

“The AMD 4700S with GDDR6 memory achieves 92,892 MB/s of copy bandwidth”

it's being offered for the PC market and it demonstrates unmodified Windows 10 can run on it. Note that PS5's Zen 2 CPUs have reduced AVX ports and Sony's choice to reduce AVX resources.
Either factoid has little to no relevance to what is being discussed so far.

Game consoles like Xbox Series X and Series S don't have the laptop's aggressive clock speed throttling and under-volting to conserve energy.
Running an M1 with fans (not the Air model) or the M1 Pro/Max in High Power mode (or even normal, you might be thinking of the x86 based jet engine the old MBP was).

Note that RTX 3080 Mobile includes RT cores.
Good for it. I am not sure RT on laptops, especially on macOS and the pro apps this MBP targets, is at the must have level. Better to have it if you can afford it, but even better to have good sustainable performance at low wattage (which is what the M1/Pro/Max series is designed for). RT HW acceleration is coming and Apple has the ability to design based also on the IP they got from IMG Tech and their

Where are the raytraced game benchmarks?
Without RT HW is not going to be as fast as GPU with it on board, 😳 shocking. Also, not everything is 100% RT dependent yet and when it will be you can bet that they will have RT acceleration HW in those GPU cores.
 
Last edited:

Mobilemofo

Member
Apple`s marketing is as vague as ever.....
"industry benchmarks"...ofc we`re not telling you which exactly :messenger_tears_of_joy: :messenger_tears_of_joy: :messenger_tears_of_joy:

Apple`s only cooking with water like everyone else. And as always they will have industry leading performance/efficiency in everything thats 100% vertically integrated/sponsored and suck at everything else.


It´s Apple we are talking about.....overcharging for their products and imprisoning their customers into their ecosystem is their business model.
Very much apples buisness model. Also, its all very well having that power, but then, as we have seen over the years, theres nothing to take advantage of that power. Pretty pointless, but i'm sure some apple folk will see otherwise.
 

rnlval

Member
ROV is a hardware feature but it does not magically code your OIT algorithm for you. You are still coding it yourself (aka it runs in software). Again: https://docs.microsoft.com/en-us/windows/win32/direct3d11/rasterizer-order-views

“This enables Order Independent Transparency (OIT) algorithms to work”

ROVs are an HLSL-only construct that applies different behavior semantics to UAVs”

Rasterizer ordered views (ROVs) enable the underlying OIT algorithms to use features of the hardware to try to resolve the transparency order correctly. Transparency is handled by the pixel shader.”
I expect more hair splitting, slides, and papers quoted to make 5 different points and try to win arguments if others throw the towel. Not sure why we keep replying to each other. You are uninteresting in anything but trying to win “something” and that gets old after a while.


Relevant if someone called nVIDIA’s efficiency into question as if it were overstated (they are certainly pulling a lot of magic tricks to get immediate mode renderers in TBDR efficiency regions, it is impressive), so you are fighting windmills here. With the downside of having to use more memory to store the entire scene and processing geometry before being able to bin triangles and start the HSR process this is still different than putting a cache to gather and reuse submitted geometry and caching the render output afterwards (you can call into question the cost to achieve highest efficiency, that is fair).


And in the 57 Billion transistors M1 Max at roughly Double that target power consumption they have not invested in RT HW. Almost as if they designed the GPU for other workloads than RT based games. On the other side they have a considerably fast design at a very low power consumption target and with tons of low latency on chip memory.
Nobody is offending PCMR with green tint as if nVIDIA had to feel inferior as a company, but your seem to be throwing factoids at the wall as if Apple somehow tried but fell short or well not sure snag the point was.


I am not sure how much HQ RT visuals you will get from that 6 CU design in a handset form factor on a battery. Still, not sure ehag


For al intents and purposes no, it does not: https://www.tomshardware.com/uk/reviews/amd-4700s-desktop-kit-review-ps5-cpu/4

“The AMD 4700S with GDDR6 memory achieves 92,892 MB/s of copy bandwidth”



Either factoid has little to no relevance to what is being discussed so far.


Running an M1 with fans (not the Air model) or the M1 Pro/Max in High Power mode (or even normal, you might be thinking of the x86 based jet engine the old MBP was).


Good for it. I am not sure RT on laptops, especially on macOS and the pro apps this MBP targets, is at the must have level. Better to have it if you can afford it, but even better to have good sustainable performance at low wattage (which is what the M1/Pro/Max series is designed for). RT HW acceleration is coming and Apple has the ability to design based also on the IP they got from IMG Tech and their


Without RT HW is not going to be as fast as GPU with it on board, 😳 shocking. Also, not everything is 100% RT dependent yet and when it will be you can bet that they will have RT acceleration HW in those GPU cores.
1.For "ROV is a hardware feature but it does not magically code your OIT algorithm for you"

Without software, hardware by itself is nothing. This is also applicable to PowerVR's Tile-Based Deferred Rendering IP.

Your "software" argument is a red herring to actual hardware features delivered by DirectX12 Hardware Feature Level 12_1 and Xbox 360's ROPS EDRAM chip (remapped to FL12_1's ROV with PC's X360 emulator).

Shaders programs are explicit parallelism software for the GpGPU and specific hardware features are needed to accelerate targeted workloads.

AMD has compute shader for OIT. http://developer.amd.com/wordpress/media/2013/06/2041_final.pdf

From 2011, DirectCompute with link list can be used for OIT before DirectX12 Feature Level 12_1's ROV.

GX1TPBq.png

The above example uses compute shader resources instead of ROPS path hardware.


With ROV, https://www.intel.com/content/www/u...ical/rasterizer-order-views-101-a-primer.html
TWrgnva.png


ROV use cases​

The ability to have deterministic order access to read/write buffers within a shader.


For OIT, the programmer has to define deterministic order, hence software will be required! ROPS are the primary raster read-write units.

You failed to factor in the programmer's defining deterministic order that requires software!

PS: Texture Units can be used as raster read/write units via compute shaders.

Again, shaders programs are explicit parallelism software for the GpGPU and specific hardware features are needed to accelerate targeted workloads. Defining deterministic order is given by the programmer, hence software will be needed.



2. For “The AMD 4700S with GDDR6 memory achieves 92,892 MB/s of copy bandwidth”

Note why I stated PS5 CPU has reduced AVX resource i.e. half the AVX ports from normal Zen 2. To reduce the amount of many individual load-store instruction usages with large memory bandwidth, scatter and gather instructions are needed.

AVX 2 has gathered instructions.
AVX 512 has scatter instructions. AVX 512's scatter and gather instructions are on Larrabee's 512 bit SIMD extensions.

Zen 3 has increased load-store units when compared to Zen 2.
Zen 3 has 3 load units and 2 store units
Zen 2 has 2 load units and 1 store unit.

Modern PC GpGPUs have dedicated DMA move engines i.e. remember the hardware Blitter? Moving large blocks of data should be done via the GpGPU.


3. Against "Also, not everything is 100% RT dependent yet and when it will be you can bet that they will have RT acceleration HW in those GPU cores."

Open source and free Blender 3D has RT hardware acceleration.

RTX hardware accelerates Autodesk Maya and Arnold


Chaos Group’s RTX Support Accelerates V-Ray Next for 3DS Max and Maya.

For AMD camp, https://gpuopen.com/learn/radeon-prorender-2-0/
For RDNA 2, Radeon Pro Render 2-0 was updated for Autodesk Maya, Blender, and SideFX Houdini.

For Cinema 4D, https://www.maxon.net/en/cinema-4d/features/rendering-with-octane
Cinema 4D's raytracing is hardware accelerated via Octane Render.
 
Last edited:
A lot of die size is SRAM (trying to keep CPU very well fed without increased power consumption): they halved the number of efficiency cores (yet kept the same 4 MB L2) and doubled the amount of the wide Performance cores (with 24 MB of L2 cache and plenty of L1 cache per core too) as well as 32-64 MB of System Level cache (64 MB in the M1 Max) shared by all processors in the SoC.

As you noted they do pack a neat punch in terms of ML acceleration on chip as well as new extra Silicon for the camera / ISP HW block which consoles do not have. I wonder how they are addressing the bandwidth they targeted for the SSD (7.8 GB/s raw).

Thanks for the update.

I don't really know much about the architecture of Apple's GPU. I presume they branched off the IMG PowerVR series design and engineered their own TBDR-based GPU pushing their own bespoke advancements.

The SRAM on-die explains the die size difference nicely. Especially with that 64MB system cache, I can imagine this thing being pretty beastly.

Apple is prioritising power efficiency over cost. As you increase the freq on a chip, power consumption doesn't scale up linearly. By running with a lower clock speed, you can save a lot of power. Hence they're going for a wider approach to save on power without compromising on performance, at a higher cost due to using more silicon. That's why you get a huge die, but not higher performance than those smaller dies. Compare the power consumption of those SoCs and you'll see the difference there. Cost/Performance/Power Consumption - you can prioritise two, not all three.

Nah, this doesn't really explain the die size discrepancy. P Panajev2001a 's explanation of the on-die SRAM caches makes more sense in explaining why the M1 Max is so big compared to the PS5 APU despite being on the smaller, denser node.
 

THE DUCK

voted poster of the decade by bots
So what's the bottom line? Half the power on the ground as a 3080? If so that's still very good.
 

Panajev2001a

GAF's Pleasant Genius
1.For "ROV is a hardware feature but it does not magically code your OIT algorithm for you"

Without software, hardware by itself is nothing. This is also applicable to PowerVR's Tile-Based Deferred Rendering IP.

Your "software" argument is a red herring to actual hardware features delivered by DirectX12 Hardware Feature Level 12_1 and Xbox 360's ROPS EDRAM chip (remapped to FL12_1's ROV with PC's X360 emulator).

Shaders programs are explicit parallelism software for the GpGPU and specific hardware features are needed to accelerate targeted workloads.

AMD has compute shader for OIT. http://developer.amd.com/wordpress/media/2013/06/2041_final.pdf

From 2011, DirectCompute with link list can be used for OIT before DirectX12 Feature Level 12_1's ROV.

GX1TPBq.png

The above example uses compute shader resources instead of ROPS path hardware.


With ROV, https://www.intel.com/content/www/u...ical/rasterizer-order-views-101-a-primer.html
TWrgnva.png


ROV use cases​

The ability to have deterministic order access to read/write buffers within a shader.


For OIT, the programmer has to define deterministic order, hence software will be required! ROPS are the primary raster read-write units.

You failed to factor in the programmer's defining deterministic order that requires software!

PS: Texture Units can be used as raster read/write units via compute shaders.

Again, shaders programs are explicit parallelism software for the GpGPU and specific hardware features are needed to accelerate targeted workloads. Defining deterministic order is given by the programmer, hence software will be needed.
Setting aside the paternalistic definitions of what shader programs are or how hardware is nothing without software, this was a long detour you are taking on one claim about PVR2DC (that MS’s own DX docs on the DC covered): the amount of software/the complexity that is required to achieve the task. Batch your geometry into opaque+punch through and transparent and on the DC you were essentially done (the GPU per pixel transparency sorting and blending was as normal as any pass).

2. For “The AMD 4700S with GDDR6 memory achieves 92,892 MB/s of copy bandwidth”

Note why I stated PS5 CPU has reduced AVX resource i.e. half the AVX ports from normal Zen 2. To reduce the amount of many individual load-store instruction usages with large memory bandwidth, scatter and gather instructions are needed.

AVX 2 has gathered instructions.
AVX 512 has scatter instructions. AVX 512's scatter and gather instructions are on Larrabee's 512 bit SIMD extensions.

Zen 3 has increased load-store units when compared to Zen 2.
Zen 3 has 3 load units and 2 store units
Zen 2 has 2 load units and 1 store unit.

Modern PC GpGPUs have dedicated DMA move engines i.e. remember the hardware Blitter? Moving large blocks of data should be done via the GpGPU.
Leaving aside your showoffy detour on AVX (Zen 2 supporting both ISA’s just at half the throughput to handle 512 bits operations) the concurrent amount of load and store instructions and dealing with transferring non contiguous blocks of data (the scatter and gather part gives a clue on where the data is), the use of GDDR6 instead of LPDDR5, the much slower SSD I/O that APU has, etc… this is a Desktop reference APU that has no active/working integrated GPU, but keep using it to try to knock down what Apple achieved a peg or two (not that I see Silicon experts unimpressed or calling Apple’s M1 Max as BS but then 🤷‍♂️).


3. Against "Also, not everything is 100% RT dependent yet and when it will be you can bet that they will have RT acceleration HW in those GPU cores."

Open source and free Blender 3D has RT hardware acceleration.

RTX hardware accelerates Autodesk Maya and Arnold


Chaos Group’s RTX Support Accelerates V-Ray Next for 3DS Max and Maya.

For AMD camp, https://gpuopen.com/learn/radeon-prorender-2-0/
For RDNA 2, Radeon Pro Render 2-0 was updated for Autodesk Maya, Blender, and SideFX Houdini.

For Cinema 4D, https://www.maxon.net/en/cinema-4d/features/rendering-with-octane
Cinema 4D's raytracing is hardware accelerated via Octane Render.


I am not saying there are no professional tools where RT HW acceleration is not useful (many of those are meant to run on Windows boxes or Linux farms) not we know M1 Max’s performance on those, just that it was anossie of prioritisation based on the target market for these chips (laptops) and the software they are meant to push to macOS customers. Large 57 Billion transistors die had different priorities than RT acceleration. It is not a must have yet.

Will games with RT mode not have that mode shine on it? Sure, same as above… but the professional target market for these laptops and thus these chips was not looking at that either (and the performance of the apps that run will show, I would not be surprised that games that optimise around the Metal API version it supports would be quite performant too).
 
Last edited:

ParaSeoul

Member
I see this a lot especially with the battery life and performance claims Apple makes in these presentations lately. People joke and then fail to show off how Apple was untruthful… like with the iPhone 13 benchmarks where it turned out the device performed even better than Apple said it would, but 🤷‍♂️.
When it comes to ARM,thats the only time I'd believe anything they say. They're legitimately 2 generations ahead of everyone else.
 

jose4gg

Member
I just saw a Dave2D playing overwatch using parallels, for non-mac users, this means virtualized, he was getting over 100 fps with bad configs and the gameplay was smooth, take a look, it is really impressed:

 

DaGwaphics

Member
if they want to get into console gaming for the next gen they may easily win as long as they don't overcharge their products and have stupid practices.

What, you don't think their new console system will have a chance. I heard it will be the only true $4k system on the block, with a super fancy case and tiny size (planned obsolescence via software included at no additional fee).
 

DaGwaphics

Member
Anandtech review is up: https://www.anandtech.com/show/17024/apple-m1-max-performance-review/6

To me is a very impressive performance for a low clocked (1296MHz) low power (60W-70W) GPU.
How would AMD and Nvidia GPUs perform at that power limit?

EDIT: interesting detail. Andrei found the CPU can use up to 243GB/s of memory bandwidth (higher than the total M1 Pro BW available) and the GPU up to 90GB/s.

Hopefully this will make Nvidia and AMD take power efficiency seriously again. Been a long time since anything viable was available in the board powered 75w range.
 

ethomaz

Banned
I see this a lot especially with the battery life and performance claims Apple makes in these presentations lately. People joke and then fail to show off how Apple was untruthful… like with the iPhone 13 benchmarks where it turned out the device performed even better than Apple said it would, but 🤷‍♂️.
Weird because M1 Max performs is nowhere close to Apple claim.

Ohhhh Apple was get cheating in benchmarks in past already.
 
Last edited:
Weird because M1 Max performs is nowhere close to Apple claim.

Ohhhh Apple was get cheating in benchmarks in past already.

Not really, it's just that they compared benchmarks tools.
But still it's good enough for anyone interest on playing on a Macbook.
 

twilo99

Member
Just remember, Apple are just getting started with this, if the CPU upgrades in the iPhone are any indication by the time they get to "A3 max" or whatever it will be called in a few years they will blowing the competition even further away...
 

Leo9

Member
Nowhere near performance that they claimed lol

More like RTX 3060 level instead RTX 3080.
Yeah let’s use non native, less optimized games to draw conclusions.
Completely pointless.
Games run much better using boot camp/Windows instead of macOS.
 
Last edited:

jose4gg

Member
Yeah let’s use non native, less optimized games to draw conclusions.
Completely pointless.
Games run much better using boot camp/Windows instead of macOS.


THIS.

Especially because in those previous Mac, the processor had the same architecture, now it's all emulated and it is still performing CRAZY good.
 

rnlval

Member
1. Setting aside the paternalistic definitions of what shader programs are or how hardware is nothing without software, this was a long detour you are taking on one claim about PVR2DC (that MS’s own DX docs on the DC covered): the amount of software/the complexity that is required to achieve the task. Batch your geometry into opaque+punch through and transparent and on the DC you were essentially done (the GPU per pixel transparency sorting and blending was as normal as any pass).


2. Leaving aside your showoffy detour on AVX (Zen 2 supporting both ISA’s just at half the throughput to handle 512 bits operations) the concurrent amount of load and store instructions and dealing with transferring non contiguous blocks of data (the scatter and gather part gives a clue on where the data is), the use of GDDR6 instead of LPDDR5, the much slower SSD I/O that APU has, etc… this is a Desktop reference APU that has no active/working integrated GPU, but keep using it to try to knock down what Apple achieved a peg or two (not that I see Silicon experts unimpressed or calling Apple’s M1 Max as BS but then 🤷‍♂️).




I am not saying there are no professional tools where RT HW acceleration is not useful (many of those are meant to run on Windows boxes or Linux farms) not we know M1 Max’s performance on those, just that it was anossie of prioritisation based on the target market for these chips (laptops) and the software they are meant to push to macOS customers. Large 57 Billion transistors die had different priorities than RT acceleration. It is not a must have yet.

Will games with RT mode not have that mode shine on it? Sure, same as above… but the professional target market for these laptops and thus these chips was not looking at that either (and the performance of the apps that run will show, I would not be surprised that games that optimise around the Metal API version it supports would be quite performant too).
1. Your argument is a red herring to hardware acceleration feature on DirectX12 Hardware Feature Level 12_1.

“This enables Order Independent Transparency (OIT) algorithms to work”
ROVs are an HLSL-only construct that applies different behavior semantics to UAVs”

You haven't proven anything.


Rasterizer ordered views (ROVs) enable the underlying OIT algorithms to use features of the hardware to try to resolve the transparency order correctly. Transparency is handled by the pixel shader.”

Panajev2001a: I expect more hair splitting, slides, and papers quoted to make 5 different points and try to win arguments if others throw the towel. Not sure why we keep replying to each other. You are uninteresting in anything but trying to win “something” and that gets old after a while.


Pixel Shader is the ROPS read/write path. LOL Pixel Shader is the programmable version from the old fix function pixel path. A pixel shader, also known as a fragment shader, is a program that dictates the color, brightness, contrast, and other characteristics of a single pixel (fragment).

Dreamcast's PowerVR GPU lacks programmable pixel shaders.

2. ISA is meaningless without implementation details. Zen 1, Zen 2, and Zen 3 can execute AVX v2 with different performance results.

AMD 4700S's Zen 2 AVX was benchmarked and it's about half of the normal Zen 2's AVX results. A gather instruction reduces the multiple load instruction usage i.e. improve code density.
 
Last edited:

rnlval

Member
Yeah let’s use non native, less optimized games to draw conclusions.
Completely pointless.
Games run much better using boot camp/Windows instead of macOS.

Note that PC games' API calls are abstracted by LLVM based JIT recompiler for GpGPU's ISA.

WIthout using a virtual CPU instruction set as a DRM method (i.e. encrypted code runs on a virtual CPU with a custom instruction set e.g. Denuvo VMProtect) and/or GPU vendor-specific graphics API extensions, the only thing "native" code from PC games is just the CPU side.
 

rnlval

Member
Hopefully this will make Nvidia and AMD take power efficiency seriously again. Been a long time since anything viable was available in the board powered 75w range.
With NVIDIA, any perf/watt gain is used by "more TFLOPS", more polymorph (geometry engine), more RT cores, more Tensors cores, and 'etc'.

PCIe 5.0 has up to 600-watts allocation.

Raytracing is a black hole for any increased TFLOPS.
 
Last edited:

rnlval

Member
I see this a lot especially with the battery life and performance claims Apple makes in these presentations lately. People joke and then fail to show off how Apple was untruthful… like with the iPhone 13 benchmarks where it turned out the device performed even better than Apple said it would, but 🤷‍♂️.
From https://wccftech.com/m1-max-gpu-beats-amd-radeon-pro-w6900x-in-affinity/
Notice "Apple’s M1 Max GPU With 32 Cores Beats a $6000 AMD Radeon Pro W6900X in the Affinity Benchmark" headline.

12 core X86 based Mac Pro was an Intel Xeon W with PCIe 3.0. gimped Radeon Pro W6900X

From https://www.apple.com/au/mac-pro/specs/
"Up to 3.4GB/s sequential read and 3.4GB/s sequential write performance". LOL

All Apple's current Intel X86 based products are stuck in the obsolete PCI 3.0e bus technology.

PS5 has SSD raw IO beating that obsolete hardware.
 
Last edited:

rnlval

Member
For I/O, that AMD NAVI 21 card is constrained by Apple Mac Pro's PCIe 3.0 chipset from an old Intel Xeon W CPU.

From https://www.apple.com/au/mac-pro/specs/
"Up to 3.4GB/s sequential read and 3.4GB/s sequential write performance". LOL

That's old hardware.

PS5's SSD raw IO beats Intel Xeon W's obsolete PCIe 3.0 I/O and I'm not even factoring in the hardware decompression.

Intel Rocket Lake has PCIe 4.0.
Intel Alder Lake has PCIe 5.0.
 
Last edited:

rnlval

Member
ROV allow you to modify the pipeline’s behaviour to handle OIT in software (shader code: https://docs.microsoft.com/en-us/windows/win32/direct3d11/rasterizer-order-views) which is different from pixel sorting DC did for opaque+punch-through geometry and transparent one ( https://docs.microsoft.com/en-us/previous-versions/ms834190(v=msdn.10) ). This was something they stopped doing on PC designs and future designs, but according to people that worked on its design it was not a big cost for the HW side at all.

B7hc40U.jpg

… and it was quite a lot of years ago too.

Not really, you just choose to pick fights for the fun of it / to stand taller, maybe got the feeling of some console warring or someone saying something you disagree with and you start hounding people’s posts with slides and links barely addressing what they are saying until they stop posting.

You think TBDR’s are obsolete and Apple’s choice to stick with them is odd if not wrong, I do think they still for now have advantages (especially to reach similar levels of performance as consoles at a much lower power consumption, die size they are not doing too bad trade-offs wise as well considering it is a lot of GPU in there but tons of other HW and embedded caches making up the space). You can do a depth pre-pass and use early-z and Hierarchical Z buffers to reject geometry later to reduce overdraw but that will still burn extra power compared to processing geometry and rendering sorted triangles with invisible ones culled by the HW and the scene already binned in independent tiles (on chip tile bandwidth is easy to keep very high and Apple has been building extensions to do more steps without rendering temporary buffers out to main memory too, and the ability to do the MSAA resolve step before writing out the tile is another bandwidth win… you are free to look at a PC CPU+GPU SoC combo with that memory bandwidth and up to 64 GB of RAM on the package and feel smug, others appreciate the engineering challenge behind something like the M1 Max).
From https://wccftech.com/m1-max-gpu-beats-amd-radeon-pro-w6900x-in-affinity/

Instances where the M1 Max GPU came up short, was in previous gaming benchmark results, where it barely beat a 70W laptop RTX 3060 and was thrashed by a 100W RTX 3080. It also underperformed in the 8K Adobe Premiere Pro test, where it just managed to take the lead against the Surface Laptop Studio, which does not sport powerful specifications at all
 
Last edited:

Panajev2001a

GAF's Pleasant Genius
From https://wccftech.com/m1-max-gpu-beats-amd-radeon-pro-w6900x-in-affinity/

Instances where the M1 Max GPU came up short, was in previous gaming benchmark results, where it barely beat a 70W laptop RTX 3060 and was thrashed by a 100W RTX 3080. It also underperformed in the 8K Adobe Premiere Pro test, where it just managed to take the lead against the Surface Laptop Studio, which does not sport powerful specifications at all

Keep cherrypicking and quoting partial numbers, multiple reviewers are testing and overall power consumption of the Mac unit (I wonder if all are setting the new macOS High Power mode settings, not the default btw) vs the laptop they test again is the strong suit of the MBP’s…

Even from your link: “Apple never really intended these chips to be used for gaming, and to be honest, the energy efficiency of these chips is something that is unmatched by any notebook touting an RTX 3060 or RTX 3080.”

I would also give some time for all vendors to update software (especially Adobe) to take as a stage of the new ARM codes. Interesting you keep bringing integrated GPU (M1) vs systems with dedicated GPU’s. We will see at the end of the day the best performance : battery life ratio too I guess ;). Anyways, this is not the attitude that Intel needs to win these kind of contracts back and do not expect Apple to lose money or sleep over this.
The big challenge will be the Mac Pro where their own designs allow for multiple dedicated GPU’s… I wonder if they will still allow them or how they plan to scale this design (they did not start from mobile to tablet to 13’’ MBP to larger MBP’s for no reason).
 
Last edited:

DaGwaphics

Member
With NVIDIA, any perf/watt gain is used by "more TFLOPS", more polymorph (geometry engine), more RT cores, more Tensors cores, and 'etc'.

PCIe 5.0 has up to 600-watts allocation.

Raytracing is a black hole for any increased TFLOPS.

Had absolutely no idea. It used to always be 75w. In the future when we experience brown-outs we'll know it's just due to the kids playing Fortnite. LOL

I can see HP and Dell shipping those systems out with 180w PSUs.
 
Last edited:
Top Bottom