Do How do u think it will play out in the PC space?
Will it just continue being included in GPUs?
Well, if I had the means to engineer a PC myself, I'd like to see something like OMI-XSR & CXL take over, preferably the former for processor-to-processor (and compute device-to-compute device) interconnects, and the latter for storage memory devices. That way you can have very low latency and fast bandwidth with the former (each link/lane is 64 GB/s), and the latter, with CXL 3.0 in particular, could allow for reversed memory buffer as well as dynamic partitioning of different memory segments of a peripheral device to be accessed by multiple hosts simultaneously (CXL 2.0 allows for the latter, up to 16 hosts for a single peripheral device, "hosts" being processors, processor cores of what some refer to as "master" processors).
The former is best for dynamic random memories (OMI has a lower latency than HBM memories and DDR, let alone GDDR) and the logic can be built into the logic layers of active interposers; the latter is potentially the best solution for storage-class devices. They're both in enterprise and cluster computing environments, but nothing on the consumer electronics side of things so far. I do think memories like GDDR are coming towards an EOL, and the necessity of nUMA designs for PC due to the limitations of PCIe as an interconnect in terms of cache coherency (also meaning devices over the bus can't use the same TLBs or virtual addressing tables; a lot of that has to be duplicated) hold back potential performance (things like AMD's SAM levering BAR features of PCIe are a good help, though).
I'd like to think a centralized, upgradable memory pool that can serve both CPU and GPU is the future, but you'd have to move on to HBM memories for that. And from there, design a socketed interposer of some kind, sort of like how CPU sockets function, but for memory modules. I don't know enough on that from an engineering POV though, clearly. As you can see most of this was focused on memory because, well, things like AI accelerated dedicated hardware units can be implemented any number of ways into the future, and probably integrated into the CPUs and GPUs, but arithmetic performance is "free", relatively speaking, compared to the sheer magnitudes of more energy required to access data in memory over the bus.
Personally I'm a lot more interested in future Processing-In-Memory or Processing-Near-Memory accelerated logic built into memory controllers as close to the memory itself, and what that can bring for performance gains. Since node shrinks might not come as quickly as planned (and probably won't realistically get under 2nm) and perf gains per shrink are getting smaller, perf gains will mainly have to come from how future designs handle data locality and memory accessing schemes, how they store and organize data and how that data is distributed among the system, etc. And that's combined with a larger shift into chiplet-based designs, new features of interconnect technologies enabling new things, etc.