Anything this game does can be done by having two similar environments in the same 3D space and forcing synchronization on a lot of pairs of objects. If they did that, there is no reason for it to run this bad.
Split-screen is one thing (and usually with split-screen you have roughly similar assets/geometry, as usually split-screen is for multiplayer and so you're on the same track or level; this is unusual in that the geometry is mostly the same but they're totally textured and lit differently for each use of the scene.) However, there are parts of the game that literally "blink" in and out of the two worlds, at full frame, in fractions of a second. I'm not a game developer, but I've seen game designers turn on/off graphical elements, and it's not
that fast. Turn on/off a grass shader or pop a big tree prop in and out and you'll see the tool take a beat to load the thing, then you'll see it show on screen but maybe with not exactly the right mipmaps needed for that viewpoint and the shadows not fully coming out right on even the baked elements already present, then another beat as the game catches up to the differences in the scene, and then everything's smooth after those first few rough passes tell the system that everything's good to go. Once it's on screen, you can dial up and down the sliders and play with the object/effect and the engine keeps going smoothly, but until the system knows what it's going to be doing with what's in the scene, it has to take a few passes to get everything in sync and in the groove.
(Maybe that's just an unoptimized instance of the engine, and it'd be different if swapping those was done in finalized, compiled game code? I feel like it's always chunky in games too though, whenever you make a major modification to a scene.)
So maybe, because Bloober needed absolutely instantaneous snaps between scenes, they really are having the game scenes running twice at full detail (and optimized so the game looks as good as it can while always having to keep two versions of a scene at a time) instead of turning detail/effects on and off as needed? (They also have sequences where the one world "takes over" the other, which could be done with a variety of tricks, but one approach would be the brute force of doing the scene twice.) And so if that's how they're doing their full-frame sequences, maybe for their split-screen or other views of the two worlds they just figured, why change the approach? Why not always run the game world twice, and then we can split it, we can blink to it, we can fade/morph between it, we can do whatever is necessary depending on the mood we want for that gameplay sequence.