Skimmed the paper, but i don’t see the part where the game engine was being played. They trained an “agent” to play doom using vizdoom, and trained the diffusion model on the agents “trajectories”. But i didn’t see anything about giving the agents the output of the diffusion model for their gameplay, or the diffusion model reacting to input.
It seems like it was able to generate the doom video based on a given trajectory, and assume that trajectory could be real time human input? That’s the best i can come up with. And the experiment was just some people watching video clips, which doesn’t track with the claims at all.