[ICLR 2025] GameGen-X: Open-World Game Generation with Diffusion Transformer and Interactive Control

A research collaboration between Hong Kong University of Science and Technology (HKUST) Smart Lab and University of Science and Technology of China (USTC) has been accepted at the ICLR 2025 conference, a top-tier venue in artificial intelligence. The study introduces GameGen-X, the first open-world game generation model that supports interactive control. Through a two-stage training strategy, GameGen-X achieves high-quality game content generation and dynamic control, making significant advancements in character diversity, environmental interaction, and event simulation. The code and dataset have been open-sourced.


Research Background

Traditional game generation models struggle to balance open-world content diversity and user interaction control. Existing methods, such as GameNGen and DIAMOND, primarily simulate predefined game mechanics and fail to generate semantically coherent new content.

HKUST SmartLab

Figure 1: Illustration of Etched’s Oasis, a generative game concept.

GameGen-X is the first model to integrate Diffusion Transformer with multimodal control signals, enabling dynamic generation of characters, environments, and events while supporting interactive control via keyboard commands, text prompts, and video guidance.

Additionally, it allows interactive control, predicting or modifying future content based on current video segments, enabling game simulation.


Methodology

OGameData: The First Open-World Video Game Dataset

HKUST SmartLab

Figure 2: Visualization of OGameData dataset construction process.

Dataset Overview

To advance interactive control in game generation, the research team constructed OGameData, the first large-scale dataset specifically designed for video game generation and interactive control.

  • Scale & Diversity:
    • Contains 1 million high-resolution video clips, ranging from a few minutes to several hours.
    • Collected from 150+ next-gen games, covering game metadata, player perspectives, and character details.
  • High-Quality Annotations:
    • Curated by 10+ human experts, who scored, filtered, ranked, and structured the data.
    • GPT-4o-assisted annotations ensure clarity and coherence, eliminating UI artifacts and visual noise.
  • Comparison with Other Datasets:
    • As shown in Table 1, OGameData surpasses Miradata, the latest open-domain generation dataset, by offering twice the annotation density per unit time.

Table 1: Comparison of OGameData vs. recent game video datasets.

HKUST SmartLab


GameGen-X Model Architecture

HKUST SmartLab

Figure 3: GameGen-X model structure, highlighting MSDiT and InstructNet.

3D-VAE: Spatiotemporal Variational Autoencoder

To address spatiotemporal redundancy, GameGen-X incorporates a 3D Variational Autoencoder (3D-VAE) to compress video segments into latent representations:

  • Spatial Downsampling: Extracts frame-level latent features to reduce computational cost.
  • Temporal Aggregation: Captures temporal dependencies, minimizing frame redundancy.
  • Latent Tensor Representation: Supports long videos and high-resolution training, meeting the demands of game content generation.

MSDiT: Masked Spatiotemporal Diffusion Transformer

GameGen-X introduces Masked Spatiotemporal Diffusion Transformer (MSDiT), integrating three attention mechanisms for high-quality game video generation:

  1. Spatial Attention: Enhances intra-frame relationships for detailed spatial consistency.
  2. Temporal Attention: Captures inter-frame dependencies, ensuring smooth frame transitions.
  3. Cross Attention: Incorporates T5-based text features, aligning video generation with semantic prompts.

Additionally, a masking mechanism selectively excludes frames from the diffusion process, optimizing denoising performance.

InstructNet: Interactive Control Module

GameGen-X leverages InstructNet for instruction fine-tuning, enabling user-interactive control:

  • Composed of N InstructNet modules, each integrating:
    1. Operation-Integrated Expert Layers
    2. Instruction-Integrated Expert Layers
  • Functionality:
    • Injects instruction features to modulate latent representations, ensuring alignment with user intent.
    • Supports character actions and environmental dynamics control.
    • Trains via continuous video sequences to simulate game mechanics and feedback loops.
    • Introduces Gaussian noise in initial frames to mitigate error accumulation.

GameGen-X Model Training Strategy

HKUST SmartLab

Figure 4: GameGen-X two-stage training strategy.

GameGen-X training is divided into two stages:

  1. Base Model Pretraining

    • Trained on OGameData-GEN for text-to-video generation, enabling long-sequence, high-quality open-world game video synthesis.
  2. Instruction Fine-Tuning

    • InstructNet is introduced, incorporating game-related multimodal control expert systems.
    • The base model is frozen, and only OGameData-INS is used to update InstructNet, mapping user inputs (text commands, keyboard controls, etc.) to game content.

Experimental Results

Evaluation Metrics

GameGen-X is evaluated using a comprehensive set of metrics, including:

  • Visual Quality:
    • FID (Fréchet Inception Distance)
    • FVD (Fréchet Video Distance)
    • IQ (Image Quality)
  • Text-Video Alignment:
    • TVA (Text-Video Alignment)
  • Interactive Control Performance:
    • SR (Success Rate)
      • SR-C (Character Action Success Rate)
      • SR-E (Environmental Event Success Rate)
  • Motion and Consistency:
    • MS (Motion Smoothness)
    • DD (Dynamism Degree)
    • SC (Subject Consistency)

Results Analysis

Quantitative Evaluation

Table 2: Comparison of GameGen-X vs. four leading open-source models (Mira, OpenSora Plan1.2, OpenSora1.2, CogVideoX-5B).

HKUST SmartLab

  • GameGen-X outperforms in FID, FVD, TVA, MS, and SC, demonstrating superior quality and coherence in game video generation.

Table 3: Comparison of control accuracy across models.

HKUST SmartLab

  • GameGen-X Achieves the highest SR-C and SR-E, excelling in interactive control and context-aware game content generation.

Qualitative Evaluation

HKUST SmartLab

Figure 5: GameGen-X diverse video generation examples (characters, environments, actions, and events).

  • GameGen-X surpasses other models in interactive control, effectively generating contextually appropriate and interactive game content.

HKUST SmartLab

Figure 6: GameGen-X interactive control examples (text commands & keyboard inputs).

  • GameGen-X demonstrates high control accuracy, effectively translating user inputs into game content.

HKUST SmartLab

Figure 7: Comparison of GameGen-X vs. open-source models in character detail, visual environments, and camera logic.

  • CogVideo (8fps) and OpenSora1.2 (frequent scene transitions) excel in dynamism (DD), but GameGen-X balances quality and control accuracy.

HKUST SmartLab

Figure 8: Comparison of GameGen-X vs. commercial models (Kling, Pika, Runway, Luma, Tongyi).

  • GameGen-X outperforms commercial models in visual quality, text-video alignment, and interactive control, showcasing its superiority in open-world game content generation.

Conclusion

GameGen-X, powered by OGameData and a two-stage training strategy, successfully achieves high-quality open-world game content generation and interactive control, paving the way for automated game design.

This study suggests that future game development may shift toward data-driven, automated content creation, reducing manual effort and accelerating intelligent open-world game evolution. Despite existing challenges, GameGen-X marks a significant step forward, laying the foundation for future research.


References

GameGen-X Official Website: https://gamegen-x.github.io/
ICLR 2025 Paper: https://openreview.net/forum?id=8VG8tpPZhe