NVIDIA research team announced NitroGen, a new artificial intelligence model. The model learns to play video games directly from screen pixels. Trained on 40,000 hours of gameplay data, the system is now available as open source.
NitroGen is a vision-to-action foundation model. It takes game screen as input. It outputs gamepad commands. The model does not use reinforcement learning. It learns purely through behavior cloning from human gameplay videos.
The model contains 493 million parameters. It processes 256×256 pixel RGB images. Each step produces 16 timestep action sequences.
Training Data
NVIDIA team collected 71,000 hours of raw gameplay videos from internet. These videos contain controller overlay images from streamers. Quality filtering left 40,000 hours of usable data.
The dataset consists of 38,739 videos from 818 content creators. It covers more than 1,000 different games. 91 games have more than 100 hours of data. 15 games have more than 1,000 hours of footage.
Action RPG genre forms the largest portion at 34.9 percent. Platformers rank second at 18.4 percent. Action-adventure genre holds 9.2 percent share.
Technical Architecture
NitroGen consists of two main components. First component is SigLIP 2 vision transformer. It converts RGB frames into 256 image tokens.
Second component is diffusion transformer structure. It was trained with conditional flow matching. The system applies 16-step denoising process.
Output comes in 21×16 tensor format. 17 dimensions are allocated for binary button states. 4 dimensions are used for two joystick coordinates.
Action Extraction System
A three-stage pipeline was developed to extract action labels from videos. First stage detects controller overlay using template matching. Second stage uses SegFormer-based model to extract button and joystick states. Final stage applies position correction and filtering.
Joystick predictions reached average R² of 0.84. Button accuracy stands at 0.96 level. Xbox and PlayStation controller families are supported.
Performance Results
NVIDIA developed a universal simulator. The simulator wraps commercial Windows games with Gymnasium interface. It enables frame-by-frame interaction without modifying game code.
Evaluation was conducted on 10 commercial games and 30 tasks. 11 combat tasks, 10 navigation tasks and 9 game-specific tasks were tested.
Zero-shot evaluation showed average completion rates between 45-60 percent. This rate remained consistent across 2D and 3D games.
Transfer learning showed significant gains. Isometric roguelike game achieved 10 percent relative improvement. 3D action RPG showed 25 percent average gain. Low data regime reached up to 52 percent improvement on some tasks.
Use Cases
NVIDIA envisions three main application areas. Next-generation game AI ranks first. Automated QA testing for video games is second area. General-purpose embodied AI research is third potential use.
The model performs best on gamepad-controlled games. It succeeds in action, platformer and racing genres. It is less effective on keyboard-mouse games like strategy and MOBA.
Access and Requirements
NitroGen model and dataset are accessible on HuggingFace. The model was tested on NVIDIA H100 GPU. It is compatible with Blackwell and Hopper architectures.
The project was released for research and development purposes. Commercial use requires additional validation. It is distributed under NVIDIA license.

