Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong1,2,†, Shuhao Gu2, Bowen Ye1,2, Zihao Yue2,4, Lei Li2,3, Feifan Song1, Sujian Li1,‡, Hao Tian2,‡

1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
2LLM-Core, Xiaomi    3The University of Hong Kong    4Renmin University of China

Contribution during internship at Xiaomi LLM-Core Team.
Co-corresponding authors.
Contact: wmxiong@pku.edu.cn, lisujian@pku.edu.cn

Accepted at ICML 2026.
Video2GUI Pipeline Overview

Abstract

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and converts them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5–20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

Method Overview

Video2GUI is a fully automated, three-stage pipeline that turns unlabeled Internet videos into grounded GUI interaction trajectories suitable for pretraining generalized GUI agents.

Video2GUI three-stage pipeline

  • (A) Coarse-to-Fine Video Filtering. Starting from 500M+ raw Internet videos, we first apply meta-info filtering with a fine-tuned Qwen2.5-7B classifier (trained on DeepSeek-V3 annotations) to select ∼20M candidate videos. We then run a fine-grained, content-based scorer that rates instruction clarity, topic relevance, and screen recording quality, yielding 4.2M high-quality tutorial videos (∼300k hours).
  • (B) Trajectory Extraction. Each video is split into ≤4-minute segments and processed by Gemini-3-Pro under a sliding-window strategy with historical context, producing instruction–trajectory pairs (u(k), e(k)) with task instructions, action timestamps, action details, and visually grounded low-level instructions.
  • (C) Action Spatial Grounding. For each interaction at timestamp t, we feed Gemini-3-Pro a triplet of high-resolution screenshots {ot-0.5s, ot, ot+0.5s} together with the low-level instruction and predict the precise grounding target bt = (x1, y1, x2, y2). Manual verification on 200 sampled actions shows over 95% are correctly parameterized.

WildGUI Dataset

Applying Video2GUI to 500 M video metadata entries yields WildGUI, the largest open-source GUI pre-training dataset to date. WildGUI offers comprehensive coverage across website, mobile, and desktop platforms, with over 12M trajectories and 124M images, spanning more than 1,500 applications. The table below compares WildGUI with prior datasets in terms of platform coverage, scale, and instruction granularity.

Comparison with existing GUI datasets

Experimental Results

GUI Grounding: ScreenSpot-Pro & OSWorld-G

We evaluate WildGUI by continually pre-training Qwen2.5-VL-7B and Mimo-VL-7B on our dataset and measuring their performance on ScreenSpot-Pro and OSWorld-G. After WildGUI pre-training, both base models match or surpass the best open-source GUI grounding models, with absolute gains of +15.7 / +26.4 on ScreenSpot-Pro / OSWorld-G for Qwen2.5-VL-7B, and +15.7 / +12.9 for Mimo-VL-7B.

Main results on ScreenSpot-Pro and OSWorld-G


Mobile GUI Action: AndroidControl & CAGUI

We further evaluate on the mobile-GUI action benchmarks AndroidControl and CAGUI. Continual pre-training on WildGUI delivers consistent improvements on both benchmarks, demonstrating that the trajectories synthesized from tutorial videos transfer beyond grounding to full action prediction across diverse mobile applications.

Results on AndroidControl and CAGUI benchmarks


Online Agent Evaluation: OSWorld & AndroidWorld

Beyond static grounding and action benchmarks, we evaluate the WildGUI-pretrained models in fully interactive online environments — OSWorld for desktop tasks and AndroidWorld for mobile tasks. WildGUI pre-training substantially boosts task success rate in both environments, validating the practical utility of our dataset for downstream agent deployment.

Online evaluation on OSWorld and AndroidWorld


Scaling Behavior

Performance scales smoothly with the amount of WildGUI pre-training data. As the number of trajectories grows, downstream accuracy continues to improve without saturation, underscoring the value of building large-scale GUI corpora and suggesting further gains are achievable as more videos are incorporated.

Impact of scaling pre-training data on performance

BibTeX

@inproceedings{
anonymous2026videogui,
title={Video2{GUI}: Synthesizing Large-Scale Interaction Trajectories for Generalized {GUI} Agent Pretraining},
author={Anonymous},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=kYVjfc56RT}
}