Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Yuquan Xie1 , Zaijing Li1 2, Rui Shao1✉, Gongwei Chen1,
Dongmei Jiang2, Kaiwen Zhou3, Yinchuan Li3, Liqiang Nie1✉,
1Harbin Institute of Technology, Shenzhen    2Peng Cheng Laboratory    3Huawei Noah's Ark Lab
✉ Corresponding author  

Abstract

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively.

Overview Framework


The Mirage-1 framework comprises a Hierarchical Planner, an Operator, a Decision Reflector, and a Hierarchical Multimodal Skills Module (HMS). To bridge the offline-online domain gap, Skill-Augmented Monte Carlo Tree Search (SA-MCTS) is employed for unseen task exploration, with successful trajectories expanding HMS capabilities. The Hierarchical Planner retrieves Core Skills from HMS and decomposes task goals into sub-goals for Operator execution. The Decision Reflector leverages Execution Skills to assess task execution feasibility.

Experiment

Table1: Performance comparison on AndroidWorld, MobileMiniWob++, and AndroidLH.


Conclusion

In this paper, we propose a Hierarchical Multimodal Skills module (HMS) that addresses the challenge of insufficient prior knowledge in long-horizon task planning. To address the domain gap between offline and online, a Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm is proposed. This algorithm effectively utilizes offline-acquired skills to reduce the action search space during online tree exploration. On top of HMS, we propose multimodal agent Mirage-1. Experimental results demonstrate that Mirage-1 achieves superior performance compared to SOTA GUI agents, particularly in long-horizon tasks.