Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Zaijing Li1 2, Yuquan Xie1 , Rui Shao1✉, Gongwei Chen1,
Weili Guan1, Dongmei Jiang2, Liqiang Nie1✉,
1Harbin Institute of Technology, Shenzhen    2Peng Cheng Laboratory, Shenzhen
✉ Corresponding author  

Optimus-3


Demonstration of Optimus-3’s capabilities as a generalist agent in Minecraft. It can perform long-horizon task planning, captioning, embodied QA, grounding, low-level action generation, and reflection in an interactive manner. All of these capabilities are seamlessly integrated into a unified end-to-end architecture, enabling robust and coherent performance across diverse task scenarios.

Abstract

Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. (1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. (2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. (3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment.

Data Generation Pipeline


Given a task pool, we utilize a knowledge graph to generate task plans, forming the planning dataset. These plans are then used as instructions for STEVE-1, which interacts with the environment to produce the action dataset. During this process, we randomly sample images and employ expert models with environmental feedback to generate the captioning, embodied QA, and grounding datasets.

Overview framework


A: The architecture of Optimus-3, which includes a task router that selects a specific task expert for each query, a ViT for visual encoding, and a MoE LLM for generating responses and low-level actions. Given a long-horizon task, it can generate a feasible plan and then execute the sub-goals sequentially. B: The proposed Multimodal Reasoning-Augmented Reinforcement Learning effectively enhances the agent's performance. C: Performance comparison of Optimus-3 against current task-specific SOTA agents, GPT-4o, and the original backbone Qwen2.5-VL.

Experiment

Main Result of Optimus-3 on Long-Horizon tasks, Planning, Captioning, Embodied QA, Grounding, and Reflection.

Table1: Main Result of Optimus-3 on Long-Horizon tasks.


Table2: Main Result of Optimus-3 on Planning, Captioning, Embodied QA, Grounding, and Reflection.


Conclusion

We introduce Optimus-3, which endowed with comprehensive capabilities in perception, planning, action, and reflection within the Minecraft. We propose a knowledge-enhanced data generation pipeline to support agent training, a task-level routing MoE to address interference among heterogeneous tasks, and a multimodal reasoning-augmented reinforcement learning method to improve performance on vision-related tasks. Extensive experimental results demonstrate that Optimus-3 marks a significant step forward toward building a generalist agent in Minecraft.