Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Zaijing Li1 2, Yuquan Xie1 , Rui Shao1✉, Gongwei Chen1,
Weili Guan1, Dongmei Jiang2, Yaowei Wang2, Liqiang Nie1✉,
1Harbin Institute of Technology, Shenzhen    2Peng Cheng Laboratory, Shenzhen
✉ Corresponding author  

Optimus-3


Given the task "Craft a diamond sword based on the current inventory", Optimus-3 employs Captioning to perceive and interpret the inventory information, Grounding to select appropriate tools, Planning to generate sub-goals based on available materials, Action to execute these sub-goals sequentially, Reflection to assess the current task state, and Embodied QA to verify whether the task has been successfully completed.

Abstract

Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, OptimusM$^{4}$, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational ``Fast Path'' for System 1 and a ``Deep Path'' for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~2 (21$\%$ on Planning, 66\% on Captioning, 76\% on Embodied QA, 3.4$\times$ on Grounding, and 18\% on Reflection) and System~1 (3\% on Long-Horizon Action) tasks, with a notable 60\% success rate on open-ended tasks.

Data Generation Pipeline


Given a task pool, we utilize a knowledge graph to generate task plans, forming the planning dataset. These plans are then used as instructions for STEVE-1, which interacts with the environment to produce the action dataset. During this process, we randomly sample images and employ expert models with environmental feedback to generate the captioning, embodied QA, and grounding datasets.

Dual-Router Aligned MoE architecture


A: Overview of Optimus-3. Given observations and instructions, Optimus-3 couples System-1 fast reaction (Action) and System-2 deliberate reasoning (Embodied QA, Planning, Grounding, Reflection) within the Dual-Router Aligned MoE architecture. B: The details of Dual-Router Aligned MoE architecture. Horizontally, Task Router assigns each input to its corresponding task expert together with a shared knowledge expert. Vertically, Layer Router accelerates latency-sensitive action inference by selectively skipping intermediate layers. Both routing decisions are made once before the forward pass. C: Performance comparison of Optimus-3 against current task-specific SOTA agents, GPT-4o, and Qwen2.5-VL

Dual-Granularity Reasoning-Aware Policy Optimization


Visualization examples of the task-specific fine-grained reward functions in DGRPO. For the Planning task, we design a Dependency-Aware Synthesis Reward, which treats the item's crafting dependency path as thinking reward and assigns fine-grained step-wise supervision as answer reward. For vision-related tasks, we introduce a Hallucination-Aware Consistency Reward that penalizes hallucinated items in the reasoning process and the final answer.

Experiment

Main Result of Optimus-3 on System 1 and System 2 tasks.

Table1: Main Result of Optimus-3 on MineSys2 Benchmark.


Table2: Main Result of Optimus-3 on Long-Horizon Benchmark.


Table3: Main Result of Optimus-3 on Open-ended Tasks.


Conclusion

In this paper, we presented Optimus-3, a unified generalist agent that organically integrates System 1 action loops with System 2 reasoning capabilities within an end-to-end framework. To overcome the challenges of data scarcity, architectural conflict, and open-world generalization, we contributed advances along three dimensions. First, we introduced a Knowledge-Enhanced Data Generation Pipeline that samples high-fidelity System 2 reasoning traces from raw interaction trajectories. By leveraging domain constraints to filter hallucinations, we constructed and released the OptimusM$^4$ dataset. Second, we proposed the Dual-Router Aligned MoE architecture to address the computational conflict between the two systems. Through horizontal parameter decoupling and vertical depth adaption, it efficiently maintains a ``Fast Path'' for reflexive control and a ``Deep Path'' for deliberative reasoning. Third, we developed the Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It establishes a Process-Outcome Co-Supervision mechanism, utilizing dual-granularity rewards to align reasoning chains with visual evidence. Extensive experiments demonstrate that Optimus-3 achieves superior performance across diverse tasks, marking a significant step toward achieving general-purpose embodied intelligence in complex, open-ended worlds.