Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

CVPR 2025

Zaijing Li1 2, Yuquan Xie1 , Rui Shao1✉, Gongwei Chen1,
Dongmei Jiang2, Liqiang Nie1✉,
1Harbin Institute of Technology, Shenzhen    2Peng Cheng Laboratory, Shenzhen
✉ Corresponding author  

Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.

Atomic Task

Dig down to mine dirt

Chop trees

Dig down to mine stone

Long-horizon Tasks

Craft an iron sword

  1. mine 7 logs
  2. craft 23 planks
  3. craft 6 stick
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 11 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 2 iron ore
  12. smelt 2 iron ore
  13. craft 1 iron sword

Craft a golden pickaxe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig down and mine 3 gold
  15. smelt 3 gold
  16. craft 1 golden axe

Craft a diamond hoe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig 2 down and mine 2 diamond
  15. craft 1 diamond hoe

Open-Ended Instruction Tasks

I want to some dirt to build house, can you help me?

I born in the desert. First, I need to find a tree to gather some logs.

I want to craft the wooden pickaxe, so I need to chop some trees to get logs.

Overview framework of our Optimus-2


We propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively.

Experiment

We report the average rewards (AR) for Atomic Tasks, average success rate (SR) for Long-Horizon Tasks and Open-Ended Instruction Tasks.

Table1: Main Result of GOAP on Atomic Tasks.


Table2: Main Result of Optimus-2 on Long-Horizon Tasks.


Table3: Main Result of Optimus-2 on Open-Ended Instruction Tasks.


Conclusion

In this paper, we propose a novel agent, Optimus-2, which can excel in various tasks in the open-world environment of Minecraft. Optimus-2 integrates an MLLM for high-level planning and a Goal-Observation-Action conditioned Policy (GOAP) for low-level control. As a core contribution of this paper, GOAP includes an Action-guided Behavior Encoder to model the observation-action sequence and an MLLM to align the goal with the observation-action sequence for predicting subsequent actions. Extensive experimental results demonstrate that GOAP has mastered various atomic tasks and can comprehend open-ended language instructions. This enables Optimus-2 to achieve superior performance on long-horizon tasks, surpassing existing SOTA. Moreover, we introduce a Minecraft GoalObservation-Action dataset to provide the community with large-scale, high-quality data for training Minecraft agents.

BibTeX

@inproceedings{li2025optimus2,
    title={Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy},
    author={Li, Zaijing and Xie, Yuquan and Shao, Rui and Chen, Gongwei and Jiang, Dongmei and Nie, Liqiang},
    booktitle={2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025},
    organization={IEEE}
}