Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Zaijing Li1 2, Yuquan Xie1 , Rui Shao1✉, Gongwei Chen1,
Dongmei Jiang2, Liqiang Nie1✉,
1Harbin Institute of Technology, Shenzhen    2Peng Cheng Laboratory, Shenzhen
✉ Corresponding author  

Abstract

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and experience that can guide agent through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allow agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal multimodular agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector in Minecraft, contributing to a better planning and reflection in the face of long-horizon tasks. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon tasks benchmark, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLM) as the backbone of Optimus-1, and the experimental results show that Optimus-1 exhibit strong generalisation with the help of Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks. The extensive experimental results show that Optimus-1 makes a major step towards a general agent with a human-like level of performance.

Demos

Wooden Group

Craft a crafting table

  1. mine 1 logs
  2. craft 4 planks
  3. craft a crafting table

Craft a wooden pickaxe

  1. mine 3 logs
  2. craft 9 planks
  3. craft 2 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe

Craft a wooden sword

  1. mine 3 logs
  2. craft 8 planks
  3. craft 1 sticks
  4. craft 1 crafting table
  5. craft 1 wooden sword

Stone Group

Craft a torch

  1. mine 3 logs
  2. craft 10 planks
  3. craft 3 stick
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and mine 1 coal
  8. craft 1 torch

Craft a stone pickaxe

  1. mine 2 logs
  2. craft 6 planks
  3. craft 2 sticks
  4. craft 1 crafting table
  5. mine 1 logs
  6. craft 1 planks
  7. craft 1 wooden pickaxe
  8. equip 1 wooden pickaxe
  9. dig down and mine 3 stone
  10. craft 1 stone pickaxe

Craft a stone sword

  1. mine 4 logs
  2. craft 15 planks
  3. craft 3 stick
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden_pickaxe
  7. dig down and mine 2 cobblestone
  8. craft 1 stone sword

Iron Group

Craft an iron pickaxe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe

Craft an iron sword

  1. mine 7 logs
  2. craft 23 planks
  3. craft 6 stick
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 11 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 2 iron ore
  12. smelt 2 iron ore
  13. craft 1 iron sword

Craft rails

  1. mine 7 logs
  2. craft 21 planks
  3. craft 5 stick
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and mine 11 stone
  8. craft 1 stone pickaxe
  9. dig down and mine 6 iron ore
  10. craft 1 furnace
  11. smelt 6 iron ore into iron ingots
  12. craft 1 rail

Golden Group

Craft golden axe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig down and mine 3 gold
  15. smelt 3 gold
  16. craft 1 golden axe

Craft golden sword

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig down and mine 2 gold
  15. smelt 2 gold
  16. craft 1 golden sword

Craft golden shovel

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig down and mine 1 gold
  15. smelt 1 gold
  16. craft 1 golden shovel

Diamond Group

Craft a diamond axe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. explore 1 to find a good spot to dig down
  8. dig down and break down 12 cobblestone
  9. craft 1 stone pickaxe
  10. equip 1 stone pickaxe
  11. craft 1 furnace
  12. dig down and break down 3 iron ore
  13. smelt 3 iron ore
  14. craft 1 iron pickaxe
  15. dig down and mine 3 diamond
  16. craft 1 diamond axe

Craft a diamond pickaxe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig down and mine 3 diamond
  15. craft 1 diamond pickaxe

Craft a diamond hoe

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. dig 2 down and mine 2 diamond
  15. craft 1 diamond hoe

Armor Group

Craft golden boots

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. equip 1 stone pickaxe
  10. craft 1 furnace
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. equip 1 iron pickaxe
  15. dig down and mine 4 gold
  16. smelt 4 gold
  17. craft 1 golden boots

Craft an iron leggings

  1. mine 9 logs
  2. craft 36 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 chest
  6. craft 1 wooden pickaxe
  7. equip 1 wooden pickaxe
  8. dig down and break down 12 cobblestone
  9. craft 1 stone pickaxe
  10. craft 1 furnace
  11. equip 1 stone pickaxe
  12. dig down and break down 5 iron ore
  13. smelt 5 iron ore
  14. craft 1 iron helmet

Craft a diamond helmet

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. explore 1 to find a good spot to dig down
  8. dig down and break down 12 cobblestone
  9. craft 1 stone pickaxe
  10. equip 1 stone pickaxe
  11. craft 1 furnace
  12. dig down and break down 3 iron ore
  13. smelt 3 iron ore
  14. craft 1 iron pickaxe
  15. equip 1 iron pickaxe
  16. dig down and mine5 diamond
  17. craft 1 diamond helmet

Redstone Group

Craft piston

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. craft 1 furnace
  10. equip 1 stone pickaxe
  11. dig down and break down 4 iron ore
  12. smelt 4 iron ore
  13. craft 1 iron pickaxe
  14. equip 1 iron pickaxe
  15. dig down and break down 1 redstone
  16. craft 1 piston

Craft redstone torch

  1. mine 10 logs
  2. craft 38 planks
  3. craft 8 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 12 cobblestone
  8. craft 1 stone pickaxe
  9. craft 1 furnace
  10. equip 1 stone pickaxe
  11. dig down and break down 3 iron ore
  12. smelt 3 iron ore
  13. craft 1 iron pickaxe
  14. equip 1 iron pickaxe
  15. dig down and break down 1 redstone
  16. craft 1 restone torch

Craft activator rail

  1. mine 10 logs
  2. craft 38 planks
  3. craft 10 sticks
  4. craft 1 crafting table
  5. craft 1 wooden pickaxe
  6. equip 1 wooden pickaxe
  7. dig down and break down 11 cobblestone
  8. craft 1 stone pickaxe
  9. craft 1 a furnace
  10. equip 1 stone pickaxe
  11. dig down and break down 9 iron ore
  12. smelt 9 iron ore
  13. craft 1 iron pickaxe
  14. equip 1 iron pickaxe
  15. dig down and break down 1 redstone
  16. craft 1 restone torch
  17. craft 1 activator rail

Overview framework of our Optimus-1


We divide the structure of Optimus-1 into Knowledge-Guided Planner, Experience-Driven Reflector, and Action Controller. In a given game environment with a long-horizon task, the Knowledge-Guided Planner senses the environment, retrieves knowledge from HDKG, and decomposes the task into executable sub-goals. The action controller then sequentially executes these sub-goals. During execution, the Experience-Driven Reflector is activated periodically, leveraging historical experience from AMEP to assess whether Optimus-1 can complete the current sub-goal. If not, it instructs the Knowledge-Guided Planner to revise its plan. Through iterative interaction with the environment,Optimus-1 ultimately completes the task.

Hybrid Multimodal Memory


(a) Extraction process of multimodal experience. The frames are filtered through video buffer and image buffer, then MineCLIP is employed to compute the visual and sub-goal similarities and finally they are stored in Abstracted Multimodal Experience Pool. (b) Overview of Hierarchical Directed Knowledge Graph. Knowledge is stored as a directed graph, where its nodes represent objects, and directed edges point to materials that can be crafted by this object.

Experiment

Main Result of Optimus-1 on long-horizon tasks benchmark.


We report the average success rate (SR), average number of steps (AS), and average time (AT) on each task group, the results of each task can be found in the Appendix experiment. Lower AS and AT metrics mean that the agent is more efficient at completing the task, while ∞ indicates that the agent is unable to complete the task. Overall represents the average result on the five groups of Iron, Gold, Diamond, Redstone, and Armor.

Generalisation and Self-Evoluation


(a) With the help of Hybrid Multimodal Memory, various MLLM-based Optimus-1 have demonstrated 2 to 6 times performance improvement. (b) Illustration of the change in Optimus-1 success rate on the unseen task over 4 epochs.

Conclusion

In this paper, we propose Hybrid Multimodal Memory module, which is inspired by the major influence of the human long-term memory system on the completion of long-horizon tasks. Hybrid Multimodal Memory module consists of two parts: HDKG and AMEP. HDKG provides the necessary world knowledge for the planning phase of the agent, and AMEP provides the refined historical experience for the reflection phase of the agent. On top of the Hybrid Multimodal Memory, we construct the multimodal and multimodular agent Optimus-1 in Minecraft. Extensive experimental results show that Optimus-1 outperforms all existing agents on long-horizon tasks. Furthermore, we validate that general-purpose MLLM, based on our proposed Hybrid Multimodal Memory and without additional parameter updates, can exceed the powerful GPT-4V baseline. This self-evolution approach provides novel insights and directions for the study of general-purpose agents.