Abstract

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and experience that can guide agent through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allow agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal multimodular agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector in Minecraft, contributing to a better planning and reflection in the face of long-horizon tasks. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon tasks benchmark, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLM) as the backbone of Optimus-1, and the experimental results show that Optimus-1 exhibit strong generalisation with the help of Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks. The extensive experimental results show that Optimus-1 makes a major step towards a general agent with a human-like level of performance.

Demos

Wooden Group

Craft a crafting table

mine 1 logs
craft 4 planks
craft a crafting table

Craft a wooden pickaxe

mine 3 logs
craft 9 planks
craft 2 sticks
craft 1 crafting table
craft 1 wooden pickaxe

Craft a wooden sword

mine 3 logs
craft 8 planks
craft 1 sticks
craft 1 crafting table
craft 1 wooden sword

Stone Group

Craft a torch

mine 3 logs
craft 10 planks
craft 3 stick
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and mine 1 coal
craft 1 torch

Craft a stone pickaxe

mine 2 logs
craft 6 planks
craft 2 sticks
craft 1 crafting table
mine 1 logs
craft 1 planks
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and mine 3 stone
craft 1 stone pickaxe

Craft a stone sword

mine 4 logs
craft 15 planks
craft 3 stick
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden_pickaxe
dig down and mine 2 cobblestone
craft 1 stone sword

Iron Group

Craft an iron pickaxe

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe

Craft an iron sword

mine 7 logs
craft 23 planks
craft 6 stick
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 11 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 2 iron ore
smelt 2 iron ore
craft 1 iron sword

Craft rails

mine 7 logs
craft 21 planks
craft 5 stick
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and mine 11 stone
craft 1 stone pickaxe
dig down and mine 6 iron ore
craft 1 furnace
smelt 6 iron ore into iron ingots
craft 1 rail

Golden Group

Craft golden axe

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
dig down and mine 3 gold
smelt 3 gold
craft 1 golden axe

Craft golden sword

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
dig down and mine 2 gold
smelt 2 gold
craft 1 golden sword

Craft golden shovel

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
dig down and mine 1 gold
smelt 1 gold
craft 1 golden shovel

Diamond Group

Craft a diamond axe

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
explore 1 to find a good spot to dig down
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
dig down and mine 3 diamond
craft 1 diamond axe

Craft a diamond pickaxe

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
dig down and mine 3 diamond
craft 1 diamond pickaxe

Craft a diamond hoe

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
dig 2 down and mine 2 diamond
craft 1 diamond hoe

Armor Group

Craft golden boots

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
equip 1 iron pickaxe
dig down and mine 4 gold
smelt 4 gold
craft 1 golden boots

Craft an iron leggings

mine 9 logs
craft 36 planks
craft 8 sticks
craft 1 crafting table
craft 1 chest
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
craft 1 furnace
equip 1 stone pickaxe
dig down and break down 5 iron ore
smelt 5 iron ore
craft 1 iron helmet

Craft a diamond helmet

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
explore 1 to find a good spot to dig down
dig down and break down 12 cobblestone
craft 1 stone pickaxe
equip 1 stone pickaxe
craft 1 furnace
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
equip 1 iron pickaxe
dig down and mine5 diamond
craft 1 diamond helmet

Redstone Group

Craft piston

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
craft 1 furnace
equip 1 stone pickaxe
dig down and break down 4 iron ore
smelt 4 iron ore
craft 1 iron pickaxe
equip 1 iron pickaxe
dig down and break down 1 redstone
craft 1 piston

Craft redstone torch

mine 10 logs
craft 38 planks
craft 8 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 12 cobblestone
craft 1 stone pickaxe
craft 1 furnace
equip 1 stone pickaxe
dig down and break down 3 iron ore
smelt 3 iron ore
craft 1 iron pickaxe
equip 1 iron pickaxe
dig down and break down 1 redstone
craft 1 restone torch

Craft activator rail

mine 10 logs
craft 38 planks
craft 10 sticks
craft 1 crafting table
craft 1 wooden pickaxe
equip 1 wooden pickaxe
dig down and break down 11 cobblestone
craft 1 stone pickaxe
craft 1 a furnace
equip 1 stone pickaxe
dig down and break down 9 iron ore
smelt 9 iron ore
craft 1 iron pickaxe
equip 1 iron pickaxe
dig down and break down 1 redstone
craft 1 restone torch
craft 1 activator rail

Overview framework of our Optimus-1

We divide the structure of Optimus-1 into Knowledge-Guided Planner, Experience-Driven Reflector, and Action Controller. In a given game environment with a long-horizon task, the Knowledge-Guided Planner senses the environment, retrieves knowledge from HDKG, and decomposes the task into executable sub-goals. The action controller then sequentially executes these sub-goals. During execution, the Experience-Driven Reflector is activated periodically, leveraging historical experience from AMEP to assess whether Optimus-1 can complete the current sub-goal. If not, it instructs the Knowledge-Guided Planner to revise its plan. Through iterative interaction with the environment,Optimus-1 ultimately completes the task.

Hybrid Multimodal Memory

(a) Extraction process of multimodal experience. The frames are filtered through video buffer and image buffer, then MineCLIP is employed to compute the visual and sub-goal similarities and finally they are stored in Abstracted Multimodal Experience Pool. (b) Overview of Hierarchical Directed Knowledge Graph. Knowledge is stored as a directed graph, where its nodes represent objects, and directed edges point to materials that can be crafted by this object.

Experiment

Main Result of Optimus-1 on long-horizon tasks benchmark.

We report the average success rate (SR), average number of steps (AS), and average time (AT) on each task group, the results of each task can be found in the Appendix experiment. Lower AS and AT metrics mean that the agent is more efficient at completing the task, while ∞ indicates that the agent is unable to complete the task. Overall represents the average result on the five groups of Iron, Gold, Diamond, Redstone, and Armor.

Generalisation and Self-Evoluation

(a) With the help of Hybrid Multimodal Memory, various MLLM-based Optimus-1 have demonstrated 2 to 6 times performance improvement. (b) Illustration of the change in Optimus-1 success rate on the unseen task over 4 epochs.

Conclusion

In this paper, we propose Hybrid Multimodal Memory module, which is inspired by the major influence of the human long-term memory system on the completion of long-horizon tasks. Hybrid Multimodal Memory module consists of two parts: HDKG and AMEP. HDKG provides the necessary world knowledge for the planning phase of the agent, and AMEP provides the refined historical experience for the reflection phase of the agent. On top of the Hybrid Multimodal Memory, we construct the multimodal and multimodular agent Optimus-1 in Minecraft. Extensive experimental results show that Optimus-1 outperforms all existing agents on long-horizon tasks. Furthermore, we validate that general-purpose MLLM, based on our proposed Hybrid Multimodal Memory and without additional parameter updates, can exceed the powerful GPT-4V baseline. This self-evolution approach provides novel insights and directions for the study of general-purpose agents.

BibTeX

@inproceedings{li2024optimus,
title={Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks},
author={Li, Zaijing and Xie, Yuquan and Shao, Rui and Chen, Gongwei and Jiang, Dongmei and Nie, Liqiang},
booktitle={NeurIPS},
year={2024}
}

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

NeurIPS 2024