InteractAI: Discovering Advanced Strategic Behavior Through Multi-Agent Evolution

Technical Report IAI-2024-01

Executive Summary

We've observed agents discovering progressively more complex tool use while playing a simple game of hide-and-seek. Through training in our new simulated hide-and-seek environment, agents build a series of six distinct strategies and counterstrategies, some of which we did not know our environment supported. The self-supervised emergent complexity in this simple environment further suggests that multi-agent co-adaptation may one day produce extremely complex and intelligent behavior.

DwNLoqioe9fQNaWvHVDz16gRLzhuANXkZ3L2U9pZpump

Environment Overview

Environment Diagram

In our environment, agents play a team-based hide-and-seek game. Hiders (blue) are tasked with avoiding line-of-sight from the seekers (red), and seekers are tasked with keeping vision of the hiders. There are objects scattered throughout the environment that hiders and seekers can grab and lock in place, as well as randomly generated immovable rooms and walls that agents must learn to navigate.

J This is for movment only I'll modify just the Movement capability card to include the video: html Copy

Agent Capabilities

Movement

The agents can move by setting a force on themselves in the x and y directions as well as rotate along the z-axis.

Vision

The agents can see objects in their line of sight and within a frontal cone.

Sensing

The agents can sense distance to objects, walls, and other agents around them using a lidar-like sensor.

Manipulation

The agents can grab and move objects in front of them.

Locking

The agents can lock objects in place. Only the team that locked an object can unlock it.

Emergent Strategies

Random (Episode 0)

The agents move randomly.

Chasing (Episodes 0–2.69 million)

Seekers learn to chase hiders.

Door Blocking (Episodes 2.69–8.62 million)

Hiders learn to grab and move boxes to block the doors.

Ramp Use (Episodes 8.62–14.5 million)

Seekers learn to use the ramp to jump over obstacles.

Ramp Defense (Episodes 14.5–43.4 million)

Hiders learn to move the ramp inside the room to prevent seekers from using it.

Training Infrastructure

We use the same training infrastructure and algorithms used to train OpenAI Five and Dactyl. However, in our environment each agent acts independently, using its own observations and hidden memory state. Agents use an entity-centric state-based representation of the world, which is permutation invariant with respect to objects and other agents.

Environment Diagram

Surprising Behaviors

Box Surfing

Since agents move by applying forces to themselves, they can grab a box while on top of it and "surf" it to the hider's location.

Endless Running

Without adding explicit negative rewards for agents leaving the play area, in rare cases hiders will learn to take a box and endlessly run with it.

Ramp Exploitation

Reinforcement learning is amazing at finding small mechanics to exploit. In these cases, agents learned to abuse contact physics in unexpected ways.

Transfer Learning Evaluation

We propose using a suite of domain-specific intelligence tests that target capabilities we believe agents may eventually acquire. Transfer performance in these settings can act as a quantitative measure of representation quality or skill.