Technical Report IAI-2024-01
We've observed agents discovering progressively more complex tool use while playing a simple game of hide-and-seek. Through training in our new simulated hide-and-seek environment, agents build a series of six distinct strategies and counterstrategies, some of which we did not know our environment supported. The self-supervised emergent complexity in this simple environment further suggests that multi-agent co-adaptation may one day produce extremely complex and intelligent behavior.
DwNLoqioe9fQNaWvHVDz16gRLzhuANXkZ3L2U9pZpump
In our environment, agents play a team-based hide-and-seek game. Hiders (blue) are tasked with avoiding line-of-sight from the seekers (red), and seekers are tasked with keeping vision of the hiders. There are objects scattered throughout the environment that hiders and seekers can grab and lock in place, as well as randomly generated immovable rooms and walls that agents must learn to navigate.
The agents can move by setting a force on themselves in the x and y directions as well as rotate along the z-axis.
The agents can see objects in their line of sight and within a frontal cone.
The agents can sense distance to objects, walls, and other agents around them using a lidar-like sensor.
The agents can grab and move objects in front of them.
The agents can lock objects in place. Only the team that locked an object can unlock it.
The agents move randomly.
Seekers learn to chase hiders.
Hiders learn to grab and move boxes to block the doors.
Seekers learn to use the ramp to jump over obstacles.
Hiders learn to move the ramp inside the room to prevent seekers from using it.
We use the same training infrastructure and algorithms used to train OpenAI Five and Dactyl. However, in our environment each agent acts independently, using its own observations and hidden memory state. Agents use an entity-centric state-based representation of the world, which is permutation invariant with respect to objects and other agents.
Since agents move by applying forces to themselves, they can grab a box while on top of it and "surf" it to the hider's location.
Without adding explicit negative rewards for agents leaving the play area, in rare cases hiders will learn to take a box and endlessly run with it.
Reinforcement learning is amazing at finding small mechanics to exploit. In these cases, agents learned to abuse contact physics in unexpected ways.
We propose using a suite of domain-specific intelligence tests that target capabilities we believe agents may eventually acquire. Transfer performance in these settings can act as a quantitative measure of representation quality or skill.