Bowen Baker∗ OpenAI firstname.lastname@example.org
Ingmar Kanitscheider∗ OpenAI email@example.com
Todor Markov∗ OpenAI firstname.lastname@example.org
Yi Wu∗ OpenAI email@example.com
Glenn Powell∗ OpenAI firstname.lastname@example.org
Bob McGrew∗ OpenAI email@example.com
Igor Mordatch∗† Google Brain firstname.lastname@example.org
Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self- supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.
Creating intelligent artificial agents that can solve a wide variety of complex human-relevant tasks has been a long-standing challenge in the artificial intelligence community. Of particular relevance to humans will be agents that can sense and interact with objects in a physical world. One approach to creating these agents is to explicitly specify desired tasks and train a reinforcement learning (RL) agent to solve them. On this front, there has been much recent progress in solving physically grounded tasks, e.g. dexterous in-hand manipulation (Rajeswaran et al., 2017; Andrychowicz et al., 2018) or locomotion of complex bodies (Schulman et al., 2015; Heess et al., 2017). However, specifying reward functions or collecting demonstrations in order to supervise these tasks can be
time consuming and costly. Furthermore, the learned skills in these single-agent RL settings are inherently bounded by the task description; once the agent has learned to solve the task, there is little room to improve.
Due to the high likelihood that direct supervision will not scale to unboundedly complex tasks, many have worked on unsupervised exploration and skill acquisition methods such as intrinsic motivation. However, current undirected exploration methods scale poorly with environment complexity and are drastically different from the way organisms evolve on Earth. The vast amount of complexity and diversity on Earth evolved due to co-evolution and competition between organisms, directed by natural selection (Dawkins & Krebs, 1979). When a new successful strategy or mutation emerges, it changes the implicit task distribution neighboring agents need to solve and creates a new pressure
for adaptation. These evolutionary arms races create implicit autocurricula (Leibo et al., 2019a) whereby competing agents continually create new tasks for each other. There has been much success in leveraging multi-agent autocurricula to solve multi-player games, both in classic discrete games such as Backgammon (Tesauro, 1995) and Go (Silver et al., 2017), as well as in continuous real-time domains such as Dota (OpenAI, 2018) and Starcraft (Vinyals et al., 2019). Despite the impressive emergent complexity in these environments, the learned behavior is quite abstract and disembodied from the physical world. Our work sees itself in the tradition of previous studies that showcase emergent complexity in simple physically grounded environments (Sims, 1994a; Bansal et al., 2018; Jaderberg et al., 2019; Liu et al., 2019); the success in these settings inspires confidence that inducing autocurricula in physically grounded and open-ended environments could eventually enable agents to acquire an unbounded number of human-relevant skills.
We introduce a new mixed competitive and cooperative physics-based environment in which agents compete in a simple game of hide-and-seek. Through only a visibility-based reward function and competition, agents learn many emergent skills and strategies including collaborative tool use, where agents intentionally change their environment to suit their needs. For example, hiders learn to create shelter from the seekers by barricading doors or constructing multi-object forts, and as a counter strategy seekers learn to use ramps to jump into hiders’ shelter. Moreover, we observe signs of dy- namic and growing complexity resulting from multi-agent competition and standard reinforcement learning algorithms; we find that agents go through as many as six distinct adaptations of strategy and counter-strategy, which are depicted in Figure 1. We further present evidence that multi-agent co-adaptation may scale better with environment complexity and qualitatively centers around more human-interpretable behavior than intrinsically motivated agents.
However, as environments increase in scale and multi-agent autocurricula become more open-ended, evaluating progress by qualitative observation will become intractable. We therefore propose a suite of targeted intelligence tests to measure capabilities in our environment that we believe our agents may eventually learn, e.g. object permanence (Baillargeon & Carey, 2012), navigation, and construction. We find that for a number of the tests, agents pretrained in hide-and-seek learn faster or achieve higher final performance than agents trained from scratch or pretrained with intrinsic motivation; however, we find that the performance differences are not drastic, indicating that much of the skill and feature representations learned in hide-and-seek are entangled and hard to fine-tune.
The main contributions of this work are: 1) clear evidence that multi-agent self-play can lead to emergent autocurricula with many distinct and compounding phase shifts in agent strategy, 2) evi- dence that when induced in a physically grounded environment, multi-agent autocurricula can lead to human-relevant skills such as tool use, 3) a proposal to use transfer as a framework for evaluating agents in open-ended environments as well as a suite of targeted intelligence tests for our domain, and 4) open-sourced environments and code1 for environment construction to encourage further research in physically grounded multi-agent autocurricula.
2 RELATED WORK
There is a long history of using self-play in multi-agent settings. Early work explored self-play using genetic algorithms (Paredis, 1995; Pollack et al., 1997; Rosin & Belew, 1995; Stanley & Mi- ikkulainen, 2004). Sims (1994a) and Sims (1994b) studied the emergent complexity in morphology and behavior of creatures that coevolved in a simulated 3D world. Open-ended evolution was fur- ther explored in the environments Polyworld (Yaeger, 1994) and Geb (Channon et al., 1998), where agents compete and mate in a 2D world, and in Tierra (Ray, 1992) and Avida (Ofria & Wilke, 2004), where computer programs compete for computational resources. More recent work attempted to formulate necessary preconditions for open-ended evolution (Taylor, 2015; Soros & Stanley, 2014). Co-adaptation between agents and environments can also give rise to emergent complexity (Florensa et al., 2017; Sukhbaatar et al., 2018; Wang et al., 2019). In the context of multi-agent RL, Tesauro (1995), Silver et al. (2016), OpenAI (2018), Jaderberg et al. (2019) and Vinyals et al. (2019) used self-play with deep RL techniques to achieve super-human performance in Backgammon, Go, Dota, Capture-the-Flag and Starcraft, respectively. Bansal et al. (2018) trained agents in a simulated 3D physics environment to compete in various games such as sumo wrestling and soccer goal shooting. In Liu et al. (2019), agents learn to manipulate a soccer ball in a 3D soccer environment and discover emergent behaviors such as ball passing and interception. In addition, communication has also been shown to emerge from multi-agent RL (Sukhbaatar et al., 2016; Foerster et al., 2016; Lowe et al., 2017; Mordatch & Abbeel, 2018).
Intrinsic motivation methods have been widely studied in the literature (Chentanez et al., 2005; Singh et al., 2010). One example is count-based exploration, where agents are incentivized to reach infrequently visited states by maintaining state visitation counts (Strehl & Littman, 2008; Bellemare et al., 2016; Tang et al., 2017) or density estimators (Ostrovski et al., 2017; Burda et al., 2019b). Another paradigm are transition-based methods, in which agents are rewarded for high prediction error in a learned forward or inverse dynamics model (Schmidhuber, 1991; Stadie et al., 2015; Mohamed & Rezende, 2015; Houthooft et al., 2016; Achiam & Sastry, 2017; Pathak et al., 2017; Burda et al., 2019a; Haber et al., 2018). Jaques et al. (2019) consider multi-agent scenarios and adopt causal influence as a motivation for coordination. In our work, we utilize intrinsic motivation methods as an alternative exploration baseline to multi-agent autocurricula. Similar comparisons have also been made in Haber et al. (2018) and Leibo et al. (2019b).
Tool use is a hallmark of human and animal intelligence (Hunt, 1996; Shumaker et al., 2011); how- ever, learning tool use in RL settings can be a hard exploration problem when rewards are unaligned. For example, in Forestier et al. (2017); Xie et al. (2019) a real-world robot learns to solve various tasks requiring tools. In Bapst et al. (2019), an agent solves construction tasks in a 2-D environment using both model-based and model-free methods. Allen et al. (2019) uses a combination of human- designed priors and model-based policy optimization to solve a collection of physics-based puzzles requiring tool use. However, in each of these works, agents were explicitly incentivized to inter- act with and use tools, whereas in our environment agents implicitly create this incentive through multi-agent competition.
3 HIDE AND SEEK
Agents are tasked with competing in a two-team hide-and-seek game in a physics-based environ- ment. The hiders are tasked with avoiding line of sight from the seekers, and the seekers are tasked with keeping vision of the hiders. There are objects scattered throughout the environment that the agents can grab and also lock in place. There are also randomly generated immovable rooms and walls that the agents must learn to navigate. Before the game of hide-and-seek begins, the hiders are given a preparation phase where the seekers are immobilized, giving the hiders a chance to run away or change their environment.
There are no explicit incentives for agents to interact with objects in the environment; the only supervision given is through the hide-and-seek objective. Agents are given a team based reward; hiders are given a reward of 1 if all hiders are hidden and -1 if any hider is seen by a seeker. Seekers are given the opposite reward, -1 if all hiders are hidden and +1 otherwise. To confine agent behavior to a reasonable space, agents are penalized with a reward of -10 if they go too far outside of the play area (outside an 18 meter square). An episode lasts 240 timesteps, the first 40% of which are the preparation phase where all agents are given zero reward.
We simulate the environment in the MUJOCO physics engine (Todorov et al., 2012). The world is populated with 1 to 3 hiders, 1 to 3 seekers, 3 to 9 movable boxes of which at least 3 are elongated, 2 movable ramps, and randomly generated static walls and rooms. We also experiment with a simpler, less randomized environment described in Appendix A.3. Agents observe the position, velocity, and size (in the case of the randomly shaped boxes) of objects and other agents. If entities are not in line-of-sight of the agent or not in a 135 degree cone in front of the agent, then they are masked out in the policy. Agents also have 30 range sensors arrayed evenly around them, similar to a lidar. In addition, each agent observes its own team and other agents’ teams as well as how much time is left in the preparation phase.
Agents are simulated as spherical objects and have 3 action types that can be chosen simultaneously at each time step. They may move by setting a discretized force along their x and y axis and torque around their z-axis. They have a single binary action to grab objects, which binds the agent to the closest object while the action is enabled. Agents may also lock objects in place with a single binary action. Objects may be unlocked only by agents on the team of the agent who originally locked the object. Agents may only grab or lock objects that are in front of them and within a small radius.
4 POLICY OPTIMIZATION
Agents are trained using self-play, which acts as a natural curriculum as agents always play oppo- nents of an appropriate level.
Agent policies are composed of two separate networks with different parameters – a policy network which produces an action distribution and a critic network which predicts the discounted future re- turns. Policies are optimized using Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Generalized Advantage Estimation (GAE) (Schulman et al., 2015), and training is performed using rapid (OpenAI, 2018), a large-scale distributed RL framework. We utilize decentralized execution and centralized training. At execution time, each agent acts given only its own observations and memory state. At optimization time, we use a centralized omniscient value function for each agent, which has access to the full environment state without any information masked due to visibility, similar to Pinto et al. (2017); Lowe et al. (2017); Foerster et al. (2018).
In all reported experiments, agents share the same policy parameters but act and observe indepen- dently; however, we found using separate policy parameters per agent also achieved all six stages of emergence but at reduced sample efficiency.
We utilize entity-centric observations (Dzˇeroski et al., 2001; Diuk et al., 2008) and use attention mechanisms to capture object-level information (Duan et al., 2017; Zambaldi et al., 2018). As shown in Figure 2 we use a self-attention (Vaswani et al., 2017) based policy architecture over entities, which is permutation invariant and generalizes to varying number of entities. More details can be found in Appendix B.
Figure 3: Environment specific statistics used to track stages of emergence in hide-and-seek. We plot the mean across 3 seeds with each individual seed shown in a dotted line, and we overlay the 6 emergent phases of strategy: (1) Running and Chasing, (2) Fort Building, (3) Ramp Use, (4) Ramp Defense, (5) Box Surfing, (6) Surf Defense. We track the maximum movement of any box or ramp during the game as well as during the preparation phase (denoted with “Prep”). We similarly track how many objects of each type were locked at the end of the episode and preparation phase. As agents train, their interaction with the tools in their environment changes. For instance, as the agents learn to build forts they move boxes and lock boxes much more during the preparation phase.
5 AUTO-CURRICULA AND EMERGENT BEHAVIOR
As agents train against each other in hide-and-seek, as many as six distinct strategies emerge, each of which creates a previously non-existing pressure for agents to progress to the next stage. Note that there are no direct incentives for agents to interact with objects or to explore, but rather the emergent strategies are solely a result of the autocurriculum induced by multi-agent competition. Figure 1 shows the progression of emergent strategies agents learn in our environment (see Appendix A.1 for trajectory traces of each strategy2).
Initially, hiders and seekers learn to crudely run away and chase. After approximately 25 million episodes of hide-and-seek, the hiders learn to use the tools at their disposal and intentionally modify their environment. They begin to construct secure shelters in which to hide by moving many boxes together or against walls and locking them in place. After another 75 million episodes, the seekers also learn rudimentary tool use; they learn to move and use ramps to jump over obstacles, allowing them to enter the hiders’ shelter. 10 million episodes later, the hiders learn to defend against this strategy; the hiders learn to bring the ramps to the edge of the play area and lock them in place, seemingly removing the only tool the seekers have at their disposal.
We originally believed defending against ramp use would be the last stage of emergence in this environment; however, we were surprised to find that yet two more qualitatively new strategies emerged. After 380 million total episodes of training, the seekers learn to bring a box to the edge of the play area where the hiders have locked the ramps. The seekers then use the ramp to move on top of the box and surf it to the hiders’ shelter. This is possible because the agents’ movement action allows them to apply a force on themselves regardless of whether they are on the ground or not; if they do this while grabbing the box under them, the box will move with while they are on top of it. In response, the hiders learn to lock all of the boxes in place before building their shelter.3
In all stages of strategy agents must learn to coordinate within their team. Similar to Liu et al. (2019), we use team-based rewards such that agents are required to collaborate in order to succeed; however, in our work we require neither population-based training (Jaderberg et al., 2017) or evolved dense rewards (Jaderberg et al., 2019). Notably, hiders learn efficient division of labor; for instance, when constructing shelter they often separately bring their own box to the construction area. Furthermore, as hiders attempt to defend against box surfing by locking boxes in place during the preparation phase, we find that 2 and 3 hider teams are able to lock 25% and 36% more boxes, respectively, than a single hider team, indicating that larger teams are able to divide labor and accomplish an objective a smaller team could not.
Many stages of emergent strategy can be mapped to behavioral shifts in the way agents interact with the tools in their environment, similar to Leibo et al. (2017); Perolat et al. (2017). We therefore track basic statistics about the agents’ interaction with objects during training, shown in Figure 3. For instance, as the hiders learn to build forts, they move and lock boxes much more during the preparation phase. Similarly, as the seekers learn to move and use ramps, the ramp movement in the main phase of the game increases, and as they learn to “box surf” there is a slight increase in the box movement during the main phase of the game. Finally, as the hiders learn to defend against this strategy by locking all boxes in place, the number of locked boxes in the preparation phase increases.
We found that scale plays a critical role in enabling progression through the emergent au- tocurricula in hide-and-seek. The default model, which uses a batch size of 64,000 and 1.6 million parameters, requires 132.3 million episodes (31.7 billion frames) over 34 hours of training to reach stage 4 of the skill progres- sion, i.e. ramp defense. In Figure 4 we show the effect of varying the batch size in our agents ability to reach stage 4. We find that larger batch sizes lead to much quicker training time by virtue of reducing the number of required optimization steps, while only marginally affecting sample efficiency down to a batch size of 32,000; however, we found that experiments with batch sizes of 16,000 and 8,000 never con- verged.
We find the emergent autocurriculum to be fairly robust as long as we randomize the environment during training. If randomization is
reduced, we find that fewer stages of the skill progression emerges, and at times less sophisticated strategies emerge instead (e.g. hiders can learn to run away and use boxes as moveable shields.); see Appendix A.2 for more details. In addition, we find that design choices such as the minimum number of elongated boxes or giving each agent their own locking mechanism instead of a team based locking mechanism can drastically increase the sample complexity. We also experimented with adding additional objects and objectives to our hide-and-seek environment as well as with several game variants instead of hide-and-seek (see Appendix A.6). We find that these alternative environments also lead to emergent tool use, providing further evidence that multi-agent interaction is a promising path towards self-supervised skill acquisition.
In the previous section we presented evidence that hide-and-seek induces a multi-agent autocurriculum such that agents continuously learn new skills and strategies. As is the case with many unsupervised reinforcement learning methods, the objective being optimized does not directly incentivize the learned behavior, making evaluation of those behaviors nontrivial. Tracking reward is an insufficient evaluation metric in multi-agent settings, as it can be ambiguous in indicating whether agents are improving evenly or have stagnated. Metrics like ELO (Elo, 1978) or Trueskill (Herbrich et al., 2007) can more reliably measure whether performance is improving relative to previous pol- icy versions or other policies in a population; however, these metrics still do not give insight into whether improved performance stems from new adaptations or improving previously learned skills. Finally, using environment specific statistics such as object movement (see Figure 3) can also be ambiguous, e.g. the choice to track absolute movement does not illuminate which direction agents moved, and designing sufficient metrics will become difficult and costly as environments scale.
In Section 6.1, we first qualitatively compare the behaviors learned in hide-and-seek to those learned from intrinsic motivation, a common paradigm for unsupervised exploration and skill acquisition. In Section 6.2, we then propose a suite of domain-specific intelligence tests to quantitatively measure and compare agent capabilities.
- 6.1 COMPARISON TO INTRINSIC MOTIVATION
Intrinsic motivation has become a popular paradigm for incentivizing unsupervised exploration and skill discovery, and there has been recent success in using intrinsic motivation to make progress in sparsely rewarded settings (Bellemare et al., 2016; Burda et al., 2019b). Because intrinsically mo- tivated agents are incentivized to explore uniformly, it is conceivable that they may not have mean- ingful interactions with the environment (as with the “noisy-TV” problem (Burda et al., 2019a)). As a proxy for comparing meaningful interaction in the environment, we measure agent and object movement over the course of an episode.
We first compare behaviors learned in hide-and-seek to a count-based exploration baseline (Strehl & Littman, 2008) with an object invariant state representation, which is computed in a similar way as in the policy architecture in Figure 2. Count-based objectives are the simplest form of state density based incentives, where one explicitly keeps track of state visitation counts and rewards agents for reaching infrequently visited states (details can be found in Appendix D). In contrast to the original hide-and-seek environment where the initial locations of agents and objects are randomized, we restrict the initial locations to a quarter of the game area to ensure that the intrinsically motivated agents receive additional rewards for exploring.
We find that count-based exploration leads to the largest agent and box movement if the state repre- sentation only contains the 2-D location of boxes: the agent consistently interacts with objects and learns to navigate. Yet, when using progressively higher-dimensional state representations, such as box location, rotation and velocity or 1-3 agents with full observation space, agent movement and, in particular, box movement decrease substantially. This is a severe limitation because it indicates that, when faced with highly complex environments, count-based exploration techniques require identifying by hand the “interesting” dimensions in state space that are relevant for the behaviors one would like the agents to discover. Conversely, multi-agent self-play does not need this degree of supervision. We also train agents with random network distillation (RND) (Burda et al., 2019b), an intrinsic motivation method designed for high dimensional observation spaces, and find it to perform slightly better than count-based exploration in the full state setting.
- 6.2 TRANSFER AND FINE-TUNING AS EVALUATION
We propose to use transfer to a suite of domain-specific tasks in order to asses agent capabilities. To this end, we have created 5 benchmark intelligence tests that include both supervised and reinforce-
ment learning tasks. The tests use the same action space, observation space, and types of objects as in the hide-and-seek environment. We examine whether pretraining agents in our multi-agent envi- ronment and then fine-tuning them on the evaluation suite leads to faster convergence or improved overall performance compared to training from scratch or pretraining with count-based intrinsic mo- tivation. We find that on 3 out of 5 tasks, agents pretrained in the hide-and-seek environment learn faster and achieve a higher final reward than both baselines.
We categorize the 5 intelligence tests into 2 domains: cognition and memory tasks, and manipulation tasks. We briefly describe the tasks here; for the full task descriptions, see Appendix C. For all tasks, we reinitialize the parameters of the final dense layer and layernorm for both the policy and value networks.
Cognition and memory tasks:
In the Object Counting supervised task, we aim to measure whether the agents have a sense of object permanence; the agent is pinned to a location and watches as 6 boxes each randomly move to the right or left where they eventually become obscured by a wall. It is then asked to predict how many boxes have gone to each side for many timesteps after all boxes have disappeared. The agent’s policy parameters are frozen and we initialize a classification head off of the LSTM hidden state. In the baseline, the policy network has frozen random parameters and only the classification head off of the LSTM hidden state is trained.
In Lock and Return we aim to measure whether the agent can remember its original position while performing a new task. The agent must navigate an environment with 6 random rooms and 1 box, lock the box, and return to its starting position.
In Sequential Lock there are 4 boxes randomly placed in 3 random rooms without doors but with a ramp in each room. The agent needs to lock all the boxes in a particular order — a box is only lockable when it is locked in the correct order — which is unobserved by the agent. The agent must discover the order, remember the position and status of visited boxes, and use ramps to navigate between rooms in order to finish the task efficiently.
Manipulation tasks: With these tasks we aim to measure whether the agents have any latent skill or representation useful for manipulating objects.
In the Construction From Blueprint task, there are 8 cubic boxes in an open room and between 1 and 4 target sites. The agent is tasked with placing a box on each target site.
In the Shelter Construction task there are 3 elongated boxes, 5 cubic boxes, and one static cylinder. The agent is tasked with building a shelter around the cylinder.
Results: In Figure 6 we show the performance on the suite of tasks for the hide-and-seek, count- based, and trained from scratch policies across 3 seeds. The hide-and-seek pretrained policy per- forms slightly better than both the count-based and the randomly initialized baselines in Lock and Return, Sequential Lock and Construction from Blueprint; however, it performs slightly worse than the count-based baseline on Object Counting, and it achieves the same final reward but learns slightly slower than the randomly initialized baseline on Shelter Construction.
We believe the cause for the mixed transfer results is rooted in agents learning skill representa- tions that are entangled and difficult to fine-tune. We conjecture that tasks where hide-and-seek pretraining outperforms the baseline are due to reuse of learned feature representations, whereas better-than-baseline transfer on the remaining tasks would require reuse of learned skills, which is much more difficult. This evaluation metric highlights the need for developing techniques to reuse skills effectively from a policy trained in one environment to another. In addition, as future en- vironments become more diverse and agents must use skills in more contexts, we may see more generalizable skill representations and more significant signal in this evaluation approach.
In Appendix A.5 we further evaluate policies sampled during each phase of emergent strategy on the suite of targeted intelligence tasks, by which we can gain intuition as to whether the capabilities we measure improve with training, are transient and accentuated during specific phases, or gener- ally uncorrelated to progressing through the autocurriculum. Noteably, we find the agent’s memory improves through training as indicated by performance in the navigation tasks; however, perfor- mance in the manipulation tasks is uncorrelated, and performance in object counting changes seems transient with respect to source hide-and-seek performance.
7 DISCUSSION AND FUTURE WORK
We have demonstrated that simple game rules, multi-agent competition, and standard reinforcement learning algorithms at scale can induce agents to learn complex strategies and skills. We observed emergence of as many as six distinct rounds of strategy and counter-strategy, suggesting that multi- agent self-play with simple game rules in sufficiently complex environments could lead to open- ended growth in complexity. We then proposed to use transfer as a method to evaluate learning progress in open-ended environments and introduced a suite of targeted intelligence tests with which to compare agents in our domain.
Our results with hide-and-seek should be viewed as a proof of concept showing that multi-agent autocurricula can lead to physically grounded and human-relevant behavior. We acknowledge that the strategy space in this environment is inherently bounded and likely will not surpass the six modes presented as is; however, because it is built in a high-fidelity physics simulator it is physically grounded and very extensible. In order to support further research in multi-agent autocurricula, we are open-sourcing our environment code.
Hide-and-seek agents require an enormous amount of experience to progress through the six stages of emergence, likely because the reward functions are not directly aligned with the resulting behavior. While we have found that standard reinforcement learning algorithms are sufficient, reducing sample complexity in these systems will be an important line of future research. Better policy learning algorithms or policy architectures are orthogonal to our work and could be used to improve sample efficiency and performance on transfer evaluation metrics.
We also found that agents were very skilled at exploiting small inaccuracies in the design of the environment, such as seekers surfing on boxes without touching the ground, hiders running away from the environment while shielding themselves with boxes, or agents exploiting inaccuracies of the physics simulations to their advantage. Investigating methods to generate environments with- out these unwanted behaviors is another import direction of future research (Amodei et al., 2016; Lehman et al., 2018).
We thank Pieter Abbeel, Rewon Child, Jeff Clune, Harri Edwards, Jessica Hamrick, Joel Liebo, John Schulman and Peter Welinder for their insightful comments on this manuscript. We also thank Alex Ray for writing parts of our open sourced code.
Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
pp. 33–65. University Press, 2012.
Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly Stachenfeld, Pushmeet Kohli, Pe- ter Battaglia, and Jessica Hamrick. Structured agents for physical construction. In International Conference on Machine Learning, pp. 464–474, 2019.
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In International Conference on Learning Repre- sentations, 2019a.
pp. 240–247. ACM, 2008.
Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural infor- mation processing systems, pp. 1087–1098, 2017.
A.E. Elo. The rating of chessplayers, past and present. Arco Pub., 1978. ISBN 9780668047210. URL https://books.google.com/books?id=8pMnAQAAMAAJ.
Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Nick Haber, Damian Mrowca, Stephanie Wang, Li F Fei-Fei, and Daniel L Yamins. Learning to play with intrinsically-motivated, self-aware agents. In Advances in Neural Information Processing Systems, pp. 8388–8399, 2018.
Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, SM Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based train- ing of neural networks. arXiv preprint arXiv:1711.09846, 2017.
Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castan˜eda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nico- las Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Ko- ray Kavukcuoglu, and Thore Graepel. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019. ISSN 0036-8075. doi: 10.1126/science.aau6249. URL https://science.sciencemag.org/content/ 364/6443/859.
Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International Conference on Machine Learning, pp. 3040–3049, 2019.
Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:1803.03453, 2018.
Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473. International Foundation for Au- tonomous Agents and Multiagent Systems, 2017.
Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019a.
Joel Z Leibo, Julien Perolat, Edward Hughes, Steven Wheelwright, Adam H Marblestone, Edgar Due´n˜ez-Guzma´n, Peter Sunehag, Iain Dunning, and Thore Graepel. Malthusian reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and Multi- Agent Systems, pp. 1099–1107. International Foundation for Autonomous Agents and Multiagent Systems, 2019b.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor- critic for mixed cooperative-competitive environments. In Advances in Neural Information Pro- cessing Systems, pp. 6379–6390, 2017.
pp. 2125–2133, 2015.
OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, 2018.
Georg Ostrovski, Marc G Bellemare, Aa¨ron van den Oord, and Re´mi Munos. Count-based ex- ploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017.
Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pp. 3643–3652, 2017.
Jordan B Pollack, Alan D Blair, and Mark Land. Coevolution of a backgammon player. In Artificial Life V: Proc. of the Fifth Int. Workshop on the Synthesis and Simulation of Living Systems, pp. 92–98. Cambridge, MA: The MIT Press, 1997.
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
Ju¨rgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neu- ral controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
L Soros and Kenneth Stanley. Identifying necessary conditions for open-ended evolution through the artificial life world of chromaria. In Artificial Life Conference Proceedings 14, pp. 793–800. MIT Press, 2014.
Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fer- gus. Intrinsic motivation and automatic curricula via asymmetric self-play. In International Conference on Learning Representations, 2018.
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, et al. AlphaS- tar: Mastering the real-time strategy game StarCraft II. https://deepmind.com/blog/ alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.
Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753, 2019.
Larry Yaeger. Computational genetics, physiology, metabolism, neural systems, learning, vision, and behavior or poly world: Life in a new context. In SANTA FE INSTITUTE STUDIES IN THE SCIENCES OF COMPLEXITY-PROCEEDINGS VOLUME-, volume 17, pp. 263–263. ADDISON-WESLEY PUBLISHING CO, 1994.
Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.
Table of Contents
- Further Emergence Results17
- Trajectory Traces From Each Stage of Emergent Strategy…………………………………….. 17
- Dependence of Skill Emergence on Randomness in the Training Distribution of Environments 17
- Quadrant Environment…………………………………………………………………………………….. 18
- Further Ablations……………………………………………………………………………………………. 19
- Evaluating Agents at Different Phases of Emergence……………………………………………. 19
- Alternative Games to Hide-and-Seek with Secondary Objectives…………………………… 20
- Zero-shot generalization…………………………………………………………………………………… 22
- Optimization Details23
- Notation…………………………………………………………………………………………………………. 23
- Proximal Policy Optimization (PPO)…………………………………………………………………. 23
- Generalized advantage estimation……………………………………………………………………… 23
- Normalization of observations, advantage targets and value function targets…………… 24
- Optimization setup………………………………………………………………………………………….. 24
- Optimization hyperparameters………………………………………………………………………….. 24
- Policy architecture details…………………………………………………………………………………. 24
- Intelligence Test Suite Details25
- Cognition and memory task……………………………………………………………………………… 25
- Manipulation task……………………………………………………………………………………………. 25
- Intrinsic Motivation Methods26
- Counted-Based Exploration………………………………………………………………………………. 26
- Random Network Distillation……………………………………………………………………………. 27
- Fine-tuning From Intrinsic Motivation Variants………………………………………………….. 27
A FURTHER EMERGENCE RESULTS
A.1 TRAJECTORY TRACES FROM EACH STAGE OF EMERGENT STRATEGY
A.2 DEPENDENCE OF SKILL EMERGENCE ON RANDOMNESS IN THE TRAINING DISTRIBUTION OF ENVIRONMENTS
Discovering tool use is a difficult exploration problem, as only a very specific set of actions will lead to an increase in reward. During training, we find that randomization over many components of our environment, such as the number of agents in each team, the number of boxes, the shape of boxes (square or elongated), the initial location of agents and objects and the presence and location of interior and exterior walls helps emergence, despite leading to a broader training distribution. As we successively reduce the amount of randomization, we find that fewer stages of skill progression emerges, and with at times less sophisticated strategies (e.g. hiders learn to run away and use boxes as moveable shields), Table A.1.
Table A.1: Number of stages of emergence for different levels of environmental randomization (batch size is held constant).
A.3 QUADRANT ENVIRONMENT
As well as the default random rooms environment described above, we also train and evaluate agents in the much simpler quadrant environment (Figure A.2) for illustration purposes. This environment consists of a room in the lower right corner that is connected by 1-2 doors to the remainder of the environment. Unlike in the main environment, the interior walls are fixed and only the location of the doors vary from episode to episode. Hiders are spawned randomly in the environment, but seekers are only spawned outside the room. Two cubic boxes are spawned inside the room and one ramp is spawned inside or outside the room. Both hiders and seekers receive a negative reward if they leave the environment or if they bring an object outside the environment. Also, in contrast to the main environment, only boxes, but not ramps can be locked in place.
The quadrant environment gives rise to a similar skill progression as in the main environment (Figure A.2): basic running and chasing, then fort building by hiders, ramp use by seekers, and ramp defense by hiders. Hiders create shelter by pushing boxes into the doors of the room in the lower right corner, preventing seekers from entering. Seekers drag ramps to the exterior walls of the room and then use them to jump inside. As ramps are not lockable in this environment, hiders instead defend against ramp use by pulling the ramp inside the room before securing the wall opening with boxes.
A.4 FURTHER ABLATIONS
In Figure A.3 we compare the performance between a masked and omniscient value function as well a purely pooling architecture versus self-attention. We find that using an omniscient value function, meaning that the value function has access to the full state of the unobscured environment, is critical to progressing through the emergent autocurricula at the given scale. We found that with the same compute budget, training with a masked value function never progressed past stage 3 (ramp usage). We further found that our self-attention architecture increases sample efficiency as compared to an architecture that only embeds and then pools entities together with a similar number of parameters. However, because self-attention requires more compute despite having the same number of parameters, the wall-clock time to convergence is slightly slower.
A.5 EVALUATING AGENTS AT DIFFERENT PHASES OF EMERGENCE
In Figure A.4 we evaluate policies sampled during each phase of emergent strategy on the suite of targeted intelligence tasks, by which we can gain intuition as to whether the capabilities we measure improve with training, are transient and accentuated during specific phases, or generally uncorrelated to progressing through the autocurriculum. We find that the hide-and-seek agent improves on the navigation and memory tasks as it progresses; notably on Lock and Return, the performance monotonically increases with emergence phase, and the policy from the phase 6 performs 20% bet- ter than the policy from phase 1. However, performance on Object Counting is transient; during
phase 1 the hide-and-seek agent performs extremely well, much better than all baselines and other phases but loses this ability in later stages. Finally, we find that performance on the manipulation tasks is relatively uncorrelated to the phases of emergence, and surprisingly the policy transferred from phase 1, the phase before any tool use emerges, performs comparably well to other phases.
A.6 ALTERNATIVE GAMES TO HIDE-AND-SEEK WITH SECONDARY OBJECTIVES
A.6.1 HIDE-AND-SEEK WITH FOOD REWARD
In the main hide-and-seek environment, hiders build forts at locations that would give them the best chance of staying hidden from seekers. To test whether hiders could be incentivized by a secondary objective to adapt the location of forts, we add additional food rewards and test whether hiders would bias the location of their forts towards the location of the food. In this environment, hiders can eat food and receive food rewards only under the following conditions: after the preparation phase, when all hiders are hidden from seekers and when the food is sufficiently close and visible to them. Therefore, just chasing after food would not be an effective strategy for hiders, because hiders receive neither hide-and-seek nor food rewards if one or more hiders are seen by a seeker. Instead, hiders are incentivized to build forts around the location of food and then eat the food while being unobserved by seekers (figure A.5).
Food is distributed in the form of 5 food pellets that are spawned close to each other in a rectangular area in the center of the environment whose side length is 1/4 of the room size. Each food pellet can provide a positive food reward of +1 for each time step. Food rewards are shared between hiders, irrespective of which hider eats a food item. Hiders (as well as seekers) can observe the locations of food pellets as separate entities; if food pellets are obstructed or outside the field of view they are masked out like other type of objects.
As shown in figure A.6, this environment gives rise to four levels of skill progression, similar to the one of the main environment: basic running and chasing, then fort building by hiders, ramp use by seekers, and ramp defense by hiders. Moreover, hiders consume food, and food consumption is highly correlated with their ability to construct stable forts; food consumption decreases during the initial phase as seekers get better at chasing and therefore prevent hiders from eating food. Food consumption then increases again as hiders learn to construct forts and shield themselves from the view of seekers, but plateaus once seekers learn to use ramps. Finally, food consumption rises again as hiders get better at defending against ramp use.
A.6.2 HIDE-AND-SEEK WITH DYNAMIC FOOD
In the food variant introduce in Sec. A.6.1, the location of food pellets is fixed throughout the episode. Here we consider a dynamic food variant such that a food pellet will be eaten up, i.e., disappear, when a hider is close to it, and then a new food pellet will show up in a different location but still within the center region. More precisely, the game area is simply an empty room without outside walls containing 2 seekers, 3 hiders, 1 dynamic food and 8 elongated boxes. The food will be always located within a square in the center of the environment with side length 1/5 of the game area size. We inherit the same reward and policy structures from the previous game.
In this game, merely building a center fort is not sufficient for obtaining the highest reward since the food might disappear and respawn outside the fort. The agent must ensure that the fort is large
enough such that all the possible spawning positions of the food will be inside the fort. Such a behavior does emerge after training for around 4.5 × 1010 samples.
We also experimented on variants where the food spawning region has different side length. When the side length is reduced to 1/6 of the game area, the same behavior emerges faster taking 1.5 1010 samples. However, when the side length of the dynamic food region is increased to 1/4 of the game area, hiders converge to a policy that ignores the food and only builds a small fort to protect themselves.
A.6.3 FOOD PROTECTION GAME
In the previous two variants, we introduce extra food reward to the hider in addition to the original hide-and-seek reward for promoting more goal-oriented behavior. Now we consider a different game rule such that the competition between hiders and seekers only depends on the food collecting reward and show that this rule can also lead to tool use and complex behavior.
We consider an empty game area surrounded by walls that contains 50 food pellets randomly dis- tributed in a center square of size 2/3 of the game environment. There are 3 hiders, 2 seekers and 7 elongated boxes in the game. The only goal for seekers is to collect food. Once a food pellet is collected by any of the seeker, a +3 reward will be given to all the seekers and then the food will disappear permanently from the game. The goal for the hider is to protect the food from seekers and their reward is simply the negative value of seekers. Each episode consists of 200 time steps. The preparation period, in which only hiders can move, extends to the first 100 time steps. The last 60 steps of the game corresponds to a food collecting period, during which the hiders cannot move. Additionally, after the preparation phase we also add a -1 “boundary penalty” for hiders when they are too close to the wall to ensure they stay within the food region.
Our initial motivation for this task was to promote hiders to learn to construct complex and large fort structures with a more direct competition pressure. For example, we expected that the agents would learn to build a large fort that would cover as much food as possible. Unfortunately, such behavior did not emerge. However, after around 7 × 1010 samples of training, the hiders surprisingly discover
an extremely efficient and effective strategy: they learn to physically ‘push’ the immobile seekers to the corner of the environment during the preparation phase and then collaboratively use elongated boxes to trap the seekers in the corner. Figure A.7 illustrates a trajectory corresponding to this strategy. Interestingly, the hiders demonstrate strong collaborations. Since there are 2 seekers and 3 hiders, when the 2 hiders are pushing the seekers to the corner, the 3rd hider will follow the ‘pushing’ hiders distantly, and once the pushing hiders finish the job, this 3rd hider will quickly grab a box to build the trap without blocking the escaping route of those 2 pushing hiders. We emphasize that this behavior was completely surprising to us and further indicates that complex behavior can emerge from multi-agent self-play with a wide range of competition rules.
Lastly, we also experimented with different versions of the close-to-wall “boundary penalties”. In all cases, we observe similar fort-building behaviors, namely blocking the seekers to the corner, with tiny variations. When we remove the penalty, the behavior emerges twice faster (3.5 1010 samples) and 1 hider will stay in the fort with the seekers instead of getting outside. When we add the penalty throughout the episode, the hiders even learned to stay in the center region while grabbing an elongated box as a tool to push the seekers towards the corner. However, the behavior emerges twice slower in this setting, i.e., 1.4 × 1011 samples required.
A.7 ZERO-SHOT GENERALIZATION
We find that the trained policies can zero-shot generalize to environments with slightly larger number of hiders, ramps or boxes in Figure A.8. The hider reward provides a reasonably good measure for generalization performance, since the modifications in question make constructing safe shelter more difficult. First, we looked at increasing the number of hiders from 1-3 hiders, as in the training distribution, to 4-6 hiders; increasing the number of hiders decreases the hider reward because all hiders need to stay hidden from seekers. However, the decline in hider reward is very gradual, indicating that the policy generalizes well to more than 4 hiders. A similar effect occurs when increasing the number of ramps because hiders need to secure more ramps from seekers. If we increase the number of ramps from 2 to 3 or 4 the hider reward drops only gradually. Finally, we find hider performance is remarkably stable, though still slowly declines, when increasing the number of boxes.
B OPTIMIZATION DETAILS
We consider the standard multi-agent reinforcement learning formalism of N agents interacting with each other in an environment. This interaction is defined by a set of states describing the state of the world and configurations of all agents, a set of observations 1, . . . N of all agents, a set of actions 1, . . . , N of all agents, a transition function : 1. . . N determining the distribution over next states, and a reward for each agent i which is a function of the state and the agent’s action. Agents choose their actions according to a stochastic policy πθi : O × A → [0, 1], where θi are the parameters of the policy. In our formulation, policies are shared between agents, πθi = πθ and the set of observations contains information for which role (e.g. hider or seeker) the agent will be rewarded. Each agent i aims to maximize its total expected discounted return Ri=EHt=0γtrit, where H is the horizon length and γ is a time discounting factor that biases agents towards preferring short term rewards to long term rewards. The action-value function is defined as Qπi (st, ai) = E[Ri|st, ai], while the state-value function is defined as V πi (st) = E[Rt|st]. The advantage function Aπi (st, ai) := Qπi (st, ai)−V πi (st) describes whether taking action ai is better or worse for agent i when in state st than the average action of policy πi.
B.2 PROXIMAL POLICY OPTIMIZATION (PPO)
Policy gradient methods aim to estimate the gradient of the policy parameters with respect to the discounted sum of rewards, which is often non-differentiable. A typical estimator of the policy gradient is g := E[Aˆt θ log πθ], where Aˆ is an estimate of the advantage function. PPO (Schulmanet al., 2017), a policy gradient variant, penalizes large changes to the policy to prevent training instabilities. PPO optimizes the objective L = E min(lt(θ)Aˆt, clip(lt(θ), 1 − E, 1 + E)Aˆt , where lt(θ) = πθ (at|st)/πold (at|st) denotes the likelihood ratio between new and old policies and clip(lt(θ), 1 − E, 1 + E) clips lt(θ) in the interval [1 − E, 1 + E].
B.3 GENERALIZED ADVANTAGE ESTIMATION
where δt+l is the TD residual, γ, λ [0, 1] are discount factors that control the bias-variance tradeoff of the estimator, V (st) is the value function predicted by the value function network and we set V (st) = 0 if st is the last step of an episode. This estimator obeys the reverse recurrence relation AˆH = δt + γλAˆH−1.
We calculate advantage targets by concatenating episodes from policy rollouts and truncating them to windows of T = 160 time steps (episodes contain 240 time steps). If a window (s0, . . . , sT −1) was generated in a single episode we use the advantage targets (AˆH=T , AˆH=T −1, . . . , AˆH=1). If a new episode starts at time step j we use the advantage targets (AˆH=j , AˆH=j−1, . . . , AˆH=1, AˆH=T −j, . . . , AˆH=1).
Similarly we use as targets for the value function (GˆH=T , GˆH=T −1, . . . , GˆH=1) for a window generated by a single episode and (GˆH=j , GˆH=j−1, . . . , GˆH=1, GˆH=T −j, . . . , GˆH=1) if a new episode starts at time step j, where the return estimator is given by GˆH := AˆH + V (st). This value function estimator corresponds to the TD(λ) estimator (Sutton & Barto, 2018).
B.4 NORMALIZATION OF OBSERVATIONS, ADVANTAGE TARGETS AND VALUE FUNCTION TARGETS
We normalize observations, advantage targets and value function targets. Advantage targets are z- scored over each buffer before each optimization step. Observations and value function targets are z-scored using a mean and variance estimator that is obtained from a running estimator with decay parameter 1 − 10−5 per optimization substep.
B.5 OPTIMIZATION SETUP
Training is performed using the distributed rapid framework (OpenAI, 2018). Using current policy and value function parameters, CPU machines roll out the policy in the environment, collect rewards, and compute advantage and value function targets. Rollouts are cut into windows of 160 timesteps and reformatted into 16 chunks of 10 timesteps (the BPTT truncation length). The rollouts are then collected in a training buffer of 320,000 chunks. Each optimization step consists of 60 SGD substeps using Adam with mini-batch size 64,000. One rollout chunk is used for at most 4 optimization steps. This ensures that the training buffer stays sufficiently on-policy.
B.6 OPTIMIZATION HYPERPARAMETERS
Our optimization hyperparameter settings are as follows:
|Mini-batch size||64,000 chunks of 10 timesteps|
|Learning rate||−4 3 · 10|
|PPO clipping parameter E||0.2|
|Max GAE horizon length T||160|
|BPTT truncation length||10|
B.7 POLICY ARCHITECTURE DETAILS
Lidar observations are first passed through a circular 1D-convolution and concatenated onto the agents representation of self, xself. Each object is concatenated with xself and then embedded with a dense layer where parameters are shared between objects of the same type, e.g. all boxes share the same embedding weights. All the embedded entities are then passed through a residual self- attention block, similar to Vaswani et al. (2017) but without position embeddings, in the form of y = dense(self attention(x)) + x. We then average-pool entity embeddings and concatenate this pooled representation to xself. Note that in the policy network the entities not observed by each agent are masked away through self-attention and pooling. Finally, this pooled representation is passed through another dense layer and an LSTM (Hochreiter & Schmidhuber, 1997) before pulling off separate action heads for each of the 3 action types described in Section 3. We also add layer normalization (Ba et al., 2016) to every hidden layer of the policy network except the 1D-convolution layer. We empirically observe that layer normalization leads to faster training and better transfer performance.
|Size of embedding layer||128|
|Size of MLP layer||256|
|Size of LSTM layer||256|
|Residual attention layer||4 attention heads of size 32|
|Weight decay coefficient||10−6|
C INTELLIGENCE TEST SUITE DETAILS
All evaluation tasks in the intelligence test suite are single-agent tasks. In order to align the observations between the hide-and-seek environment and the evaluation tasks, we add fake hider observations, preparation phase observations and set them all to 0.
C.1 COGNITION AND MEMORY TASK
All the variations of cognition and memory tasks have a horizon of 120 timesteps, and all boxes are cuboids.
An agent is pinned in place and watches as 6 boxes move either to left or right at random. The boxes disappear behind walls such that agent can no longer see them, and the agent is asked to predict how many boxes went left or right far after all the boxes have disappeared. In this test we evaluate the quality of the existing representation by holding the agent’s policy fixed and only train a new classification head from the agent’s LSTM state. The classification head first passes the LSTM state through a layernorm and a single dense layer with 64 units. We then do a 7-class classification predicting whether 0 through 6 boxes have gone to the left.
Lock and Return:
In this game, the agent needs to navigate towards a hidden box, lock it, and then return to its starting position.
The game area has 6 randomly generated connected rooms with static walls and 1 box. When the box is locked, the agent will be given a reward of +5. If the agent unlocks the box during the episode, a -5 penalty will be given. Additionally, if the box remains unlocked at the end of the episode, the agent will be given another -5 penalty. A success is determined when the agent returns to its starting location within 0.1 radius and with the box locked. For promoting fast task accomplishment, we give the agent a +1 reward for each timestep of success. We also introduce shaped reward with coefficient of 0.5 for easier learning: at each time step, the shaped reward is the decrement in the distance between the agent location towards the target (either the unlocked box or the starting location).
There are 4 boxes and the agent needs to lock all the boxes in an unobserved order sequentially. A box can be locked only if it is locked in the right order.
The game area is randomly partition into three rooms with 2 walls. The 4 boxes are randomly placed in the game area. Each room has a ramp. The agent has to utilize the ramps to navigate between rooms. When a box is successfully locked (according to the order), a +5 bonus is given. If a box is unlocked, -5 penalty will be added. When all the boxes get locked, the agent will receives a +1 per-timestep success bonus. We also use the same shaped distance reward as the lock and return task here.
C.2 MANIPULATION TASK
All variations of the manipulation task have 8 boxes, but no ramps.
Construction from Blueprint:
The horizon is at most 240 timesteps, but an episode can end early if the agent successfully finishes the construction. The game area is an empty room. The locations of the construction sites are sampled uniformly at random (we use rejection sampling to ensure that construction sites do not overlap).
For each construction site, agents observe its position and that of its 4 corners. Since there are no construction sites in the hide-and-seek game and the count-based baseline environments, we need to change our policy architecture to integrate the new observations. Each construction site observation is concatenated with xself and then embedded through a new dense layer shared across all sites. This
dense layer is randomly initialized and added to the multi-agent and count-based policies before the start of training. The embedded construction site representations are then concatenated with all other embedded object representations before the residual self-attention block, and the rest of the architecture is the same as the one used in the hide-and-seek game.
The reward at each timestep is equal to a reward scale constant times the mean of the smooth min- imum of the distances between each construction site corner and every box corner. Let there be k construction sites and n boxes, and let dij be the distance between construction site corner i and box corner j, and let di be the smooth minimum of the distances from construction site corner i to all box corners. The reward at each timestep follows the following formula:
Here, sd is the reward scale parameter and α is the smoothness hyperparameter (α must be non- positive; α = 0 gives us the mean, and α → −∞ gives us the regular min function). In addition, when all construction sites have a box placed within a certain distance dmin of them, and all con- struction site corners have a box corner located within dmin of them, the episode ends and all agents receive reward equal to sc k, where sc is a separate reward scale parameter. For our experiment, n = 8 and k is randomly sampled between 1 and 4 (inclusive) every episode. The hyperparameter values we use for the reward are the following:
Shelter construction: The goal of the task is to build a shelter around a cylinder that is randomly placed in the play area. The horizon is 150 timesteps, and the game area is an empty room. The location of the cylinder is uniformly sampled at random a minimum distance away from the edges of the room (this is because if the cylinder is too close to the external walls of the room, the agents are physically unable to complete the whole shelter). The diameter of the cylinder is uniformly randomly sampled between dmin and dmax. There are 3 movable elongated boxes and 5 movable square boxes. There are 100 rays that originate from evenly spaced locations on the bounding walls of the room and target the cylinder placed within the room. The reward at each timestep is ( n s), where n is the number of raycasts that collide with the cylinder that timestep and s is the reward scale hyperparameter.
We use the following hyperparameters:
D INTRINSIC MOTIVATION METHODS
We inherit the same policy architecture as well as optimization hyperparameters as used in the hide- and-seek game.
Note that only the Sequential Lock task in the transfer suite contains ramps, so for the other 4 tasks we remove ramps in the environment for training intrinsic motivation agents.
D.1 COUNTED-BASED EXPLORATION
For each real value from the continuous state of interest, we discretize it into 30 bins. Then we randomly project each of these discretized integers into a discrete embedding of dimension 16 with integer value ranging from 0 to 9. Here we use discrete embeddings for the purpose of accurate hashing. For each input entity, we concatenate all its obtained discrete embeddings as this entity’s feature embedding. An max-pooling is performed over the feature embeddings of all the entities belonging to each object type (i.e., agent, lidar, box and ramp) to obtain a entity-invariant object representation. Finally, concatenating all the derived object representations results in the final state representation to count.
We run a decentralized version of the count-based exploration where each parallel rollout worker shares the same random projection for computing embeddings but maintains its own counts. Let N (S) denote the counts for state S in a particular rollout worker. Then the intrinsic reward is calculated by 0.1/√N (S)
D.2 RANDOM NETWORK DISTILLATION
Random Network Distillation (RND) (Burda et al., 2019b) uses a fixed random network, i.e., a target network, to produce a random projection for each state while learns anther network, i.e., a predictor network, to fit the output from the target network on visited states. The prediction error between two networks is used as the intrinsic motivation.
For the random target network, we use the same architecture as the value network except that we remove the LSTM layer and project the final layer to a 64 dimensional vector instead of a single value. The architecture is the same for predictor network. We use the squared difference between predictor and target network output with an coefficient of 1.0 as the intrinsic reward.
D.3 FINE-TUNING FROM INTRINSIC MOTIVATION VARIANTS