**Aditya Grover **1 **Maruan Al-Shedivat **2 **Jayesh K. Gupta **1 **Yura Burda **3 **Harrison Edwards **3

# Abstract

Modeling agent behavior is central to understanding the emergence of complex phenomena in multiagent systems. Prior work in agent modeling has largely been task-specific and driven by hand-engineering domain-specific prior knowledge. We propose a general learning framework for modeling agent behavior in any multiagent system using only a handful of interaction data. Our framework casts agent modeling as a representation learning problem. Consequently, we construct a novel objective inspired by imitation learning and agent identification and design an algorithm for unsupervised learning of representations of agent policies. We demonstrate empirically the utility of the proposed framework in (i) a challenging high- dimensional competitive environment for continuous control and (ii) a cooperative environment for communication, on supervised predictive tasks, unsupervised clustering, and policy optimization using deep reinforcement learning.

# 1. Introduction

Intelligent agents rarely act in isolation in the real world and often seek to achieve their goals through interaction with other agents. Such interactions give rise to rich, complex behaviors formalized as per-agent policies in a multiagent system (Ferber, 1999; Wooldridge, 2009). Depending on the underlying motivations of the agents, interactions could be directed towards achieving a shared goal in a collaborative setting, opposing another agent in a competitive setting, or be a mixture of these in a setting where agents collaborate in teams to compete against other teams. Learning useful representations of the policies of agents based on their inter- actions is an important step towards characterization of the agent behavior and more generally inference and reasoning in multiagent systems.

In this work, we propose an unsupervised encoder-decoder framework for learning continuous representations of agent policies given access to only a few episodes of interaction. For any given agent, the representation function is an en- coder that learns a mapping from an interaction (*i.e.*, one or more episodes of observation and action pairs involving the agent) to a continuous embedding vector. Using such embeddings, we condition a policy network (decoder) and train it simultaneously with the encoder to *imitate *other interactions involving the same (or a coupled) agent. Additionally, we can explicitly *discriminate *between the embeddings corresponding to different agents using triplet losses.

For the embeddings to be useful, the representation function should *generalize *to both unseen interactions and unseen agents for novel downstream tasks. Generalization is well- understood in the context of supervised learning where a good model is expected to attain similar train and test performance. For multiagent systems, we consider a notion of generalization based on *agent-interaction graphs*. An agent- interaction graph provides an abstraction for distinguishing the agents (nodes) and interactions (edges) observed during training, validation, and testing.

Our framework is agnostic to the nature of interactions in multiagent systems, and hence broadly applicable to com- petitive and cooperative environments. In particular, we consider two multiagent environments: (i) a competitive continuous control environment, RoboSumo (Al-Shedivat et al., 2018), and (ii) a ParticleWorld environment of cooperative communication where agents collaborate to achieve a common goal (Mordatch & Abbeel, 2018). For evaluation, we show how representations learned by our framework are effective for downstream tasks that include clustering of agent policies (unsupervised), classification such as win or loss outcomes in competitive systems (super- vised), and policy optimization (reinforcement). In the case of policy optimization, we show how these representations can serve as privileged information for better training of agent policies. In RoboSumo, we train agent policies that can condition on the opponent’s representation and achieve superior win rates much more quickly as compared to an equally expressive baseline policy with the same number of parameters. In ParticleWorld, we train speakers that can communicate more effectively with a much wider range of listeners given knowledge of their representations.

# 2. Preliminaries

In this section, we present the necessary background and notation relevant to the problem setting of this work.

**Markov games. **We use the classical framework of Markov games (Littman, 1994) to represent multiagent systems. A *Markov game *extends the general formulation of partially observable Markov decision processes (POMDP) to the multiagent setting. In a Markov game, we are given a set of *n *agents on a state-space S with action spaces A_{1}*, *A_{2}*, *· · · *, *A* _{n} *and observation spaces O

_{1}

*,*O

_{2}

*,*· · ·

*,*O

*respectively. At every time step*

_{n}*t*, an agent

*i*receives an observation

*o*

^{(}

^{t}^{)}∈ O

*and executes an action*

_{i}*a*

^{(}

^{t}^{)}∈ A

*work is largely agnostic to the choice of the algorithm, and we restrict our presentation to behavioral cloning, leaving other imitation learning paradigms to future work.*

_{i}**Extended Markov games. **In this work, we are interested in interactions that involve not all but only a subset of agents. For this purpose, we generalize Markov games as follows. First, we augment the action space of each agent with a NO-OP (*i.e.*, no action). Then, we introduce a problem parameter, every rollout of the Markov game, all but *k *agents deterministically execute the NO-OP operator while the *k *agents execute actions as per the policies defined on the original observation and action spaces. Accordingly, we assume that each agent receives rewards only in the interaction episode it participates in. Informally, the extension allows for multi- agent systems where all agents do not necessarily have to participate simultaneously in an interaction. For instance, this allows to consider one-vs-one multiagent tournaments where only two players participate in any given match.

To further introduce the notation, consider a multiagent system as a generalized Markov game. We denote the set of agent policies with *P *= {*π*^{(}^{i}^{)}}* ^{n} *, interaction episodes with

*E*= {

*E*}

_{M}*where*

^{m}*M*⊆ {1

_{j}*,*2

*,*· · ·

*, n*}

*,*|

*M*| =

_{j}*k*is the set of

*k*agents participating in episode

*E*. To simplify presentation for the rest of the paper, we assume

_{M}j*k*= 2 and, consequently, denote the set of interaction episodes between agents

*i*and

*j*as

*E*. A single episode,

_{ij}*e*∈

_{ij}*E*, consists of a sequence of observations and actions for the specified time horizon,

_{ij}*H*.

**Imitation learning. **Our approach to learning policy representations relies on behavioral cloning (Pomerleau, 1991)— a type of imitation learning where we train a mapping from observations to actions in a supervised manner. Although there exist other imitation learning algorithms (*e.g.*, inverse reinforcement learning, Abbeel & Ng, 2004), our frame-work is largely agnostic to the choice of the algorithm, and we restrict our presentation to behavioral cloning, leaving other imitation learning paradigms to future work.

# 3. Learning framework

The dominant paradigm for unsupervised representation learning is to optimize the parameters of a representation function that can best explain or generate the observed data. For instance, the skip-gram objective used for language and graph data learns representations of words and nodes predictive of representations of surrounding context (Mikolov et al., 2013; Grover & Leskovec, 2016). Similarly, autoen-coding objectives, often used for image data, learn representations that can reconstruct the input (Bengio et al., 2009).

In this work, we wish to learn a representation function that maps episode(s) from an agent policy, *π*^{(}^{i}^{)} ∈ Π to a real-valued vector embedding where Π is a class of representable policies. That is, we optimize for the parameters *θ *for a function *f _{θ} *: E → R

*where E denotes the space of episodes corresponding to a policy and*

^{d}*d*is the dimension of the embedding. Here, we have assumed the agent policies are black-boxes,

*i.e.*, we can only access them based on interaction episodes with other agents in a Markov game. Hence, for every agent

*i*, we wish to learn policies using

*E*= ∪

_{i}

_{j}E^{(}

^{i}^{)}. Here,

*E*

^{(}

^{i}^{)}refers the episode data for inter- actions between agent

*i*and

*j*, but consisting of only the observation and action pairs of agent

*i*. For a multiagent system, we propose the following auxiliary tasks for learning a

*good representation*of an agent’s policy:

*Generative**representations.*The representation should be useful for simulating the agent’s policy.*Discriminative representations.*The representation should be able to distinguish the agent’s policy with the policies of other agents.

Accordingly, we now propose generative and discriminative objectives for representation learning in multiagent systems.

## 3.1. Generative representations via imitation learning

Imitation learning does not require direct access to the re- ward signal, making it an attractive task for unsupervised representation learning. Formally, we are interested in learning a policy *π*^{(}^{i}^{)}* _{φ}* : S×A → [0

*,*1] for an agent

*i*given access to observation and action pairs from interaction episode(s) involving the agent. For behavioral cloning, we maximize the following (negative) cross-entropy objective:

where the expectation is over interaction episodes of agent *i* and the optimization is over the parameters *φ*.

Learning individual policies for every agent can be computationally and statistically prohibitive for large-scale multi- agent systems, especially when the number of interaction episodes per agent is small. Moreover, it precludes generalization across the behaviors of such agents. On the other hand, learning a single policy for all agents increases sample efficiency but comes at the cost of reduced modeling flexibility in simulating diverse agent behaviors. We offset this dichotomy by learning a single *conditional *policy net- work. To do so, we first specify a representation function, *f _{θ} *: E → R

*, with parameters*

^{d}*θ*, where E represents the space of episodes. We use this embedding to condition the other agents. We note that various other notions of distance can also be used. The one presented above corresponding to a squared softmax objective (Hoffer & Ailon, 2015).

## 3.2. Discriminative representations via identification

An intuitive requirement for any representation function learned for a multiagent system is that the embeddings should reflect characteristics of an agent’s behavior that distinguish it from other agents. To do so in an unsupervised manner, we propose an objective for agent identification based on the triplet loss directly in the space of embeddings. To learn a representation for agent i based on interaction episodes, we use the representation function fθ to compute three sets of embeddings: (i) a positive embedding for an episode e+ ∼ Ei involving agent i, (ii) a negative embedding for an episode e− ∼ Ej involving a random agent j 6= i, and (iii) a reference embedding for an episode e∗ ∼ Ei again involving agent i, but different from e+.

Given these embeddings, we define the triplet loss:

where pe = fθ(e+), ne = fθ(e−), re = fθ(e∗). Intuitively, the loss encourages the positive embedding to be closer to the reference embedding than the negative embedding, which makes the embeddings of the same agent tend to cluster together and be further away from embeddings of other agents. We note that various other notions of distance can also be used. The one presented above corresponding to a squared softmax objective (Hoffer & Ailon, 2015).

## 3.3. Hybrid generative-discriminative representations

Conditional imitation learning encourages *f _{θ} *to learn representations that can learn and simulate the entire policy of the agents and agent identification incentivizes representations that can distinguish between agent policies. Both objectives are complementary, and we combine Eq. (1) and Eq. (2) to get the final objective used for representation learning:

where λ > 0 is a tunable hyperparameter that controls the relative weights of the discriminative and generative terms. The pseudocode for the proposed algorithm is given in Algorithm 1. In experiments, we parameterize the conditional policy πθ,φ using neural networks and use stochastic gradient-based methods for optimization.

# 4. Generalization in MAS

Generalization is well-understood for supervised learning— models that shows similar train and test performance exhibit good generalization. To measure the quality of the learned representations for a multiagent system (MAS), we intro- duce a graphical formalism for reasoning about agents and their interactions.

## 4.1. Generalization across agents & interactions

In many scenarios, we are interested in generalization of the policy representation function *f _{θ} *across novel agents and interactions in a multiagent system. For instance, we would like

*f*to output useful embeddings for a downstream task, even when evaluated with respect to unseen agents and interactions. This notion of generalization is best understood using agent-interaction graphs (Grover et al., 2018).

_{θ}The *agent-interaction graph *describes interactions between a set of agent policies *P *and a set of interaction episodes *I *through a graph *G *= (*P, I*).^{1} An example graph is shown in Figure 1a. The graph represents a multiagent system consisting of interactions between pairs of agents, and we will especially focus on the interactions involving **A**lice, **B**ob, **C**harlie, and **D**avis. The interactions could be competitive (*e.g.*, a match between two agents) or cooperative (*e.g.*, two agents communicating for a navigation task).

We learn the representation function *f _{θ} *on a subset of the interactions, denoted by the solid black edges in Figure 1a. At test time,

*f*is evaluated on some downstream task of interest. The agents and interactions observed at test time can be different from those used for training. In particular, we consider the following cases:

_{θ}**Weak generalization.2** Here, we are interested in the generalization performance of the representation function on an unseen interaction between existing agents, all of which are observed during training. This corresponds to the red edge representing the interaction between **A**lice and **B**ob in Figure 1a. From the context of an agent-interaction graph, the test graph adds only edges to the train graph.

**Strong generalization. **Generalization can also be evaluated with respect to unseen agents (and their interactions). This corresponds to the addition of agents **C**harlie and **D**avis in Figure 1a. Akin to a few shot learning setting, we observe a few of their interactions with existing agents **A**lice and

**B**ob (green edges) and generalization is evaluated on unseen interactions involving **C**harlie and **D**avis (blue edges). The test graph adds both nodes and edges to the train graph.

For brevity, we skip discussion of weaker forms of generalization that involves evaluation of the test performance on unseen episodes of an existing training edge (black edge).

## 4.2. Generalization across tasks

Since the representation function is learned using an un- supervised auxiliary objective, we test its generalization performance by evaluating the usefulness of these embed- dings for various kinds downstream tasks described below.

**Unsupervised. **These embeddings can be used for clustering, visualization, and interpretability of agent policies in a low-dimensional space. Such semantic associations be- tween the learned embeddings can be defined for a single agent wherein we expect representations for the same agent based on distinct episodes to be embedded close to each other, or across agents wherein agents with similar policies will have similar embeddings on average.

**Supervised. **Deep neural network representations are especially effective for predictive modeling. In a multiagent setting, the embeddings serve as useful features for learning agent properties and interactions, including assignment of *role *categories to agents with different skills in a collaborative setting, or prediction of win or loss outcomes of interaction matches between agents in a competitive setting.

**Reinforcement. **Finally, we can use the learned representation functions to improve generalization of the policies learned from a reinforcement signal in competitive and cooperative settings. We design policy networks that, in addition to observations, take embedding vectors of the opposing agents as inputs. The embeddings are computed from the past interactions of the opposing agent either with the agent being trained or with other agents using the representation function (Figure 2). Such embeddings play the role of privileged information and allow us to train a policy network that uses this information to learn faster and generalize better to opponents or cooperators unseen at training time.

# 5. Evaluation methodology & results

We evaluate the proposed framework for both competitive and collaborative environments on various down- stream machine learning tasks. In particular, we use the RoboSumo and ParticleWorld environments for the competitive and collaborative scenarios, respectively. We consider the embedding objectives in Eq. (1), Eq. (2), and Eq. (3) independently and refer to them as Emb-Im, Emb-Id, and Emb-Hyb respectively. The hyperparameter *λ *for Emb-Hyb is chosen by grid search over *λ *{0*.*01*, *0*.*05*, *0*.*1*, *0*.*5} on a held-out set of interactions.

In all our experiments, the representation function *f _{θ} *is specified through a multi-layer perceptron (MLP) that takes as input an episode and outputs an embedding of that episode. In particular, the MLP takes as input a single (observation, action) pair to output an intermediate embedding. We average the intermediate embeddings for all (observation, action) pairs in an episode to output an episode embedding. To condition a policy network on the embedding, we simply concatenate the observation fed as input to the network with the embedding. Experimental setup and other details beyond what we state below are deferred to the Appendix.

## 5.1. The RoboSumo environment

For the competitive environment, we use RoboSumo (Al- Shedivat et al., 2018)—a 3D environment with simulated physics (based on MuJoCo (Todorov et al., 2012)) that al- lows agents to control multi-legged 3D robots and compete against each other in continuous-time wrestling games (Figure 1b). For our analysis, we train a diverse collection of 25 agents, some of which are trained via self-play and others are trained in pairs concurrently using Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017).

We start with a fully connected agent-interaction graph (clique) of 25 agents. Every edge in this graph corresponds to 10 rollout episodes involving the corresponding agents.

IICR (W) | IICR (S) | Acc (W) | Acc(S) | |

Emb-Im | 0.24 | 0.23 | 0.71 | 0.60 |

Emb-Id | 0.25 | 0.27 | 0.67 | 0.56 |

Emb-Hyb | 0.22 | 0.21 | 0.73 | 0.56 |

The maximum length (or horizon) of any episode is 500 time steps, after which the episode is declared a draw. To evaluate weak generalization, we sample a connected subgraph for training with approximately 60% of the edges preserved for training, and remaining split equally for validation and testing. For strong generalization, we preserve 15 agents and their interactions with each other for training, and similarly, 5 agents and their within-group interactions each for validation and testing.

**5.1.1. EMBEDDING ANALYSIS**

To evaluate the robustness of the embeddings, we compute multiple embeddings for each policy based on different episodes of interaction at test time. Our evaluation metric is based on the intra- and inter-cluster Euclidean distances between embeddings. The intra-cluster distance for an agent is the average pairwise distance between its embeddings computed on the set of test interaction episodes involving the agent. Similarly, the inter-cluster distance is the average pairwise distance between the embeddings of an agent with those of other agents. Let Ti = {t^{(i)}c }^{ni}_{c}=1 denote the set of test interactions involving agent i. We define the intra-inter cluster ratio (IICR) as:

The intra-inter clustering ratios are reported in Table 1. A ratio less than 1 suggests that there is signal that identifies the agent, and the signal is stronger for lower ratios. Even though this task might seem especially suited for the agent identification objective, we interestingly find that the Emb-Im attains lower clustering ratios than Emb-Id for both weak and strong generalization. Emb-Hyb outper- forms both these methods. We qualitatively visualize the embeddings learned using Emb-Hyb by projecting them on the leading principal components, as shown in Figures 3a and 3b for 10 test interaction episodes of 5 randomly selected agents in the weak and strong generalization settings respectively.

**5.1.2. OUTCOME PREDICTION**

We can use these embeddings directly for training a classifier to predict the outcome of an episode (win/loss/draw). For classification, we use an MLP with 3 hidden layers

of 100 units each and the learning objective minimizes the cross entropy error. The input to the classifier are the embed- dings of the two agents involved in the episode. The results are reported in Table 1. Again, imitation based methods seem more suited for this task with Emb-Hyb and Emb-Im outperforming other methods for weak and strong generalization respectively.

**5.1.3. POLICY OPTIMIZATION**

Here we ask whether embeddings can be used to improve learned policies in a reinforcement learning setting both in terms of end performance and generalization. To this end, we select 5 training, 5 validation, and 5 testing opponents from the pool of 25 pre-trained agents. Next, we train a new agent with reinforcement learning to compete against the selected 5 training opponents; the agent is trained concurrently against all 5 opponents using a distributed version of PPO algorithm, as described in Al-Shedivat et al. (2018). Throughout training, we evaluate new agents on the 5 testing opponents and record the average win and draw rates.

Using this setup, we compare a baseline agent with MLP-based policy with an agent whose policy takes 100- dimensional embeddings of the opponents as additional in- puts at each time step and uses that information to condition its behavior on the opponent’s representation. The embeddings for each opponent are either computed *online*, *i.e.*, based on an interaction episode rolled out during training at a previous time step (Figure 2), or *offline*, *i.e.*, pre-computed before training the new agent using only interactions be- tween the pre-trained opponents.

Figure 4 shows the average win rates against the set of training and testing opponents for the baseline and our agents that use different types of embeddings. While every new agent is able to achieve almost 100% win rate against the training opponents, we see that the agents that condition their policies on the opponent’s embeddings perform better on the held-out set of opponents, *i.e.*, generalize better, with the best performance achieved with Emb-Hyb. We also note that embeddings computed offline turn out to lead to better performance than if computed online^{3}. As an ablation test, we also evaluate our agents when they are provided an incorrect embedding (either all zeros, Emb-zero, or an embedding selected for a different random opponent, Emb-rand) and observe that such embeddings lead to a degradation in performance^{4}.

Table 2: Intra-inter clustering ratios (IICR) for weak (W) and strong (S) generalization on ParticleWorld. Lower is better.

IICR (W) | IICR (S) | |

Emb-Im | 0.58 | 0.86 |

Emb-Id | 0.50 | 0.82 |

Emb-Hyb | 0.54 | 0.85 |

Table 3: Average train and test rewards for speaker policies on ParticleWorld.

Finally, to evaluate strong generalization in the RL set- ting, we pit the newly trained baseline and agents with embedding-conditional policies against each other. Since the embedding network has never seen the new agents, it must exhibit strong generalization to be useful in such set- ting. The results are give in Figures 5 and 6. Even though the margin is not very large, the agents that use Emb-Hyb perform the best on average.

## 5.2. The ParticleWorld environment

For the collaborative setting, we evaluate the framework on the ParticleWorld environment for cooperative communication (Mordatch & Abbeel, 2018; Lowe et al., 2017). The environment consists of a continuous 2D grid with 3 landmarks and two kinds of agents collaborating to navigate to a common landmark goal (Figure 1c). At the beginning of every episode, the *speaker *agent is shown the RGB color of a single target landmark on the grid. The speaker then communicates a fixed length binary message to the *listener *agent. Based on the received messages, the listener agent the moves in a particular direction. The final reward, shared across the speaker and listener agents, is the distance of the listener to the target landmark after a fixed time horizon.

The agent-interaction graph for this environment is bipartite with only cross edges between speaker and listener agents. Every interaction edge in this graph corresponds to 1000 rollout episodes where the maximum length of any episode is 25 steps. We pretrain 28 MLP parameterized speaker and listener agent policies. Every speaker learns through communication with only two different listeners neighboring speaker agents in the agent-interaction graph, the listener agents also show diversity in the learned policies. The policies are learned using multiagent deep deterministic policy gradients (MADDPG, Lowe et al., 2017).

In this environment, the speakers and listeners are tightly coupled. Hence we vary the setup used previously in the competitive scenario. We wish to learn embeddings of listeners based on their interactions with speakers. Since the agent-interaction graph is bipartite, we use the embeddings of listener agents to condition a shared policy network for the respective speaker agents.

**5.2.1. EMBEDDING ANALYSI**

For the weak generalization setting, we remove an outgoing edge from every listener agent in the original graph to obtain the training graph. In the case of strong generalization, we set aside 7 listener agents (and their outgoing edges) each for validation and testing while the representation function is learned on the remaining 14 listener agents and their interactions. The intra-inter clustering ratios are shown in Table 2, and the projections of the embeddings learned using Emb-Hyb are visualized in Figure 3c and Figure 3d for weak and strong generalization respectively. In spite of the high degree of sparsity in the training graph, the intra- inter clustering ratio for the test interaction embeddings is less than unity suggesting an agent-specific signal. Emb-id works particularly well in this environment, achieving best results for both weak and strong generalization.

**5.2.2. POLICY OPTIMIZATION**

Here, we are interested in learning speaker agents that can communicate more effectively with a much wider range of listeners given knowledge of their embeddings. Referring back to Figure 2, we learn a policy *π _{ψ} *for a speaker agent that conditions on the representation function

*f*for the listener agents. For cooperative communication, we con- sider interactions with 14 pre-trained listener agents split as 6 training, 4 validation, and 4 test agents.

_{θ}^{5}Similar to the competitive setting, we compare performance against a baseline speaker agent that does not have access to any privilege information about the listeners. We summarize the results for the best validated models during training and 100 interaction episodes per test listener agent across 5 initializations in Table 3. From the results, we observe that online embedding based methods can generalize better than the baseline methods. The baseline MADDPG achieves the lowest training error, but fails to generalize well enough and incurs a low average reward for the test listener agents.

# 6. Discussion & Related Work

Agent modeling is a well-studied topic within multiagent systems. See Albrecht & Stone (2017) for an excellent recent survey on this subject. The vast majority of literature concerns with learning models for a specific predictive task. Predictive tasks are typically defined over actions, goals, and beliefs of other agents (Stone & Veloso, 2000). In competitive domains such as Poker and Go, such tasks are often integrated with domain-specific heuristics to model opponents and learn superior policies (Rubin & Watson, 2011; Mnih et al., 2015). Similarly, intelligent tutoring systems take into account pedagogical features of students and teachers to accelerate learning of desired behaviors in a collaborative environment (McCalla et al., 2000).

In this work, we proposed an approach for modeling agent behavior in multiagent systems through unsupervised rep- resentational learning of agent policies. Since we sidestep any domain specific assumptions and learn in an unsupervised manner, our framework learns representations that are useful for several downstream tasks. This extends the use of deep neural networks in multiagent systems to applications beyond traditional reinforcement learning and predictive modeling (Mnih et al., 2015; Hoshen, 2017).

Both the generative and discriminative components of our framework have been explored independently in prior work. Imitation learning has been extensively studied in the single- agent setting and recent work by Le et al. (2017) proposes an algorithm for imitation in a coordinated multiagent system. Wang et al. (2017) proposed an imitation learning algorithm for learning robust controllers with few expert demonstrations in a single-agent setting that conditions the policy network on an inference network, similar to the encoder in our framework. In another recent work, Li et al. (2017) propose an algorithm for learning interpretable representations using generative adversarial imitation learning. Agent identification which represents the discriminative term in the learning objective is inspired from triplet losses and Siamese networks that are used for learning representations of data using distance comparisons (Hoffer & Ailon, 2015).

A key contribution of this work is a principled methodology for evaluating generalization of representations in multiagent systems based on the graphs of the agent interactions. Graphs are a fundamental abstraction for modeling relational data, such as the interactions arising in multiagent systems (Zhou et al., 2016a;b; Chen et al., 2017; Battaglia et al., 2016; Hoshen, 2017) and concurrent work proposes to learn such graphs directly from data (Kipf et al., 2018).

# 7. Conclusion & Future Work

In this work, we presented a framework for learning representations of agent policies in multiagent systems. The agent policies are accessed using a few interaction episodes with other agents. Our learning objective is based on a novel combination of a generative component based on imitation learning and a discriminative component for distinguishing the embeddings of different agent policies. Our overall framework is unsupervised, sample-efficient, and domain- agnostic, and hence can be readily extended to many environments and downstream tasks. Most importantly, we showed the role of these embeddings as privileged information for learning more adaptive agent policies in both collaborative and competitive settings.

In the future, we would like to explore multiagent systems with more than two agents participating in the interactions. Semantic interpolation of policies directly in the embedded space in order to obtain a policy with desired behaviors quickly is another promising direction. Finally, it would be interesting to extend and evaluate the proposed framework to learn representations for history dependent policies such as those parameterized by long short-term memory networks.

# Acknowledgements

We are thankful to Lisa Lee, Daniel Levy, Jiaming Song, and everyone at OpenAI for helpful comments and discussion. AG is supported by a Microsoft Research PhD Fellowship. MA is partially supported by NIH R01GM114311. JKG is partially supported by the Army Research Laboratory through the Army High Performance Computing Research Center under Cooperative Agreement W911NF-07-2-0027.

# References

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In *International Conference on Machine Learning*, 2004.

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor- datch, I., and Abbeel, P. Continuous adaptation via meta- learning in nonstationary and competitive environments. In *International Conference on Learning Representations*, 2018.

Albrecht, S. V. and Stone, P. Autonomous agents modeling other agents: A comprehensive survey and open problems. *arXiv preprint arXiv: 1709.08071*, 2017.

Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Interaction networks for learning about objects, relations and physics. In *Advances in Neural Information Process- ing Systems*, 2016.

Bengio, Y. et al. Learning deep architectures for ai. *Foundations and trends*QR *in Machine Learning*, 2(1):1–127, 2009.

Chen, M., Zhou, Z., and Tomlin, C. J. Multiplayer reach- avoid games via pairwise outcomes. *IEEE Transactions on Automatic Control*, 62(3):1451–1457, 2017.

Ferber, J. *Multi-agent systems: An introduction to dis- tributed artificial intelligence*, volume 1. Addison-Wesley Reading, 1999.

Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In *SIGKDD Conference on Knowledge Discovery and Data Mining*, 2016.

Grover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., and Edwards, H. Evaluating generalization in multiagent systems using agent-interaction graphs. In *International Conference on Autonomous Agents and Multiagent Systems*, 2018.

Hoffer, E. and Ailon, N. Deep metric learning using triplet network. In *International Workshop on Similarity-Based Pattern Recognition*, pp. 84–92. Springer, 2015.

Hoshen, Y. VAIN: Attentional multi-agent predictive modeling. In *Advances in Neural Information Processing Systems*, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In *International Conference on Learning Representations*, 2015.

Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In

*International Conference on Machine Learning*, 2018.

Le, H. M., Yue, Y., and Carr, P. Coordinated multi-agent imitation learning. In *International Conference on Machine Learning*, 2017.

Li, Y., Song, J., and Ermon, S. Inferring the latent structure of human decision-making from raw visual inputs. In *Advances in Neural Information Processing Systems*, 2017.

Littman, M. L. Markov games as a framework for multi- agent reinforcement learning. In *International Confer- ence on Machine Learning*, 1994.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative- competitive environments. In *Advances in Neural Information Processing Systems*, 2017.

McCalla, G., Vassileva, J., Greer, J., and Bull, S. Active learner modelling. In *Intelligent tutoring systems*, pp. 53–62. Springer, 2000.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In *Advances in Neural Information Processing Systems*, 2013.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- land, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. *Nature*, 518(7540): 529–533, 2015.

Mordatch, I. and Abbeel, P. Emergence of grounded com- positional language in multi-agent populations. In *AAAI Conference on Artificial Intelligence*, 2018.

Pomerleau, D. A. Efficient training of artificial neural net- works for autonomous navigation. *Neural Computation*, 3(1):88–97, 1991.

Rubin, J. and Watson, I. Computer poker: A review. *Artificial intelligence*, 175(5-6):958–987, 2011.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. *arXiv preprint arXiv:1707.06347*, 2017.

Stone, P. and Veloso, M. Multiagent systems: A survey from a machine learning perspective. *Autonomous Robots*, 8 (3):345–383, 2000.

Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In *International Conference on Intelligent Robots and Systems*, 2012.

Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N., and Heess, N. Robust imitation of diverse behaviors. In *Advances in Neural Information Processing Systems*, 2017.

Wooldridge, M. *An introduction to multiagent systems*. John Wiley & Sons, 2009.

Zhou, Z., Bambos, N., and Glynn, P. Dynamics on linear in- fluence network games under stochastic environments. In *International Conference on Decision and Game Theory for Security*, 2016a.

Zhou, Z., Yolken, B., Miura-Ko, R. A., and Bambos, N. A game-theoretical formulation of influence networks. In *American Control Conference*, 2016b.

# A. Experimental Setup

## RoboSumo Environment

To limit the scope of our study, we restrict agent morphologies to only 4-leg robots. During the game, observations of each agent were represented by a 120-dimensional vector comprised of positions and velocities of its own body and positions of the opponent’s body; agent’s actions were 8- dimensional vectors that represented torques applied to the corresponding joints.

**Network Architecture**

Agent policies are parameterized as multi-layer perceptrons (MLPs) with 2 hidden layers of 90 units each. For the embedding network, we used another MLP network with 2 hidden layers of 100 units each to give an embedding of size 100. For the conditioned policy network we also reduce the hidden layer size to 64 units each.

**Policy Optimization**

For learning the population of agents, we use the distributed version of PPO algorithm as described in (Al-Shedivat et al., 2018) with 2 × 10^{−}^{3} learning rate, *E *= 0*.*2, 16,000 time steps per update with 6 epochs 4,000 time steps per batch.

**Training**

For our analysis, we train a diverse collection of 25 agents, some of which are trained via self-play and others are trained in pairs concurrently, forming a clique agent-interaction graph.

## ParticleWorld Environment

The overall continuous observation and discrete action space for the speaker agents are 3 and 7 dimensions respectively. For the listener agents, the observation and action spaces are 15 and 5 dimensions respectively.

**Network Architecture**

Agent policies and shared critic (*i.e.*, a value function) are parameterized as multi-layer perceptrons (MLPs) with 2 hidden layers of 64 units each. The observation space for the speaker is small (3 dimensions), and a small embedding of size 5 for the listener policy gives good performance. For the embedding network, we again used an MLP with 2 hidden layers of 100 units each.

**Policy Optimization**

For learning the initial population of listener and agent policies, we use multiagent deep deterministic policy gradients (MADDPG) as the base algorithm (Lowe et al., 2017). Adam optimizer (Kingma & Ba, 2015) with a learning rate of 4 × 10^{−3} was used for optimization. Replay buffer size was set to 10^{6} timesteps.

**Training**

We first train 28 speaker-listener pairs using the MADDPG algorithm. From this collection of 28 speakers, we train another set of 28 listeners, each trained to work with a speaker pair, forming a bipartite agent-interaction graph. We choose the best 14 listeners for later experiments.

## Foot Note

1 Stanford University 2Carnegie Mellon University 3OpenAI. Correspondence to: Aditya Grover . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

1If we have more than two participating agents per interaction episode, we could represent the interactions using a hypergraph.

2Also referred to as intermediate generalization by Grover et al. (2018).

3Perhaps, this is due to differences in the interactions of the opponents between themselves and with the new agent that the embedding network was not able to capture entirely.

4Performance decrease is most significant for Emb-zero, which is an out-of-distribution all-zeros vector.

5None of the methods considered were able to learn a nontrivial speaker agent when trained simultaneously with all 28 listener agents. Hence, we simplified the problem by considering the 14 listener agents that attained the best rewards during pretraining.