### Joshua Achiam

UC Berkeley & OpenAI

### Dario Amodei

OpenAI

### Harrison Edwards

OpenAI

### Pieter Abbeel

UC Berkeley

# Abstract

We explore methods for option discovery based on variational inference and make two algorithmic contributions. First: we highlight a tight connection between variational option discovery methods and variational autoencoders, and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories, and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent’s perfor- mance is strong enough (as measured by the decoder) on the current set of contexts. We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.

# 1 Introduction

Humans are innately driven to experiment with new ways of interacting with their environments. This can accelerate the process of discovering skills for downstream tasks and can also be viewed as a primary objective in its own right. This drive serves as an inspiration for reward-free option discovery in reinforcement learning (based on the options framework of Sutton et al. [1999], Precup [2000]), where an agent tries to learn skills by interacting with its environment without trying to maximize cumulative reward for a particular task.

In this work, we explore variational option discovery, the space of methods for option discovery based on variational inference. We highlight a tight connection between prior work on variational option discovery and variational autoencoders (Kingma and Welling [2013]), and derive a new method based on the connection. In our analogy, a policy acts as an encoder, translating contexts from a noise distribution into trajectories; a decoder attempts to recover the contexts from the trajectories, and rewards the policies for making contexts easy to distinguish. Contexts are random vectors which have no intrinsic meaning prior to training, but they become associated with trajectories as a result of training; each context vector thus corresponds to a distinct option. Therefore this approach learns a set of options which are as diverse as possible, in the sense of being as easy to distinguish from each other as possible. We show that Variational Intrinsic Control (VIC) (Gregor et al. [2016]) and the recently-proposed Diversity is All You Need (DIAYN) (Eysenbach et al. [2018]) are specific instances of this template which decode from states instead of complete trajectories.

We make two main algorithmic contributions:

- We introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method which decodes from trajectories.The idea is to encourage learning dynamical modes instead of goal-attaining modes, e.g. ‘move in a circle’ instead of ‘go to X’.
- We propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent’s performance is strong enough (as measured by the decoder) on the current set of contexts.

We perform a comparison analysis of VALOR, VIC, and DIAYN with and without the curriculum trick, evaluating them in various robotics environments (point mass, cheetah, swimmer, ant).^{1} We show that, to the extent that our metrics can measure, all three of them perform similarly, except that VALOR can attain qualitatively different behavior because of its trajectory-centric approach, and DIAYN learns more quickly because of its denser reward signal. We show that our curriculum trick stabilizes and speeds up learning for all three methods, and can allow a single agent to learn up to hundreds of modes. Beyond our core comparison, we also explore applications of variational option discovery in two interesting spotlight environments: a simulated robot hand and a simulated humanoid. Variational option discovery finds naturalistic finger-flexing behaviors in the hand environment, but performs poorly on the humanoid, in the sense that it does not discover natural crawling or walking gaits. We consider this evidence that pure information-theoretic objectives can do a poor job of capturing human priors on useful behavior in complex environments. Lastly, we try a proof-of- concept for applicability to downstream tasks in a variant of ant-maze by using a (particularly good) pretrained VALOR policy as the lower level of a hierarchy. In this experiment, we find that the VALOR policy is more useful than a random network as a lower level, and equivalently as useful as learning a lower level from scratch in the environment.

# 2 Related Work

**Option Discovery**: Substantial prior work exists on option discovery (Sutton et al. [1999], Precup [2000]); here we will restrict our attention to relevant recent work in the deep RL setting. Bacon et al. [2017] and Fox et al. [2017] derive policy gradient methods for learning options: Bacon et al. [2017] learn options concurrently with solving a particular task, while Fox et al. [2017] learn options from demonstrations to accelerate specific-task learning. Vezhnevets et al. [2017] propose an architecture and training algorithm which can be interpreted as implicitly learning options. Thomas et al. [2017] find options as controllable factors in the environment. Machado et al. [2017a], Machado et al. [2017b], and Liu et al. [2017] learn *eigenoptions*, options derived from the graph Laplacian associated with the MDP. Several approaches for option discovery are primarily information-theoretic: Gregor et al. [2016], Eysenbach et al. [2018], and Florensa et al. [2017] train policies to maximize mutual information between options and states or quantities derived from states; by contrast, we maximize information between options and whole trajectories. Hausman et al. [2018] learn skill embeddings by optimizing a variational bound on the entropy of the policy; the final objective function is closely connected with that of Florensa et al. [2017].

**Universal Policies**: Variational option discovery algorithms learn universal policies (goal- or instruction- conditioned policies), like universal value function approximators (Schaul et al. [2015]) and hindsight experience replay (Andrychowicz et al. [2017]). However, these other approaches require extrinsic reward signals and a hand-crafted instruction space. By contrast, variational option discovery is unsupervised and finds its own instruction space.

**Intrinsic Motivation**: Many recent works have incorporated intrinsic motivation (especially cu- riosity) into deep RL agents (Stadie et al. [2015], Houthooft et al. [2016], Bellemare et al. [2016], Achiam and Sastry [2017], Fu et al. [2017], Pathak et al. [2017], Ostrovski et al. [2017], Edwards et al. [2018]). However, none of these approaches were combined with learning universal policies, and so suffer from a problem of knowledge fade: when states cease to be interesting to the intrinsic reward signal (usually when they are no longer novel), unless they coincide with extrinsic rewards or are on a direct path to the next-most novel state, the agent will forget how to visit them.

**Variational Autoencoders**: Variational autoencoders (VAEs) (Kingma and Welling [2013]) learn a probabilistic encoder *q _{φ}*(

*z*|

*x*) and decoder

*p*(

_{θ}*x*|

*z*) which map between data

*x*and latent variables

*z*by optimizing the evidence lower bound (ELBO) on the marginal distribution

*p*(

_{θ}*x*), assuming a prior

*p*(

*z*) over latent variables. Higgins et al. [2017] extended the VAE approach by including a parameter

*β*to control the capacity of

*z*and improve the ability of VAEs to learn disentangled representations of high-dimensional data. The

*β*-VAE optimization problem is

and when *β *= 1, it reduces to the standard VAE of Kingma and Welling [2013].

**Novelty Search**: Option discovery algorithms based on the diversity of learned behaviors can be viewed as similar in spirit to novelty search (Lehman [2012]), an evolutionary algorithm which finds behaviors which are diverse with respect to a characterization function which is usually pre-designed but sometimes learned (as in Meyerson et al. [2016]).

# 3 Variational Option Discovery Algorithms

Our aim is to learn a policy *π *where action distributions are conditioned on both the current state *s _{t} *and a

*context*

*c*which is sampled at the start of an episode and kept fixed throughout. The context should uniquely specify a particular mode of behavior (also called a skill). But instead of using reward functions to ground contexts to trajectories, we want the meaning of a context to be arbitrarily assigned (‘discovered’) during training.

We formulate a learning approach as follows. A context *c *is sampled from a noise distribution *G*, and then encoded into a trajectory *τ *= (*s*_{0}*, a*_{0}*, …, s _{T} *) by a policy

*π*(·|

*s*); afterwards

_{t}, c*c*is decoded from

*τ*with a probabilistic decoder

*D*. If the trajectory

*τ*is unique to

*c*, the decoder will place a high probability on

*c*, and the policy should be correspondingly reinforced. Supervised learning can be applied to the decoder (because for each

*τ*, we know the ground truth

*c*). To encourage exploration, we include an entropy regularization term with coefficient

*β*. The full optimization problem is thus

where *P _{D} *is the distribution over contexts from the decoder, and the entropy term is H(

*π*|

*c*) = E

_{τ}_{∼}

*[ E*

_{π,c}*t H*(

*π*(·|

*s*))]. We give a generic template for option discovery based on Eq. 2 as Algorithm 1. Observe that the objective in Eq. 2 has a one-to-one correspondence with the

_{t}, c*β*-VAE objective in Eq. 1: the context

*c*maps to the data

*x*, the trajectory

*τ*maps to the latent representation

*z*, the policy

*π*and the MDP together form the encoder

*q*, the decoder

_{φ}*D*maps to the decoder

*p*, and the entropy regularization H(

_{θ}*π*|

*c*) maps to the KL-divergence of the encoder distribution from a prior where trajectories are generated by a uniform random policy (proof in Appendix A). Based on this connection, we call algorithms for solving Eq. 2 variational option discovery methods.

Algorithm 1 Template for Variational Option Discovery with Autoencoding Objective |

Generate initial policy π0 , decoder _{θ}D0_{φ}for k = 0, 1, 2, … doSample context-trajectory pairs D = {( c)}^{i}, τ^{i}_{i}_{=1}, by first sampling a context _{,…,N} c ∼ G and then rolling out a trajectory in the environment, τ ∼ π(·|·_{θ}k , c).Update policy with any reinforcement learning algorithm to maximize Eq. 2, using batch D Update decoder by supervised learning to maximize E [log P(_{D}c|τ )], using batch Dend for |

** **

## 3.1 Connections to Prior Work

**Variational Intrinsic Control**: Variational Intrinsic Control^{2} (VIC) (Gregor et al. [2016]) is an option discovery technique based on optimizing a variational lower bound on the mutual information between the context and the final state in a trajectory, conditioned on the initial state. Gregor et al. [2016] give the optimization problem as

where *µ *is the starting state distribution for the MDP. This differs from Eq. 2 in several ways: the context distribution *G *can be optimized, *G *depends on the initial state *s*_{0}, *G *is entropy-regularized, entropy regularization for the policy *π *is omitted, and the decoder only looks at the first and last state of the trajectory instead of the entire thing. However, they also propose to keep *G *fixed and state-independent, and do this in their experiments; additionally, their experiments use decoders which are conditioned on the final state only. This reduces Eq. 3 to Eq. 2 with *β *= 0 and log *P _{D}*(

*c*|

*τ*) = log

*P*(

_{D}*c*|

*s*). We treat this as the canonical form of VIC and implement it this way for our comparison study.

_{T}**Diversity is All You Need**: Diversity is All You Need (DIAYN) (Eysenbach et al. [2018]) performs option discovery by optimizing a variational lower bound for an objective function designed to maximize mutual information between context and *every *state in a trajectory, while minimizing mutual information between actions and contexts conditioned on states, and maximizing entropy of the mixture policy over contexts. The exact optimization problem is

In DIAYN, *G *is kept fixed (as in canonical VIC), so the term log *G*(*c*) is constant and may be removed from the optimization problem. Thus Eq. 4 is a special case of Eq. 2 with log *P _{D}*(

*c*|

*τ*) = E

*=0 log*

^{T}_{t}*P*(

_{D}*c*|

*s*).

_{t}## 3.2 VALOR

In this section, we propose Variational Autoencoding Learning of Options by Reinforcement (VALOR), a vari- ational option discovery method which directly optimizes Eq. 2 with two key decisions about the decoder:

- The decoder never sees actions. Our conception of ‘interesting’ behaviors requires that the agent attempt to interact with the environment to achieve some change in state. If the decoder was permitted to see raw actions, the agent could signal the context directly through its actions and ignore the environment. Limiting the decoder in this way forces the agent to manipulate the environment to communicate with the decoder.

- Unlike in DIAYN, the decoder does
*not*decompose as a sum of per-timestep computations. That is, log*P*(_{D}*c*|*τ*) /= L*T f*(*s*). We choose against this decomposition because it could limit the ability of the decoder to correctly distinguish between behaviors which share some states, or behaviors which share all states but reach them in different orders._{t}, c

We implement VALOR with a recurrent architecture for the decoder (Fig. 1), using a bidirectional LSTM to make sure that both the beginning and end of a trajectory are equally important. We only use *N *= 11 equally spaced observations from the trajectory as inputs, for two reasons: 1) computational efficiency, and 2) to encode a heuristic that we are only interested in low-frequency behaviors (as opposed to information-dense high-frequency jitters). Lastly, taking inspiration from Vezhnevets et al. [2017], we only decode from the *k*-step *transitions *(deltas) in state space between the *N *observations. Intuitively, this corresponds to a prior that agents should move, as any two modes where the agent stands still in different poses will be indistinguishable to the decoder (because the deltas will be identically zero). We do not decode from transitions in VIC or DIAYN, although we note it would be possible and might be interesting future work.

## 3.3 Curriculum Approach

The standard approach for context distributions, used in VIC and DIAYN, is to have *K *discrete contexts with a uniform distribution: *c *∼ Uniform(*K*). In our experiments, we found that this worked poorly for large *K *across all three algorithms we compared. Even with very large batches (to ensure that each context was sampled often enough to get a low-variance contribution to the gradient), training was challenging. We found a simple trick to resolve this issue: start training with small *K* (where learning is easy), and gradually increase it over time as the decoder gets stronger. Whenever E [log *P _{D}*(

*c*|

*τ*)] is high enough (we pick a fairly arbitrary threshold of

*P*(

_{D}*c*|

*τ*) ≈ 0

*.*86), we increase

*K*according to

where *K _{max} *is a hyperparameter. As our experiments show, this curriculum leads to faster and more stable convergence.

# 4 Experimental Setup

In our experiments, we try to answer the following questions:

- What are best practices for training agents with variational option discovery algorithms (VALOR, VIC, DIAYN)? Does the curriculum learning approach help?
- What are the qualitative results from running variational option discovery algorithms? Are the learned behaviors recognizably distinct to a human? Are there substantial differences between algorithms?
- Are the learned behaviors useful for downstream control tasks?

**Test environments**: Our core comparison experiments is on a slate of locomotion environments: a custom 2D point agent, the HalfCheetah and Swimmer robots from the OpenAI Gym [Brockman et al., 2016], and a customized version of Ant from Gym where contact forces are omitted from the observations. We also tried running variational option discovery on two other interesting simulated robots: a dextrous hand (with S ∈ R^{48} and A ∈ R^{20}, based on Plappert et al. [2018]), and a new complex humanoid environment we call ‘toddler’ (with S ∈ R^{335} and A ∈ R^{35}). Lastly, we investigated applicability to downstream tasks in a modified version of Ant-Maze (Frans et al. [2018]).

**Implementation**: We implement VALOR, VIC, and DIAYN with vanilla policy gradient as the RL algorithm (described in Appendix B.1). We note that VIC and DIAYN were originally implemented with different RL algorithms: Gregor et al. [2016] implemented VIC with tabular Q learning (Watkins and Dayan [1992]), and Eysenbach et al. [2018] implemented DIAYN with soft actor-critic (Haarnoja et al.). Also unlike prior work, we use recurrent neural network policy architectures. Because there is not a final objective function to measure whether an algorithm has achieved qualitative diversity of behaviors, our hyperparameters are based on what resulted in stable training, and kept constant across algorithms. Because the design space for these algorithms is very large and evaluation is to some degree subjective, we caution that our results should not necessarily be viewed as definitive.

**Training techniques**: We investigated two specific techniques for training: curriculum generation via Eq. 5, and context embeddings. On context embeddings: a natural approach for providing the integer context as input to a neural network policy is to convert the context to a one-hot vector and concatenate it with the state, as in Eysenbach et al. [2018]. Instead, we consider whether training is improved by allowing the agent to learn its own embedding vector for each context.

# 5 Results

**Exploring Optimization Techniques**: We present partial findings for our investigation of training techniques in Fig. 2 (showing results for just VALOR), with complete findings in Appendix C. In Fig. 2a, we compare performance with and without embeddings, using a uniform context distribution, for several choices of *K *(the number of contexts). We find that using embeddings consistently improves the speed and stability of training. Fig. 2a also illustrates that training with a uniform distribution becomes more challenging as *K *increases. In Figs. 2b and 2c, we show that agents with the curriculum trick and embeddings achieve mastery on *K _{max} *= 64 contexts substantially faster

than the agents trained with uniform context distributions in Fig. 2a. As shown in Appendix C, these results are consistent across algorithms.

**Comparison Study of Qualitative Results**: In our comparison, we tried to assess whether variational option discovery algorithms learn an interesting set of behaviors. This is subjective and hard to measure, so we restricted ourselves to testing for behaviors which are easy to quantify or observe; we note that there is substantial room in this space for developing performance metrics, and consider this an important avenue for future research.

We trained agents by VALOR, VIC, and DIAYN, with embeddings and *K *= 64 contexts, with and without the curriculum trick. We evaluated the learned behaviors by measuring the following quantities: final *x*-coordinate for Cheetah, final distance from origin for Swimmer, final distance from origin for Ant, and number of *z*-axis rotations for Ant^{3}. We present partial findings in Fig. 3 and complete results in Appendix D. Our results confirm findings from prior work, including Eysenbach et al. [2018] and Florensa et al. [2017]: variational option discovery methods, in some MuJoCo environments, are able to find locomotion gaits that travel in a variety of speeds and directions. Results in Cheetah and Ant are particularly good by this measure; in Swimmer, fairly few behaviors actually travel any meaningful distance from the origin (*>** *3 units), but it happens non-negligibly often. All three algorithms produce similar results in the locomotion domains, although we do find slight differences: particularly, DIAYN is more prone than VALOR and VIC to learn behaviors like ‘attain target state,’ where the target state is fixed and unmoving. Our DIAYN behaviors are overall less mobile than the results reported by Eysenbach et al. [2018]; we believe that this is due to qualitative differences in how entropy is maximized by the underlying RL algorithms (soft actor-critic vs. entropy-regularized policy gradients).

We find that the curriculum approach does not appear to change the diversity of behaviors discovered in any large or consistent way. It appears to slightly increase the ranges for Cheetah *x*-coorindate, while slightly decreasing the ranges for Ant final distance. Scrutinizing the X-Y traces for all learned modes, it seems (subjectively) that the curriculum approach causes agents to move more erratically (see Appendices D.11—D.14). We do observe a particularly interesting effect for robustness: the curriculum approach makes the distribution of scores more consistent between random seeds (for performances of all seeds separately, see Appendices D.3—D.10).

We also attempted to perform a baseline comparison of all three variational option discovery methods against an approach where we used random reward functions in place of a learned decoder; however, we encountered substantial difficulties in optimizing with random rewards. The details of these experiments are given in Appendix E.

**Hand and Toddler Environments**: Optimizing in the Hand environment (Fig. 4f) was fairly easy and usually produced some naturalistic behaviors (eg pointing, bringing thumb and forefinger together, and one common rude gesture) as well as various unnatural behaviors (hand splayed out in what

would be painful poses). Optimizing in the Toddler environment (Fig. 4g) was highly challenging; the agent frequently struggled to learn more than a handful of behaviors. The behaviors which the agent did learn were extremely unnatural. We believe that this is because of a fundamental limitation of purely information-theoretic RL objectives: humans have strong priors on what constitutes natural behavior, but for sufficiently complex systems, those behaviors form a set of measure zero in the space of all possible behaviors; when a purely information-theoretic objective function is used, it will give no preference to the behaviors humans consider natural.

**Learning Hundreds of Behaviors**: Via the curriculum approach, we are able to train agents in the Point environment to learn hundreds of behaviors which are distinct according to the decoder (Fig. 4e). We caution that this does not necessarily expand the space of behaviors which are learnable—it may merely allow for increasingly fine-grained binning of already-learned behaviors into contexts. From various experiments prior to our final results, we developed an intuition that it was important to carefully consider the capacity of the decoder here: the greater the decoder’s capacity, the more easily it would overfit to undetectably-small differences in trajectories.

**Mode Interpolation**: We experimented with interpolating between context embeddings for point and ant policies to see if we could obtain interpolated behaviors. As shown in Fig. 5, we found that some reasonably smooth interpolations were possible. This suggests that even though only a discrete number of behaviors are trained, the training procedure learns general-purpose universal policies.

**Downstream Tasks**: We investigated whether behaviors learned by variational option discovery could be used for a downstream task by taking a policy trained with VALOR on the Ant robot (Uniform distribution, seed 10; see Appendix D.7), and using it as the lower level of a two-level hierarchical policy in Ant-Maze. We held the VALOR policy fixed throughout downstream training, and only trained the upper level policy, using A2C as the RL algorithm (with reinforcement occuring only at the lower level—the upper level actions were trained by signals backpropagated through the lower level). Results are shown in Fig. 4h. We compared the performance of the VALOR-based agent to three baselines: a hierarchical agent with the same architecture trained from scratch on Ant-Maze (‘Trained’ in Fig. 4h), a hierarchical agent with a fixed random network as the lower level (‘Random’ in Fig. 4h), and a non-hierarchical agent with the same architecture as the upper level in the hierarchical agents (an MLP with one hidden layer, ‘None’ in Fig. 4h). We found that the VALOR agent worked as well as the hierarchy trained from scratch and the non-hierarchical policy, with qualitatively similar learning curves for all three; the fixed random network performed quite poorly by comparison. This indicates that the space of options learned by (the particular run of) VALOR was at least as expressive as primitive actions, for the purposes of the task, and that VALOR options were more expressive than random networks here.

# 6 Conclusions

We performed a thorough empirical examination of variational option discovery techniques, and found they produce interesting behaviors in a variety of environments (such as Cheetah, Ant, and Hand), but can struggle in very high-dimensional control, as shown in the Toddler environment. From our mode interpolation and hierarchy experiments, we found evidence that the learned policies are universal in meaningful ways; however, we did not find clear evidence that hierarchies built on variational option discovery would outperform task-specific policies learned from scratch.

We found that with purely information-theoretic objectives, agents in complex environments will discover behaviors that encode the context in trivial ways—eg through tiling a narrow volume of the state space with contexts. Thus a key challenge for future variational option discovery algorithms is to make the decoder distinguish between trajectories in a way which corresponds with human intuition about meaningful differences.

## Acknowledgments

Joshua Achiam is supported by TRUST (Team for Research in Ubiquitous Secure Technology) which receives support from NSF (award number CCF-0424422).

# References

Joshua Achiam and Shankar Sastry. Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning. mar 2017. URL http://arxiv.org/abs/1703.01732.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. *NIPS*, 2017. URL http://arxiv.org/abs/1707.01495.

Pierre-luc Bacon, Jean Harb, and Doina Precup. The Option-Critic Architecture. *AAAI*, 2017. Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-Based Exploration and Intrinsic Motivation. *NIPS*, jun 2016. URL http://arxiv.org/abs/1606.01868.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. 2016. URL http://arxiv.org/abs/1606.01540.

Yan Duan, Xi Chen, John Schulman, and Pieter Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. *The 33rd International Conference on Machine Learning (ICML 2016) (2016)*, 48:14, 2016. URL http://arxiv.org/abs/1604.06778.

Harri Edwards, Yuri Burda, and Amos Storkey. Curiosity-driven Exploration by Bootstrapping Features, feb 2018. URL https://openreview.net/forum?id=S1gWUifW0b.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function. 2018. URL http://arxiv.org/abs/1802.06070.

Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic Neural Networks for Hierarchical Rein- forcement Learning. *ICLR*, pages 1–17, 2017.

Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-Level Discovery of Deep Options. 2017. URL http://arxiv.org/abs/1703.08294.

Kevin Frans, Henry M Gunn, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman Openai. Meta Learning Shared Hierarchies. In *ICLR*, 2018. URL https://openreview.net/pdf?id= SyX0IeWAW.

Justin Fu, John Co-Reyes, and Sergey Levine. EX2: Exploration with Exemplar Models for Deep Reinforcement Learning. In *NIPS*, pages 2577–2587, 2017. URL https://papers.nips.cc/paper/ 6851-ex2-exploration-with-exemplar-models-for-deep-reinforcement-learning.

Karol Gregor, Danilo Rezende, and Daan Wierstra. Variational Intrinsic Control. pages 1–15, 2016.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor. URL https:// arxiv.org/pdf/1801.01290.pdf.

Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an Embedding Space for Transferable Robot Skills. *ICLR*, 2018.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner, and Google Deepmind. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. *Iclr*, (July):1–13, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational Information Maximizing Exploration. *NIPS*, may 2016. URL http://arxiv.org/ abs/1605.09674.

Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. *International Conference on Learning Representations 2015*, pages 1–15, 2015. ISSN 09252312. doi: http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503.

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. (Ml):1–14, 2013. ISSN 1312.6114v10. doi: 10.1051/0004-6361/201527329. URL http://arxiv.org/abs/1312.6114.

Joel Lehman. *Evolution through the Search for Novelty*. PhD thesis, 2012. URL http://joellehman.com/lehman-dissertation.pdf.

Miao Liu, Marlos C. Machado, Gerald Tesauro, and Murray Campbell. The Eigenoption-Critic Framework. *NIPS Hierarchical RL Workshop*, 2017. URL http://arxiv.org/abs/1712. 04065.

Marlos C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian Framework for Option Discovery in Reinforcement Learning. 2017a.

Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption Discovery Through the Deep Successor Representation. pages 1–20, 2017b.

Elliot Meyerson, Joel Lehman, and Risto Miikkulainen. Learning Behavior Characterizations for Novelty Search. In *GECCO*, 2016. doi: 10.1145/2908812.2908929. URL ftp://www.cs. utexas.edu/pub/neural-nets/papers/meyerson.gecco16.pdf.

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. pages 1–28, 2016. URL http://arxiv.org/abs/1602.01783.

Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, and Remi Munos. Count-Based Ex- ploration with Neural Density Models. *ICML*, mar 2017. URL http://arxiv.org/abs/1703. 01310.

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven Exploration by Self-supervised Prediction. In *ICML*, may 2017. URL http://arxiv.org/abs/1705.05363.

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mcgrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research. 2018. URL https://arxiv.org/pdf/1802.09464.pdf.

Doina Precup. Temporal Abstraction in Reinforcement Learning. *PhD Thesis, University of Mas- sachusetts*, 2000. ISSN 1308-0911. doi: 10.16953/deusbed.74839.

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Ap- proximators. *Proceedings of The 32nd International Conference on Machine Learning*, pages 1312–1320, 2015. ISSN 1938-7228. URL http://jmlr.org/proceedings/papers/v37/ schaul15.html.

Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. jul 2015. URL http://arxiv.org/abs/1507.00814.

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. *Artificial Intelligence*, 112, 1999.

Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently Controllable Factors. pages 1–13, 2017.

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. (1), 2017. ISSN 1938-7228. URL http://arxiv.org/abs/1703.01161.

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. *Machine Learning*, 8(3-4):279–292, 1992. ISSN 0885-6125. doi: 10.1007/BF00992698. URL http://link.springer.com/10.1007/ BF00992698.

# A VAE-Equivalence Proof

The KL-divergence of *P *(*τ *|*π, c*) from *P *(*τ *|*π*_{0}) is

The first term is our entropy regularization term. The second term, for a uniform random policy *π*_{0}, is a constant independent of *π *(as long as *T *is the same for all episodes) and can thus be removed from the objective function without changing the optimization problem.

# B Implementation Details

## B.1 Policy Optimization Algorithm

In this section, we will describe how we performed policy optimization for our experiments. We used vanilla policy gradient to optimize the reinforcement objective for all three variational option discovery algorithms,

although details varied slightly between algorithms and environments. The variation between environments was due to the presence or absence of extrinsic rewards. In all environments except for Ant, there were no extrinsic rewards; however, in Ant, a small penalty was applied for falling over (as opposed to terminating the episode when the agent falls over, as in Eysenbach et al. [2018]).

- For VALOR and VIC, the advantage function was:

where the normalize function subtracts out the batch mean and divides by the batch standard deviation, and *V _{ψ} *was a learned value function baseline.

*V*(

_{ψ}*s*) was learned by taking one gradient descent step on

_{t}, cper iteration.

- For DIAYN, the advantage function was:

where *V _{ψ}*(

*s*) was learned by descending on

_{t}, cWhen computing the gradient of the entropy term, we made an approximation that ignored the role of

*π *in the distribution over trajectories:

resulting in the same entropy regularization as in Mnih et al. [2016]. Following practices for vanilla policy gradient established in Duan et al. [2016], we use the Adam optimizer Kingma and Ba [2015].

## B.2 Hyperparameters

For all variational option discovery algorithms, we used:

- 1000 paths per epoch for the policy gradient batch
*γ*= 0*.*97 as the discount factor*β*= 1*e*^{−3}as the entropy regularization coefficient, where applicable (omitted for VIC)1*e*^{−3}as the Adam learning rate- LSTM(64) followed by MLP(32) with tanh activations as the policy architecture
- 32 as the context embedding dimension (when using context embeddings)

For VALOR, the decoder was a bidirectional LSTM where the cell for each direction was of size 64. For VIC and DIAYN, the decoder was an MLP of size (180*, *180).

# C Additional Analysis for Best Practices

# D Complete Experimental Results for Comparison Study

## D.1 Guide to Reading This Section

In this section we present the results from our core comparison of {VALOR, VIC, DIAYN} × {Uni- form, Curriculum}. Because these algorithms perform unsupervised behavior discovery, analyzing our results is highly-challenging: there is no single, quantitative measure by which to compare the

algorithms. We choose to examine our results in a variety of ways:

- Learning curves for the optimization objective.
- Bar charts and histograms to show scores for the learned behaviors. Particularly, we evaluate final
*x*-coordinate in the Cheetah environment, final distance traveled in the Swimmer environment, final distance traveled in the Ant environment, and number of*z*-axis rotations in the Ant environment. Scores are evaluated on trajectories of length*T*= 1000 steps, even though agents are trained on trajectories with*T*= 250; we find that using longer horizons at test time clarifies the differences between behaviors. - X-Y traces for agent trajectories in the Point and Ant environments. (X-Y traces for the center-of-mass in Swimmer are not very insightful: Swimmer behavior is highly oscillatory and so it is difficult to discern what is happening.)

Regarding the bar charts and histograms in subsections D.3—D.10:

- The bar charts are arranged in nearly the same way as the charts in 3: the
*x*-axis is behavior ID, and the*y*-axis shows score in log scale for that behavior. The black bars show standard deviations for behavior scores. - The histograms show score on the
*x*-axis, and number of behaviors that fall into a given bin on the*y*-axis in log scale. - The charts for ‘all’ show the composite bars for all behaviors from seeds 0, 10, and 20. The ‘s0’, ‘s10’, and ‘s20’ charts show behaviors from particular random seeds. Each single seed corresponds to a single policy with
*K*= 64 behaviors.

Regarding the X-Y traces in subsections D.11—D.14:

- In the Point traces, the ranges for
*x*and*y*are*x*∈ [−1*.*3*,*1*.*3] and*y*∈ [−1*.*3*,*1*.*3]. - In the Ant traces, the ranges for
*x*and*y*are*x*∈ [−15*,*15] and*y*∈ [−15*,*15]. - For the Point environment, traces are taken from trajectories with the same time horizon as training (
*T*= 65); for the Ant environment, we use the*T*= 1000 trajectories.

## D.2 Learning Curves

## D.3 Evaluating Learned Behaviors: Cheetah, Uniform Context Distribution

## D.4 Evaluating Learned Behaviors: Cheetah, Curriculum Context Distribution

## D.5 Evaluating Learned Behaviors: Swimmer, Uniform Context Distribution

## D.6 Evaluating Learned Behaviors: Swimmer, Curriculum Context Distribution

## D.7 Evaluating Learned Behaviors: Ant (Distance), Uniform Context Distribution

## D.8 Evaluating Learned Behaviors: Ant (Distance), Curriculum Context Distribution

## D.9 Evaluating Learned Behaviors: Ant (Rotations), Uniform Context Distribution

## D.10 Evaluating Learned Behaviors: Ant (Rotations), Curriculum Context Distribution

## D.11 Point Environment, Uniform Context Distribution, XY-Traces

## D.12 Point Environment, Curriculum Context Distribution, XY-Traces

## D.13 Ant Environment, Uniform Context Distribution, XY-Traces

## D.14 Ant Environment, Curriculum Context Distribution, XY-Traces

# E Learning Multimodal Policies with Random Rewards

We considered a random reward baseline, where an agent acting under context *c *would receive a reward

where *v _{c} *was a random context-specific unit vector, obtained by sampling from N (0

*, I*) and then normalizing. It seemed plausible that rewards of this form would do a good job of encoding human priors for robot behavior for the simple locomotion tasks in our core comparison. In practice, it turned

out to be extremely challenging to train multimodal agents with these rewards; while somewhat easier to train unimodal agents with them, the behaviors that we observed were less interesting than expected. We present results from two sets of experiments:

RR1. a ceteris paribus analogue to our core comparison between variational option discovery algorithms, using all of the same hyperparameters (number of epochs, paths per epoch, number of contexts, the use of embeddings, learning rates, etc.), except with rewards from Eq. 6 instead of a learned decoder,

RR2. and a set of experiments where all else is equal except that the number of contexts is *K *= 1

instead of *K *= 64.

RR1 is a direct and fair comparison, while RR2 allows us to gain intuition for the behavior obtained by optimizing these random rewards separately from the challenges of multitask learning.

## E.1 Results from RR1

The results in Cheetah (Fig. 20) look reasonable in composite, but are weak for individual random seeds: in each seed, the results are nearly bimodal, with one mode learning to run forward at some speed, and the other mode learning to run backwards at another speed. In Swimmer (Fig. 21), this form of random rewards inspires almost no motion. Results in the Ant environment (Figs. 22, 23) show extreme variability: no individual behavior was consistent with respect to the score functions we used (the black bars, representing standard deviation, are very large for every behavior).

## E.2 Results from RR2

We found no significant difference in quality of learned behaviors between the multimodal policies in RR1 and the unimodal policies in RR2, as shown in Fig. 24. That is, training with a single random reward function, instead of several at once, did not result in useful or consistent behavior as measured by our score functions.

## E.3 Discussion

Our conclusion is that random rewards based on Eq. 6 do not result in interesting behavior in the environments we considered. However, there may exist a functional form for random rewards which performs better.