Joshua Achiam

UC Berkeley & OpenAI

Dario Amodei

OpenAI

Harrison Edwards

OpenAI

Pieter Abbeel

UC Berkeley

Abstract

We explore methods for option discovery based on variational inference and make two algorithmic contributions. First: we highlight a tight connection between variational option discovery methods and variational autoencoders, and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories, and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent’s perfor- mance is strong enough (as measured by the decoder) on the current set of contexts. We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.

1 Introduction

Humans are innately driven to experiment with new ways of interacting with their environments. This can accelerate the process of discovering skills for downstream tasks and can also be viewed as a primary objective in its own right. This drive serves as an inspiration for reward-free option discovery in reinforcement learning (based on the options framework of Sutton et al. [1999], Precup [2000]), where an agent tries to learn skills by interacting with its environment without trying to maximize cumulative reward for a particular task.

In this work, we explore variational option discovery, the space of methods for option discovery based on variational inference. We highlight a tight connection between prior work on variational option discovery and variational autoencoders (Kingma and Welling [2013]), and derive a new method based on the connection. In our analogy, a policy acts as an encoder, translating contexts from a noise distribution into trajectories; a decoder attempts to recover the contexts from the trajectories, and rewards the policies for making contexts easy to distinguish. Contexts are random vectors which have no intrinsic meaning prior to training, but they become associated with trajectories as a result of training; each context vector thus corresponds to a distinct option. Therefore this approach learns a set of options which are as diverse as possible, in the sense of being as easy to distinguish from each other as possible. We show that Variational Intrinsic Control (VIC) (Gregor et al. [2016]) and the recently-proposed Diversity is All You Need (DIAYN) (Eysenbach et al. [2018]) are specific instances of this template which decode from states instead of complete trajectories.

We make two main algorithmic contributions:

We introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method which decodes from trajectories.The idea is to encourage learning dynamical modes instead of goal-attaining modes, e.g. ‘move in a circle’ instead of ‘go to X’.
We propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent’s performance is strong enough (as measured by the decoder) on the current set of contexts.

We perform a comparison analysis of VALOR, VIC, and DIAYN with and without the curriculum trick, evaluating them in various robotics environments (point mass, cheetah, swimmer, ant).¹ We show that, to the extent that our metrics can measure, all three of them perform similarly, except that VALOR can attain qualitatively different behavior because of its trajectory-centric approach, and DIAYN learns more quickly because of its denser reward signal. We show that our curriculum trick stabilizes and speeds up learning for all three methods, and can allow a single agent to learn up to hundreds of modes. Beyond our core comparison, we also explore applications of variational option discovery in two interesting spotlight environments: a simulated robot hand and a simulated humanoid. Variational option discovery finds naturalistic finger-flexing behaviors in the hand environment, but performs poorly on the humanoid, in the sense that it does not discover natural crawling or walking gaits. We consider this evidence that pure information-theoretic objectives can do a poor job of capturing human priors on useful behavior in complex environments. Lastly, we try a proof-of- concept for applicability to downstream tasks in a variant of ant-maze by using a (particularly good) pretrained VALOR policy as the lower level of a hierarchy. In this experiment, we find that the VALOR policy is more useful than a random network as a lower level, and equivalently as useful as learning a lower level from scratch in the environment.

2 Related Work

Option Discovery: Substantial prior work exists on option discovery (Sutton et al. [1999], Precup [2000]); here we will restrict our attention to relevant recent work in the deep RL setting. Bacon et al. [2017] and Fox et al. [2017] derive policy gradient methods for learning options: Bacon et al. [2017] learn options concurrently with solving a particular task, while Fox et al. [2017] learn options from demonstrations to accelerate specific-task learning. Vezhnevets et al. [2017] propose an architecture and training algorithm which can be interpreted as implicitly learning options. Thomas et al. [2017] find options as controllable factors in the environment. Machado et al. [2017a], Machado et al. [2017b], and Liu et al. [2017] learn eigenoptions, options derived from the graph Laplacian associated with the MDP. Several approaches for option discovery are primarily information-theoretic: Gregor et al. [2016], Eysenbach et al. [2018], and Florensa et al. [2017] train policies to maximize mutual information between options and states or quantities derived from states; by contrast, we maximize information between options and whole trajectories. Hausman et al. [2018] learn skill embeddings by optimizing a variational bound on the entropy of the policy; the final objective function is closely connected with that of Florensa et al. [2017].

Universal Policies: Variational option discovery algorithms learn universal policies (goal- or instruction- conditioned policies), like universal value function approximators (Schaul et al. [2015]) and hindsight experience replay (Andrychowicz et al. [2017]). However, these other approaches require extrinsic reward signals and a hand-crafted instruction space. By contrast, variational option discovery is unsupervised and finds its own instruction space.

Intrinsic Motivation: Many recent works have incorporated intrinsic motivation (especially cu- riosity) into deep RL agents (Stadie et al. [2015], Houthooft et al. [2016], Bellemare et al. [2016], Achiam and Sastry [2017], Fu et al. [2017], Pathak et al. [2017], Ostrovski et al. [2017], Edwards et al. [2018]). However, none of these approaches were combined with learning universal policies, and so suffer from a problem of knowledge fade: when states cease to be interesting to the intrinsic reward signal (usually when they are no longer novel), unless they coincide with extrinsic rewards or are on a direct path to the next-most novel state, the agent will forget how to visit them.

Variational Autoencoders: Variational autoencoders (VAEs) (Kingma and Welling [2013]) learn a probabilistic encoder q_φ(z|x) and decoder p_θ(x|z) which map between data x and latent variables z by optimizing the evidence lower bound (ELBO) on the marginal distribution p_θ(x), assuming a prior p(z) over latent variables. Higgins et al. [2017] extended the VAE approach by including a parameter β to control the capacity of z and improve the ability of VAEs to learn disentangled representations of high-dimensional data. The β-VAE optimization problem is

and when β = 1, it reduces to the standard VAE of Kingma and Welling [2013].

Novelty Search: Option discovery algorithms based on the diversity of learned behaviors can be viewed as similar in spirit to novelty search (Lehman [2012]), an evolutionary algorithm which finds behaviors which are diverse with respect to a characterization function which is usually pre-designed but sometimes learned (as in Meyerson et al. [2016]).

3 Variational Option Discovery Algorithms

Our aim is to learn a policy π where action distributions are conditioned on both the current state s_t and a context c which is sampled at the start of an episode and kept fixed throughout. The context should uniquely specify a particular mode of behavior (also called a skill). But instead of using reward functions to ground contexts to trajectories, we want the meaning of a context to be arbitrarily assigned (‘discovered’) during training.

We formulate a learning approach as follows. A context c is sampled from a noise distribution G, and then encoded into a trajectory τ = (s₀, a₀, …, s_T ) by a policy π(·|s_t, c); afterwards c is decoded from τ with a probabilistic decoder D. If the trajectory τ is unique to c, the decoder will place a high probability on c, and the policy should be correspondingly reinforced. Supervised learning can be applied to the decoder (because for each τ , we know the ground truth c). To encourage exploration, we include an entropy regularization term with coefficient β. The full optimization problem is thus

where P_D is the distribution over contexts from the decoder, and the entropy term is H(π|c) = E_τ_∼_π,c [ Et H(π(·|s_t, c))]. We give a generic template for option discovery based on Eq. 2 as Algorithm 1. Observe that the objective in Eq. 2 has a one-to-one correspondence with the β-VAE objective in Eq. 1: the context c maps to the data x, the trajectory τ maps to the latent representation z, the policy π and the MDP together form the encoder q_φ, the decoder D maps to the decoder p_θ, and the entropy regularization H(π|c) maps to the KL-divergence of the encoder distribution from a prior where trajectories are generated by a uniform random policy (proof in Appendix A). Based on this connection, we call algorithms for solving Eq. 2 variational option discovery methods.

Algorithm 1 Template for Variational Option Discovery with Autoencoding Objective

Generate initial policy π_θ0 , decoder D_φ0
for k = 0, 1, 2, … do
Sample context-trajectory pairs D = {(cⁱ, τⁱ)}_i₌₁_,…,N , by first sampling a context c ∼ G and then rolling out a trajectory in the environment, τ ∼ π_θk (·|·, c).
Update policy with any reinforcement learning algorithm to maximize Eq. 2, using batch D Update decoder by supervised learning to maximize E [log P_D(c|τ )], using batch D
end for

3.1 Connections to Prior Work

Variational Intrinsic Control: Variational Intrinsic Control² (VIC) (Gregor et al. [2016]) is an option discovery technique based on optimizing a variational lower bound on the mutual information between the context and the final state in a trajectory, conditioned on the initial state. Gregor et al. [2016] give the optimization problem as

where µ is the starting state distribution for the MDP. This differs from Eq. 2 in several ways: the context distribution G can be optimized, G depends on the initial state s₀, G is entropy-regularized, entropy regularization for the policy π is omitted, and the decoder only looks at the first and last state of the trajectory instead of the entire thing. However, they also propose to keep G fixed and state-independent, and do this in their experiments; additionally, their experiments use decoders which are conditioned on the final state only. This reduces Eq. 3 to Eq. 2 with β = 0 and log P_D(c|τ ) = log P_D(c|s_T ). We treat this as the canonical form of VIC and implement it this way for our comparison study.

Diversity is All You Need: Diversity is All You Need (DIAYN) (Eysenbach et al. [2018]) performs option discovery by optimizing a variational lower bound for an objective function designed to maximize mutual information between context and every state in a trajectory, while minimizing mutual information between actions and contexts conditioned on states, and maximizing entropy of the mixture policy over contexts. The exact optimization problem is

In DIAYN, G is kept fixed (as in canonical VIC), so the term log G(c) is constant and may be removed from the optimization problem. Thus Eq. 4 is a special case of Eq. 2 with log P_D(c|τ ) = E^T_t=0 log P_D(c|s_t).

3.2 VALOR

In this section, we propose Variational Autoencoding Learning of Options by Reinforcement (VALOR), a vari- ational option discovery method which directly optimizes Eq. 2 with two key decisions about the decoder:

The decoder never sees actions. Our conception of ‘interesting’ behaviors requires that the agent attempt to interact with the environment to achieve some change in state. If the decoder was permitted to see raw actions, the agent could signal the context directly through its actions and ignore the environment. Limiting the decoder in this way forces the agent to manipulate the environment to communicate with the decoder.

Unlike in DIAYN, the decoder does not decompose as a sum of per-timestep computations. That is, log P_D(c|τ ) /= LT f (s_t, c). We choose against this decomposition because it could limit the ability of the decoder to correctly distinguish between behaviors which share some states, or behaviors which share all states but reach them in different orders.

Figure 1: Bidirectional LSTM architecture for VALOR decoder. Blue blocks are LSTM cells.

We implement VALOR with a recurrent architecture for the decoder (Fig. 1), using a bidirectional LSTM to make sure that both the beginning and end of a trajectory are equally important. We only use N = 11 equally spaced observations from the trajectory as inputs, for two reasons: 1) computational efficiency, and 2) to encode a heuristic that we are only interested in low-frequency behaviors (as opposed to information-dense high-frequency jitters). Lastly, taking inspiration from Vezhnevets et al. [2017], we only decode from the k-step transitions (deltas) in state space between the N observations. Intuitively, this corresponds to a prior that agents should move, as any two modes where the agent stands still in different poses will be indistinguishable to the decoder (because the deltas will be identically zero). We do not decode from transitions in VIC or DIAYN, although we note it would be possible and might be interesting future work.

3.3 Curriculum Approach

The standard approach for context distributions, used in VIC and DIAYN, is to have K discrete contexts with a uniform distribution: c ∼ Uniform(K). In our experiments, we found that this worked poorly for large K across all three algorithms we compared. Even with very large batches (to ensure that each context was sampled often enough to get a low-variance contribution to the gradient), training was challenging. We found a simple trick to resolve this issue: start training with small K (where learning is easy), and gradually increase it over time as the decoder gets stronger. Whenever E [log P_D(c|τ )] is high enough (we pick a fairly arbitrary threshold of P_D(c|τ ) ≈ 0.86), we increase K according to

where K_max is a hyperparameter. As our experiments show, this curriculum leads to faster and more stable convergence.

4 Experimental Setup

In our experiments, we try to answer the following questions:

What are best practices for training agents with variational option discovery algorithms (VALOR, VIC, DIAYN)? Does the curriculum learning approach help?
What are the qualitative results from running variational option discovery algorithms? Are the learned behaviors recognizably distinct to a human? Are there substantial differences between algorithms?
Are the learned behaviors useful for downstream control tasks?

Test environments: Our core comparison experiments is on a slate of locomotion environments: a custom 2D point agent, the HalfCheetah and Swimmer robots from the OpenAI Gym [Brockman et al., 2016], and a customized version of Ant from Gym where contact forces are omitted from the observations. We also tried running variational option discovery on two other interesting simulated robots: a dextrous hand (with S ∈ R⁴⁸ and A ∈ R²⁰, based on Plappert et al. [2018]), and a new complex humanoid environment we call ‘toddler’ (with S ∈ R³³⁵ and A ∈ R³⁵). Lastly, we investigated applicability to downstream tasks in a modified version of Ant-Maze (Frans et al. [2018]).

Implementation: We implement VALOR, VIC, and DIAYN with vanilla policy gradient as the RL algorithm (described in Appendix B.1). We note that VIC and DIAYN were originally implemented with different RL algorithms: Gregor et al. [2016] implemented VIC with tabular Q learning (Watkins and Dayan [1992]), and Eysenbach et al. [2018] implemented DIAYN with soft actor-critic (Haarnoja et al.). Also unlike prior work, we use recurrent neural network policy architectures. Because there is not a final objective function to measure whether an algorithm has achieved qualitative diversity of behaviors, our hyperparameters are based on what resulted in stable training, and kept constant across algorithms. Because the design space for these algorithms is very large and evaluation is to some degree subjective, we caution that our results should not necessarily be viewed as definitive.

Training techniques: We investigated two specific techniques for training: curriculum generation via Eq. 5, and context embeddings. On context embeddings: a natural approach for providing the integer context as input to a neural network policy is to convert the context to a one-hot vector and concatenate it with the state, as in Eysenbach et al. [2018]. Instead, we consider whether training is improved by allowing the agent to learn its own embedding vector for each context.

5 Results

Exploring Optimization Techniques: We present partial findings for our investigation of training techniques in Fig. 2 (showing results for just VALOR), with complete findings in Appendix C. In Fig. 2a, we compare performance with and without embeddings, using a uniform context distribution, for several choices of K (the number of contexts). We find that using embeddings consistently improves the speed and stability of training. Fig. 2a also illustrates that training with a uniform distribution becomes more challenging as K increases. In Figs. 2b and 2c, we show that agents with the curriculum trick and embeddings achieve mastery on K_max = 64 contexts substantially faster

Figure 2: Studying optimization techniques with VALOR in HalfCheetah, showing performance—in (a) and (b), E[log *P_D*(c|τ )]; in (c), the value of K throughout the curriculum—vs training iteration. (a) compares learning curves with and without context embeddings (solid vs dotted, resp.), for K ∈ {8, 16, 32, 64}, with uniform context distributions. (b) compares curriculum (with *K_max* = 64) to uniform (with K = 64) context distributions, using embeddings for both. The dips for the curriculum curve indicate when K changes via Eq. 5; values of K are shown in (c). The dashed red line shows when K = *K_max* for the curriculum; after it, the curves for Uniform and Curriculum can be fairly compared. All curves are averaged over three random seeds.

than the agents trained with uniform context distributions in Fig. 2a. As shown in Appendix C, these results are consistent across algorithms.

Comparison Study of Qualitative Results: In our comparison, we tried to assess whether variational option discovery algorithms learn an interesting set of behaviors. This is subjective and hard to measure, so we restricted ourselves to testing for behaviors which are easy to quantify or observe; we note that there is substantial room in this space for developing performance metrics, and consider this an important avenue for future research.

We trained agents by VALOR, VIC, and DIAYN, with embeddings and K = 64 contexts, with and without the curriculum trick. We evaluated the learned behaviors by measuring the following quantities: final x-coordinate for Cheetah, final distance from origin for Swimmer, final distance from origin for Ant, and number of z-axis rotations for Ant³. We present partial findings in Fig. 3 and complete results in Appendix D. Our results confirm findings from prior work, including Eysenbach et al. [2018] and Florensa et al. [2017]: variational option discovery methods, in some MuJoCo environments, are able to find locomotion gaits that travel in a variety of speeds and directions. Results in Cheetah and Ant are particularly good by this measure; in Swimmer, fairly few behaviors actually travel any meaningful distance from the origin (> 3 units), but it happens non-negligibly often. All three algorithms produce similar results in the locomotion domains, although we do find slight differences: particularly, DIAYN is more prone than VALOR and VIC to learn behaviors like ‘attain target state,’ where the target state is fixed and unmoving. Our DIAYN behaviors are overall less mobile than the results reported by Eysenbach et al. [2018]; we believe that this is due to qualitative differences in how entropy is maximized by the underlying RL algorithms (soft actor-critic vs. entropy-regularized policy gradients).

We find that the curriculum approach does not appear to change the diversity of behaviors discovered in any large or consistent way. It appears to slightly increase the ranges for Cheetah x-coorindate, while slightly decreasing the ranges for Ant final distance. Scrutinizing the X-Y traces for all learned modes, it seems (subjectively) that the curriculum approach causes agents to move more erratically (see Appendices D.11—D.14). We do observe a particularly interesting effect for robustness: the curriculum approach makes the distribution of scores more consistent between random seeds (for performances of all seeds separately, see Appendices D.3—D.10).

We also attempted to perform a baseline comparison of all three variational option discovery methods against an approach where we used random reward functions in place of a learned decoder; however, we encountered substantial difficulties in optimizing with random rewards. The details of these experiments are given in Appendix E.

Hand and Toddler Environments: Optimizing in the Hand environment (Fig. 4f) was fairly easy and usually produced some naturalistic behaviors (eg pointing, bringing thumb and forefinger together, and one common rude gesture) as well as various unnatural behaviors (hand splayed out in what

Figure 3: Bar charts illustrating scores for behaviors in Cheetah, Swimmer, and Ant, with x-axis showing behavior ID and y-axis showing the score in log scale. Each red bar (width 1 on the x-axis) gives the average score for 5 trajectories conditioned on a single context; each chart is a composite from three random seeds, each of which was run with K = 64 contexts, for a total of 192 behaviors represented per chart. Behaviors were sorted in descending order by average score. Black bars show the standard deviation in score for a given behavior (context), and the upper-right corner of each chart shows the average decoder probability E[*P_D*(τ |c)].

Figure 4: Various figures for spotlight experiments. Figs. 4a and 4e show results from learning hundreds of behaviors in the Point env, with *K_max* = 1024. Fig. 4f shows that optimizing Eq. 2 in the Hand environment is quite easy with the curriculum approach; all agents master the *K_max* = 64 contexts in < 2000 iterations. Fig. 4g illustrates the challenge for variational option discovery in Toddler: after 15000 iterations, only K = 40 behaviors have been learned. Fig. 4d shows the Ant-Maze environment, where red obstacles prevent the ant from reaching the green goal. Fig. 4h shows performance in Ant-Maze for different choices of a low-level policy in a hierarchy; in the Random and VALOR experiments, the low-level policy receives no gradient updates.

Figure 5: Plots on the far left and far right show X-Y traces for behaviors learned by VALOR; in-between plots show the X-Y traces conditioned on interpolated contexts.

would be painful poses). Optimizing in the Toddler environment (Fig. 4g) was highly challenging; the agent frequently struggled to learn more than a handful of behaviors. The behaviors which the agent did learn were extremely unnatural. We believe that this is because of a fundamental limitation of purely information-theoretic RL objectives: humans have strong priors on what constitutes natural behavior, but for sufficiently complex systems, those behaviors form a set of measure zero in the space of all possible behaviors; when a purely information-theoretic objective function is used, it will give no preference to the behaviors humans consider natural.

Learning Hundreds of Behaviors: Via the curriculum approach, we are able to train agents in the Point environment to learn hundreds of behaviors which are distinct according to the decoder (Fig. 4e). We caution that this does not necessarily expand the space of behaviors which are learnable—it may merely allow for increasingly fine-grained binning of already-learned behaviors into contexts. From various experiments prior to our final results, we developed an intuition that it was important to carefully consider the capacity of the decoder here: the greater the decoder’s capacity, the more easily it would overfit to undetectably-small differences in trajectories.

Mode Interpolation: We experimented with interpolating between context embeddings for point and ant policies to see if we could obtain interpolated behaviors. As shown in Fig. 5, we found that some reasonably smooth interpolations were possible. This suggests that even though only a discrete number of behaviors are trained, the training procedure learns general-purpose universal policies.

Downstream Tasks: We investigated whether behaviors learned by variational option discovery could be used for a downstream task by taking a policy trained with VALOR on the Ant robot (Uniform distribution, seed 10; see Appendix D.7), and using it as the lower level of a two-level hierarchical policy in Ant-Maze. We held the VALOR policy fixed throughout downstream training, and only trained the upper level policy, using A2C as the RL algorithm (with reinforcement occuring only at the lower level—the upper level actions were trained by signals backpropagated through the lower level). Results are shown in Fig. 4h. We compared the performance of the VALOR-based agent to three baselines: a hierarchical agent with the same architecture trained from scratch on Ant-Maze (‘Trained’ in Fig. 4h), a hierarchical agent with a fixed random network as the lower level (‘Random’ in Fig. 4h), and a non-hierarchical agent with the same architecture as the upper level in the hierarchical agents (an MLP with one hidden layer, ‘None’ in Fig. 4h). We found that the VALOR agent worked as well as the hierarchy trained from scratch and the non-hierarchical policy, with qualitatively similar learning curves for all three; the fixed random network performed quite poorly by comparison. This indicates that the space of options learned by (the particular run of) VALOR was at least as expressive as primitive actions, for the purposes of the task, and that VALOR options were more expressive than random networks here.

6 Conclusions

We performed a thorough empirical examination of variational option discovery techniques, and found they produce interesting behaviors in a variety of environments (such as Cheetah, Ant, and Hand), but can struggle in very high-dimensional control, as shown in the Toddler environment. From our mode interpolation and hierarchy experiments, we found evidence that the learned policies are universal in meaningful ways; however, we did not find clear evidence that hierarchies built on variational option discovery would outperform task-specific policies learned from scratch.

We found that with purely information-theoretic objectives, agents in complex environments will discover behaviors that encode the context in trivial ways—eg through tiling a narrow volume of the state space with contexts. Thus a key challenge for future variational option discovery algorithms is to make the decoder distinguish between trajectories in a way which corresponds with human intuition about meaningful differences.

Acknowledgments

Joshua Achiam is supported by TRUST (Team for Research in Ubiquitous Secure Technology) which receives support from NSF (award number CCF-0424422).

References

Joshua Achiam and Shankar Sastry. Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning. mar 2017. URL http://arxiv.org/abs/1703.01732.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. NIPS, 2017. URL http://arxiv.org/abs/1707.01495.

Pierre-luc Bacon, Jean Harb, and Doina Precup. The Option-Critic Architecture. AAAI, 2017. Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-Based Exploration and Intrinsic Motivation. NIPS, jun 2016. URL http://arxiv.org/abs/1606.01868.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. 2016. URL http://arxiv.org/abs/1606.01540.

Yan Duan, Xi Chen, John Schulman, and Pieter Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. The 33rd International Conference on Machine Learning (ICML 2016) (2016), 48:14, 2016. URL http://arxiv.org/abs/1604.06778.

Harri Edwards, Yuri Burda, and Amos Storkey. Curiosity-driven Exploration by Bootstrapping Features, feb 2018. URL https://openreview.net/forum?id=S1gWUifW0b.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function. 2018. URL http://arxiv.org/abs/1802.06070.

Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic Neural Networks for Hierarchical Rein- forcement Learning. ICLR, pages 1–17, 2017.

Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-Level Discovery of Deep Options. 2017. URL http://arxiv.org/abs/1703.08294.

Kevin Frans, Henry M Gunn, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman Openai. Meta Learning Shared Hierarchies. In ICLR, 2018. URL https://openreview.net/pdf?id= SyX0IeWAW.

Justin Fu, John Co-Reyes, and Sergey Levine. EX2: Exploration with Exemplar Models for Deep Reinforcement Learning. In NIPS, pages 2577–2587, 2017. URL https://papers.nips.cc/paper/ 6851-ex2-exploration-with-exemplar-models-for-deep-reinforcement-learning.

Karol Gregor, Danilo Rezende, and Daan Wierstra. Variational Intrinsic Control. pages 1–15, 2016.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor. URL https:// arxiv.org/pdf/1801.01290.pdf.

Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an Embedding Space for Transferable Robot Skills. ICLR, 2018.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner, and Google Deepmind. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. Iclr, (July):1–13, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational Information Maximizing Exploration. NIPS, may 2016. URL http://arxiv.org/ abs/1605.09674.

Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15, 2015. ISSN 09252312. doi: http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503.

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. (Ml):1–14, 2013. ISSN 1312.6114v10. doi: 10.1051/0004-6361/201527329. URL http://arxiv.org/abs/1312.6114.

Joel Lehman. Evolution through the Search for Novelty. PhD thesis, 2012. URL http://joellehman.com/lehman-dissertation.pdf.

Miao Liu, Marlos C. Machado, Gerald Tesauro, and Murray Campbell. The Eigenoption-Critic Framework. NIPS Hierarchical RL Workshop, 2017. URL http://arxiv.org/abs/1712. 04065.

Marlos C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian Framework for Option Discovery in Reinforcement Learning. 2017a.

Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption Discovery Through the Deep Successor Representation. pages 1–20, 2017b.

Elliot Meyerson, Joel Lehman, and Risto Miikkulainen. Learning Behavior Characterizations for Novelty Search. In GECCO, 2016. doi: 10.1145/2908812.2908929. URL ftp://www.cs. utexas.edu/pub/neural-nets/papers/meyerson.gecco16.pdf.

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. pages 1–28, 2016. URL http://arxiv.org/abs/1602.01783.

Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, and Remi Munos. Count-Based Ex- ploration with Neural Density Models. ICML, mar 2017. URL http://arxiv.org/abs/1703. 01310.

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven Exploration by Self-supervised Prediction. In ICML, may 2017. URL http://arxiv.org/abs/1705.05363.

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mcgrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research. 2018. URL https://arxiv.org/pdf/1802.09464.pdf.

Doina Precup. Temporal Abstraction in Reinforcement Learning. PhD Thesis, University of Mas- sachusetts, 2000. ISSN 1308-0911. doi: 10.16953/deusbed.74839.

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Ap- proximators. Proceedings of The 32nd International Conference on Machine Learning, pages 1312–1320, 2015. ISSN 1938-7228. URL http://jmlr.org/proceedings/papers/v37/ schaul15.html.

Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. jul 2015. URL http://arxiv.org/abs/1507.00814.

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 112, 1999.

Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently Controllable Factors. pages 1–13, 2017.

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. (1), 2017. ISSN 1938-7228. URL http://arxiv.org/abs/1703.01161.

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. ISSN 0885-6125. doi: 10.1007/BF00992698. URL http://link.springer.com/10.1007/ BF00992698.

A VAE-Equivalence Proof

The KL-divergence of P (τ |π, c) from P (τ |π₀) is

The first term is our entropy regularization term. The second term, for a uniform random policy π₀, is a constant independent of π (as long as T is the same for all episodes) and can thus be removed from the objective function without changing the optimization problem.

B Implementation Details

B.1 Policy Optimization Algorithm

In this section, we will describe how we performed policy optimization for our experiments. We used vanilla policy gradient to optimize the reinforcement objective for all three variational option discovery algorithms,

although details varied slightly between algorithms and environments. The variation between environments was due to the presence or absence of extrinsic rewards. In all environments except for Ant, there were no extrinsic rewards; however, in Ant, a small penalty was applied for falling over (as opposed to terminating the episode when the agent falls over, as in Eysenbach et al. [2018]).

For VALOR and VIC, the advantage function was:

where the normalize function subtracts out the batch mean and divides by the batch standard deviation, and V_ψ was a learned value function baseline. V_ψ(s_t, c) was learned by taking one gradient descent step on

per iteration.

For DIAYN, the advantage function was:

where V_ψ(s_t, c) was learned by descending on

When computing the gradient of the entropy term, we made an approximation that ignored the role of

π in the distribution over trajectories:

resulting in the same entropy regularization as in Mnih et al. [2016]. Following practices for vanilla policy gradient established in Duan et al. [2016], we use the Adam optimizer Kingma and Ba [2015].

B.2 Hyperparameters

For all variational option discovery algorithms, we used:

1000 paths per epoch for the policy gradient batch
γ = 0.97 as the discount factor
β = 1e⁻³ as the entropy regularization coefficient, where applicable (omitted for VIC)1e⁻³ as the Adam learning rate
LSTM(64) followed by MLP(32) with tanh activations as the policy architecture
32 as the context embedding dimension (when using context embeddings)

For VALOR, the decoder was a bidirectional LSTM where the cell for each direction was of size 64. For VIC and DIAYN, the decoder was an MLP of size (180, 180).

C Additional Analysis for Best Practices

Figure 6: Analysis for understanding best training practices for various algorithms with HalfCheetah as the environment. The x-axis is number of training iterations, and in (a) and (b), the y-axis is E[log *P_D*(c|τ )]; in (c), the y-axis gives the current value of K in the curriculum. (a) shows a direct comparison between learning curves with (dark) and without (dotted) context embeddings, for K ∈ {8, 16, 32, 64}. (b) shows learning performance for the curriculum approach with *K_max* = 64, compared against the uniform distribution approach with K = 64: the spikes and dips for the curriculum curve are characteristic of points when K changes according to Eq. 5. The dashed red line shows when K = *K_max* for the curriculum approach; prior to it, the curves for Uniform and Curriculum are not directly comparable, but after it, they are. (c) shows K for the curriculum approach throughout the runs from (b). All curves are averaged over three random seeds.

D Complete Experimental Results for Comparison Study

D.1 Guide to Reading This Section

In this section we present the results from our core comparison of {VALOR, VIC, DIAYN} × {Uni- form, Curriculum}. Because these algorithms perform unsupervised behavior discovery, analyzing our results is highly-challenging: there is no single, quantitative measure by which to compare the

algorithms. We choose to examine our results in a variety of ways:

Learning curves for the optimization objective.
Bar charts and histograms to show scores for the learned behaviors. Particularly, we evaluate final x-coordinate in the Cheetah environment, final distance traveled in the Swimmer environment, final distance traveled in the Ant environment, and number of z-axis rotations in the Ant environment. Scores are evaluated on trajectories of length T = 1000 steps, even though agents are trained on trajectories with T = 250; we find that using longer horizons at test time clarifies the differences between behaviors.
X-Y traces for agent trajectories in the Point and Ant environments. (X-Y traces for the center-of-mass in Swimmer are not very insightful: Swimmer behavior is highly oscillatory and so it is difficult to discern what is happening.)

Regarding the bar charts and histograms in subsections D.3—D.10:

The bar charts are arranged in nearly the same way as the charts in 3: the x-axis is behavior ID, and the y-axis shows score in log scale for that behavior. The black bars show standard deviations for behavior scores.
The histograms show score on the x-axis, and number of behaviors that fall into a given bin on the y-axis in log scale.
The charts for ‘all’ show the composite bars for all behaviors from seeds 0, 10, and 20. The ‘s0’, ‘s10’, and ‘s20’ charts show behaviors from particular random seeds. Each single seed corresponds to a single policy with K = 64 behaviors.

Regarding the X-Y traces in subsections D.11—D.14:

In the Point traces, the ranges for x and y are x ∈ [−1.3, 1.3] and y ∈ [−1.3, 1.3].
In the Ant traces, the ranges for x and y are x ∈ [−15, 15] and y ∈ [−15, 15].
For the Point environment, traces are taken from trajectories with the same time horizon as training (T = 65); for the Ant environment, we use the T = 1000 trajectories.

D.2 Learning Curves

Figure 7: Learning curves for all algorithms and environments in our core comparison, for number of contexts K = 64. The curriculum trick generally tends to speed up and stabilize performance, except for DIAYN and VIC in the point environment.

D.3 Evaluating Learned Behaviors: Cheetah, Uniform Context Distribution

Figure 8: Final x-coordinate in the Cheetah environment.

D.4 Evaluating Learned Behaviors: Cheetah, Curriculum Context Distribution

Figure 9: Final x-coordinate in the Cheetah environment.

D.5 Evaluating Learned Behaviors: Swimmer, Uniform Context Distribution

Figure 10: Final distance from origin in the Swimmer environment.

D.6 Evaluating Learned Behaviors: Swimmer, Curriculum Context Distribution

Figure 11: Final distance from origin in the Swimmer environment.

D.7 Evaluating Learned Behaviors: Ant (Distance), Uniform Context Distribution

Figure 12: Final distance from origin in the Ant environment.

D.8 Evaluating Learned Behaviors: Ant (Distance), Curriculum Context Distribution

Figure 13: Final distance from origin in the Ant environment.

D.9 Evaluating Learned Behaviors: Ant (Rotations), Uniform Context Distribution

Figure 14: Number of z-axis rotations in the Ant environment.

D.10 Evaluating Learned Behaviors: Ant (Rotations), Curriculum Context Distribution

Figure 15: Number of z-axis rotations in the Ant environment.

D.11 Point Environment, Uniform Context Distribution, XY-Traces

Figure 16: Learned behaviors in the Point environment with uniform context distributions. Each sub-plot shows X-Y traces for five trajectories conditioned on the same context (because the learned behaviors are highly repeatable, most traces almost entirely overlap). All traces for an algorithm come from a single policy which was trained with K = 64 contexts.

D.12 Point Environment, Curriculum Context Distribution, XY-Traces

Figure 17: Learned behaviors in the Point environment with the curriculum trick. Each sub-plot shows X-Y traces for five trajectories conditioned on the same context (because the learned behaviors are highly repeatable, most traces almost entirely overlap). All traces for an algorithm come from a single policy which was trained with *K_max* = 64 contexts. Where a blank sub-plot appears, the agent was never trained on that context (K was less than *K_max* at the end of 5000 iterations of training).

D.13 Ant Environment, Uniform Context Distribution, XY-Traces

Figure 18: Learned behaviors in the Ant environment with uniform context distributions. Each sub-plot shows X-Y traces for five trajectories conditioned on the same context (because the learned behaviors are highly repeatable, most traces almost entirely overlap). All traces for an algorithm come from a single policy which was trained with K = 64 contexts.

D.14 Ant Environment, Curriculum Context Distribution, XY-Traces

Figure 19: Learned behaviors in the Ant environment with the curriculum trick. Each sub-plot shows X-Y traces for five trajectories conditioned on the same context (because the learned behaviors are highly repeatable, most traces almost entirely overlap). All traces for an algorithm come from a single policy which was trained with *K_max* = 64 contexts. Where a blank sub-plot appears, the agent was never trained on that context (K was less than *K_max* at the end of 5000 iterations of training).

E Learning Multimodal Policies with Random Rewards

We considered a random reward baseline, where an agent acting under context c would receive a reward

where v_c was a random context-specific unit vector, obtained by sampling from N (0, I) and then normalizing. It seemed plausible that rewards of this form would do a good job of encoding human priors for robot behavior for the simple locomotion tasks in our core comparison. In practice, it turned

out to be extremely challenging to train multimodal agents with these rewards; while somewhat easier to train unimodal agents with them, the behaviors that we observed were less interesting than expected. We present results from two sets of experiments:

RR1. a ceteris paribus analogue to our core comparison between variational option discovery algorithms, using all of the same hyperparameters (number of epochs, paths per epoch, number of contexts, the use of embeddings, learning rates, etc.), except with rewards from Eq. 6 instead of a learned decoder,

RR2. and a set of experiments where all else is equal except that the number of contexts is K = 1

instead of K = 64.

RR1 is a direct and fair comparison, while RR2 allows us to gain intuition for the behavior obtained by optimizing these random rewards separately from the challenges of multitask learning.

E.1 Results from RR1

The results in Cheetah (Fig. 20) look reasonable in composite, but are weak for individual random seeds: in each seed, the results are nearly bimodal, with one mode learning to run forward at some speed, and the other mode learning to run backwards at another speed. In Swimmer (Fig. 21), this form of random rewards inspires almost no motion. Results in the Ant environment (Figs. 22, 23) show extreme variability: no individual behavior was consistent with respect to the score functions we used (the black bars, representing standard deviation, are very large for every behavior).

Figure 20: Final x-coordinate in the Cheetah environment for random rewards.

E.2 Results from RR2

We found no significant difference in quality of learned behaviors between the multimodal policies in RR1 and the unimodal policies in RR2, as shown in Fig. 24. That is, training with a single random reward function, instead of several at once, did not result in useful or consistent behavior as measured by our score functions.

E.3 Discussion

Our conclusion is that random rewards based on Eq. 6 do not result in interesting behavior in the environments we considered. However, there may exist a functional form for random rewards which performs better.

Variational Option Discovery Algorithms