**Bradly C. Stadie∗**

UC Berkeley

**Ge Yang**∗

University of Chicago

**Rein Houthooft**

OpenAI

**Xi Chen**

Covariant.ai

**Yan Duan**

Covariant.ai

**Yuhuai Wu**

University of Toronto

**Pieter Abbeel**

UC Berkeley

**Ilya Sutskever**

OpenAI

# Abstract

We interpret meta-reinforcement learning as the problem of learning how to quickly find a good sampling distribution in a new environment. This interpretation leads to the development of two new meta-reinforcement learning algorithms: E-MAML and E-RL^{2}. Results are presented on a new environment we call ‘Krazy World’: a difficult high-dimensional gridworld which is designed to highlight the importance of correctly differentiating through sampling distributions in meta-reinforcement learning. Further results are presented on a set of maze environments. We show E-MAML and E-RL^{2} deliver better performance than baseline algorithms on both tasks.

# 1 Introduction

Reinforcement learning can be thought of as a procedure wherein an agent bias its sampling process towards areas with higher rewards. This sampling process is embodied as the policy *π*, which is responsible for outputting an action *a *conditioned on past environmental states {*s*}. Such action affects changes in the distribution of the next state *s*^{f} ∼ *T *(*s, a*). As a result, it is natural to identify the policy *π *with a sampling distribution over the state space.

This perspective highlights a key difference between reinforcement learning and supervised learning: In supervised learning, the data is sampled from a fixed set. The i.i.d. assumption implies that the model does not affect the underlying distribution. In reinforcement learning, however, the very goal is to learn a policy *π*(*a*|*s*) that can manipulate the sampled states *P _{π}*(

*s*) to the agent’s advantage.

This property of RL algorithms to affect their own data distribution during the learning process is particularly salient in the field of meta-reinforcement learning. Meta RL goes by many different names: learning to learn, multi-task learning, lifelong learning, transfer learning, etc [25, 26, 22, 21, 23, 12, 24, 39, 37]. The goal, however, is usually the same–we wish to train the agents to learn transferable knowledge that helps it generalize to new situations. The most straightforward way to tackle this problem is to explicitly optimize the agent to deliver good performance after some adaptation step. Typically, this adaptation step will take the agent’s prior and use it to update its current policy to fit its current environment.

This problem definition induces an interesting consequence: during meta-learning, we are no longer under the obligation to optimize for maximal reward during training. Instead, we can optimize for a sampling process that maximally informs the meta-learner how it should adapt to new environments. In the context of gradient based algorithms, this means that one principled approach for learning an optimal sampling strategy is to differentiate the meta RL agent’s per-task sampling process with respect to the goal of maximizing the reward attained by the agent post-adaptation. To the best of our knowledge, such a scheme is hitherto unexplored.

In this paper, we derive an algorithm for gradient-based meta-learning that explicitly optimizes the per- task sampling distributions during adaptation with respect to the expected future returns produced by it post-adaptation self. This algorithm is closely related to the recently proposed MAML algorithm [7]. For reasons that will become clear later, we call this algorithm E-MAML. Inspired by this algorithm, we develop a less principled extension of RL^{2} that we call E-RL^{2}. To demonstrate that this method learns more transferable structures for meta adaptation, we propose a new, high-dimensional, and dynamically-changing set of tasks over a domain we call “Krazy-World”. “Krazy-World” requires the agent to learn high-level structures, and is much more challenging for state-of-the-art RL algorithms than simple locomotion tasks.^{2}.

We show that our algorithms outperform baselines on “Krazy-World”. To verify we are not over-fitting to this environment, we also present results on a set of maze environments.

# 2 Preliminaries

**Reinforcement Learning Notation: **Let *M *= (S*, *A*, *P*, R, ρ*_{0}*, γ, T *) represent a discrete-time finite-horizon discounted Markov decision process (MDP). The elements of *M *have the following definitions: S is a state set, A an action set, P : S ×A×S → R_{+} a transition probability distribution, *R *: S × A → R a reward function, *ρ*_{0} : S → R_{+} an initial state distribution, *γ *∈ [0*, *1] a discount factor, and *T *is the horizon. Occasionally we use a loss function L(*s*) = −*R*(*s*) rather than the reward *R*. In a classical reinforcement learning setting, we optimize to obtain a set of parameters *θ *which maximize the expected discounted return under the policy *π _{θ} *: S × A → R

_{+}. That is, we optimize to obtain

*θ*that maximizes

and

**The Meta Reinforcement Learning Objective: **In meta reinforcement learning, we consider a family of MDPs M = {*M _{i}*}

*which comprise a distribution of tasks. The goal of meta RL is to find a policy*

^{N}*i*=1*π*and paired update method

_{θ}*U*such that, when

*M*∼ M is sampled,

_{i}*π*

_{U}_{(}

_{θ}_{)}solves

*M*quickly. The word

_{i}*quickly*is key: By quickly, we mean orders of magnitude more sample efficient than simply solving

*M*with policy gradient or value iteration methods from scratch. For example, in an environment where policy gradients require over 100,000 samples to produce good returns, an ideal meta RL algorithm should solve these tasks by collecting less than 10 trajectories. The assumption is that, if an algorithm can solve a problem with so few samples, then it might be ‘learning to learn.’ That is, the agent is not learning how to master a particular task but rather how to quickly adapt to new ones. The objective can be written cleanly as

_{i}This objective is similar to the one that appears in MAML [7], which we will discuss further below. In MAML, *U *is chosen to be the stochastic gradient descent operator parameterized by the task.

# 3 Problem Statement and Algorithms

## 3.1 Fixing the Sampling Problem with E-MAML

We can expand the expectation from (1) into the integral form

It is true that the objective (1) can be optimized by taking a derivative of this integral with respect to *θ *and carrying out a standard REINFORCE style analysis to obtain a tractable expression for the gradient [40]. However, this decision might be sub-optimal.

Our key insight is to recall the sampling process interpretation of RL. In this interpretation, the policy *π _{θ} *implicitly defines a sampling process over the state space. Under this interpretation, meta RL tries to learn a strategy for quickly generating good per-task sampling distributions. For this learning process to work, it needs to receive a signal from each per-task sampling distribution which measures its propensity to positively impact the meta-learning process. Such a term does not make an appearance when directly optimizing (1). Put more succinctly,

**directly optimizing (1) will not account for the impact of the original sampling distribution**

*π*

_{θ}**on the future rewards**

*R*(

*τ*)

*, τ*∼

*π*

_{U}_{(}

_{θ,τ}_{¯)}. Concretely, we would like to account for the fact that the samples

*τ*¯ drawn under

*π*will impact the final returns

_{θ}*R*(

*τ*) by influencing the initial update

*U*(

*θ, τ*¯). Making this change will allow initial samples

*τ*¯ ∼

*π*to be reinforced by the expected future returns after the sampling update

_{θ}*R*(

*τ*). Under this scheme, the initial samples

*τ*¯ are encouraged to cover the state space enough to ensure that the update

*U*(

*θ, τ*¯) is maximally effective.

Including this dependency can be done by writing the modified expectation as

This provides an expression for computing the gradient which correctly accounts for the dependence on the initial sampling distribution.

We now find ourselves wishing to find a tractable expression for the gradient of (3). This can be done quite smoothly by applying the product rule under the integral sign and going through the REINFORCE style derivation twice to arrive at a two term expression

The term on the left is precisely the original MAML algorithm [7]. This term encourages the agent to take update steps *U *that achieve good final rewards. The second term encourages the agent to take actions such that the eventual meta-update yields good rewards (crucially, it does not try and exploit the reward signal under its own trajectory *τ*¯). In our original derivation of this algorithm, we felt this term would afford the the policy the opportunity to be more exploratory, as it will attempt to deliver the maximal amount of information useful for the future rewards *R*(*τ *) without worrying about its own rewards *R*(*τ*¯). Since we felt this algorithm augments MAML by adding in an exploratory term, we called it E-MAML. At present, the validity of this interpretation remains an open question.

For the experiments presented in this paper, we will assume that the operator *U *that is utilized in MAML and E-MAML is stochastic gradient descent. However, many other interesting choices exist.

_{3.2 }Choices for the Update Operator *U *, and the Exploration Policy *π*_{θ}

_{θ}

When only a single iteration of inner policy gradient occurrs, the initial policy *π _{θ}*0 is entirely responsible for the exploratory sampling. The best exploration would be a policy

*π*0 that can generate the most informative samples for identifying the task. However if more than 1 iteration of inner policy gradient occurs, then some of the exploratory behavior can also be attributed to the update operation U. There is a non-trivial interplay between the initial policy and the subsequent policy updates, especially when exploring the environment would take a few different policies.

_{θ}## 3.3 Understanding the E-MAML Update

A standard reinforcement learning objective can be represented by the stochastic computation graph

[29] as in Figure.2a, where the loss is computed w.r.t. the policy parameter*θ*using estimator Eq.1. For clarity, we can use a short-hand notation

*U*as in

*θ*

^{f}=

*U*(

*θ*) to represent this graph (Fig.2b).

Figure 1: E-MAML

As described in [1], we can then write down the surrogate objective of E-MAML as the graph in Fig.1c whereas model-agnostic meta-learning (MAML) treats the graph as if the first policy gradient update is deterministic. It is important to note that one can in fact smoothly tune between the stochastic treatment of *U *(Fig.1c) and the non-stochastic treatment of *U *(Fig.1d) by choosing the hyperparameter *λ*. In this regard, E-MAML considers the sampling distribution *π _{θ}I *as an extra, exploration friendly term that one could add to the standard meta-RL objective used in MAML.

Figure 2: Computation graphs for **(a) **REINFORCE/Policy Gradient, where *θ *is the policy parameter. *π *is the policy function. During learning, the policy *π _{θ} *interacts with the environment, parameterized as the transition function

*T*and the reward function

*R*. L is the proximal policy optimization objective, which allows making multiple gradient steps with the same trajectory sample. Only 1-meta gradient updated is used in these experiments.

**(b)**shows the short-hand we use to represent (a) in (c) and (d). Note that circle ◦ represents stochastic nodes, and □ represents deterministic nodes. The policy gradient update sub-graph

*U*is stochastic, which is where here in (b) we have a circle.

**(c)**the inner policy gradient sub-graph and the policy gradient meta-loss. The original parameter

*θ*gets updated to

*θ*

^{f}, which is then evaluated by the outer proximal policy optimization objective. E-MAML treats the inner policy gradient update as an stochastic node. Whereas

**(d)**MAML treats this as a deterministic node, thus neglecting the causal entropy term of the inner update operator

*U*.

^{3.4 }E-RL^{2}

RL^{2} optimizes (1) by feeding multiple rollouts from multiple different MDPs into an RNN. The hope is that the RNN hidden state update *h _{t} *=

*C*(

*x*

_{t}, h_{t}_{−1}), will learn to act as the update function

*U*. Then, performing a policy gradient update on the RNN will correspond to optimizing the meta objective (1).

We can write the RL^{2} update rule more explicitly in the following way. Suppose *L *represents an RNN. Let Env* _{k}*(

*a*) be a function that takes an action, uses it to interact with the MDP representing task

*k*, and returns the next observation

*o*, reward

^{3}*r*, and a termination flag

*d*. Then we have

To train this RNN, we sample *N *MDPs from M and obtain *k *rollouts for each MDP by running the MDP through the RNN as above. We then compute a policy gradient update to move the RNN parameters in a direction which maximizes the returns over the *k *trials performed for each MDP.

Inspired by our derivation of E-MAML, we attempt to make RL^{2} better account for the impact of its initial sampling distribution on its final returns. However, we will take a slightly different approach. Call the rollouts that help account for the impact of this initial sampling distribution Explore-rollouts. Call rollouts that do not account for this dependence Exploit-rollouts. For each MDP *M _{i}*, we will sample

*p*Explore-rollouts and

*k*−

*p*Exploit-rollouts. During an Explore-rollout, the forward pass through the RNN will receive all information. However, during the backwards pass, the rewards contributed during Explore-rollouts will be set to zero. The graident computation will only get rewards provided by Exploit-Rollouts. For example, if there is one Explore-rollout and one Exploit-rollout, then we would proceed as follows. During the forward pass, the RNN will receive all information regarding rewards for both episodes. During the backwards pass, the returns will be computed as

Where *χ _{E} *is an indicator that returns 0 if the episode is an Explore-rollout and 1 if it is an Exploit- rollout. This return, and not the standard sum of discounted returns, is what is used to compute the policy gradient. The hope is that zeroing the return contributions from Explore-rollouts will encourage the RNN to account for the impact of casting a wider sampling distribution on the final meta-reward. That is, during Explore-rollouts the policy will take actions which may not lead to immediate rewards but rather to the RNN hidden weights that perform better system identification. This system identification will in turn lead to higher rewards in later episodes.

# 4 Experiments

## 4.1 Krazy World Environment

To test the importance of correctly differentiating through the sampling process in meta reinforcement learn- ing, we engineer a new environment known as Krazy World. To succeed at Krazy World, a successful meta learning agent will first need to identify and adapt to many different tile types, color palettes, and dynamics. This environment is challening even for state-of-the-art RL algorithms. In

this environment, it is essential that meta-updates account for the impact of the original sampling distribution on the final meta-updated reward. Without accounting for this impact, the agent will not receive the gradient of the per-task episodes with respect to the meta-update. But this is precisely the gradient that encourages the agent to quickly learn how to correctly identify parts of the environment. See the Appendix for a full description of the environment.

## 4.2 Mazes

A collection of maze environments. The agent is placed at a random square within the maze and must learn to navigate the twists and turns to reach the goal square. A good exploratory agent will spend some time learning the maze’s layout in a way that minimizes repetition of future mistakes. The mazes are not rendered, and consequently this task is done with state space only. The mazes are 20 × 20 squares large.

## 4.3 Results

In this section, we present the following experimental results.

- Learning curves on Krazy World and mazes.
- The gap between the agent’s initial performance on new environments and its performance after updating. A good meta learning algorithm will have a large gap after updating. A standard RL algorithm will have virtually no gap after only one update.
- Three experiments that examine the exploratory properties of our algorithms.

When plotting learning curves in Figure 4 and Figure 5, the Y axis is the reward obtained after training at test time on a set of 64 held-out test environments. The X axis is the total number of environmental time-steps the algorithm has used for training. Every time the environment advances forward by one step, this count increments by one. This is done to keep the time- scale consistent across meta-learning curves.

For Krazy World, learning curves are presented in Figure 4. E-MAML and E-RL^{2} have the best final performance. E-MAML has the steepest initial gains for the first 10 million time-steps. Since meta-learning algorithms are often very expensive, having a steep initial ascent is quite valuable. Around 14 million training steps, E- RL^{2} passes E-MAML for the best performance. By 25 million time-steps, E-RL^{2} has converged. MAML delivers comparable final performance to E-MAML. However, it takes it much longer to obtain this level of performance. Finally, RL^{2} has comparatively poor performance on this task and very high variance. When we manually examined the RL^{2} trajectories to figure out why, we saw the agent consistently finding a single goal square and then refusing to explore any further. The additional experiments presented below seem consistent with this finding.

Learning curves for mazes are presented in Figure 5. Here, the story is different than Krazy World. RL^{2} and E-RL^{2} both perform better than MAML and E-MAML. We suspect the reason for this is

that RNNs are able to leverage memory, which is more important in mazes than in Krazy World. This environment carries a penalty for hitting the wall, which MAML and E-MAML discover quickly. However, it takes E-RL^{2} and RL^{2} much longer to discover this penalty, resulting in worse performance at the beginning of training. MAML delivers worse final performance and typically only learns how to avoid hitting the wall. RL^{2} and E-MAML sporadically solve mazes. E-RL^{2} manages to solve many of the mazes.

When examining meta learning algorithms, one important metric is the update size after one learning episode. Our meta learning algorithms should have a large gap between their initial policy, which is largely exploratory, and their updated policy, which should often solve the problem entirely. For MAML, we look at the gap between the initial policy and the policy after one policy gradient step For RL^{2}, we look at the results after three exploratory episodes, which give the RNN hidden state *h *sufficient time to update. Note that three is the number of exploratory episodes we used during training as well. This metric shares similarities with the Jump Start metric considered in prior literature [35]. These gaps are presented in figure 6.

Finally, in Figure 7 we see three heuristic metrics desigend to measure a meta-learners capacity for system identification. First, we consider the fraction of tile types visited by the agent at test time. A good agent should visit and identify many different tile types. Second, we consider the number of times an agent visits a death tile at test time. Agents that are efficient at identification should visit this tile type exactly once and then avoid it. More naive agents will run into these tiles repeatedly, dying repetedly and instilling a sense of pity in onlookers. Finally, we look at how many goals the agent reaches. RL^{2} tends to visit fewer goals. Usually, it finds one goal and exploits it. Overall, our suggested algorithms achieve better performance under these metrics.

# 5 Related Work

This work builds depends upon recent advances in deep reinforcement learning. [15, 16, 13] allow for discrete control in complex environments directly from raw images. [28, 16, 27, 14], have allowed for high-dimensional continuous control in complex environments from raw state information.

It has been suggested that our algorithm is related to the exploration vs. exploitation dilemma. There exists a large body of RL work addressing this problem [10, 5, 11]. In practice, these methods are often not used due to difficulties with high-dimensional observations, difficulty in implementation on arbitrary domains, and lack of promising results. This resulted in most deep RL work utilizing epsilon greedy exploration [15], or perhaps a simple scheme like Boltzmann exploration [6]. As a result of these shortcomings, a variety of new approaches to exploration in deep RL have recently been suggested [34, 9, 31, 18, 4, 19, 17, 8, 23, 17, 8, 26, 25, 32, 33, 12, 24]. In spite of these numerous efforts, the problem of exploration in RL remains difficult.

Many of the problems in meta RL can alternatively be addressed with the field of *hierarchical reinforcement learning*. In hierarchical RL, a major focus is on learning primitives that can be reused and strung together. Frequently, these primitives will relate to better coverage over state visitation frequencies. Recent work in this direction includes [38, 2, 36, 20, 3, 39]. The primary reason we consider meta-learning over hierarchical RL is that we find hierarchical RL tends to focus more on defining specific architectures that should lead to hierarchical behavior, whereas meta learning instead attempts to directly optimize for these behaviors.

As for meta RL itself, the literature is spread out and goes by many different names. There exist relevant literature on life-long learning, learning to learn, continual learning, and multi-task learning [22, 21]. We encourage the reviewer to look at the review articles [30, 35, 37] and their citations. The work most similar to ours has focused on adding curiosity or on a free learning phrase during training. However, these approaches are still quite different because they focus on defining an intrinsic motivation signals. We only consider better utilization of the existing reward signal. Our algorithm makes an explicit connection between free learning phases and the its affect on meta-updates.

# 6 Closing Remarks

In this paper, we considered the importance of sampling in meta reinforcement learning. Two new algorithms were derived and their properties were analyzed. We showed that these algorithms tend to learn more quickly and cover more of their environment’s states during learning than existing algorithms. It is likely that future work in this area will focus on meta-learning a curiosity signal which is robust and transfers across tasks, or learning an explicit exploration policy. Another interesting avenue for future work is learning intrinsic rewards that communicate long-horizon goals, thus better justifying exploratory behavior. Perhaps this will enable meta agents which truly *want *to explore rather than being forced to explore by mathematical trickery in their objectives.

# 7 Acknowledgement

This work was supported in part by ONR PECASE N000141612723 and by AWS and GCE compute credits.

# Appendix A: Krazy-World

We find this environment challenging for even state-of-the- art RL algorithms. For each environment in the test set, we optimize for 5 million steps using the Q-Learning algorithm from Open-AI baselines. This exercise delivered a mean score of 1*.*2 per environment, well below the human baselines score of 2.7. The environment has the following challenging features:

**8** **different tile types**: ** Goal squares **provide +1 reward when retrieved. The agent reaching the goal does not cause the episode to terminate, and there can be multiple goals.

**will be skipped over in the direction the agent is transversing.**

*Ice squares***will kill the agent and end the episode.**

*Death squares***act as a wall, impeding the agent’s**

*Wall squares*movement. ** Lock squares **can only be passed once the agent has collected the corresponding key from a key square.

**transport the agent to a different teleporter square on the map.**

*Teleporter squares***provide the agent with additional energy. If the agent runs out of energy, it can no longer move. The agent proceeds normally across**

*Energy squares***.**

*normal squares***Ability to randomly swap color palette**: The color palette for the grid can be permuted randomly, changing the color that corresponds to each of the tile types. The agent will thus have to identify the new system to achieve a high score. Note that in representations of the gird wherein basis vectors are used rather than images to describe the state space, each basis vector corresponds to a tile type–permuting the colors corresponds to permuting the types of tiles these basis vectors represent. We prefer to use the basis vector representation in our experiments, as it is more sample efficient.

**Ability to randomly swap dynamics**: The game’s dynamics can be altered. The most naive alteration simply permutes the player’s inputs and corresponding actions (issuing the command for down moves the player up etc). More complex dynamics alterations allow the agent to move multiple steps at a time in arbitrary directions, making the movement more akin to that of chess pieces.

**Local or Global Observations**: The agent’s observation space can be set to some fixed number of squares around the agent, the squares in front of the agent, or the entire grid. Observations can be given as images or as a grid of basis vectors. For the case of basis vectors, each element of the grid is embedded as a basis vector that corresponds to that tile type. These embeddings are then concatenated together to form the observation proper. We will use local observations.

# Appendix B: Table of Hyperparameters

Table 1: Table of Hyperparameters: E-MAML

Hyperparameter | Value | Comments |

Parallel Samplers | 40 ∼ 128 | |

Batch Timesteps | 500 | |

Inner Algo | PPO / VPG / CPI | Much simpler than TRPO |

Meta Algo | PPO / VPG / CPI | |

Max Grad Norm | 0.9 ∼ 1.0 | improves stability |

PPO Clip Range | 0.2 | |

Gamma | 0 ∼ 1 | tunnable hyper parameter |

GAE Lambda | 0.997 | |

Alpha | 0.01 | Could be learned |

Beta | 1e-3 | 60 |

Vf Coeff | 0 | value baseline is disabled |

Ent Coeff | 1e-3 | improves training stability |

Inner Optimizer | SGD | |

Meta Optimizer | Adam | |

Inner Gradient Steps | 1 ∼ 20 | |

Simple Sampling | True/False | With PPO being the inner algorithm, one can reuse the same path sample for multiple gradient steps. |

Meta Gradient Steps | 1 ∼ 20 | Requires PPO when > 1 |

# Appendix C: Experiment Details

For both Krazy World and mazes, training proceeds in the following way. First, we initialize 32 training environments and 64 test environments. Every initialized environment has a different seed. Next, we initialize our chosen meta-RL algorithm by uniformly sampling hyper-parameters from predefined ranges. Data is then collected from all 32 training environments. The meta-RL algorithm then uses this data to make a meta-update, as detailed in the algorithm section of this paper. The meta-RL algorithm is then allowed to do one training step on the 64 test environments to see how fast it can train at test time. These test environment results are recorded, 32 new tasks are sampled from the training environments, and data collection begins again. For MAML and E-MAML, training at test time means performing one VPG update at test time. For RL^{2} and E-RL^{2}, this means running three exploratory episodes to allow the RNN memory weights time to update and then reporting the loss on the fourth and fifth episodes. For both algorithms, meta-updates are calculated with PPO [27]. The entire process from the above paragraph is repeated from the beginning 64 times and the results are averaged to provide the final learning curves featured in this paper.

# References

- [1] Anonymous.
**Promp: Proximal meta-policy search**.*ICLR Submission*, 2019. - [2] Bacon, P. and Precup, D.
**The option-critic architecture**.*In NIPS Deep RL Workshop*, 2015. - [3] Barto and Mahadevan.
**Recent advances in hierarchical reinforcement learning**.*Discrete Event Dynamic Systems*, 2003. - [4] Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R.
**Unifying count based exploration and intrinsic motivation**.*NIPS*, 2016. - [5] Brafman, R. I. and Tennenholtz, M.
**R-max – a general polynomial time algorithm for near-optimal reinforcement learning**.*Journal of Machine Learning Research*, 2002. - [6] Carmel, D. and Markovitch, S.
**Exploration strategies for model-based learning in multi- agent systems**.*Autonomous Agents and Multi-Agent Systems*, 1999. - [7] Finn, C., Abbeel, P., and Levine, S.
**Model-agnostic metalearning for fast adaptation of deep networks**.*ICML*, 2017. - [8] Graziano, Vincent, Glasmachers, Tobias, Schaul, Tom, Pape, Leo, Cuccu, Giuseppe, Leitner, Jurgen, and Schmidhuber, Jürgen.
**Artificial Curiosity for Autonomous Space Exploration**.*Acta Futura*, 1(1):41–51, 2011. doi: 10.2420/AF04.2011.41. URL http://dx.doi.org/ 10.2420/AF04.2011.41. - [9] Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P.
**Vime: Varia- tional information maximizing exploration**.*NIPS*, 2016. - [10] Kearns, M. J. and Singh, S. P.
**Near-optimal reinforcement learning in polynomial time**.*Machine Learning*, 2002. - [11] Kolter, J. Z. and Ng, A. Y.
**Near-bayesian exploration in polynomial time**.*ICML*, 2009. - [12] Kompella, Stollenga, Luciw, and Schmidhuber.
**Exploring the predictable**.*Advances in Evolutionary Computing*, 2002. - [13] Koutník, Jan, Cuccu, Giuseppe, Schmidhuber, Jürgen, and Gomez, Faustino.
**Evolving large- scale neural networks for vision-based reinforcement learning**.*GECCO 2013*, 15:1061– 1068, 2013. - [14] Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.
**Continuous control with deep reinforcement learning**.*arXiv:1509.02971*, 2015. - [15] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al.
**Human-level control through deep reinforcement learning**.*Nature*, 518(7540):529–533, 2015. - [16] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timo- thy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray.
**Asynchronous methods for deep reinforcement learning**.*arXiv:1602.01783*, 2016. - [17] Ngo, Hung, Luciw, Matthew, Forster, Alexander, and Schmidhuber, Juergen.
**Learning skills from play: Artificial curiosity on a Katana robot arm**.*Proceedings of the International Joint Conference on Neural Networks*, pp. 10–15, 2012. ISSN 2161-4393. doi: 10.1109/IJCNN. 2012.6252824. - [18] Osband, I., Blundell, C., Pritzel, A., and Van Roy, B.
**Deep exploration via bootstrapped dqn**.*NIPS*, 2016. - [19] Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos, R.
**Count-based exploration with neural density models**.*arXiv:1703.01310*, 2017. - [20] Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R.
**Progressive neural networks**.*CoRR, vol. abs/1606.04671*, 2016. - [21] Schmidhuber.
**Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-… hook**.*Diploma thesis, TUM*, 1987. - [22] Schmidhuber. Gödel machines: Fully self-referential optimal universal self-improvers.
*Artificial General Intelligence*, 2006. - [23] Schmidhuber, J.
**Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots**.*Artificial Intelligence*, 2015. - [24] Schmidhuber, J., Zhao, J., and Schraudolph, N.
**Reinforcement learning with self-modifying policies**.*Learning to learn, Kluwer*, 1997. - [25] Schmidhuber, Juergen.
**On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Mod- els**. pp. 1–36, 2015. ISSN 10450823. doi: 1511.09249v1. URL http://arxiv.org/abs/1511.09249.

- [26] Schmidhuber, Jürgen.
**A Possibility for Implementing Curiosity and Boredom in Model- Building Neural Controllers**.*Meyer, J.A. and Wilson, S.W. (eds) : From Animals to animats*, pp. 222–227, 1991. - [27] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.
**Proximal****policy****optimiza-****tion algorithms**.*arXiv:1707.06347*, 2017. - [28] Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and Abbeel, Pieter.
**Trust region policy optimization**.*Arxiv preprint 1502.05477*, 2015. - [29] Schulman, John a/snd Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter.
**Gradient esti- mation using stochastic computation graphs**. In*Advances in Neural Information Processing Systems*, pp. 3528–3536, 2015. - [30] Silver, Yand, and Li.
**Lifelong machine learning systems: Beyond learning algorithms**.*DAAAI Spring Symposium-Technical Report, 2013.*, 2013. - [31] Stadie, Bradly C., Levine, S., and Abbeel, P.
**Incentivizing exploration in reinforcement learning with deep predictive models**.*arXiv:1507.00814*, 2015. - [32] Storck, Jan, Hochreiter, Sepp, and Schmidhuber, Jürgen.
**Reinforcement driven information acquisition in non-deterministic environments**.*Proceedings of the International . . .*, 2: 159–164, 1995. - [33] Sun, Yi, Gomez, Faustino, and Schmidhuber, Jürgen.
**Planning to be surprised: Optimal Bayesian exploration in dynamic environments**.*Lecture Notes in Computer Science (includ- ing subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, 6830 LNAI:41–51, 2011. ISSN 03029743. doi: 10.1007/978-3-642-22887-2_5. - [34] Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P.
**exploration: A study of count-based exploration for deep reinforcement learning.***arXiv:1611.04717*, 2016. - [35] Taylor and Stone.
**Transfer****learning****for****reinforcement****learning****domains:****A survey**.*DAAAI**Spring Symposium-Technical Report, 2013.*, 2009. - [36] Tessler, C. Givony, S., Zahavy, T., Mankowitz, D.J., and Mannor, S.
**A deep hierarchical approach to lifelong learning in minecraft**.*arXiv:1604.07255*, 2016. - [37] Thrun.
**Is learning the n-th thing any easier than learning the first?***NIPS*, 1996. - [38] Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K.
**Feudal networks for hierarchical reinforcement learning**.*arXiv preprint arXiv:1703.01161*, 2017. - [39] Wiering and Schmidhuber.
**Hq-learning**.*Adaptive Behavior*, 1997. - [40] Williams, Ronald J.
**Simple statistical gradient-following algorithms for connectionist re- inforcement learning**.*Machine Learning 8, 3-4 (1992), 229–256*, 1992.