Skip to main content

Gotta Learn Fast: A New Benchmark for Generalization in RL

Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, John Schulman


{alex, vickipfau, csh, oleg, joschu}


In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the HedgehogTM video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms in the RL domain. We also present and evaluate some baseline algorithms on the new benchmark.

1             Motivation

In the past few years, it has become clear that deep reinforcement learning can solve difficult, high-dimensional problems when given a good reward function and unlimited time to interact with the environment. However, while this kind of learning is a key aspect of intelligence, it is not the only one. Ideally, intelligent agents would also be able to generalize between tasks, using prior experience to pick up new skills more quickly. In this report, we introduce a new benchmark that we designed to make it easier for researchers to develop and test RL algorithms with this kind of capability.

Most popular RL benchmarks such as the ALE [1] are not ideal for testing generalization between similar tasks. As a result, RL research tends to “train on the test set”, boasting an algorithm’s final performance on the same environment(s) it was trained on. For the field to advance towards algorithms with better generalization properties, we need RL bench- marks with proper splits between “train” and “test” environments, similar to supervised learning datasets. Our benchmark has such a split, making it ideal for measuring cross-task generalization.

One interesting application of cross-task generalization is few-shot learning. Recently, supervised few-shot learning algorithms have improved by leaps and bounds [2]–[4]. This progress has hinged on the availability of good meta-learning datasets such as Omniglot [5] and Mini-ImageNet [6]. Thus, if we want better few-shot RL algorithms, it makes sense to construct a similar kind of dataset for RL. Our benchmark is designed to be a meta-learning dataset, consisting of many similar tasks sampled from a single task distribution. Thus, it is a suitable test bed for few-shot RL algorithms.

Beyond few-shot learning, there are many other applications of cross-task generalization that require the right kind of benchmark. For example, you might want an RL algorithm to learn how to explore in new environments. Our benchmark poses a fairly challenging exploration problem, and the train/test split presents a unique opportunity to learn how to explore on some levels and transfer this ability to other levels.

2           Related Work

Our Gym Retro project, as detailed in Section 3.1, is related to both the Retro Learning En- vironment (RLE) [7] and the Arcade Learning Environment (ALE) [1]. Unlike these projects, however, Gym Retro aims to be flexible and easy to extend, making it straightforward to create a huge number of RL environments.

Our benchmark is related to other meta-learning datasets like Omniglot [5] and Mini- ImageNet [6]. In particular, our benchmark is intended to serve the same purpose for RL as datasets like Omniglot serve for supervised learning.

Our baselines in Section 4 explore the ability of RL algorithms to transfer between video game environments. Several prior works have reported positive transfer results in the video game setting:

  • Parisotto et al. [8] observed that pre-training on certain Atari games could increase a network’s learning speed on other Atari games.
  • Rusu et al. [9] proposed a new architecture for transfer learning called progressive networks, and showed that it could boost learning speed across a variety of previously unseen Atari games.
  • Pathak et al. [10] found that an exploratory agent trained on one level of Super Mario Bros. could be used to boost performance on two other levels.
  • Fernando et al. [11] found that their PathNet algorithm increased learning speed on average when transferring from one Atari game to another.
  • Higgins et al. [12] used an unsupervised vision objective to produce robust features for a policy, and found that this policy was able to transfer to previously unseen vision tasks in DeepMind Lab [13] and MuJoCo [14].

In previous literature on transfer learning in RL, there are two common evaluation tech- niques: evaluation on synthetic tasks, and evaluation on the ALE. The former evaluation technique is rather ad hoc and makes it hard to compare different algorithms, while the latter typically reveals fairly small gains in sample complexity. One problem with the ALE in particular is that all the games are quite different, meaning that it may not be possible to get large improvements from transfer learning.

Ideally, further research in transfer learning would be able to leverage a standardized benchmark that is difficult like the ALE but rich with similar environments like well-crafted synthetic tasks. We designed our proposed benchmark to satisfy both criteria.

3           The Sonic Benchmark

This section describes the Sonic benchmark in detail. Each subsection focuses on a different aspect of the benchmark, ranging from technical details to high-level design features.

3.1           Gym Retro

Underlying the Sonic benchmark is Gym Retro, a project aimed at creating RL environments from various emulated video games. At the core of Gym Retro is the gym-retro Python package, which exposes emulated games as Gym [15] environments. Like RLE [7], gym-retro uses the libretro API1 to interface with game emulators, making it very easy to add new emulators to gym-retro.

The gym-retro package includes a dataset of games. Each game in the dataset consists of a ROM, one or more save states, one or more scenarios, and a data file. Here are high-level descriptions of each of these components:

  • ROM – the data and code that make up a game; loaded by an emulator to play that game.
  • Save state – a snapshot of the console’s state at some point in the game. For example, a save state could be created for the beginning of each level.
  • Data file – a file describing where various pieces of information are stored in console memory. For example, a data file might indicate where the score is located.
  • Scenario – a description of done conditions and reward functions. A scenario file can reference fields from the data file.

3.2          The Sonic Video Game

Figure 1: Screenshots from Sonic 3 & Knuckles. Left: a situation where the player can be shot into the air by utilizing an object with lever-like dynamics (Mushroom Hill Zone, Act 2). Middle: a door that opens when the player jumps on a button (Hydrocity Zone, Act 1). Right: a swing that the player must jump from at exactly the right time to reach a high platform (Mushroom Hill Zone, Act 2).

In this benchmark, we use three similar games: Sonic The HedgehogTM, Sonic The HedgehogTM2, and Sonic 3 & Knuckles. All of these games have very similar rules and controls, although there are subtle differences between them (e.g. Sonic 3 & Knuckles includes some extra controls and characters). We use multiple games to get as many environments for our dataset as possible.

Each Sonic game is divided up into zones, and each zone is further divided up into acts. While the rules and overarching objective remain the same throughout the entire game, each zone has a unique set of textures and objects. Different acts within a zone tend to share these textures and objects, but differ in spatial layout. We will refer to a (ROM, zone, act) tuple as a “level”.

The Sonic games provide a rich set of challenges for the player. For example, some zones include platforms that the player must jump on in order to open doors. Other zones require the player to first jump on a lever to send a projectile into the air, then wait for the projectile to fall back on the lever to send the player over some sort of obstacle. One zone even has a swing that the player must jump off of at a precise time in order to launch Sonic up to a higher platform. Examples of these challenges are presented in Figure 1.

3.3          Games and Levels

Our benchmark consists of a total of 58 save states taken from three different games, where each of these save states has the player at the beginning of a different level. A number of acts from the original games were not used because they contained only boss fights or because they were not compatible with our reward function.

We split the test set by randomly choosing zones with more than one act and then randomly choosing an act from each selected zone. In this setup, the test set contains mostly objects and textures present in the training set, but with different layouts.

The test levels are listed in the following table:

Sonic The HedgehogSpringYardZone1
Sonic The HedgehogGreenHillZone2
Sonic The HedgehogStarLightZone3
Sonic The HedgehogScrapBrainZone1
Sonic The Hedgehog 2MetropolisZone3
Sonic The Hedgehog 2HillTopZone2
Sonic The Hedgehog 2CasinoNightZone2
Sonic 3 & KnucklesLavaReefZone1
Sonic 3 & KnucklesFlyingBatteryZone2
Sonic 3 & KnucklesHydrocityZone1
Sonic 3 & KnucklesAngelIslandZone2

3.4          Frame Skip

The step() method on raw gym-retro environments progresses the game by roughly  1th of a second. However, following common practice for ALE environments, we require the use of a frame skip [16] of 4. Thus, from here on out, we will use timesteps as the main unit of measuring in-game time. With a frame skip of 4, a timestep represents roughly  1th of a second. We believe that this is more than enough temporal resolution to play Sonic well.

Moreover, since deterministic environments are often susceptible to trivial scripted solutions [17], we require the use of a stochastic “sticky frame skip”. Sticky frame skip adds a small amount of randomness to the actions taken by the agent; it does not directly alter observations or rewards.

Like standard frame skip, sticky frame skip applies n actions over 4n frames. However, for each action, we delay it by one frame with probability 0.25, applying the previous action for that frame instead. The following diagram shows an example of an action sequence with sticky frame skip:

3.5          Episode Boundaries

Experience in the game is divided up into episodes, which roughly correspond to lives. At the end of each episode, the environment is reset to its original save state. Episodes can end on three conditions:

  • The player completes a level successfully. In this benchmark, completing a level corre- sponds to passing a certain horizontal offset within the level.
  • The player loses a life.
  • 4500 timesteps have elapsed in the current episode. This amounts to roughly 5 minutes of in-game time.

The environment should only be reset if one of the aforementioned done conditions is met. Agents should not use special APIs to tell the environment to start a new episode early.

Note that our benchmark omits the boss fights that often take place at the end of a level. For levels with boss fights, our done condition is defined as a horizontal offset that the agent must reach before the boss fight. Although boss fights could be an interesting problem to solve, they are fairly different from the rest of the game. Thus, we chose not to include them so that we could focus more on exploration, navigation, and speed.

3.6          Observations

A gym-retro environment produces an observation at the beginning of every timestep. This observation is always a 24-bit RGB image, but the dimensions vary by game. For Sonic, the screen images are 320 pixels wide and 224 pixels tall.

3.7           Actions

At every timestep, an agent produces an action representing a combination of buttons on the game console. Actions are encoded as binary vectors, where 1 means “pressed” and 0 means “not pressed”. For Sega Genesis games, the action space contains the following buttons: B, A, MODE, START, UP, DOWN, LEFT, RIGHT, C, Y, X, Z.

A small subset of all possible button combinations makes sense in Sonic. In fact, there are only eight essential button combinations:

{{}, {LEFT}, {RIGHT}, {LEFT, DOWN},

{RIGHT, DOWN}, {DOWN}, {DOWN, B}, {B}}

The UP button is also useful on occasion, but for the most part it can be ignored.

3.8         Rewards

During an episode, agents are rewarded such that the cumulative reward at any point in time is proportional to the horizontal offset from the player’s initial position. Thus, going right always yields a positive reward, while going left always yields a negative reward. This reward function is consistent with our done condition, which is based on the horizontal offset in the level.

The reward consists of two components: a horizontal offset, and a completion bonus. The horizontal offset reward is normalized per level so that an agent’s total reward will be 9000 if it reaches the predefined horizontal offset that marks the end of the level. This way, it is easy to compare scores across levels of varying length. The completion bonus is 1000 for reaching the end of the level instantly, and drops linearly to zero at 4500 timesteps. This way, agents are encouraged to finish levels as fast as possible2.

Since the reward function is dense, RL algorithms like PPO [18] and DQN [16] can easily make progress on new levels. However, the immediate rewards can be deceptive; it is often necessary to go backwards for prolonged amounts of time (Figure 2). In our RL baselines, we use reward preprocessing so that our agents are not punished for going backwards. Note, however, that the preprocessed reward still gives no information about when or how an agent should go backwards.

3.9          Evaluation

In general, all benchmarks must provide some kind of performance metric. For Sonic, this metric takes the form of a “mean score” as measured across all the levels in the test set. Here are the general steps for evaluating an algorithm on Sonic:

  1. At training time, use the training set as much or as little as you like.
  • At test time, play each test level for 1 million timesteps. Play each test level separately; do not allow information to flow between test levels. Multiple copies of each environment may be used (as is done in algorithms like A3C [19]).
  • For each 1 million timestep evaluation, average the total reward per episode across all episodes. This gives a per-level mean score.
  • Average the mean scores for all the test levels, giving an aggregate metric of performance.

Figure 2: A trace of a successful path through the first part of Labyrinth Zone, Act 2 in Sonic The HedgehogTM. In the initial green segment, the agent is moving rightwards, getting positive reward. In the red segment, the agent must move to the left, getting negative reward. During the orange segment, the agent is once again moving right, but its cumulative reward is still not as high as it was after the initial green segment. In the final green segment, the agent is finally improving its cumulative reward past the initial green segment. For an average player, it takes 20 to 30 seconds to get through the red and orange segments.

The most important aspect of this procedure is the timestep limit for each test level. In the infinite-timestep regime, there is no strong reason to believe that meta-learning or transfer learning is necessary. However, in the limited-timestep regime, transfer learning may be necessary to achieve good performance quickly.

We aim for this version of the Sonic benchmark to be easier than zero-shot learning but harder than ∞-shot learning. 1 million timesteps was chosen as the timestep limit because modern RL algorithms can make some progress in this amount of time.

4           Baselines

In this section, we present several baseline learning algorithms and discuss their performance on the benchmark. Our baselines include human players, several methods that do not make use of the training set, and a simple transfer learning approach consisting of joint training followed by fine tuning. Table 1 gives the aggregate scores for each of the baselines, and Figure 3 compares the baselines’ aggregate learning curves.

4.1           Humans

For the human baseline, we had four test subjects play each test level for one hour. Before seeing the test levels, each subject had two hours to practice on the training levels. Table 7 in Appendix C shows average human scores over the course of an hour.

Table 1: Aggregate test scores for each of the baseline algorithms.

Rainbow2748.6 ± 102.2
JERK1904.0 ± 21.9
PPO1488.8 ± 42.8
PPO (joint)3127.9 ± 116.9
Rainbow (joint)2969.2 ± 170.2
Human7438.2 ± 624.2
Figure 3: The mean learning curves for all the baselines across all the test levels. Every curve is an average over three runs. The y-axis represents instantaneous score, not average over training.

4.2          Rainbow

Deep Q-learning (DQN) [16] is a popular class of algorithms for reinforcement learning in high-dimensional environments like video games. We use a specific variant of DQN, namely Rainbow [20], which performs particularly well on the ALE.

We retain the architecture and most of the hyper-parameters from [20], with a few small changes. First, we set Vmax = 200 to account for Sonic’s reward scale. Second, we use a replay buffer size of 0.5M instead of 1M to lower the algorithm’s memory consumption. Third, we do not use hyper-parameter schedules; rather, we simply use the initial values of the schedules from [20].

Since DQN tends to work best with a small, discrete action space, we use an action space containing seven actions:


{DOWN}, {DOWN, B}, {B}}

We use an environment wrapper that rewards the agent based on deltas in the maximum x-position. This way, the agent is rewarded for getting further than it has been before (in the current episode), but it is not punished for backtracking in the level. This reward preprocessing gives a sizable performance boost.

Table 2 in Appendix C shows Rainbow’s scores for each test level.

4.3          JERK: A Scripted Approach

In this section, we present a simple algorithm that achieves high rewards on the benchmark without using any deep learning. This algorithm completely ignores observations and instead looks solely at rewards. We call this algorithm Just Enough Retained Knowledge (JERK). We note that JERK is loosely related to The Brute [21], a simple algorithm that finds good trajectories in deterministic Atari environments without leveraging any deep learning.

Algorithm 1 in Appendix A describes JERK in detail.    The main idea is to explore using a simple algorithm, then to replay the best action sequences more and more frequently as training progresses. Since the environment is stochastic, it is never clear which action sequence is the best to replay. Thus, each action sequence has a running mean of its rewards. Table 3 in Appendix C shows JERK’s scores for each test level. We note that JERK actually performs better than regular PPO, which is likely due to JERK’s perfect memory and its tailored exploration strategy.

4.4          PPO

Proximal Policy Optimization (PPO) [18] is a policy gradient algorithm which performs well on the ALE. For this baseline, we run PPO individually on each of the test levels.

For PPO we use the same action and observation spaces as for Rainbow, as well as the same reward preprocessing. For our experiments, we scaled the rewards by a small constant factor in order to bring the advantages to a suitable range for neural networks. This is similar to how we set Vmax for Rainbow. The CNN architecture is the same as the one used in [18] for Atari.

We use the following hyper-parameters for PPO:

Table 4 in Appendix C shows PPO’s scores for each test level.

4.5          Joint PPO

While Section 4.4 evaluates PPO with no meta-learning, this section explores the ability of PPO to transfer from the training levels to the test levels. To do this, we use a simple joint training algorithm3, wherein we train a policy on all the training levels and then use it as an initialization on the test levels.

During meta-training, we train a single policy to play every level in the training set. Specifically, we run 188 parallel workers, each of which is assigned a level from the training set. At every gradient step, all the workers average their gradients together, ensuring that the policy is trained evenly across the entire training set. This training process requires hundreds of millions of timesteps to converge (see Figure 4), since the policy is being forced to learn a lot more than a single level. Besides the different training setup, we use the same hyper-parameters as for regular PPO.

Once the joint policy has been trained on all the training levels, we fine-tune it on each test level under the standard evaluation rules. In essence, the training set provides an initialization that is plugged in when evaluating on the test set. Aside from the initialization, nothing is changed from the evaluation procedure used for Section 4.4.

Figure 4 shows that, after roughly 50 million timesteps of joint training, further improvement on the training set stops leading to better performance on the test set. This can be thought of as the point where the model starts to overfit. The figure also shows that zero-shot performance does not increase much after the first few million timesteps of joint training.

Figure 4: Intermediate performance during the process of joint training a PPO model. The x-axis corresponds to timesteps into the joint training process. The zero-shot curves were densely sampled during training, while the fine-tuning curves were sampled periodically.

Table 5 in Appendix C shows Joint PPO’s scores for each test level. Table 9 in Appendix D shows Joint PPO’s final scores for each training level. The resulting test performance is superior to that of Rainbow, and is roughly 100% better than that of regular PPO. Thus, it is clear that some kind of useful information is being transferred from the training levels to the test levels.

4.6          Joint Rainbow

Since Rainbow outperforms PPO with no joint training, it is natural to ask if Joint Rainbow analogously outperforms Joint PPO. Surprisingly, our experiments indicate that this is not the case.

To train a single Rainbow model on the entire training set, we use a multi-machine training setup with 32 GPUs. Each GPU corresponds to a single worker, where each worker has its own replay buffer and eight environments. The environments are all “joint environments”, meaning that they sample a new training level at the beginning of every episode. Each worker runs the algorithm described in Algorithm 2 in Appendix A.

Besides the unusual batch size and distributed worker setup, all the hyper-parameters are kept the same as for the regular Rainbow experiment.

Table 6 in Appendix C shows the performance of fine-tuning on every test level. Table 8 in Appendix D shows the performance of the jointly trained model on every training level.

5            Discussion

We have presented a new reinforcement learning benchmark and used it to evaluate several baseline algorithms. Our results leave a lot of room for improvement, especially since our best transfer learning results are not much better than our best results learning from scratch. Also, our results are nowhere close to the maximum achievable score (which, by design, is somewhere between 9000 and 10000).

Now that the benchmark and baseline results have been laid out, there are many directions to take further research. Here are some questions that future research might seek to answer:

  • How much can exploration objectives help training performance on the benchmark?
  • Can transfer learning be improved using data augmentation?
  • Is it possible to improve performance on the test set using a good feature representation learned on the training set (like in Higgins et al. [12])?
  • Can different architectures (e.g. Transformers [23] and ResNets [24]) be used to improve training and/or test performance?

While we believe the Sonic benchmark is a step in the right direction, it may not be sufficient for exploring meta-learning, transfer learning, and generalization in RL. Here are some possible problems with this benchmark, which will only be proven or disproven once more work has been done:

  • It may be possible to solve a Sonic level in many fewer than 1M timesteps without any transfer learning.
  • Sonic-specific hacks may outperform general meta-learning approaches.
  • Exploration strategies that work well in Sonic may not generalize beyond Sonic.
  • Mastering a Sonic level involves some degree of memorization. Algorithms which are good at few-shot memorization may not be good at other tasks.


  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The Arcade Learning Environment: An evaluation platform for general agents,” Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, Jun. 2013.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in International conference on machine learning, 2016, pp. 1842–1850.
  • C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” 2017. eprint: arXiv:1703.03400.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “Meta-learning with temporal convolutions,” 2017. eprint: arXiv:1707.03141.
  • B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum, “One shot learning of simple visual concepts,” in Conference of the Cognitive Science Society (CogSci), 2011.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., Curran Associates, Inc., 2016, pp. 3630–3638.
  • N. Bhonker, S. Rozenberg, and I. Hubara, “Playing SNES in the Retro Learning Environment,” 2016. eprint: arXiv:1611.02205.
  • E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-Mimic: Deep multitask and transfer reinforcement learning,” 2015. eprint: arXiv:1511.06342.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” 2016. eprint: arXiv:1606.04671.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self- supervised prediction,” 2017. eprint: arXiv:1705.05363.
  • C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “PathNet: Evolution channels gradient descent in super neural networks,” 2017. eprint: arXiv:1701.08734.
  • I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement learning,” 2017. eprint: arXiv:1707.08475.
  • C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Ku¨ttler, A. Lefrancq, S. Green, V. Vald´es, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen, “DeepMind Lab,” 2016. eprint: arXiv:1612.03801.
  • E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control.,” in IROS, IEEE, 2012, pp. 5026–5033.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” 2016. eprint: arXiv:1606.01540.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. J. Hausknecht, and M. Bowling, “Revisiting the Arcade Learning Environment: Evaluation protocols and open problems for general agents,” CoRR, vol. abs/1709.06009, 2017.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy opti- mization algorithms,” 2017. eprint: arXiv:1707.06347.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” 2016. eprint: arXiv: 1602.01783.
  • M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” 2017. eprint: arXiv:1710.02298.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015, pp. 4148–4152.
  • A. Nichol and J. Schulman, “On first-order meta-learning algorithms,” 2018. eprint: arXiv: 1803.02999.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L- . Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778.

A             Detailed Algorithm Descriptions

B            Plots for Multiple Seeds

In this section, we present per-algorithm learning curves on the test set. For each algorithm, we run three different random seeds.

Figure 5: Test learning curves for JERK.
Figure 6: Test learning curves for PPO.
Figure 7: Test learning curves for Rainbow.
Figure 8:    Test learning curves for Joint Rainbow.
Figure 9:   Test learning curves for Joint PPO.

C             Scores on Test Set

Table 2: Detailed evaluation results for Rainbow.

StateScoreFinal Score
AngelIslandZone Act23576.0 ± 89.25070.1 ± 433.1
CasinoNightZone Act26045.2 ± 845.48607.9 ± 1022.5
FlyingBatteryZone Act21657.5 ± 10.12195.4 ± 190.8
GreenHillZone Act26332.0 ± 263.56817.2 ± 392.8
HillTopZone Act22847.8 ± 161.93432.7 ± 252.9
HydrocityZone Act1886.4 ± 31.4867.2 ± 0.0
LavaReefZone Act12623.6 ± 78.02908.5 ± 106.1
MetropolisZone Act31178.1 ± 229.32278.8 ± 280.6
ScrapBrainZone Act1879.1 ± 141.02050.0 ± 1089.9
SpringYardZone Act11787.6 ± 136.53861.0 ± 782.2
StarLightZone Act32421.9 ± 110.82680.3 ± 366.2
Aggregate2748.6 ± 102.23706.3 ± 192.7

Table 3: Detailed evaluation results for JERK.

StateScoreFinal Score
AngelIslandZone Act21305.2 ± 13.31605.1 ± 158.7
CasinoNightZone Act22231.0 ± 556.82639.7 ± 799.5
FlyingBatteryZone Act21384.9 ± 13.01421.8 ± 25.0
GreenHillZone Act23702.1 ± 199.14862.2 ± 178.7
HillTopZone Act21901.6 ± 56.01840.4 ± 326.8
HydrocityZone Act12613.0 ± 149.63895.5 ± 50.0
LavaReefZone Act1267.1 ± 71.6200.3 ± 71.9
MetropolisZone Act32623.7 ± 209.23291.4 ± 398.2
ScrapBrainZone Act11442.6 ± 108.81756.3 ± 314.2
SpringYardZone Act1838.9 ± 186.1829.2 ± 158.2
StarLightZone Act32633.5 ± 23.43033.3 ± 53.8
Aggregate1904.0 ± 21.92306.8 ± 74.0

Table 4: Detailed evaluation results for PPO.

StateScoreFinal Score
AngelIslandZone Act21491.3 ± 537.82298.3 ± 1355.8
CasinoNightZone Act22517.8 ± 1033.02343.6 ± 1044.5
FlyingBatteryZone Act21105.8 ± 177.31305.7 ± 221.9
GreenHillZone Act22477.6 ± 435.32655.7 ± 373.4
HillTopZone Act22408.0 ± 140.43173.1 ± 549.7
HydrocityZone Act1622.8 ± 288.6433.5 ± 348.4
LavaReefZone Act1885.8 ± 125.6683.9 ± 206.3
MetropolisZone Act31007.6 ± 145.11058.6 ± 400.4
ScrapBrainZone Act11162.0 ± 202.82190.8 ± 667.5
SpringYardZone Act1564.2 ± 195.6644.2 ± 337.4
StarLightZone Act32134.4 ± 313.42519.0 ± 98.8
Aggregate1488.8 ± 42.81755.1 ± 65.2

Table 5: Detailed evaluation results for Joint PPO.

StateScoreFinal Score
AngelIslandZone Act23283.0 ± 681.04375.3 ± 1132.8
CasinoNightZone Act25410.2 ± 635.66142.4 ± 1098.7
FlyingBatteryZone Act21513.3 ± 48.31748.0 ± 15.1
GreenHillZone Act28769.3 ± 308.88921.2 ± 59.5
HillTopZone Act24289.9 ± 334.24688.6 ± 109.4
HydrocityZone Act11249.8 ± 206.32821.7 ± 154.1
LavaReefZone Act12409.0 ± 253.53076.0 ± 13.7
MetropolisZone Act31409.5 ± 72.92004.3 ± 110.4
ScrapBrainZone Act11634.6 ± 287.02112.0 ± 713.9
SpringYardZone Act12992.9 ± 350.04663.4 ± 799.5
StarLightZone Act31445.3 ± 110.52636.7 ± 103.3
Aggregate3127.9 ± 116.93926.3 ± 78.1

Table 6: Detailed evaluation results for Joint Rainbow.

StateScoreFinal Score
AngelIslandZone Act23770.5 ± 231.84615.1 ± 1082.5
CasinoNightZone Act27877.7 ± 556.08851.2 ± 305.4
FlyingBatteryZone Act22110.2 ± 114.42585.7 ± 131.1
GreenHillZone Act26106.8 ± 667.16793.5 ± 643.6
HillTopZone Act22378.4 ± 92.53531.3 ± 4.9
HydrocityZone Act1865.0 ± 1.3867.2 ± 0.0
LavaReefZone Act12753.6 ± 192.82959.7 ± 134.1
MetropolisZone Act31340.6 ± 224.01843.2 ± 253.0
ScrapBrainZone Act1983.5 ± 34.32075.0 ± 568.3
SpringYardZone Act12661.0 ± 293.64090.1 ± 700.2
StarLightZone Act31813.7 ± 94.52533.8 ± 239.0
Aggregate2969.2 ± 170.23704.2 ± 151.1

Table 7: Detailed evaluation results for humans.

AngelIslandZone Act28758.3 ± 477.9
CasinoNightZone Act28662.3 ± 1402.6
FlyingBatteryZone Act26021.6 ± 1006.7
GreenHillZone Act28166.1 ± 614.0
HillTopZone Act28600.9 ± 772.1
HydrocityZone Act17146.0 ± 1555.1
LavaReefZone Act16705.6 ± 742.4
MetropolisZone Act36004.8 ± 440.4
ScrapBrainZone Act16413.8 ± 922.2
SpringYardZone Act16744.0 ± 1172.0
StarLightZone Act38597.2 ± 729.5
Aggregate7438.2 ± 624.2

D                        Scores on Training Set

Table 8: Final performance for the joint Rainbow model over the last 10 episodes for each environment. Error margins are computed using the standard deviation over three runs.

AngelIslandZone Act14765.6 ± 1326.2LaunchBaseZone Act21850.1 ± 124.3
AquaticRuinZone Act15382.3 ± 1553.1LavaReefZone Act2820.3 ± 80.9
AquaticRuinZone Act24752.7 ± 1815.0MarbleGardenZone Act12733.2 ± 232.1
CarnivalNightZone Act13554.8 ± 379.6MarbleGardenZone Act2180.7 ± 150.2
CarnivalNightZone Act22613.7 ± 46.4MarbleZone Act14127.0 ± 375.9
CasinoNightZone Act12165.7 ± 75.9MarbleZone Act21615.7 ± 47.6
ChemicalPlantZone Act14483.5 ± 954.6MarbleZone Act31595.1 ± 77.6
ChemicalPlantZone Act22840.4 ± 216.4MetropolisZone Act1388.9 ± 184.2
DeathEggZone Act12334.3 ± 61.0MetropolisZone Act23048.6 ± 1599.9
DeathEggZone Act23197.8 ± 32.0MushroomHillZone Act12076.0 ± 1107.8
EmeraldHillZone Act19273.4 ± 385.8MushroomHillZone Act22869.1 ± 1150.4
EmeraldHillZone Act29410.1 ± 421.1MysticCaveZone Act11606.8 ± 776.9
FlyingBatteryZone Act1711.8 ± 99.1MysticCaveZone Act24359.4 ± 547.5
GreenHillZone Act14164.7 ± 311.2OilOceanZone Act11998.8 ± 10.0
GreenHillZone Act35481.3 ± 1095.1OilOceanZone Act23613.7 ± 1244.9
HiddenPalaceZone9308.9 ± 119.1SandopolisZone Act11475.3 ± 205.1
HillTopZone Act1778.0 ± 8.1SandopolisZone Act2539.9 ± 0.7
HydrocityZone Act2825.7 ± 2.2ScrapBrainZone Act2692.6 ± 67.6
IcecapZone Act15507.0 ± 167.5SpringYardZone Act23162.3 ± 38.7
IcecapZone Act23198.2 ± 774.7SpringYardZone Act32029.6 ± 211.3
LabyrinthZone Act13005.3 ± 197.8StarLightZone Act14558.9 ± 1094.1
LabyrinthZone Act21420.8 ± 533.0StarLightZone Act27105.5 ± 404.2
LabyrinthZone Act31458.7 ± 255.4WingFortressZone3004.6 ± 7.1
LaunchBaseZone Act12044.5 ± 601.7Aggregate3151.7 ± 218.2

Table 9: Final performance for the joint PPO model over the last 10 episodes for each environment. Error margins are computed using the standard deviation over two runs.

AngelIslandZone Act19668.2 ± 117.0LaunchBaseZone Act21836.0 ± 545.0
AquaticRuinZone Act19879.8 ± 4.0LavaReefZone Act22155.1 ± 1595.2
AquaticRuinZone Act28676.0 ± 1183.2MarbleGardenZone Act13760.0 ± 108.5
CarnivalNightZone Act14429.5 ± 452.0MarbleGardenZone Act21366.4 ± 23.5
CarnivalNightZone Act22688.2 ± 110.4MarbleZone Act15007.8 ± 172.5
CasinoNightZone Act19378.8 ± 409.3MarbleZone Act21620.6 ± 30.9
ChemicalPlantZone Act19825.0 ± 6.0MarbleZone Act32054.4 ± 60.8
ChemicalPlantZone Act22586.8 ± 516.9MetropolisZone Act11102.8 ± 281.5
DeathEggZone Act13332.5 ± 39.1MetropolisZone Act26666.7 ± 53.0
DeathEggZone Act23141.5 ± 282.5MushroomHillZone Act13210.2 ± 2.7
EmeraldHillZone Act19870.7 ± 0.3MushroomHillZone Act26549.6 ± 1802.9
EmeraldHillZone Act29901.6 ± 18.9MysticCaveZone Act16755.9 ± 47.8
FlyingBatteryZone Act11642.4 ± 512.9MysticCaveZone Act26189.6 ± 16.6
GreenHillZone Act17116.0 ± 2783.5OilOceanZone Act14938.8 ± 13.3
GreenHillZone Act39878.5 ± 5.1OilOceanZone Act26964.9 ± 1929.3
HiddenPalaceZone9918.3 ± 1.4SandopolisZone Act12548.1 ± 80.8
HillTopZone Act14074.2 ± 370.1SandopolisZone Act21087.5 ± 21.5
HydrocityZone Act24756.8 ± 3382.3ScrapBrainZone Act21403.7 ± 3.3
IcecapZone Act15389.9 ± 35.6SpringYardZone Act29306.8 ± 489.1
IcecapZone Act26819.4 ± 67.9SpringYardZone Act32608.1 ± 113.2
LabyrinthZone Act15041.4 ± 194.6StarLightZone Act16363.6 ± 198.7
LabyrinthZone Act21337.9 ± 61.9StarLightZone Act28336.1 ± 998.3
LabyrinthZone Act31918.7 ± 33.5WingFortressZone3109.2 ± 50.9
LaunchBaseZone Act12714.0 ± 17.7Aggregate5083.6 ± 91.8