Search Unity

Announcing the Obstacle Tower Challenge winners and open source release

, 八月 7, 2019

After six months of competition (and a few last-minute submissions), we are happy to announce the conclusion and winners of the Obstacle Tower Challenge. We want to thank all of the participants for both rounds and congratulate Alex Nichol, the Compscience.org team, and Songbing Choi for placing in the challenge. We are also excited to share that we have open-sourced Obstacle Tower for the research community to extend for their own needs.

Challenge winners

We started this challenge in February as a way to help foster research in the AI community, by providing a challenging new benchmark of agent performance built in Unity, which we called Obstacle Tower. The Obstacle Tower was developed to be difficult for current machine learning algorithms to solve, and push the boundaries of what was possible in the field by focusing on procedural generation. Key to that was only allowing participants access to one hundred instances of the Obstacle Tower, and evaluating their trained agents on a set of unique procedurally generated towers they had never seen before. In this way, agents had to be able not only to solve the versions of the environment they had seen before, but also do well on unexpected variations, a key property of intelligence referred to as generalization.

Once we created Obstacle Tower we performed preliminary benchmarking ourselves using two of the state-of-the-art algorithms at the time. Our learned agents were able to solve a little over an average of 3 floors solved on these unseen instances of the tower used for evaluation. In contrast, humans without experience playing video games are able to solve an average of 15 floors, often getting as high as 20 floors into a tower.

Since the start of the contest, we have received close to 3,000 of submitted agents and been delighted to watch as participant’s continued to submit even more compelling agents for evaluation. The top six final agents submitted by participants were able to solve over 10 floors of unseen versions of the tower, with the top entry solving an average of nearly 20 floors, similar to the performance of experienced human players. We wanted to highlight all participants who solved at least ten floors during evaluation, as well as our top three winners.

Challenge Winners
Place Name Username Average floors Average reward
1st Alex Nichol unixpickle 19.4 35.86
2nd Compscience.org giadefa 16 28.7
3rd Songbin Choi sungbinchoi 13.2 23.2
Honorable Mentions
Place Name Username Average floors Average reward
4th Joe Booth joe_booth 10.8 18.06
5th Doug Meng dougm 10 16.5
6th UEFDL Miffyli 10 16.42

Open source release

We are happy to announce that all of the source code for Obstacle Tower is now available under the Apache 2 license. We waited to open source until the contest was completed to prevent anyone from reverse-engineering the task or evaluation process. Now that it is over, we hope researchers and users are able to take things apart to help learn how to solve the task better, as well as modify the Obstacle Tower for your own needs. The Obstacle Tower was built to be highly modular, and relies heavily on procedural generation of multiple aspects of the environment, from the floor layout to the item and module placement in each room. We expect that this modularity will make it easy for researchers to define their own custom tasks using the pieces and tools we’ve built.

The focus of the Obstacle Tower Challenge is what we refer to in our paper as weak generalization (sometimes called within-distribution generalization). For the challenge, agents had access to one hundred towers and were tested on an additional five towers. Importantly, all of these towers were generated using the same set of rules. As such, there were no big surprises for the agents.

Also of interest is a different kind of generalization, what we refer to as the strong kind (or sometimes called out of distribution). In this scenario, the agent would be tested on a version of Obstacle Tower, which was generated using a different set of rules from the training set. In our paper, we held out a separate visual theme for the evaluation phase, which used different textures, geometry, and lighting. Because our baseline agents performed catastrophically in these cases, we opted to only test for weak generalization in the challenge. That being said, we think that strong generalization benchmarks can be an even better measure of progress in artificial intelligence, as humans are easily able to strongly generalize, while agents typically fail at such tasks. We look forward to the community extending our work and proposing their own unique benchmarks using this open source release.

Lastly, we want to give a shout-out to our collaborators on the project, Julian Togelius and Ahmed Khalifa, and thank them for their contributions in the design process, and for Ahmed’s open source procedural generation tool, which we utilized to create the floor layouts in Obstacle Tower.

To learn more about the project, as well as how to extend it for your own uses, head over to the GitHub page for the project.

Meet the Winners

1st Place – Alex Nichol

About Alex

Alex has been programming since he was 11 years old. As a senior in high school, Alex became very interested in AI. He is completely self-taught in AI, using online courses, blogs, and papers as necessary. He studied at Cornell for three semesters before leaving to pursue AI full-time and ultimately joining OpenAI (he has since left but still maintains a strong interest in AI). Recently he has taken up cooking!

Details

Alex trained his agent in several steps. First, he trained a classifier to identify objects (boxes, doors, etc..). This classifier was used throughout the process to tell the agent what objects it has seen in the past 50 timesteps. Then, Alex used behavioral cloning to train an agent to imitate human demonstrations. Lastly, Alex used a variant of Proximal Policy Optimization (PPO) which he calls “prierarchy” to fine-tune his behavioral cloned agent based on the game’s reward function. This variant of PPO replaces the entropy term with a KL-divergence term that keeps the agent close to the original behavior cloned policy. Alex tried a few other approaches that didn’t quite pan out – Generative Adversarial Imitation Learning (GAIL) for more sample-efficient imitation learning, CMA-ES to learn a policy from scratch, and stacking last layer features from the classifier and feeding it into the agent (instead of using the classifier’s outputs for the state).

If you would like to learn more, Alex wrote up a detailed blog post and has shared the code he used for the challenge. You can also find Alex on Twitter, Github, and his personal website.

2nd Place – Compscience.org

About Compscience.org

At the computational science laboratory (www.compscience.org) at Universitat Pompeu Fabra, Gianni and Miha work at the interface between computing and different application areas, looking at developing computational models with intelligent behavior. Gianni is the head of the computational science laboratory at University Pompeu Fabra, an Icrea research professor, and a founder at Acellera. Miha is a PhD student in Gianni’s biology group. The team felt that the Obstacle Tower Challenge was a good way to quickly learn and iterate new ideas in a relevant 3D environment.

Details

The team’s final model was PPO with a reduced action set and a reshaped reward function. For the first floors, the team also used KL-devergence terms to induce behaviors into the agent similar to what Alex Nichol did. But was later dropped in higher floors. The team also used a sampling algorithm at the key floors to focus the actors to run more in floors and seeds where it was neither good nor bad. Later, the team used a more standard sampling at higher floors. The team did not have enough time to assess the exact benefits of each method, which they plan to do in the future. They plan to release the source code as soon as they can understand better and generalize these aspects. Lastly, the team tried world models (create a very compressed representation of the observation with an autoencoder and build a policy using evolutionary algorithms over this space). It did not work but the team learned a lot.

The team enjoyed the Obstacle Tower and believe that more realistic environments in terms of physics will be important so that the agents can do amazing things with enough samples. The team used 10B steps to train their agent. You can find out more about the team on Github and the lab’s website.

3rd Place – Songbin Choi

About Songbin

Based in Seoul, Songbin has a PhD in biomedical engineering. Like many others who are fascinated with deep learning, Songbin is self-taught. He leverages the many papers, lectures, libraries, and code that are freely available online. He has tackled several computer vision tasks and challenges in the past. Songbin was excited about the Obstacle Tower and the chance to wrestle with a reinforcement learning problem.

Details

Songbin used the PPO algorithm implemented as part of the ML-Agents toolkit. During the challenge, his agent took actions in a sequentially coordinated fashion to achieve certain subtasks (for example, moving the box to a certain position). He used a gated recurrent unit (GRU) in order for the agent to make memory-backed decisions. To reduce overfitting, dropout layers were added and left right flipping was also used, which is a common data augmentation method in imaging tasks. He then recorded human play and repeatedly added those experiences to the replay buffer while training. As a side effect of playing the Obstacle Tower during the challenge, Songbin has become an expert player of the game. Although human play is brutally expensive to collect, it is of high quality and reduced the amount of simulation time needed. Songbin also tried longer sequence lengths but failed to achieve better performance (even though it was contrary to his expectation). He is still trying to figure out why it did not work. He utilized all 100 tower seeds during training with no separate validation set for evaluation. He suspected overfitting in the model, even though he tried to reduce as much as possible.

Lastly, although deep learning methods in computer vision, especially image classification tasks, has matured in recent years, deep reinforcement learning tasks are relatively more tricky. His top scoring agent failed to show comparable performance to human player (around floor 30). Watching AlphaGo and AlphaStar beat professional players, Songbin believes there is still a lot of room for improvement in the Obstacle Tower.

Honorable mentions

Joe Booth

About Joe

Joe Booth has over 25 years in the video game industry and worked on many familiar titles and franchises such as FIFA, Need For Speed, Ghost Recon, and Rollercoaster Tycoon. He is currently the VP of development for an incubator called Orions Wave. The main focus is Orions Systems, which is a video analytics platform that uses humans and AI/CV compute in an interchangeable, distributed way to get around the limits of today’s AI.

Details

Joe used an optimized version of PPO + Demonstrations for the Obstacle Tower challenge. He focused on compressing the input/output of the network, adding a recurrent memory and basing the hyperparameters on Large-Scale Study of Curiosity-Driven Learning’s Unity environment. For Round 2, he added demonstrations that passed floor 10, but never consistently. He also tried working towards using semantics. Although he realized it would not pay off in time, it is the direction he wanted to go in the long term.

Joe wrote up a separate blog post on the Obstacle Tower and released a round 1 paper on Arxiv. His round 1 code can be found here. You can find Joe on Twitter, LinkedIn, Github, and his personal website.

Doug Meng

About Doug

Doug Meng is a solution architect at NVIDIA, focusing on applied machine learning, enabling GPGPU in the cloud. Previously, he had a few years of experience in machine learning, statistics, and distributed systems, with some research experience in signal processing.

Details

Doug trained his agent using a modified DeepMind IMPALA with batched inference and customized replay buffer.  He used Obstacle Tower retro mode with frame stacking of 4 and a few other tricks from the OpenAI baseline. The agent took about 12 days to train. Most of the time was spent struggling to reduce the training time in order to try out more algorithms. He also tried PPO and Rainbow, but his hypothesis was that off policy hurts the model performance quite a bit, whereas IMPALA is slightly off-policy. Those agents could not consistently get past floor 7.

UEFDL

About UEFDL

UEFDL is a three-member team including Anssi Kanervisto, Janne Karttunen, Ville Hautamaki. The team is from the School of Computing, University of Eastern Finland. Anssi is a 2nd year Ph.D. student working on using video games in reinforcement learning research. Janne is a recent MSc graduate with a thesis on deep reinforcement and transfers learning from games to robotics. Ville is a senior researcher focusing on machine learning, Bayseian inference, and speech technology.

Details

The team used Advantage Actor Critic (A2C) with a Long-Short-Term Memory unit (LSTM) from stable-baselines packages. One model for floors 0-4 and another for 5-9. The models for floors 10+ did not learn to solve the puzzle so were not included. The team first trained models for floors 0-4 then use that model as a starting point for floors 5-9. This way, the agent can focus on finding the key in later levels and avoid issues with the model for floors 5-9 forgetting how to complete earlier floors (unlikely, but just in case). They also tried a few other experiments such as AC2 with curiosity, PPO at different entropies, and replacing some of the Obstacle Tower environments with “replay environments” of human gameplay. Overall, the team was excited about the competition and other machine learning video game competitions.

Thank you!

Thanks so much to everyone who participated and our partners at Google Cloud for providing GCP credits and AICrowd for hosting the challenge. When we started the competition we weren’t sure if participants would be able to pass the ten-floor threshold, but the community has impressed us with getting as far as 19 floors into unseen versions of the tower. That being said, each instance of Obstacle Tower contains 100 floors. This means that there is still 80% of the tower left unsolved! Furthermore, there is a greater need for control and planning in the upper floors, as enemies, more dangerous rooms, and more complicated floor layouts are introduced. We think this means there is a lot of room for new methods to be developed in the field to make additional progress. We look forward to seeing what progress is made over the next months and years as researchers continue to tackle Obstacle Tower.

If you have any questions about the challenge please email us at OTC@unity3d.com. If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions, please apply!

9 评论

订阅评论

评论被关闭。

  1. Everyone tried very hard on this contest. Why not publish at least the IDs of top 10 final contestants? It would be a good gesture. As one of the contestants who placed top 10, I would appreciate if you do so, so I can save the article as a good memory of the time I spent. Also as a student, I want to put it on my resume that I placed top 10 in an RL contest.

  2. If you have some free time on your hands, why not make some extra cash every week? Follow this link for more information

    1. Spam user.
      Admin: delete these comments and then ban this seemingly bot spam user

  3. I was interested in this blog post but feel it is too hard to follow for anyone who isn’t already an expert in the field because of all the acronyms/initialisms that are never explained.

  4. nice

  5. I was interested in this blog post but feel it is too hard to follow for anyone who isn’t already an expert in the field because of all the acronyms/initialisms that are never explained.

    Eventually all the acronyms and jargon became too dense that I gave up on reading this entry, which is a shame because I find the topic fascinating.

    A good rule of thumb when using acronyms of jargon is to fully write it out the first time, followed by the acronym in parentheses. Then if the reader ever finds themselves unsure of the meaning of an acronym they’ve just seen 5 times in the same paragraph, they can just scroll back up to the first mention of it to get a quick reminder/explanation.

    Skimming the text as I scrolled down to the comments, I only saw one instance (GRU) where this was done.

    I may not know what a gated recurrent unit is, but it’s a lot easier to find out through my own research if I have more to go on than just GRU, searching for which would likely give me more results related to Despicable Me, Minions, or a monster that will eat me than anything about AI.

    1. Hi Deozaan – thank you for your feedback. You are absolutely correct, we should include the full spelled out acronyms for readers who may not be as deep in the space.

      1. Can you update the post to explain what PPO etc means? Would be lovely to not have to do a ton of research just to understand the blog post

        1. hi Isaac, unfortunately, I wouldn’t be able to update the entire post. however, here are a few links to some of the techniques described above:

          https://openai.com/blog/openai-baselines-ppo/
          https://cs.stanford.edu/~ermon/papers/imitation_nips2016_main.pdf
          https://arxiv.org/pdf/1802.01561