Search Unity

We just released the new version of ML-Agents toolkit (v0.4), and one of the new features we are excited to share with everyone is the ability to train agents with an additional curiosity-based intrinsic reward.

Since there is a lot to unpack in this feature, I wanted to write an additional blog post on it. In essence, there is now an easy way to encourage agents to explore the environment more effectively when the rewards are infrequent and sparsely distributed. These agents can do this using a reward they give themselves based on how surprised they are about the outcome of their actions. In this post, I will explain how this new system works, and then show how we can use it to help our agent solve a task that would otherwise be much more difficult for a vanilla Reinforcement Learning (RL) algorithm to solve.

Curiosity-driven exploration

When it comes to Reinforcement Learning, the primary learning signal comes in the form of the reward: a scalar value provided to the agent after every decision it makes. This reward is typically provided by the environment itself and specified by the creator of the environment. These rewards often correspond to things like +1.0 for reaching the goal, -1.0 for dying, etc. We can think of this kind of rewards as being extrinsic because they come from outside the agent. If there are extrinsic rewards, then that means there must be intrinsic ones too. Rather than being provided by the environment, intrinsic rewards are generated by the agent itself based on some criteria. Of course, not any intrinsic reward would do. We want intrinsic rewards which ultimately serve some purpose, such as changing the agent’s behavior such that it will get even greater extrinsic rewards in the future, or that the agent will explore the world more than it might have otherwise. In humans and other mammals, the pursuit of these intrinsic rewards is often referred to as intrinsic motivation and tied closely to our feelings of agency.

Researchers in the field of Reinforcement Learning have put a lot of thought into developing good systems for providing intrinsic rewards to agents which endow them with similar motivation as we find in nature’s agents. One popular approach is to endow the agent with a sense of curiosity and to reward it based on how surprised it is by the world around it. If you think about how a young baby learns about the world, it isn’t pursuing any specific goal, but rather playing and exploring for the novelty of the experience. You can say that the child is curious. The idea behind curiosity-driven exploration is to instill this kind of motivation into our agents. If the agent is rewarded for reaching states which are surprising to it, then it will learn strategies to explore the environment to find more and more surprising states. Along the way, the agent will hopefully also discover the extrinsic reward as well, such as a distant goal position in a maze, or sparse resource on a landscape.

We chose to implement one specific such approach from a recent paper released last year by Deepak Pathak and his colleagues at Berkeley. It is called Curiosity-driven Exploration by Self-supervised Prediction, and you can read the paper here if you are interested in the full details. In the paper, the authors formulate the idea of curiosity in a clever and generalizable way. They propose to train two separate neural-networks: a forward and an inverse model. The inverse model is trained to take the current and next observation received by the agent, encode them both using a single encoder, and use the result to predict the action that was taken between the occurrence of the two observations. The forward model is then trained to take the encoded current observation and action and predict the encoded next observation. The difference between the predicted and real encodings is then used as the intrinsic reward, and fed to the agent. Bigger difference means bigger surprise, which in turn means bigger intrinsic reward.

By using these two models together, the reward not only captures surprising things, but specifically captures surprising things that the agent has control over, based on its actions. Their approach allows an agent trained without any extrinsic rewards to make progress in Super Mario Bros simply based on its intrinsic reward. See below for a diagram from the paper outlining the process.

Diagram showing Intrinsic Curiosity Module. White boxes correspond to input. Blue boxes correspond to neural network layers and outputs. Filled blue lines correspond to flow of activation in the network. Green dotted lines correspond to comparisons used for loss calculation. Green box corresponds to intrinsic reward calculation.

Pyramids environment

In order to test out curiosity, no ordinary environment will do. Most of the example environments we’ve released through v0.3 of ML-Agents toolkit contain rewards which are relatively dense and would not benefit much from curiosity or other exploration enhancement methods. So to put our agent’s newfound curiosity to the test, we created a new sparse rewarding environment called Pyramids. In it, there is only a single reward, and random exploration will rarely allow the agent to encounter it. In this environment, our agent takes the form of the familiar blue cube from some of our previous environments. The agent can move forward or backward and turn left or right, and it has access to a view of the surrounding world via a series of ray-casts from the front of the cube.

An agent observing the surroundings using ray-casts (Visualized here in black for illustrative purposes)

This agent is dropped into an enclosed space containing nine rooms. One of these rooms contains a randomly positioned switch, while the others contain randomly placed un-movable stone pyramids. When the agent interacts with the switch by colliding with it, the switch then turns from red to green. Along with this change of color, a pyramid of movable sand bricks is then spawned randomly in one of the many rooms of the environment. On top of this pyramid is a single golden brick. When the agent collides with this brick, the agent receives +2 extrinsic reward. The trick is that there are no intermediate rewards for moving to new rooms, flipping the switch, or knocking over the tower. The agent has to learn to perform this sequence without any intermediate help.

Agent trained with PPO+Curiosity moving to pyramid after interacting with the switch.

Agents trained using a vanilla Proximal Policy Optimization (PPO, our default RL algorithm in ML-Agents) on this task do poorly, often failing to do better than chance (average -1 reward), even after 200,000 steps.

In contrast, agents trained with PPO and the Curiosity-Driven intrinsic reward consistently solve it within 200,000 episodes, and often even in half that time.

 

Cumulative extrinsic reward over time for PPO+Curiosity agent (blue) and PPO agent (red). Averaged over five runs each.

We also looked at agents trained with the intrinsic reward signal only, and while they don’t learn to solve the task, they learn a qualitatively more interesting policy which enables them to move between multiple rooms, compared to the extrinsic only policy which has the agent moving in small circles within a single room.

Using Curiosity with PPO

If you’d like to use curiosity to help train agents in your environments, enabling it is easy. First, grab the latest ML-Agents toolkit release, then add the following line to the hyperparameter file of the brain you are interested in training: use_curiosity: true. From there, you can start the training process as usual. If you use TensorBoard, you will notice that there are now a few new metrics being tracked. These include the forward and inverse model loss, along with the cumulative intrinsic reward per episode.

Giving your agent curiosity won’t help in all situations. Particularly if your environment already contains a dense reward function, such as our Crawler and Walker environments, where a non-zero reward is received after most actions, you may not see much improvement. If your environment contains only sparse rewards, then adding intrinsic rewards has the potential to turn these tasks from unsolvable to easily solvable using Reinforcement Learning. This has applicability particularly when it makes the most sense for simple rewards such as win/lose or completed/failed for tasks.

If you do use the Curiosity feature, I’d love to hear about your experience. Feel free to reach out to us on our GitHub issues page, or email us directly at ml-agents@unity3d.com.  Happy training!

17 코멘트

코멘트 구독

댓글 남기기

사용할 수 있는 HTML 태그 및 속성: <a href=""> <b> <code> <pre>

  1. You left out the most interesting part: how do you calculate the “surprise” level ? Is it done automatically by the toolkit ? Else, what parameters are taken in account?

    1. Hi Manu,

      Here is the paragraph in the post where this is mentioned:

      We chose to implement one specific such approach from a recent paper released last year by Deepak Pathak and his colleagues at Berkeley. It is called Curiosity-driven Exploration by Self-supervised Prediction, and you can read the paper here if you are interested in the full details. In the paper, the authors formulate the idea of curiosity in a clever and generalizable way. They propose to train two separate neural-networks: a forward and an inverse model. The inverse model is trained to take the current and next observation received by the agent, encode them both using a single encoder, and use the result to predict the action that was taken between the occurrence of the two observations. The forward model is then trained to take the encoded current observation and action and predict the encoded next observation. The difference between the predicted and real encodings is then used as the intrinsic reward, and fed to the agent. Bigger difference means bigger surprise, which in turn means bigger intrinsic reward.

      1. Yes, I had read that. My interrogation is how the current observation is calculated so that we can make a difference between 2 of these ?
        My guess is that we should hand-code the parameters needed to calculate it depending on our environment (eg: agent x,y,z location or the room number the agent is in), but my guesses in AI aren’t very good… :-p

  2. Great staff ! I would want to port Udacity self driving car simulation (imitation learning with nvidia CNN) on your framework. Concerning imitation learning, what sort of neural network are you using ? Regarding the PPO python script, it seem you use only full connected network, am I right ?

    1. Hi Marc,

      The training code we provide adapts itself to the observation space of the brains you set-up in your scene. If you have a brain with visual (camera) observations, then we will use a CNN. If it is only vector observations, then fully-connected layers. If it is both, then we will process each in its own “stream” and them combine them.

  3. How to treat the self-motivation intrinsic reward? In my opinion, can it be treated as multiple task learning? Because a different rewards defines a different task. Here, external rewards is for getting blocks and intrinsic rewards is for navigate to a different room. To this point, it has some kind of flavor of meta learning or option discovery in hierarchical RL. While, from the perspective of reward shaping, it’s kind of related to inverse RL. It’s true the method is good and workig, but I think we lack of the systematic mind about how to solve the task. Even we don’t know where should we attribute this problems to, because of the tremendous coined terms. I do hate this.

    1. Hi Alvin,

      Thanks for your discussion. There are a number of ways to think about it. For me at least, I think about us driving the policy of the agent through “policy space” along a vector that is the combination of the extrinsic and intrinsic reward signals. The extrinsic reward pushes the policy in a direction that maximized cumulative extrinsic reward, which the intrinsic signal pushed the policy in a space where the result’s of the agent’s actions are more difficult to predict.

  4. Can’t wait to see a model that modulate between exploitation (extrinsic reward) and exploration base on evolving environment (when extrinsec reward decrease or the agent is in a kind of satiate state (boredom)). Also I would like to see experiment that minimize the effort taken toward the reward (basically reward lazyness/efficiency, ie same outcome with less actions) a lot of NN ai have noisy movements or actions.

    1. Hi Kharil,

      Thanks for the comments. These are interesting ideas! One thing our RL-trained agents actually attempt to optimize for is “entropy regularized reward,” which doesn’t quite equate to a laziness signal, but something similar. It encourages the agent to learn a policy that maximized reward, while also being least committed to any single action. It turns out this formulation helps a lot for learning, since the agents can more quickly adapt to new situations, and give up bad behaviors for newer better ones.

  5. Cool to see Pathak et al.’s intrinsic curiosity work being applied to new tasks. I was curious to see how it could be used in other environments that weren’t in the paper. Nice work!

  6. well, this one looks very promising. I will definitely try this feature next weekend (when I have enough time for the hobby projects)
    Also, I am wondering about how we can set up an environment where multiple agents (maybe even with different brains) communicates about their current states and next actions.
    For example:
    Agent A : “Hey I am going to jump over that wall.”
    Agent B : “Cool, Agent A go ahead and jump.”
    Agent C : “Nah, I have been there, there is no reward behind that wall, Don’t waste your time.”
    Agent A : “Oh, okay then, I will go somewhere else”:

    1. Something similar seems to have been used by Open AI recently : https://blog.openai.com/openai-five/
      In the video they talk about some “team_spirit” hyperparameter that was used so that the agents do not act in a selfish manner.

      1. Thank you Mad, I am aware of openai-five but somehow I missed that team-spirit hyperparameter part. I will pay close attention to that part. Thank you again.

    2. Hi Emre,

      Multi-agent communication is definitely something we are looking at. In fact, there have already been researchers who have done similar things to your ideas in their work. I think there are a lot of potentially cool applications these kinds of agents can have in games.

      1. I can’t wait to see what you are going to create about multi-agent communication. I tried to create agents that aware of what other agents are observing, but that wasn’t enough.
        Do you have any guess about when we can get this type of integration?

  7. preciso enteder a programação podes me ajudar ?

  8. preciso de ajuda ajuda para comesar a programar.