Search Unity

The world of robotics is full of unknowns! From sensor noise to the exact positioning of important objects, robots have a critical need to understand the world around them to perform accurately and robustly. We previously demonstrated a pick-and-place task in Unity using the Niryo One robot to pick up a cube with a known position and orientation. This solution would not be very robust in the real world, as precise object locations are rarely known a priori. In our new Object Pose Estimation Demo, we show you how to use the Unity Computer Vision Perception Package to collect data and train a deep learning model to predict the pose of a given object. We then show you how to integrate the trained model with a virtual UR3 robotic arm in Unity to simulate the complete pick-and-place system on objects with unknown and arbitrary poses.

Robots in the real world often operate in and must adapt to dynamic environments. Such applications often require robots to perceive relevant objects and interact with them. An important aspect of perceiving and interacting with objects is understanding their position and orientation relative to some coordinate system, also referred to as their “pose.” Early pose-estimation approaches often relied on classical computer vision techniques and custom fiducial markers. These solutions are designed to operate in specific environments, but often fail when their environments change or diverge from the expected. The gaps introduced by the limitations of traditional computer vision are being addressed by promising new deep learning techniques. These new methods create models that can predict the correct output for a given input by learning from many examples.

This project uses images and ground-truth pose labels to train a model to predict the object’s pose. At run time, the trained model can predict an object’s pose from an image it has never seen before. Usually, tens of thousands or more images need to be collected and labeled for the deep learning models to perform sufficiently. Real-world collection of this data is tedious, expensive, and, in some cases like 3D object localization, inherently difficult. Even when this data can be collected and labeled, the process can turn out to be biased, error-prone, tedious, and expensive. So how do you apply powerful machine learning approaches to your problem when the data you want is out of reach or doesn’t actually exist for your application yet?

Unity Computer Vision allows you to generate synthetic data as an efficient and effective solution for your machine learning data requirements. This example shows how we generated automatically labeled data in Unity to train a machine learning model. This model is then deployed in Unity on a simulated UR3 robotic arm using the Robot Operating System (ROS) to enable pick-and-place with a cube that has an unknown pose.

Generating Synthetic Data

Close Up of Randomly Generated Poses and Environment Lighting

Simulators, like Unity, are a powerful tool to address challenges in data collection by generating synthetic data. Using Unity Computer Vision, large amounts of perfectly labeled and varied data can be collected with minimal effort, as previously shown. For this project, we collect many example images of the cube in various poses and lighting conditions. This method of randomizing aspects of the scene is called domain randomization1. More varied data usually leads to a more robust deep learning model.

To collect data with the cube in various poses in the real world, we would have to manually move the cube and take a picture. Our model used over 30,000 images to train, so if we could do this in just 5 seconds per image, it would take us over 40 hours to collect this data! And that time doesn’t include the labeling that needs to happen. Using Unity Computer Vision, we can generate 30,000 training images and another 3,000 validation images with corresponding labels in just minutes! The camera, table, and robot position are fixed in this example, while the lighting and cube’s pose vary randomly in each captured frame. The labels are saved to a corresponding JSON file where the pose is described by a 3D position (x,y,z) and quaternion orientation (qx,qy,qz,qw). While this example only varies the cube pose and environment lighting, Unity Computer Vision allows you to easily add randomization to various aspects of the scene. To perform pose estimation, we use a supervised machine learning technique to analyze the data and generate a trained model.

Using Deep Learning to Predict Pose

Deep Learning Model Architecture for Pose Estimation

In supervised learning, a model learns how to predict a specific outcome based on training a set of inputs and corresponding outputs, images, and pose labels in our case. A few years ago, a team of researchers presented2 a convolutional neural network (CNN) that could predict the position of an object. Since we are interested in a 3D pose for our cube, we extended this work to include the cube’s orientation in the network’s output. To train the model, we minimize the least squared error, or L2 distance, between the predicted pose and the ground-truth pose. After training, the model predicted the cube’s location within 1cm and the orientation within 2.8 degrees (0.05 radians). Now let’s see if this is accurate enough for our robot to successfully perform the pick-and-place task!

Motion Planning in ROS

Pose Estimation Workflow

The robot we are using in this project is a UR3 robotic arm with a Robotiq 2F-140 gripper, which was brought into our Unity scene using the Unity Robotics URDF Importer package. To handle communication, the Unity Robotics ROS-TCP Connector package is used while the ROS MoveIt package handles motion planning and control.

Now that we can accurately predict the pose of the cube with our deep learning model, we can use this predicted pose as the target pose in our pick-and-place task. Recall that in our previous Pick-and-Place Demo, we relied on the ground-truth pose of the target object. The difference here is that the robot performs the pick-and-place task with no prior knowledge of the cube’s pose and only gets a predicted pose from the deep learning model. The process has 4 steps:

  1. An image with the target cube is captured by Unity
  2. The image is passed to a trained deep learning model, which outputs a predicted pose
  3. The predicted pose is sent to the MoveIt motion planner
  4. ROS returns a trajectory to Unity for the robot to execute in an attempt to pick up the cube

Each iteration of the task sees the cube moved to a random location. Although we know the cube’s pose in simulation, we will not have the benefit of this information in the real world. Thus, to lay the groundwork for transferring this project to a real robot, we need to determine the cube’s pose from sensory data alone. Our pose estimation model makes this possible and, in our simulation testing, we can reliably pick up the cube 89% of the time in Unity!


Pick-and-Place using Pose Estimation in Unity

Our Object Pose Estimation Demo shows how Unity gives you the capability to generate synthetic data, train a deep learning model, and use ROS to control a simulated robot to solve a problem. We used the Unity Computer Vision tools to create synthetic, labeled training data and trained a simple deep learning model to predict a cube’s pose. The demo provides a tutorial walking you through how to recreate this project, which you can expand by applying more randomizers to create more complex scenes. We used the Unity Robotics tools to communicate with a ROS inference node that uses the trained model to predict a cube’s pose. These tools and others open the door for you to explore, test, develop, and deploy solutions locally. When you are ready to scale your solution, Unity Simulation saves both time and money compared to local systems.

And did you know that both Unity Computer Vision and Unity Robotics tools are free to use!? Head over to the Object Pose Estimation Demo to get started using them today!

Keep Creating

Now that we can pick up objects with an unknown pose, imagine how else you could expand this! What if there are obstacles in the way? Or multiple objects in the scene? Think about how you might handle this, and keep an eye out for our next post!

Can’t wait until our next post!? Sign up to get email updates about our work in robotics or computer vision.

You can also find more robotics projects on our Unity Robotics GitHub

For more computer vision projects, visit our Unity Computer Vision page.

Our team would love to hear from you if you have any questions, feedback, or suggestions! Please reach out to


  1. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel, “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World” arXiv:1703.06907, 2017
  2. J. Tobin, W. Zaremba, and P. Abbeel, “Domain randomization and generative models for robotic grasping,” arXiv preprint arXiv:1710.06425, 2017

4 replies on “Teaching robots to see with Unity”

This is a really good innovation, world is going to artificial intelligence, unity too.

Thanks unity for this addition.

This is great information. I’m curious to learn more.

I have a few questions:

1. Will this work with objects of multiple sizes and shapes, as long as the robot can physically grab it (so it can’t be too large, too small, or too smooth)?

2. What happens if the cube or the end location for the cube to be placed are out of reach for the robot? (does an error appear or the robot try to reach it and then stop moving?)

3. Would I be able to control a real robot:

3.1 If the real robot is placed in the same position with regards to the virtual robot, and have the real robot perform the same movements of pick and place that virtual is doing (but without any real cube for the real robot to grab)? I just want to see the real robot perform its movement and grasping motions while the virtual one performs its actions. Maybe even generate a rosbag from the virtual objects movements which can then be played on the real robot?

3.2 If the camera input of the environment (to track the markers on the cube and the end placement position) are coming from a real-world camera (i.e. a webcam on the computer running Unity), which would then use the information from that camera to feed into Unity, which would then send the pose information of the robot joints to the real robot?

4. Out of curiosity, with the Unity Computer Vision, would I be able to use it for AR, as I would use it to track specific markers (or even specific 3D objects) in the real environment and having that object’s pose, I can place a virtual object on top of it?

Thank you.

Hi Ivan, to answer your questions:

1) This tutorial is about training a computer vision neural network to “see” where a fixed-size cube is. Training it to analyse arbitrary shaped objects, in order to figure out where to grab them, would be a very different task.

2) Most likely, the computer vision system would see where the cube is, but when that information is fed to the trajectory generator, it would be unable to find a valid way to reach that position, and would report an error.

3.1) Yes, if you have a real robot of the same design as the virtual one, you can replay the same commands on both robots and they should behave the same. We have a tutorial specifically about doing this:

3.2) Training a neural network using simulated data and then applying what it’s learned to the real world is an area of active research… I don’t believe we’ve tested this specific setup this way yet, but it’s certainly on the roadmap.

4) This question is hard to answer, as it’s way outside the scope of the tutorial! To be clear, in this tutorial you’ll use the Unity Perception package to generate annotated training images, and then use those to train a Pytorch computer vision model. In principle, yes, you could probably make an AR system with a workflow similar to this.

Thank you for all of that information Laurie.

Great to see that you guys have already looked into replicating the simulated robot’s movement to a real robot, this is becoming even more interesting.

To hear that testing the neural network on a real world environment is on the roadmap is great news. All of this work seems really promising and I can’t wait to see how it further develops.

Comments are closed.