A Four-Phase Strategy for Robotic Vision Processing, Part Two

Thursday 6/13/19 09:00am
Posted By Dev Singh
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

In Part 1 of this two-part series on machine vision for robotics, we discussed how advances in technology are allowing today's robots to see, analyze, and make decisions more like humans. This involves developing logic that can determine the orientation of objects, deal with moving objects, and perform navigation.

To accomplish this, we introduced the four-phase strategy listed below and discussed the first two phases, pre-processing and feature detection.

  1. Preprocessing: Data is collected from the real world (e.g., from sensors and cameras) and converted into a more usable state.
  2. Feature Detection: Features such as corners, edges, etc. are extracted from the preprocessed data.
  3. Object Detection and Classification: Objects are detected from the features and may be classified according to known feature maps.
  4. Object Tracking and Navigation: Objects that have been identified are tracked across time. This can include both objects and changing viewpoints of the environment as the robot navigates.

In this blog, we'll look at how the third and fourth phases, object detection and classification and object tracking and navigation, provide the foundation for higher-level robotic vision functionality.

Detecting Objects and Orientations

With features detected, the next step is to detect and classify objects from them. This has traditionally been a difficult task thanks to challenges like variations in viewpoints, different sized images, and varying illumination conditions. Fortunately, a neural network that has been trained to detect and classify objects with such variations can help. These networks are often trained from datasets containing large quantities of labelled images such as those illustrated here:

A dataset of images of 16 various types of dogs for machine learning

One popular approach is to employ a convolutional neural network (CNN), where small regions of the image are fed into the network in a process known as “sliding windows.” Developers can implement this using the Qualcomm® Neural Processing Engine SDK, which supports a wide variety of neural network layer types including convolution layers. The Qualcomm Neural Processing Engine SDK also supports output from Caffe2 and TensorFlow, as well as the ONNX format, so developers can take advantage of those frameworks in their object detection code.

Another task is to determine the orientation of objects, which is important for both object interaction and navigation. The main challenge here is determining the orientation of an object and/or the robot itself in 3D world-space. A popular approach is to apply homography algorithms such as linear least square solver, random sampling and consensus (RANSAC), and least median of squares, to compute points between frames of 2D imagery. The Qualcomm® Computer Vision library provides developers with hardware-accelerated homography and pose evaluation APIs for this purpose.

Once objects have been detected, they can then be assigned metadata such as an ID, bounding box, etc. which can be used during object detection and navigation.

Camera detecting two objects as a woman and small child entering a home

Object Tracking and Navigation

With objects and aspects of the surrounding environment identified, a robot then needs to track them. Since objects can move around, and the robot's viewport will change as it navigates, developers will need a mechanism to track these elements over time and across frames captured by the camera(s) and other sensors. Since this mechanism must be fast enough to run every frame, numerous algorithms have been devised over the years which approach the problem in different ways.

For example, centroid tracking computes the center point of a bounding box around an identified object across frames, and computes the distance between the point as it changes, under the assumption that the object will only move a certain distance each frame. Another approach is to use a Kalman filter that uses statistics over time to predict the location of an object.

The Qualcomm Computer Vision library includes a hardware-accelerated implementation of the mean shift algorithm. This approach basically finds the mean of some aspect of an image (e.g., color histogram) within a sub region of a frame. It then looks for the same description within the next frame by seeking to maximize similarities in features. This allows it to account for changes such as scale, orientation etc. and to ultimately track where the object is. This is illustrated in the image below, where the algorithm is able to track the player's hand and pin-point its new location:

Tracking the motion of a woman tennis player serving via seven stop-motion images

Since these techniques only need to track a subset of the original features, they can generally deal with changes such as orientation, occlusion etc., efficiently and with good success, which makes them effective for robotics vision processing.

But objects aren't the only thing that need to be tracked. The robot itself should be able to successfully navigate its environment and this is where Simultaneous Localization and Mapping (SLAM) comes in. SLAM seeks to estimate a robot's location and derive a map of the environment. It can be implemented using a number of algorithms such as Kalman filters. SLAM is often implemented by fusing data from multiple sensors, and when it involves visual data, the process is often referred to as Visual-Inertial Simultaneous Localization and Mapping (VISLAM).

Applying filters to an image of a car driving along a road with color and tracking images

Developers can use the Qualcomm® Machine Vision SDK, which supports a number of algorithms to determine position and orientation. This includes VISLAM using an extended Kalman filter that fuses camera and IMU data to derive a 6-DoF pose estimate in real world coordinates, depth from stereo, and voxel mapping.

Of course, SLAM is only as good as what the robot can sense, so developers should be sure to choose high-quality cameras and sensors, and find ways to ensure that they're not blocked from capturing data. From a safety aspect, developers should also devise fail safes in case data cannot be acquired (e.g. the cameras become covered).

Applying Computer Vision in Robotics

This four-phase strategy should provide a fairly rich set of data that developers can use as the basis of high-level robotic functionality. Developers who are interested in moving forward with next-generation robotics, can purchase the Qualcomm® Robotics RB3 development kit from Thundercomm and start taking advantage of the various SDKs available on Qualcomm Developer Network. And for those just getting started with robotics, be sure to check out our involvement with FIRST® (For Inspiration and Recognition of Science and Technology).

So now that you've seen a strategy for applying computer vision in robotics, we'd love to hear how you would approach your robotic development.

Snapdragon, Qualcomm Neural Processing SDK, Qualcomm Robotics RB3, Qualcomm Computer Vision SDK and Qualcomm Machine Vision SDK are products of Qualcomm Technologies, Inc. and/or its subsidiaries.