Computer vision applications often need to process multi-frame images, such as those acquired through video capture. As a result, one of the most important aspects of this processing is the ability to identify and track objects as they move, and as the viewport changes. Uses cases for these capabilities are virtually endless, ranging from robotics vision processing (e.g., SLAM) and autonomous vehicles, to analysis of security footage and augmented reality.
While such functionality may seem trivial given today’s powerful mobile processors–and after all, we humans do it day in and day out–it is actually quite involved to reproduce this in the digital world. In this blog, we’ll examine some of the challenges developers face in dealing with moving objects in video capture, and then look at a few approaches to hopefully solve those challenges.
Challenges with Moving Frames
Analyzing frames of video presents a number of challenges that exist in large part because of the sheer number of variables involved with a given scene. Below are just a few challenges you should consider when developing object detection and tracking applications:
- Object transformations: objects can translate, rotate, and scale over time.
- Object occlusion: objects may become blocked by other objects, either partially or fully.
- Motion blur: the incoming images may become blurred across frames based on factors such as the speed of moving objects, recording frame rate, etc.
- Fast motion: the speed of object transformations can be very fast between frames and can vary based on both the speed of the object itself and the object’s speed relative to the rate of frame capture.
- General scene clutter: scenes may contain complex features and many objects.
- Similarities between objects: objects such as people’s faces can contain many similar features, making it difficult to distinguish between them.
- Environmental effects: effects such as variations in illumination, rain, and haze can all affect the incoming image quality.
- Tracking failure: objects may disappear and then later reappear either in full or in part. Resolving this requires object re-detection.
- Camera transformations: the position, orientation, and viewport settings of the camera can change over time, affecting how objects appear.
Thankfully, there have been a number of approaches that have been developed, many of which are now possible due to advances in processing power at the edge.
Before we can track an object as it moves, we need to know what the object looks like while keeping in mind that the object’s appearance may change across frames. The first step in tackling these challenges is to develop a visual appearance model (aka target representation) as shown in the images below.
This consists of the algorithms that will be used to identify an object and associate it with a unique identifier (e.g., an object ID). The component responsible for this is often referred to as a classifier, whose job is to take a patch of image data as input, and produce a probability as output that the image contains the identified object.
Once we have a mechanism to identify an object, the next step is to determine the motion model (aka localization). This consists of the algorithms which determine the location(s) of object(s) across frames, and may include functionality to predict future locations. Be sure to check out our previous blog, A Four-Phase Strategy for Robotic Vision Processing, Part Two for more information on this topic.
The visual appearance and motion models serve as the foundation for a more general object tracking process which typically involves determining the object’s initial state and its appearance, estimating its motion, and calculating its position. Collectively such algorithms are referred to as tracking algorithms, which include computations for both the appearance and motion models. And in some cases, the computations of both models feed into each other to derive their results.
Classification of Tracking Algorithms
Before looking at specific algorithms, it’s important to be aware of the general classifications of tracking algorithms that exist.
Detection-based tracking algorithms work across video frames to detect objects and determine tracking trajectories, and in general, can handle cases where objects may appear and disappear across frames. On the other hand, detection-free tracking algorithms must initialize objects on the first frame of video. Detection-free tracking is often used for cases where the number of objects will remain static across frames.
Single-object algorithms, as the name suggests, track a single object that has been identified in the very first frame, even if there are multiple objects in the scene. Multi-object tracking algorithms are capable of tracking varied objects, even if they only enter a scene after the first frame.
Offline tracking algorithms can be employed when the video footage has been captured, and can be processed offline (e.g., analyzing security footage). Here, calculations can analyze footage in both directions (i.e., previous to next frames and vice versa) to enhance tracking prediction calculations. In addition, training also takes place offline such that the system is trained to identify what objects look like (e.g., specific faces). Online (i.e., real-time) tracking algorithms on the other hand, can only analyze frames captured up to the current moment, and use predictive calculations that rely on past frames to help determine where an object has moved in a subsequent frame.
Target representation and localization methods such as kernel-based tracking and contour tracking are approaches which have low computation complexity because they track object properties based on features such as contours. On the other hand, filtering and data association methods such as the Kalman filter and Particle filter use known information about scenes and objects, evaluate different hypotheses about objects and their positions, and can handle objects that change over time.
Approaches and Algorithms
With object tracking being such a hot topic in vision processing these days, all sorts of approaches and algorithms are continually being developed. In this section we list a few algorithms to give you an idea as to the breadth and depth of this exciting area of computer science.
A convolutional neural network (CNN)-based offline tracker such as GOTURN, starts by training against thousands of videos, and is designed to handle single object tracking. It then determines the bounding box of an object on the first frame of video and tracks it through subsequent frames. While it doesn’t handle object occlusions, it can handle variations in viewpoints, lighting, and object deformations. Check out our blog: Running Inference Using a Pre-trained Neural Network if you need an introduction to neural networks.
Centroid tracking works by taking the bounding box of an object in each frame. The bounding box can be calculated using any number of algorithms such as the CNN-based approach mentioned above. Centroid tracking then computes the center of the bounding box and assigns it an ID. On each subsequent frame the algorithm seeks to determine if the newly calculated bounding box can be associated with that identified in the previous frame. If the association can be made, then the new location is computed and thus the object has been tracked.
You Only Look Once (YOLO) tracking is a deep learning approach that divides a frame into regions and applies a neural network to predict bounding boxes and probabilities for each region. The bounding boxes are then weighted by predicted probabilities to identify objects. If the probabilities of object identification matches are found, then the objects have been tracked for a given set of frames. This is illustrated in the images below.
Of course, there are many other algorithms out there, but the algorithms listed above provide a hint at just how many different ways the problem can be approached.
Object Tracking on Mobile
Qualcomm Technologies, Inc. (QTI) is no stranger to object tracking. Our Qualcomm®Computer Vision SDK includes a rich API for detecting and tracking objects and features including faces and text, as well as motion. The Qualcomm® Neural Processing SDK can be used to perform AI algorithms such as on-device neural network inference, while our Machine Vision SDK is suitable for robotic and autonomous vehicle applications. And of course our Qualcomm® Snapdragon™ mobile platforms are ready to execute object detection and tracking algorithms with features like its Qualcomm® Hexagon™ DSP processor, Qualcomm Spectra™ image signal processor, and Qualcomm® Adreno™ GPU.
Object detection and tracking are key components in computer vision as they facilitate everything from video footage analysis to autonomous robots that can navigate environments. And just as the applications are virtually endless, so too are the clever approaches and algorithms being developed to tackle the challenges. With this in mind, we’d love to know how you are approaching object detection and tracking in your projects.
Qualcomm Snapdragon, Qualcomm Neural Processing SDK, Qualcomm Hexagon, Qualcomm Spectra, Qualcomm Adreno, and Qualcomm Computer Vision are products of Qualcomm Technologies, Inc. and/or its subsidiaries.