Object Tracking Techniques

Download PDF

Updated2026-02-18
5 minute(s) read

NI Vision implements two object tracking algorithms:

Mean shift—A simple algorithm that tracks the user-defined objects by iteratively updating the location of the object.

EM-based mean shift (shape adapted mean shift)—An extended version of the mean shift algorithm in which not only the location but also the shape (including scale) of the object is adapted frame after frame.

To track an object, the target object must first be characterized over a feature space. The color histogram is a very robust representation of the object appearance, and is chosen as the feature space. Moving objects are characterized by their histograms. The feature-histogram-based target representations are regularized by spatial masking with an isotropic kernel.

Understanding Mean Shift

The mean shift algorithm is a is a simple method for finding the position of a local mode (local maximum) of a kernel-based estimate of a probability density function. Object tracking for an image frame is performed by a combination of histogram extraction, weight computation and derivation of new location.

There are three stages to the mean shift algorithm:

Target model—Choose the target object in the given frame. Represent the target model in the given feature space (color histogram) with a kernel.

Mean shift convergence—In the next frame, search with the current histogram and spatial data for the best target match candidate by maximizing the similarity function. In the mean shift algorithm, the object center moves from current location to a new location as shown in the figure below. The kernel is moved until the convergence of the similarity function, then the location of the object is updated.

Update location and model—Update the target model, and the location of the target, based on the blending parameter.

Understanding EM-Based Mean Shift

The mean shift algorithm is not scale or geometric-shift invariant. To track an object that may appear to change in size or shape, the EM-based mean shift algorithm is required.

The EM-based mean shift, or shape adapted mean shift, algorithm is an extension of the standard algorithm already described. The EM-based mean shift algorithm simultaneously estimates the position of the local mode and the covariance matrix that describes the approximate shape of the local mode. The covariance matrix that defines the shape and scale of the region (that defines the object) is updated every frame to adapt to the shape and scale of the object in that frame.

There are three stages to the mean shift algorithm:

Target model—Choose the target object in the given frame. Represent the target model in the given feature space (color histogram) with a kernel.

Mean shift convergence—In the next frame, search with the current histogram and spatial data for the best target match candidate by maximizing the similarity function. In the mean shift algorithm, the object center moves from current location to a new location, essentially the center of mass, as shown in the figure below. The magnitude and direction of the move is represented by the mean shift vector. The kernel is moved until the convergence of the similarity function, then the location of the object is updated along with the covariance of the kernel.

Update location and model—Update the target model (including the scale and shape), and the location of the target, based on the blending parameter and maximum acceptable scale and shape changes.

Kalman Prediction

EM-based mean shift also features a Kalman Filter implementation. A Kalman filter uses the history of measurements of the target to build a model of the state of the system. The history of measurements is used to accurately predict the location of the target.

Histogram Back Projection

Back projection is one method used to improve the convergence of the target candidate's size and location with the actual size and location of an object. Back projection is a way of recording how well the pixels of a target candidate fit the distribution of pixels that the target models. This allows the user to gauge how well the model of the object matches its appearance.

A histogram of an image known to contain the object of interest is created, and is then back projected over the image. Proper thresholding of the resulting image should isolate the object from the background.

Each pixel value in the resulting image represents the likelihood that the pixel is part of the object. The minimum pixel value of 0 indicates the pixel does not belong to the object, while the maximum value of 255 verifies that the pixel belongs to the object. This back projected image is a good indication of how well the tracking algorithm has been able to identify the pixels that belong to the object to be tracked.

Background Subtraction

A second method used to improve the convergence of the target model is background subtraction. This method is a process that extracts foreground objects in a particular scene. This helps reduce false positives and creates a better match between the target model and the target candidates.

Choosing the Right Parameters

The following parameters can be set by the user to create an object tracking applications suited to their needs:

Histogram bins—Defines the number of bins needed to represent the histogram that characterizes the object. As the number of bins decreases, the number of colors that fall into a given range expands, thus subtle color differentiation will not be possible. Increasing the number of bins allows greater differentiation between very similar colors. Generally, using more bins results in faster matching. By default, 16 bins are used for grayscale images, while RGB images use 8 bins.

Blending parameter—Defines the degree to which the target model is based on the previous frame. This parameter falls between 1 and 100. For very high values, the model relies heavily on the current frame. As a result, if the target object is occluded or out of frame, it will be unable to locate the object in the next frame. For very low values, the model relies heavily on the previous frame. As a result, the model will not adapt to new changes in the appearance of the object. This may be desired in surveillance applications where the target may be frequently occluded. The default value is 10%.

Max iterations—Specifies the maximum number of iterations until a match is found. Matching iterates until the similarity of the target object and target model converges, or the maximum number of iterations is reached. The default value is 15.

The following additional parameters can be used to configure the EM-based mean shift algorithm.

Max scale change—The maximum percentage that the size of the region defining the object can change between frames.

Max rotation change—The maximum number of degrees that the region defining the object can rotate between frames.

Max shape change—The maximum percentage that the shape of the region defining the object can change between frames.

Vision User Manual