Multi object tracking research trends

2 minute read

Published:

Multiple object tracking approaches…

Tracking objects from video become one of the extensively studied research with the advent of artificial intelligence. The problem is divided into two parts: detection of object in each video frame and association of the same object over time to make the tracking happen.

Tracking = Detection + Association

An association algorithm (most commonly hungraian algorithm) associate each object in consecutive frames. To conduct association over frames for an individual object different cues are considered to match, like appearance , motion, pose etc.

There are two types of tracking approaches: 1. Top-down approach, 2. Bottom-up approach. Top-down approaches detect the bounding box first and then associate bounding boxes in consecutive frames. Bottom-up approach use more granular cue to detect and associate objects over frames. In particular, it uses pose or pixel cue for association.

I have enlisted few papers from both approaches.

Top-down tracking approaches:

  • Paper 1:- Tracking The Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies link.
  • Paper 2:- Tracking without bells and whistles link.
Sample Image
Figure 1 (Paper 1): An individual object’s detected boxes till timeframe t is associated with a detection in t+1 timeframe. This paper use LSTM to use temporal information of an object in matching.
Sample Image
Figure 2 (Paper 2):This paper converts a detector into tracktor. Objects moves very slightly in consecutive frames. So individual detected objects in frame t-1 and t will be highly overlapped with each other. They use detection proposal location of individual object from t-1 frame and perform detection in frame t. If the detection score is high continue the tracking, otherwise initiate new tracks.

Bottom-up tracking approaches:

  • Paper 3:- PoseTrack: Joint Multi-Person Pose Estimation and Tracking link.
  • Paper 4:- Simple Baselines for Human Pose Estimation and Tracking link.
Sample Image
Figure 3 (Paper 3): They detected all body parts in consecutive frames and form graph with detection nodes, edges between nodes in current frames body parts (spatial edges) and edges between same type body parts in temporal frame (temporal edges). Then, cut the graph so that minimum number of nodes and edges remains.
Sample Image
Figure 4 (Paper 4): They used optical flow and detection intersection-over-union for selecting current frames pose need to be tracked. Then perform greedy matching between object pose keypoints.

Conclusion: Bottom-up approach matching is computationally expensive as matching need to be done with many keypoints of multiple objects over time. Therefore, its difficult to get feasible solution from bottom approach. On the other hand, top down approaches associate objects using detected bounding boxes which is faster to optimize even using simple hungarian algorithm.