MOTS: Multi-object tracking and segmentation


Posted by pshow on June 8, 2020

(CVPR2019)MOTS: Multi-object tracking and segmentation


  • 使用半自动的方式在两个追踪数据集上构建了像素级别的掩膜 KITTI [13] and MOTChallenge [38] datasets for training and evaluating methods that tackle the MOTS task.

    Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames.

  • 提出新的多目标追踪与分割评估基准

    We propose the new soft Multi-Object Tracking and Segmentation Accuracy (sMOTSA) measure that can be used to simultaneously evaluate all aspects of the new task.

  • 使用单一CNN网络实现了目标检测追踪和语义分割 TrackR-CNN as a baseline


bounding box level tracking performance is saturating 基于bbox的追踪已经趋于饱和


常规的VOS数据集do not provide annotations on video data or even information on object identities across different images.

常规的VOS数据集provide only bounding box annotations of objects.



propose TrackR-CNN as a baseline method which addresses all aspects of the MOTS task.

TrackR-CNN extends Mask R-CNN [15] with 3D convolutions to incorporate temporal information and by an associassociation head which is used to link object identities over time.



In particular, targets may enter and leave the scene at any time and must be recovered after long-time occlusion and under appearance changes. Many MOT datasets focus on street scenarios,


Existing VOS datasets contain only few objects which are also present in most frames. In addition, the common evaluation metrics for this task (region Jaccard index and boundary F-measure) do not take error cases like id switches into account that can occur when tracking multiple objects.

MOTS focuses on a set of pre-defined classes and considers crowded scenes with many interacting objects. MOTS also adds the difficulty of discovering and tracking a varying number of new objects as they appear and disappear in a scene.


Cityscapes [12],BDD [62], and ApolloScape [19]


generating segmentation masks from scribbles [50], or clicks [59]. These methods require user input for every object to be segmented, while our annotation procedure can segment many objects fully-automatically, letting annotators focus on improving results for difficult cases.


there are some datasets with MOT annotations, i.e., tracks annotated at the bounding box level. For the MOTS task, these datasets lack segmentation masks. Our annotation procedure therefore adds segmentation masks for the bounding boxes in two MOT datasets.


We use a convolutional network to automatically produce segmentation masks from bounding boxes, followed by a correction step using manual polygon annotations. Per track, we fine-tune the initial network using the manual annotations as additional training data, similarly to [6]. We iterate the process of generating and correcting masks until pixel-level accuracy for all annotation masks has been reached.

使用DeepLabv3+, which takes as input a crop of the input image specified by the bounding box with a small context region added, together with an additional input channel that encodes the bounding box as a mask.

Using two manually annotated segmentation masks per object for fine-tuning the refinement network already produces relatively good masks for the object’s appearances in the other frames, but often small errors remain.




We further annotated 4 of 7 sequences of the MOTChallenge 2017 [38] training dataset4 and obtained the MOTSChallenge dataset. MOTSChallenge focuses on pedestrians in crowded scenes and is very challenging due to many occlusion cases, for which a pixel-wise description is especially beneficial. A sample of the annotations is shown in Fig. 2, statistics are given in Table 1.