(CVPR2019)MOTS: Multi-object tracking and segmentation

主要贡献：

使用半自动的方式在两个追踪数据集上构建了像素级别的掩膜 KITTI [13] and MOTChallenge [38] datasets for training and evaluating methods that tackle the MOTS task.

Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames.
提出新的多目标追踪与分割评估基准

We propose the new soft Multi-Object Tracking and Segmentation Accuracy (sMOTSA) measure that can be used to simultaneously evaluate all aspects of the new task.
使用单一CNN网络实现了目标检测追踪和语义分割 TrackR-CNN as a baseline

简介

bounding box level tracking performance is saturating 基于bbox的追踪已经趋于饱和

将目标检测，语义分割，目标追踪相关联

常规的VOS数据集do not provide annotations on video data or even information on object identities across different images.

常规的VOS数据集provide only bounding box annotations of objects.

当目标被遮挡时使用bbox作为gt就会引入多余的信息，使用语义掩膜是恰当的。

propose TrackR-CNN as a baseline method which addresses all aspects of the MOTS task.

TrackR-CNN extends Mask R-CNN [15] with 3D convolutions to incorporate temporal information and by an associassociation head which is used to link object identities over time.

数据集

there are some datasets with MOT annotations, i.e., tracks annotated at the bounding box level. For the MOTS task, these datasets lack segmentation masks. Our annotation procedure therefore adds segmentation masks for the bounding boxes in two MOT datasets.

半自动标注过程

We use a convolutional network to automatically produce segmentation masks from bounding boxes, followed by a correction step using manual polygon annotations. Per track, we fine-tune the initial network using the manual annotations as additional training data, similarly to [6]. We iterate the process of generating and correcting masks until pixel-level accuracy for all annotation masks has been reached.

使用DeepLabv3+, which takes as input a crop of the input image specified by the bounding box with a small context region added, together with an additional input channel that encodes the bounding box as a mask.

Using two manually annotated segmentation masks per object for fine-tuning the refinement network already produces relatively good masks for the object’s appearances in the other frames, but often small errors remain.

MOTSChallenge

We further annotated 4 of 7 sequences of the MOTChallenge 2017 [38] training dataset4 and obtained the MOTSChallenge dataset. MOTSChallenge focuses on pedestrians in crowded scenes and is very challenging due to many occlusion cases, for which a pixel-wise description is especially beneficial. A sample of the annotations is shown in Fig. 2, statistics are given in Table 1.

评估方法

本文浏览量: 次

MOTS: Multi-object tracking and segmentation

CVPR2019, MOTS

(CVPR2019)MOTS: Multi-object tracking and segmentation

主要贡献：

简介

相关工作

多目标追踪数据集

VOS数据集

VIS数据集

半自动标注

数据集

半自动标注过程

MOTSChallenge

评估方法

CATALOG

FEATURED TAGS

FRIENDS