Simple Unsupervised Multi-Object Tracking
标注MOT数据集往往费时费力。通过使用一个非监督的ReID 网络，an unsupervised re-identication network，无需对数据集进行标注。
- A spatio-temporal association model which associates boxes in nearby frames to create clusters of tracklets
- A re-identication model which associates tracklets over larger windows to deal with complexities in tracking such as occlusions and target interactions.
- 比naive ReID 效果更好，更容易与现有的追踪器相结合。
simple unsupervised ReID is sufficient even in crowded scenarios with occlusions and person interactions. 简单的ReID的网络在复杂的场景下也能表现良好。
Various approaches have been proposed here including using network ows , graph cuts , MCMC  and minimum cliques  if the entire video is provided beforehand (batch processing).
In scenarios where we get frame-by-frame input, Hungarian matching [53,3], greedy matching  and Recurrent Neural Networks [15,43] are popularly used models for sequential prediction (online processing).
association metrics/cost functions 相关矩阵可以通过时域-空域信息或者ReID获得
Spatiotemporal relations: IoU，Kalman filter， Recurrent Neural Networks，
Recent approaches leverage appearance-reliant pre-trained bounding box regressors from object detection  or single object tracking [56,11] pipelines to regress the bounding box in the next frame. Since most of the above models are unsupervised (requiring no tracking annotations), they complement our work and can be incorporated with our proposed approach for creating efficacious unsupervised trackers.
ReID across multiple cameras: 大量标注好的多摄像头数据集使得基于CNN的ReID模型表现良好，对追踪而言重匹配的数据量没有那么大，所以没必要使用复杂的ReID模型。
ReID for monocular 2D tracking: 比较CNN提取的特征是ReID在追踪领域常用的方法
Framework: Learning by generating tracking labels
首先未标注的视频通过现有的检测器来获得目标的bbox，通过非监督的时空相关算法来获得短时间的目标追踪轨迹。(set of associated detections of the same person over time).
Training ReID models:
In step (i), we only utilize the bounding boxes and use Kalman ltering combined with Hungarian matching to simulate labels. Since we use no appearance information, our tracking labels are noisy.
In step (ii), we proceed by making both the aforementioned assumptions that no two videos or tracklets share common labels. We assign a unique label to each tracklet and train a network with cross-entropy loss to predict this label given any image from that tracklet.
In CenterTrack, we extract tracks using its unsupervised model and rene it with our ReID network using a DeepSORT framework
simpler choices alone are sufficient to match the performance of supervised networks.
We obtain our SimpleReID model by training a ResNet50  backbone popularly used by trackers for a fair comparison. We train the model with tracklets generated by SORT  on the PathTrack  dataset to test generalization to unseen MOT16/17 data.
使用PathTrack作为训练集 在整个MOT17 训练集，测试集上进行分析。
The submitted tracker consists of our proposed SimpleReID model incorporated with CenterTrack  for bounding box regression using public detections.
while CenterTrack is a good detector, it cannot maintain long tracks which is compensated by using our appearance features for Re-identication.
Past literature [49,34] indicates that unsupervised ReID is unlikely to excel in crowded scenarios due to the complexities of tracking in such scenes.
(i) We show that the test performance of SimpleReID (on unseen videos) is equivalent to that of a supervised ReID model, on its training set itself。
(ii) We show that SimpleReID achieves the above desiderata even with simple trackers which are highly reliant on the ReID component.
Limits of unsupervised ReID:
We perform experiments across various weaker scenarios such as having no ReID, or using pretrained-ImageNet as-is, and show that these perform signicantly worse than SimpleReID - proving that SimpleReID is important to match supervised performance.
We rst train another recent supervised tracker, Tracktor++v2, which uses bounding box regression along with a supervised ReID model to predict the position of an object in the next frame.
We train the supervised ReID model on the training data for MOT16/ MOT17 and then benchmark the performance on the same training set. In contrast, this data is new to our SimpleReID models, i.e., have not seen these videos previously.
in Table 3. We observe that using ImageNet-pretrained ReID somewhat improves IDF1 scores compared to using no ReID network at all, but fails to achieve the upper bound by a considerable margin. Our SimpleReID approach successfully recovers the remaining performance gap. This is achieved consistently across different variations.
ReID-reliant unsupervised tracking:
Due to the low dependence of Tracktor on its ReID model, one may argue that it might not be the best framework for evaluation of ReID models in tracking.
Hence, we also perform the same experiments on a popular tracker DeepSORT  that is highly reliant on the ReID model used, since the only visual features it receives is from the ReID network. We replace the supervised ReID model used in DeepSORT with different ReID methods and tabulate results in Table 4.
First, we observe that replacing supervised ReID with random features causes a severe drop in performance over supervised counterpart, with MOTA score decreasing by 9.4% and IDF1 decreasing by 31.3%, demonstrating the degree of reliance on ReID in the DeepSORT framework. When substituted with features from an ImageNet-pretrained ResNet, we get a similar result: a signicant improvement over SORT, yet much lower than supervised ReID performance. We further benchmark with a supervised ReID model trained on Market1501 dataset  and observe lower performance compared to the ImageNet-pretrained model, indicating that features learned for cross-camera person-ReID datasets without trajectory annotations do not transfer to multi-object tracking. Lastly, we observe that our unsupervised SimpleReID covers the remaining performance gap, as seen above.
Scope for improvement in ReID:
We observe that with modern detectors, the gap between SimpleReID and the corresponding oracles is small enough to limit the scope for further improvement.
Overall, we conclude that unsupervised SimpleReID counterintuitively matches the limiting performance of supervised counterparts in diffcult MOT scenarios, by leveraging only unlabeled videos.
(CVPR2020)TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model
Tracking Objects as Points