(CVPR2020)TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model
a concise end-to-end model TubeTK which only needs one step training by introducing the “bounding-tube” to indicate temporal-spatial locations of objects in a short video clip.
TubeTK provides a novel direction of multi-object tracking, and we demonstrate its potential to solve the above challenges without bells and whistles.
TubeTK has the ability to overcome occlusions to some extent without any ancillary technologies like ReID.
Video multi-object tracking (MOT)中往往采用tracking-by-detection (TBD) 框架。As a two-step method,把目标追踪拆分为两步，逐帧检测目标的空间位置，然后在时域上将检测结果串联为一个个轨迹。
追踪策略需要视频的spatial-temporal information (STI)但是TBD破坏了语义的连贯性。
Can we solve the multi-object tracking in a neat one-step framework?
we demonstrate that the much simpler one-step tracker even achieves better performance than the TBD-based counterparts.
As shown in Fig. 1, a Btube is defined by 15 points in space-time compared to the traditional 2D box of 4 points. 3个bbox 加 3个时间戳。
To predict the Btube that captures spatial-temporal information, we employ a 3D CNN framework. By treating a video as 3D data instead of a group of 2D image frames, it can extract spatial-temporal features simultaneously.
simple IoU-based post-processing is applied to link Btubes and form final tracks.
Tracking-by-detection-based model，两步，Re-ID techniques
Bridging the gap between detection and tracking，利于运动或者时域信息来辅助检测
Tracking framework based on trajectories or tubes
Btube definition： Adopting a 3D Bbox to point out an object across frames is the simplest extension method, but obviously, this 3D Bbox is too sparse to precisely represent the target’s moving trajectory.
A Btube can be uniquely identified in space and time by 15 coordinate values and it is generated by a method similar to the linear spline interpolation which splits a whole track into several overlapping Btubes.
As shown in Fig. 2 a, a Btube T is a decahedron composed of 3 Bboxes in different video frames, namely $B_s$, $B_m$, and $B_e$, which need 12 coordinate values to define. And 3 other values are used to point out their temporal positions.
This setting allows the target to change its moving direction once in a short time.
Generating Btubes from tracks：Btubes can only capture simple linear trajectories, thus we need to disassemble complex target’s tracks into short clips in which motions can approximately be seen as linear and captured by our Btubes.
Btubes只能捕获线性的运动轨迹，所以需要将实际目标的运动轨迹切片。The disassembly process is shown in Fig. 2 b. We split a whole track into multiple overlapping Btubes by extending EVERY Bbox in it to a Btube. 每3个bbox为一个切片，切片之间有重叠。
We can extend Bboxes to longer Btubes for capturing more temporal information, but long Btubes generated by linear interpolation cannot well represent the complex moving trail (see Fig. 2).
Overcoming the occlusion：Btubes可以预测目标的运动趋势，从而可以短时预测被遮挡的物体，当两个轨迹交错时也能减少ID Switch。
we adopt the 3D convolutional structure  to capture spatial-temporal features, which is widely used in the video action recognition task [6, 24, 19]. The whole pipeline is shown in Fig. 3.
Tube GIoU 将GIoU推广到3D
(CVPR2020)Learning a Neural Solver for Multiple Object Tracking
Simple Unsupervised Multi-Object Tracking