Visual Tracking With Deep Learning And The Context
一. The overview of Visual Tracking 目标跟踪简介
1. What is visual tracking?
This three pictures are the 1,40,80 frame of the same video.When we give the bounding-box of the running woman in the first frame,the bounding-box can still circle the same woman.
Given the initialized state (e.g.position and size) of a target object in a frame of a video, the goal of tracking is to estimate the states of the target in the subsequent frames.
Although object tracking has been studied for several decades, and much progress has been made in recent years , it remains a very challenging problem.
Numerous factors affect the performance of a tracking algorithm, such as illumination variation, occlusion, as well as background clutters, and there exists no single tracking approach that can successfully handle all scenarios.
2. Difficulties of visual tracking
There are many limiting factors of object tracking based on video image. In the theory and method, the research on the target tracking is confronted with great challenge.
The diversity of the target
Multiple moving targets. It is difficult to describe the unified model.
Motion laws of the targets are very complex.
The movement of the targets can lead to changes in its appearance.
Mutual occlusion may occur between multiple moving objects.
The complexity of the scene
Changes in lighting, atmospheric conditions in the scene can cause serious interference.
Regions having similar appearance as the target.
The target may be obscured by objects in the scene
In a dilemma
Fast but Fallible
Robust but Slow
The contradiction between real-time and accuracy
3. Recent algorithms for visual tracking
Based on model matching
----- global model matching
- Create a target appearance model online or offline.
- Search for the most similar regions of the image in the model.
- Advantage: Tracking rigid targets works well.
- Disadvantage: can not work while the appearance changed.
-----Local model matching
- Tracking targets are divided into different components, and the models are respectively established for each component.
- Human motion is divided into head, limbs, body.
- Advantage: Tracking stability. Especially occlusion
- Disadvantage: Matching between components is difficult. time-consuming
-----Feature matching
- Extracts features with translation, rotation, and scaling invariance.
- Feature matching the current frame.
- Advantage: insensitive to the shape, scale and other changes of the target.
- Disadvantage: Most image features are sensitive to ambient conditions such as changes in light.
Based on classification
- Take the tracking as online classification.
- One is the target, the other is the background.
- Training a target-background classifier.
- The classifier is updated with the current image frame
- Advantage: has a certain self-adaptability to the change of target
- Disadvantage: Classification accuracy often depends on the expression of target features
Based on bayes filtering
- Combining a priori information with current information.
- The state of the target image in the current frame is estimated optimally using the a priori information before the current frame.
- Typical algorithms include** Kalman filter** and particle filter.
- Advantage: Wide range of applications and less constraints.
- disadvantage: Particle filter algorithms often produce a large number of particles due to the precision of filtering, and the more the number of particles required, the higher the complexity of the algorithm
Based on deep learning(after 2015)
Depth learning in the field of target tracking is not smooth sailing. The main problem is the lack of training data: one of the magic of the depth model comes from the effective training of a large number of labeled training data, while the target tracking only provides the first frame of the bounding-box as training data. In this case, it is difficult to train a depth model at the beginning of the trace for the current target.
Several ideas:
- Pre-training the depth model with auxiliary image data, and fine-tune on-line tracking.(DLT,SO-DLT NIPS15)
- The CNN classification network pre-trained by the existing large-scale classification dataset is used to extract the features.(FCNT,HCFT ICCV15)
- Pre-training with tracking sequences.(Mdnet CVPR16)
- Using RNN.(RTT CVPR16)
4. Deep Learning for visual tracking
DLT: Learning a Deep Compact Image Representation for Visual Tracking (NIPS 2014)
预训练:SDAE+Tiny Image dataset+无监督训练:通用的物体表征能力;
在线跟踪结构:SDAE的encoding(通用特征表示)+sigmoid分类(二分类跟踪方式):获得 目标与背景的分类;
微调:利用第一帧获取正负样本:获取当前目标与背景更有针对性的分类网络;
后续帧跟踪:当前帧粒子滤波提取patch+patch依次输入分类网络+置信度;
模型更新:限定阈值;
优点:预训练+微调:解决训练数据不足
缺点:32*32 自编码器是否适合分类跟踪任务 4层网络特征表达能力不足
SO-DLT:Transferring Rich Feature Hierarchies for Robust Visual Tracking(ICCV 2015)
在线跟踪:处理t帧时,以t-1帧预测位置为中心; 从小到大采样不同尺度区域,依次放入网络; 当CNN输出的概率图高于一个值,停止采样,以当前概率图为最佳区域; 在最终区域里确定boundingbox大小与位置
模型更新:CNNs---->及时响应目标变化; CNNl---->对噪声鲁棒;
借鉴:ensemble的思路解决update 的敏感性 ,跟踪算法提高评分的杀手锏。
FCNT: Visual Tracking with Fully Convolutional Networks (ICCV 2015)
预训练:VGGNet+imageNet已分类数据集;
核心: FeatureMap可以直接做跟踪目标定位;
高层特征:擅长区分不同类(高度抽象)
底层特征:擅长区分同类物体(关注局部细节)
两层卷积结构: conv4-3:区分相似物体distractor(SNet) conv5-3:区分类别信息 (GNet)
在线跟踪: 利用上一帧中心采样一块区域,分别输入SNet和GNet; 生成两个heatmap(互补);
SNet:去掉了distractor
GNet:目标更加明显
总结: 有效抑制漂移,对遮挡不鲁棒 track新思路(多少层 哪几层)
MDNet:Learning Multi-Domain Convolutional Neural Networks for Visual Tracking(CVPR 2016)
图像分类与实际跟踪的巨大差别;
图像分类: 目标和背景的任意组合,目标出现在任何一个背景都要被检测出;
实际跟踪: 给出第一帧的前后景后,后续帧前后景和第一帧很类似;
直接用视频序列预训练CNN; 目标差别:某类物体在一个序列中是目标,在另一个就可能是背景;
共享层:CNN获得目标通用的特征表达;
特定区域层:每个训练序列--->单独的domain--->单独的二分类层--->区分当前序列前后景 (解决不同序列目标不一致问题)
确定bounding:RCNN Region Proposal方式 上一帧附近寻找256个proposal,之后进行bounding回归
总结:Precision达到了94.8% 实时性:目标检测的Region Proposal是否适合在线跟踪任务 (256个proposal 89个domain)
Use RNN?
这是一个视频的第一帧 第10帧和第20帧,汽车在匀速前进时,视频序列具有明显的时序相关性。
跟踪任务的特殊性(时间序列,前后相关)
是否可以使用多方向的递归神经网络(RNN)学出跟踪视频序列的前后关联性?
What is RNN ?
RNN Tracker
CVPR2016
AAAI2016
5. Visual Tracking With The Context
Context information is also very important for tracking.
Recently, some approaches have been proposed by mining auxiliary objects or local visual information surrounding the target to assist tracking .
The context information is especially helpful when the target is fully occluded or leaves the image region .
To improve the tracking performance, some tracker fusion methods have been proposed recently.
Context-Aware Visual Tracking
the environment can also be advantageous to the tracker if it contains objects that are correlated to the target
Question: whether the object being followed by the tracker is really the target?
Answer:Use the dynamic environment!
How to track a face in a crowd?
- it is almost impossible to learn a discriminative model to distinguish the face of interest from the rest of the crowd.
Why do we have to focus our attention only on the target?
- If the person (with that face) is wearing a quite unique shirt (or a hat), then including the shirt (or the hat) in matching will surely make the tracking much easier and more robust.
- if another face always accompanies the target face, treating them as a geometric structure and tracking them as a group.
It seems that:
- A target is seldom isolated and independent to the entire scene.
- there may exist some objects that have short-term or long-term motion correlations to the targets.
So why not track the target and auxiliary objects as a group?
What is auxiliary objects?
- frequent co-occurrence with the target .
- consistent motion correlation to the target.
- suitable for tracking.
This definition may cover a large variety of image regions or features
- simple,generic, and low-level is better
- Choose color regions but not the features
- Because the color regions can be reliably and efficiently tracked
Experiments
(The yellow bounding-box is the target. the red are the color region.)
Tracking the Invisible: Learning Where the Object Might be
context helps in object detection is wellknown.
strongest predictors of vehicle presence and location in an image is the shadow it casts on the road
In tracking, many temporary, but potentially very strong links exist between the tracked object and the rest of the image.
local image features vote for the object.
- Implicit Shape Model is used to choose the local image features.
- Object points lie on the object surface and thus always have a strong correlation to the object motion(green points).
- points on other independently moving objects or in the static background, are considered to carry no information about the object position(blue points).
- Supporters are features which are useful to predicting the target object positions. They at least temporarily move in a way which is statistically related to the motion of the target(red points).
the position of an object can be estimated even when it is not seen directly (e.g., fully occluded or outside of the image region)
How to choose the supporter?
Experiments
We can see what we can not see
Context Tracker: Exploring Supporters and Distracters
Visual tracking is very challenging when the target leaves the field of view leading the tracker to follow another similar object, and not reacquire the right target when it reappears.
There is additional information which can be exploited instead of using only the object region.
What is supporters and distracters?
Distracters
- Regions have similar appearance as the target
- consistently co-occur
- The tracker must keep tracking these distracters to avoid drifting
- dangerous
Supporters
- local key-points around the target
- consistently co-occur
- motion correlation
- useful
Experiments
6. 目标跟踪的方向
提高目标的特征描述能力
- 足够强的特征能够应对绝大多负面的环境影响
提高系统实时性 - 搜索策略需要遍历很多冗余区域大大影响到跟踪算法的实时性
- 如何缩小目标搜索范围