Human Detection in Video over Large Viewpoint Changes

Human Detection in Video over Large Viewpoint Changes Human Detection in Video over Large Viewpoint Changes

media.cs.tsinghua.edu.cn
from media.cs.tsinghua.edu.cn More from this publisher
27.09.2014 Views

Recall Recall Recall Recall Recall Recall 1258 G. Duan, H. Ai, and S. Lao 1 Test Sequence 1 of ours 1 Test Sequence 2 of ours 0.75 S05 of PETS2007 0.95 0.95 0.7 0.9 0.9 0.65 0.85 0.85 0.6 0.8 0.75 0.7 0.65 0.05 0.1 0.15 0.2 0.25 0.3 FPPI 1 1/2 1/3 1/4 1/5 0.8 0.75 0.7 1 1/2 1/3 0.65 1/4 1/5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 FPPI 0.55 0.5 0.45 0.4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 FPPI 1 1/2 1/3 1/4 1/5 0.7 Seq.#1 of ETHZ 0.9 Seq.#2 of ETHZ 0.8 Seq.# 3 of ETHZ 0.65 0.8 0.75 0.6 0.7 0.55 0.7 0.65 0.5 0.6 0.6 0.55 0.45 0.5 0.5 0.4 0.35 0.3 0.25 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 FPPI 1 1/2 1/3 1/4 1/5 0.4 0.3 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 FPPI 1 1/2 1/3 1/4 1/5 0.45 1 0.4 1/2 1/3 0.35 1/4 1/5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 FPPI Fig. 7: The robustness evaluation of our approach to video frame rate changes. 1 represents the original frame rate, 1/2 represents that the testing dataset is downsampled to 1/2 original frame rate and so as 1/3, 1/4 and 1/5. Compared to [20] [22], our approach is better. Furthermore, we have not used any additional cues like depth maps, ground-plane estimation, and occlusion reasoning, which are used in [20]. “HOG, IMHwd and HIKSVM” proposed in [4], combines HOG feature [18] and the Internal Motion Histogram wavelet difference (IMHwd) descriptor [3] together using histogram intersection kernel SVM (HIKSVM) and achieves the best results than other approaches used in [4]. Our approach is not as good as but comparable to “ HOG, IMHwd and HIKSVM”and we argue that our approach is much simpler and faster. Currently, our approach takes about 0.29s on ETHZ, 0.61s on PETS and 0.55s on our dataset to process one frame in average. In order to evaluate the robustness of our approach to video frame rate changes, we downsample the videos to their 1/2, 1/3, 1/4 and 1/5 original frame rate to compare with the original frame rate. Our approach is evaluated on the testing datasets with frame rate changes, and the ROC curves are shown in Fig. 7. The results on the three sequences of ETHZ dataset and S05 of PETS2007 are similar, but the results on our collected two sequences differ a lot. The main reason is that: 1) as frame rates get lower, human motion changes more abruptly; 2) note that human motion in near VV videos changes more fiercely than that in near HV or SV ones. Our collected two sequences are near VV, so relatively human motion changes even more drastically than the other testing datasets. Generally speaking, our approach is robust to the video frame rate change to a certain extent. Discussions. We then discuss about the selection of EMC-Boost instead of MC-Boost and a general Boost algorithm. MC-Boost runs several classifiers together at any time, which may be good at classification problems, but not at detection problems. Considering the sharing

Human Detection in Video over Large Viewpoint Changes 1259 feature at the beginning of clustering and the good clusters after clustering, we propose more suitable learning algorithms, MC and SC, for detection problems. Furthermore, risk map is an essential part of MC-Boost, in which the risk of one sample is related to predefined neighbors in the same class and in the opposite. But a proper neighborhood definition itself might be a tough question. For preserving the merits of MC-Boost and avoiding its shortcomings, we argue that EMC-Boost is more suitable for a detection problem. A general Boost algorithm can work well when the sample space has fewer variations, while EMC-Boost is designed to cluster the space into several sub ones which can make the learning process faster. But after clustering, a sample is considered as correct if it belongs to any sub space and thus it may also bring with more negatives. This may be the reason that Intra-frame CF+EMC-Boost is inferior to Intra-frame CF+Boost in Fig. 6. Mainly considering the cluster ability, we choose EMC-Boost other than a general Boost algorithm. In fact, it is impractical to learn a detector of I 2 CF +Boost without clustering because of the large amount of weak features and positives. 8 Conclusion In this paper, we propose Intra-frame and Inter-frame Comparison Features (I 2 CF s), Enhanced Multiple Clusters Boost algorithm (EMC-Boost), Multiple Video Sampling (MVS) strategy and a two-stage tree structure detector to detect human in video over large viewpoint change. I 2 CF s combine appearance information and motion information automatically. EMC-Boost can cluster a sample space quickly and efficiently. MVS strategy makes our approach robust to frame rate and human motion. The final detector is organized as a two-stage tree structure to fully mine the discriminative features of the appearance and motion information. The evaluations on challenging datasets show the efficiency of our approach. There are some future works to be done. The large feature pool causes lots of difficulties during training. One work is to design a more efficient learning algorithm. MVS strategy can make the approach more robust to frame rate, but not handle arbitrary frame rates once a detector is learned. To achieve better results, another work may integrate object detection in video with object detection in static images and object tracking. An interesting question to EMC- Boost is what kind of clusters EMC-Boost can obtain. Take human for example. Different poses, views, viewpoints or illumination will make human different and perceptual co-clusters of these samples differ with different criterion. The further relation between the discriminative features and samples is critical to the results. Another work may study the relations among features, objects and EMC-Boost. Our approach can also be applied to other object detection, multiple objects detection or object category as well. Acknowledgement. This work is supported by National Science Foundation of China under grant No.61075026, and it is also supported by a grant from Omron Corporation.

Recall<br />

Recall<br />

Recall<br />

Recall<br />

Recall<br />

Recall<br />

1258 G. Duan, H. Ai, and S. Lao<br />

1<br />

Test Sequence 1 of ours<br />

1<br />

Test Sequence 2 of ours<br />

0.75<br />

S05 of PETS2007<br />

0.95<br />

0.95<br />

0.7<br />

0.9<br />

0.9<br />

0.65<br />

0.85<br />

0.85<br />

0.6<br />

0.8<br />

0.75<br />

0.7<br />

0.65<br />

0.05 0.1 0.15 0.2 0.25 0.3<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.8<br />

0.75<br />

0.7<br />

1<br />

1/2<br />

1/3<br />

0.65<br />

1/4<br />

1/5<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6<br />

FPPI<br />

0.55<br />

0.5<br />

0.45<br />

0.4<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.7<br />

Seq.#1 of ETHZ<br />

0.9<br />

Seq.#2 of ETHZ<br />

0.8<br />

Seq.# 3 of ETHZ<br />

0.65<br />

0.8<br />

0.75<br />

0.6<br />

0.7<br />

0.55<br />

0.7<br />

0.65<br />

0.5<br />

0.6<br />

0.6<br />

0.55<br />

0.45<br />

0.5<br />

0.5<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.4<br />

0.3<br />

0.2<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.45<br />

1<br />

0.4<br />

1/2<br />

1/3<br />

0.35<br />

1/4<br />

1/5<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

FPPI<br />

Fig. 7: The robustness evaluation of our approach to video frame rate changes. 1 represents<br />

the orig<strong>in</strong>al frame rate, 1/2 represents that the test<strong>in</strong>g dataset is downsampled<br />

to 1/2 orig<strong>in</strong>al frame rate and so as 1/3, 1/4 and 1/5.<br />

Compared to [20] [22], our approach is better. Furthermore, we have not used<br />

any additional cues like depth maps, ground-plane estimation, and occlusion reason<strong>in</strong>g,<br />

which are used <strong>in</strong> [20]. “HOG, IMHwd and HIKSVM” proposed <strong>in</strong> [4],<br />

comb<strong>in</strong>es HOG feature [18] and the Internal Motion Histogram wavelet difference<br />

(IMHwd) descriptor [3] together us<strong>in</strong>g histogram <strong>in</strong>tersection kernel SVM<br />

(HIKSVM) and achieves the best results than other approaches used <strong>in</strong> [4]. Our<br />

approach is not as good as but comparable to “ HOG, IMHwd and HIKSVM”and<br />

we argue that our approach is much simpler and faster. Currently, our approach<br />

takes about 0.29s on ETHZ, 0.61s on PETS and 0.55s on our dataset to process<br />

one frame <strong>in</strong> average.<br />

In order to evaluate the robustness of our approach to video frame rate<br />

changes, we downsample the videos to their 1/2, 1/3, 1/4 and 1/5 orig<strong>in</strong>al frame<br />

rate to compare with the orig<strong>in</strong>al frame rate. Our approach is evaluated on<br />

the test<strong>in</strong>g datasets with frame rate changes, and the ROC curves are shown <strong>in</strong><br />

Fig. 7. The results on the three sequences of ETHZ dataset and S05 of PETS2007<br />

are similar, but the results on our collected two sequences differ a lot. The ma<strong>in</strong><br />

reason is that: 1) as frame rates get lower, human motion changes more abruptly;<br />

2) note that human motion <strong>in</strong> near VV videos changes more fiercely than that<br />

<strong>in</strong> near HV or SV ones. Our collected two sequences are near VV, so relatively<br />

human motion changes even more drastically than the other test<strong>in</strong>g datasets.<br />

Generally speak<strong>in</strong>g, our approach is robust to the video frame rate change to a<br />

certa<strong>in</strong> extent.<br />

Discussions. We then discuss about the selection of EMC-Boost <strong>in</strong>stead of<br />

MC-Boost and a general Boost algorithm.<br />

MC-Boost runs several classifiers together at any time, which may be good at<br />

classification problems, but not at detection problems. Consider<strong>in</strong>g the shar<strong>in</strong>g

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!