11.02.2014 Views

Fast Human Detection Using Node-Combined Part Detector

Fast Human Detection Using Node-Combined Part Detector

Fast Human Detection Using Node-Combined Part Detector

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2011 18th IEEE International Conference on Image Processing<br />

FAST HUMAN DETECTION USING NODE-COMBINED PART DETECTOR<br />

Song CAO<br />

Department of Electronic Engineering,<br />

Tsinghua University, Beijing 100084, China<br />

Genquan DUAN, Haizhou AI<br />

Department of Computer Science and Technology,<br />

Tsinghua University, Beijing 100084, China<br />

ABSTRACT<br />

Detecting people in occlusion and articulated pose remains a<br />

big challenging problem in computer vision. To achieve a fast<br />

and accurate human detection algorithm, <strong>Node</strong>-<strong>Combined</strong><br />

<strong>Part</strong> <strong>Detector</strong> (NCPD) Model is proposed in this paper.<br />

We make two major contributions: (1) We propose a novel<br />

method, torso-nodes combination, to integrate part detectors.<br />

(2) We adopt stable part detectors described by Associated<br />

Paring Comparison Features (APCF) and trained with Real-<br />

AdaBoost algorithm. This new human detection algorithm is<br />

not only much faster than the previous work but also maintaining<br />

competitive accuracy with the state-of-the-art human<br />

detection system. Besides, the algorithm performs better<br />

within low false alarm. For average time per image, our algorithm<br />

can achieve speedup rate of about 10x as compared<br />

with Deformable <strong>Part</strong> based Model (DPM) and over 125x as<br />

compared with Poselet Model.<br />

Index Terms— Object <strong>Detection</strong>, <strong>Node</strong>-<strong>Combined</strong> <strong>Part</strong><br />

<strong>Detector</strong>, Occlusion, High Articulation<br />

1. INTRODUCTION<br />

Object detection is to locate objects in images, e.g. face detection<br />

[1] and pedestrian detection [2], which is well studied<br />

in computer vision. However, Detecting people in occlusion<br />

and high articulation remains a big challenge. There are<br />

mainly two difficulties for human detection: 1) <strong>Human</strong>s are<br />

non-rigid objects which cause variations in contour, shape and<br />

color, thus it is hard to use one holistic classifier to describe<br />

all the situations and variations. 2) There are occlusions, due<br />

to a multitude of occluding accessories such as backpacks,<br />

clothes, bags, or due to other persons and objects. To handle<br />

this challenge, part based model becomes popular [3] [4] [5],<br />

which can be regarded as providing more variables to describe<br />

a highly varied object. But how shall we select and train these<br />

part detectors? How to integrate them into an efficient robust<br />

human detector?<br />

Various algorithms have been proposed for human detection<br />

to deal with occlusion or articulated pose. Deformable<br />

<strong>Part</strong> based Model (DPM) [5] based on Histograms of Oriented<br />

Gradients (HOG) features [2] combined with Latent Support<br />

Vector Machines (LSVM) training strategy was proposed<br />

in [5] for object detection, in which several part detectors are<br />

learned within the model root (a bound box of object). The<br />

authors established a star model which made each part detector<br />

has its deformable position relationship with the model<br />

root. The inner part detectors contribute to a better description<br />

of inner details of an object, which explores more information<br />

for object detection.<br />

Poselet is an innovative work that was first proposed in<br />

[6], which achieves state-of-the-art results in the detection<br />

and segmentation of human in PASCAL Visual Object Classes<br />

(PASCAL VOC) [7]. In Poselet, the authors randomly select<br />

patches from the training images as seed poselets (poselet<br />

can be folded hands, occluded legs, hands holding up and so<br />

on). Each poselet is described by HOG feature and trained<br />

with linear SVM. Then the random selected poselet detectors<br />

are cluttered and have their own prediction of potential<br />

human location. Many weak and random selected poselets<br />

indicate human position and achieve state-of-the-art results in<br />

PASCAL VOC human detection in the recent several years.<br />

However, two issues exist in the Poselet based detection algorithm.<br />

The first issue is that it is relatively time-consuming<br />

because much of the time is spent on the detection of poselets<br />

and exploiting context among poselets. The other one is<br />

that most of the random selected poselet detectors have a relative<br />

low accuracy and most of poselets indicate the same body<br />

parts like face and head shoulder.<br />

Reviewing progress of detection problems, Boosting<br />

trained detector, eg. face detection [1], pedestrian detection<br />

[8] has proven to be efficient and accurate. To achieve a highly<br />

efficient detection algorithm, we propose <strong>Node</strong> <strong>Combined</strong><br />

<strong>Part</strong> <strong>Detector</strong> (NCPD) Model which involves four stable part<br />

detectors described by Associated Paring Comparison Features<br />

(APCF) and trained with Real-AdaBoost algorithm.<br />

Our approach is an experimental study on AdaBoost based<br />

part detectors for human detection.<br />

We consider precise and well-trained part detectors are<br />

the key to real-time human detection in occlusion and high<br />

articulation. We pick up several stable part detectors integrated<br />

by the torso-nodes as demonstrated in Fig.1. We consider<br />

our stable part detectors should not only have a high detection<br />

accuracy, but also cover most of poselets used in [4].<br />

Therefore, in implementation, four stable part detectors (i.e.<br />

face, head shoulder, upper body, whole body) are adopted.<br />

978-1-4577-1302-6/11/$26.00 ©2011 IEEE 3650


2011 18th IEEE International Conference on Image Processing<br />

We integrated stable part detectors through torso-nodes to establish<br />

our NCPD Model. This new human detection algorithm<br />

can speed up the detection procedure significantly while<br />

maintaining an competitive accuracy similar to the existing<br />

state-of-the-art methods.<br />

Fig. 1. NCPD Model. The left image is the structure of our<br />

NCPD model. The right image explicitly demonstrates our<br />

stable part detectors<br />

Our contributions are summarized as follows: (1) <strong>Node</strong>-<br />

<strong>Combined</strong> <strong>Part</strong> <strong>Detector</strong> (NCPD) Model is proposed to integrate<br />

stable part detectors with torso-nodes. (2) Stable<br />

part detectors are learned by AdaBoost using APCF features<br />

which obtains high efficiency in human detection.<br />

The rest of this paper is organized as follows: The following<br />

Section gives the overview of our approach. Section<br />

3 presents the NCPD Model proposed in this paper. While<br />

in Section 4, we demonstrate the training methods of our stable<br />

part detectors, Quantitative experiments and evaluations<br />

on PASCAL VOC test datasets are carried out in Section<br />

5. Finally, conclusion and future work are offered in the last<br />

Section.<br />

2. OVERVIEW OF OUR APPROACH<br />

Our approach mainly contains three steps. The first step is to<br />

train our part detectors. To improve the human detection accuracy,<br />

we should require our part detectors to be robust with<br />

fewer variations. Based on such an idea, we train detectors for<br />

parts, e.g. face, head shoulder, upper body and whole body<br />

which will be explicitly explained in Sec.4. The second step<br />

is to integrate our stable part detectors as an efficient robust<br />

human detector. We propose <strong>Node</strong>-<strong>Combined</strong> <strong>Part</strong> <strong>Detector</strong><br />

(NCPD) Model in Sec.3.2, where each stable part detector<br />

has a prediction of the position of torso-nodes. Finally, postprocessing<br />

is made by non-maximum suppression. Following<br />

this procedure, we obtain our efficient human detector which<br />

achieves competitive results in several challenging datasets.<br />

3. NODE COMBINED PART DETECTOR (NCPD)<br />

MODEL<br />

3.1. Stable <strong>Part</strong> <strong>Detector</strong>s<br />

We consider that a human in high articulation and occlusion<br />

can be described by many variables. Assuming there are N<br />

poselets in the human detection system where each poselet<br />

represents a variable, thus each person can be described by<br />

a N-length vector based on poselets representation. However,<br />

in a detection problem, we should acknowledge that a N<br />

(usually N > 150) dimension space is large and extensively<br />

makes detection task more complexity. By observing that<br />

some variables are redundant and represent the same semantic<br />

meanings (e.g. many poselets are similar to face), we consider<br />

further reducing the dimension space by using limited, but<br />

principal variables. In practical, we suggest to use stable part<br />

detectors as the principle variables which have fewer variations<br />

in a highly articulated or occluded human. Motivated<br />

by [3], we define our part detectors to be face, head shoulder,<br />

upper body and whole body. These four detectors are stable<br />

and are suitable for human detection. Even in Poselet framework,<br />

most of the effective poselets are similar to these four<br />

body parts, and on the other hand, these four stable parts nearly<br />

cover most of useful poselets when poselets are applied in<br />

detection task. We have also considered adding in more stable<br />

detectors like legs, left body and right body in our algorithm.<br />

However, these detectors are in large variations and less discriminative<br />

as compared with background. To achieve high<br />

accuracy and efficiency, we do not adopt them in our current<br />

algorithm.<br />

3.2. Integration of Stable <strong>Part</strong> <strong>Detector</strong>s<br />

Reviewing other tree structure models [5] [9], all the parts<br />

are integrated by one model root. Observing some empirical<br />

knowledge that torso is always under the head with fewer s-<br />

patial variations, similar to Pictorial Structure [9] [10], our<br />

<strong>Node</strong>-<strong>Combined</strong> <strong>Part</strong> <strong>Detector</strong> (NCPD) Model is established<br />

in which torso is set as its root. However, different from [10],<br />

we adopt a new method, named as torso-nodes combination,<br />

to integrate our stable part detectors into an efficient robust<br />

human detector. Our method, applying Hough voting idea,<br />

uses the distribution of root configuration instead of root s-<br />

patial center, to integrate our stable part detectors. After detection<br />

procedure of all four stable part detectors, assuming<br />

we get n part recalls where we rank them descending with<br />

detection scores as P 1 , P 2 , . . . , P n . Specifically, P 1 is the<br />

highest-probability part recalls. Let L i (N 1 i , N2 i , N3 i , N4 i ) represent<br />

the root configuration of each part P i . We can particularly<br />

consider L i as the torso-nodes distribution, where N k i<br />

is a Gaussian Distribution trained from training dataset. (In<br />

implementation, four torso-nodes refer to left/right shoulders<br />

and left/right hips). We integrated two part detector recalls i<br />

and j using Kullback-Leibler divergence as follows:<br />

4∑<br />

S ij = D KL (N k i , N k j ) + D KL (N k i , N k j ) (1)<br />

k=1<br />

where S ij is an integration distance. If S ij is no larger<br />

than a threshold, then part P i and part P j belong to the same<br />

person. We consider integrating part recalls from the highest<br />

3651


2011 18th IEEE International Conference on Image Processing<br />

score one. We adopt this greedy search procedure because it<br />

utilizes the most reliable information first which owns a computational<br />

advantage. We sum up all the part recalls which<br />

belong to one potential human location as the final human detection<br />

score. Therefore, we integrate our stable part detectors<br />

under the framework of spatial consistence with the information<br />

from less varied torso-nodes. An example of integration<br />

strategy is demonstrated in Fig.2.<br />

Fig. 2. Integration of Stable <strong>Part</strong> <strong>Detector</strong>s. Red, yellow and<br />

blue bound boxes demonstrate detection recalls of face, head<br />

shoulder and upper body respectively. As the torso-nodes distribution<br />

of face and upper body are close, they are integrated<br />

into the same potential human location.<br />

4. TRAINING STABLE PART DETECTORS<br />

4.1. Weak Features<br />

Previously, HOG feature combined with linear SVM is a classic<br />

method in pedestrian detection which has the advantage<br />

of capturing gradient information except its high computation<br />

complexity in both memory and time. We consider that both<br />

gradient and appearance features are important in a detection<br />

procedure, therefore we adopt Associated Paring Comparison<br />

Features (APCF) [8] which has been proved very efficient<br />

and accurate in pedestrian detection. APCF is a feature<br />

which describes invariance of color and gradient of an object<br />

to some extent and it contains two essential elements, Pairing<br />

Comparison of Color (PCC) and Pairing Comparison of Gradient<br />

(PCG). A PCC is a Boolean color comparison of two<br />

granules and a PCG is a Boolean gradient comparison of two<br />

granules in which a granule is a square window patch. For<br />

more details, please refer to [8].<br />

4.2. The Training Algorithm<br />

The Real AdaBoost [11] is used to learn Nested Cascade <strong>Detector</strong><br />

[12] for part detection. For interested readers, please<br />

refer to [11] [12] for more details.<br />

5. EXPERIMENTS<br />

We use the PASCAL VOC 2009 training dataset for training,<br />

where we annotated the position of the four stable parts and<br />

torso-nodes. To demonstrate the effectiveness and efficiency<br />

of our NCPD Model, we make the experiments on PAS-<br />

CAL VOC test dataset, using the same criteria as the PAS-<br />

CAL VOC detection competition, that is, the detection can be<br />

regarded as true positive only if it gets a ratio of overlap area<br />

to union area up to 50%. However, not as previous work in<br />

Deformable <strong>Part</strong> based Model (DPM) and Poselet, we do not<br />

use a bound box adjustment strategy as post-processing procedure,<br />

though according to reports, this adjustment strategy<br />

will improve the detection average precision for about 1% to<br />

3%. All experiments are tested on a computer with Intel Core<br />

2, 2.63GHz, 4GB RAM.<br />

Performance comparison. We compare the detection accuracy<br />

with two of the best human detection methods, Deformable<br />

<strong>Part</strong> based Model (DPM) and Poselet. The comparison<br />

with our NCPD Model is shown in Fig.3. These ROC<br />

curves are based on the part of PASCAL test dataset which<br />

were released with annotations. It can be found that our model<br />

(NCPD Model) gives relatively higher detection rate by 5% to<br />

some extent as compared with existing methods. We achieve<br />

better detection accuracy than Poselet in PASCAL VOC 2008<br />

and 2010, while in PASCAL VOC 2009, we obtain a similar<br />

performance. However, we do not outperform Deformable<br />

<strong>Part</strong> based Model (DPM) in PASCAL VOC 2010.<br />

Speed comparison. We test our model for the speedup<br />

rate. The average times per image for each model and NCPD<br />

model speedup rate are summarized in Table 1 and Table 2<br />

where PASCAL VOC 2008, 2009 and 2010 test dataset were<br />

used. It shows that Poselet is a time-consuming method. Our<br />

NCPD Model is faster than DPM, and achieves speedup rate<br />

for about 10x, and 125x as compared with Poselet. We admit<br />

that cascade DPM [13] has already improved the speed of<br />

DPM. However, our method still reach a speedup rate about<br />

2x. While as reported in [13], to achieve high efficiency,<br />

cascade DPM might suffer a loss in accuracy comparing with<br />

original version of DPM.<br />

Fig.4 shows some results comparing Poselet with our<br />

method. Our method can better deal with occlusion and articulated<br />

pose (e.g. (a)(b) in Fig.4) than Poselet. Also, our<br />

NCPD model shows its effectiveness when integrating part<br />

detectors (e.g. (c)(d) in Fig.4). This torso-nodes combination<br />

idea helps us get higher performance in low false alarm rate<br />

by effectively integrating our boosted stable part detectors.<br />

Therefore, we achieve a fast and accurate human detection<br />

algorithm using our NCPD model.<br />

Table 1. Average time per image for different models.<br />

average time per image<br />

PASCAL test dataset 2008 2009 2010<br />

Poselet 112s 118s 121s<br />

DPM 8.95s 9.03s 9.01s<br />

NCPD model 0.89s 0.87s 0.93s<br />

3652


2011 18th IEEE International Conference on Image Processing<br />

Fig. 3. ROC curves comparison for three different models. (a) PASCAL 2008 dataset (197 pictures, 412 annotations). (b)<br />

PASCAL 2009 dataset (72 pictures, 162 annotations). (c) PASCAL 2010 dataset (505 pictures, 737 annotations)<br />

Table 2. NCPD model speedup rate.<br />

average time per image<br />

PASCAL test dataset 2008 2009 2010<br />

cf. Poselet 125.8x 135.6x 130.1x<br />

cf. DPM 10.1x 10.4x 9.7x<br />

Fig. 4. <strong>Detection</strong> Results. The first row is the detection results<br />

of Poselet. The second row is the detection results of our<br />

NCPD model<br />

6. CONCLUSION<br />

In this paper, we focus on human detection in occlusion and<br />

high articulation which remains a challenging problem in<br />

computer vision. We propose <strong>Node</strong>-<strong>Combined</strong> <strong>Part</strong> <strong>Detector</strong><br />

(NCPD) Model which integrates stable part detectors using<br />

less varied torso-nodes into an efficient and robust human<br />

detector. Different from most previous part based work, we<br />

use AdaBoost with APCF features to train our part detectors.<br />

Our approach is well performing in occlusion and high articulation,<br />

and it demonstrates competitive detection accuracy<br />

and fast speed for human detection. We conclude that the<br />

model described in this paper for detecting people is equally<br />

applicable to other object categories. This is the subject of an<br />

ongoing research.<br />

7. ACKNOWLEDGEMENT<br />

This work is supported by National Science Foundation of<br />

China under grant No.61075026.<br />

8. REFERENCES<br />

[1] P. Viola and M. Jones., “Rapid object detection using a boosted<br />

cascade of simple features,” in Proc. CVPR, 2001.<br />

[2] N. Dalal and B. Triggs, “Histogram of oriented gradients for<br />

human detection,” in Proc. CVPR, 2005.<br />

[3] G. Duan, H. Ai, and S. Lao, “A structural filter approach to<br />

human detection,” in Proc. ECCV, 2010.<br />

[4] L. Bourdev, S. Maji, T. Brox, and J. Malik, “Detecting people<br />

using mutually consistent poselet activations,” in Proc. ECCV,<br />

2010.<br />

[5] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,<br />

“Object detection with discriminatively trained part<br />

based models,” IEEE Transactions on Pattern Analysis and<br />

Machine Intelligence, vol. 32, no. 9, 2010.<br />

[6] L. Bourdev and J. Malik, “Poselets: Body part detectors trained<br />

using 3d human pose annotations,” in Proc. ICCV, 2009.<br />

[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and<br />

A. Zisserman, “The pascal visual object classes (voc) challenge,”<br />

International Journal of Computer Vision, vol. 88, no.<br />

2, 2010.<br />

[8] G. Duan, C. Huang, H. Ai, and S. Lao, “Boosting associated<br />

pairing comparison features for pedestrian detection,” in Proc.<br />

ICCV Workshop, 2009.<br />

[9] P. Felzenszwalb and D. Huttenlocher, “Pictorial structures for<br />

object recognition,” International Journal of Computer Vision,<br />

vol. 61, no. 1, pp. 234–778, 2005.<br />

[10] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures<br />

revisited: People detection and articulated pose estimation,” in<br />

Proc. CVPR, 2009.<br />

[11] R. E. Schapire and Y. Singer, “Improved boosting algorithmsusing<br />

confidence-rated predictions,” Machine Learning, pp.<br />

297–336, 1999.<br />

[12] C. Huang, H. Ai, B. Wu, and S. Lao, “Boosting nested cascade<br />

detector for multi-view face detection,” in Proc. ICPR, 2004.<br />

[13] P. Felzenszwalb, R. Girshick, and D. McAllester, “Cascade<br />

object detection with deformable part models,” in Proc. CVPR,<br />

2010.<br />

3653

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!