27.09.2014 Views

Human Detection in Video over Large Viewpoint Changes

Human Detection in Video over Large Viewpoint Changes

Human Detection in Video over Large Viewpoint Changes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong><br />

Viewpo<strong>in</strong>t <strong>Changes</strong><br />

Genquan Duan 1 , Haizhou Ai 1 , and Shihong Lao 2<br />

1 Computer Science & Technology Department, Ts<strong>in</strong>ghua University, Beij<strong>in</strong>g, Ch<strong>in</strong>a<br />

ahz@mail.ts<strong>in</strong>ghua.edu.cn<br />

2 Core Technology Center, Omron Corporation, Kyoto, Japan<br />

lao@ari.ncl.omron.co.jp<br />

Abstract. In this paper, we aim to detect human <strong>in</strong> video <strong>over</strong> large<br />

viewpo<strong>in</strong>t changes which is very challeng<strong>in</strong>g due to the diversity of human<br />

appearance and motion from a wide spread of viewpo<strong>in</strong>t doma<strong>in</strong><br />

compared with a common frontal viewpo<strong>in</strong>t. We propose 1) a new feature<br />

called Intra-frame and Inter-frame Comparison Feature to comb<strong>in</strong>e<br />

both appearance and motion <strong>in</strong>formation, 2) an Enhanced Multiple Clusters<br />

Boost algorithm to co-cluster the samples of various viewpo<strong>in</strong>ts and<br />

discrim<strong>in</strong>ative features automatically and 3) a Multiple <strong>Video</strong> Sampl<strong>in</strong>g<br />

strategy to make the approach robust to human motion and frame rate<br />

changes. Due to the large amount of samples and features, we propose a<br />

two-stage tree structure detector, us<strong>in</strong>g only appearance <strong>in</strong> the 1 st stage<br />

and both appearance and motion <strong>in</strong> the 2 nd stage. Our approach is evaluated<br />

on some challeng<strong>in</strong>g Real-world scenes, PETS2007 dataset, ETHZ<br />

dataset and our own collected videos, which demonstrate the effectiveness<br />

and efficiency of our approach.<br />

1 Introduction<br />

<strong>Human</strong> detection and track<strong>in</strong>g are <strong>in</strong>tensively researched <strong>in</strong> computer vision <strong>in</strong><br />

recent years due to their wide spread of potential applications <strong>in</strong> real-world tasks<br />

such as driver-aided system, visual surveillance system, etc., <strong>in</strong> which real time<br />

and high accuracy performance is required.<br />

In this paper, our problem is to detect human <strong>in</strong> video <strong>over</strong> large viewpo<strong>in</strong>t<br />

changes as shown <strong>in</strong> Fig. 1. It is very challeng<strong>in</strong>g due to the follow<strong>in</strong>g issues: 1)<br />

A large variation <strong>in</strong> human appearance <strong>over</strong> wide viewpo<strong>in</strong>t changes is caused by<br />

different poses, positions and views; 2) Light<strong>in</strong>g change and background clutter<br />

as well as occlusions make human detection much harder; 3) The diversity of<br />

human motion exists where any direction of movement is possible and the shape<br />

and the size of a human <strong>in</strong> video may change as he moves; 4) <strong>Video</strong> frame rates<br />

are often different; 5) The camera is mov<strong>in</strong>g sometimes.<br />

The basic idea for human detection <strong>in</strong> video is to comb<strong>in</strong>e both appearance<br />

and motion <strong>in</strong>formation. There are ma<strong>in</strong>ly three difficult problems to be explored<br />

<strong>in</strong> this detection task: 1) How to comb<strong>in</strong>e both appearance and motion <strong>in</strong>formation<br />

to generate discrim<strong>in</strong>ative features? 2) How to tra<strong>in</strong> a detector to c<strong>over</strong> such


1248 G. Duan, H. Ai, and S. Lao<br />

Fig. 1: <strong>Human</strong> detection <strong>in</strong> video <strong>over</strong> large viewpo<strong>in</strong>t changes. Samples of three typical<br />

viewpo<strong>in</strong>ts and correspond<strong>in</strong>g scenes are given.<br />

a large variation <strong>in</strong> human appearance and motion <strong>over</strong> wide viewpo<strong>in</strong>t changes?<br />

3) How to deal with the changes <strong>in</strong> video frame rate or abrupt motion if us<strong>in</strong>g<br />

motion features on several consecutive frames? Viola and Jones [1] first made use<br />

of appearance and motion <strong>in</strong>formation <strong>in</strong> object detection, where they tra<strong>in</strong>ed<br />

AdaBoosted classifiers with Harr features on two consecutive frames. Later Jones<br />

and Snow [2] extended this work by propos<strong>in</strong>g appearance filter, difference filter<br />

and shifted difference filter on 10 consecutive frames and us<strong>in</strong>g predef<strong>in</strong>ed several<br />

categories of samples. The approaches <strong>in</strong> [1] [2] can solve the 1 st problem,<br />

but still face the challenge of the 3 rd problem. The approach <strong>in</strong> [2] can handle<br />

the 2 nd problem to some extent but as even human himself sometimes cannot<br />

tell which predef<strong>in</strong>ed category a mov<strong>in</strong>g object belongs to and thus its application<br />

will be limited, while the approach <strong>in</strong> [1] tra<strong>in</strong>s detectors by mix<strong>in</strong>g all<br />

positives together. Dalal et al. [3] comb<strong>in</strong>ed HOG descriptors and some motionbased<br />

descriptors together to detect humans with possibly mov<strong>in</strong>g cameras and<br />

backgrounds. Wojek et al. [4] proposed to comb<strong>in</strong>e multiple and complementary<br />

feature types and <strong>in</strong>corporate motion <strong>in</strong>formation for human detection, which<br />

coped with mov<strong>in</strong>g camera and clustered background well and achieved promis<strong>in</strong>g<br />

results on humans with a common frontal viewpo<strong>in</strong>t. In this paper, our aim<br />

is to design a novel feature to take advantages of both appearance and motion<br />

<strong>in</strong>formation, and to propose an efficient learn<strong>in</strong>g algorithm to learn a practical<br />

detector of rational structure even when the samples are tremendously diverse<br />

for handl<strong>in</strong>g the difficulties mentioned above <strong>in</strong> one framework.<br />

The rest of this paper is organized as follows. Related work is <strong>in</strong>troduced <strong>in</strong><br />

Sec. 2. The proposed feature (I 2 CF ), the co-cluster algorithm (EMC-Boost) and<br />

the sampl<strong>in</strong>g strategy (MVS) are given <strong>in</strong> Sec. 3, Sec. 4 and Sec. 5 respectively<br />

and they are <strong>in</strong>tegrated to handle human detection <strong>in</strong> video <strong>in</strong> Sec. 6. Some<br />

experiments and conclusions are given <strong>in</strong> Sec. 7 and the last section respectively.<br />

2 Related work<br />

In literature, human detection <strong>in</strong> video can be divided roughly <strong>in</strong>to four categories.<br />

1) <strong>Detection</strong> <strong>in</strong> static images as [5] [6] [7]. APCF [5], HOG [6] and<br />

Edgelet [7] are def<strong>in</strong>ed on appearance only. APCF compares colors or gradient<br />

orientations of two squares <strong>in</strong> images that can describe the <strong>in</strong>variance of color


<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong> Viewpo<strong>in</strong>t <strong>Changes</strong> 1249<br />

and gradient of an object to some extent. HOG computes oriented gradient distribution<br />

<strong>in</strong> a rectangle image w<strong>in</strong>dow. An edgelet is a short segment of l<strong>in</strong>e or<br />

curve, which is predef<strong>in</strong>ed based on prior knowledge. 2) <strong>Detection</strong> <strong>over</strong> videos<br />

as [1] [2]. Both of them are already mentioned <strong>in</strong> the previous section. 3) Object<br />

track<strong>in</strong>g as [8] [9]. Some need manual <strong>in</strong>itializations as <strong>in</strong> [8], and some are<br />

with the aid of detection as <strong>in</strong> [9]. 4) Detect<strong>in</strong>g events or human behaviors. 3D<br />

volumetric features [10] are designed for event detection, which can be 3D harr<br />

like features. ST-patch [11] is used for detect<strong>in</strong>g behaviors. Inspired by those<br />

works, we propose Intra-frame and Inter-frame Comparison Features (I 2 CF s)<br />

to comb<strong>in</strong>e appearance and motion <strong>in</strong>formation.<br />

Due to the large variation <strong>in</strong> human appearance <strong>over</strong> wide viewpo<strong>in</strong>t changes,<br />

it is impossible to tra<strong>in</strong> a usable detector by tak<strong>in</strong>g the sample space as a<br />

whole. The solution is, divide and conquer, to cluster the sample space <strong>in</strong>to<br />

some subspaces dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. A subspace can be dealt with as one class and<br />

the difficulty exists ma<strong>in</strong>ly <strong>in</strong> cluster<strong>in</strong>g the sample space. An efficient way is<br />

to cluster sample space automatically like <strong>in</strong> [12] [13] [14]. Clustered Boost<strong>in</strong>g<br />

Tree (CBT) [14] splits sample space automatically by the already learned discrim<strong>in</strong>ative<br />

features dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g process for pedestrian detection. Mixture of<br />

Experts [12] (MoE) jo<strong>in</strong>tly learns multiple classifiers and data partitions. It emphasizes<br />

much on local experts and is suitable when <strong>in</strong>put data can be naturally<br />

divided <strong>in</strong>to homogeneous subsets, which is even impossible for a fixed viewpo<strong>in</strong>t<br />

of human as shown <strong>in</strong> Fig. 1. MC-Boost [13] co-clusters images and visual<br />

features by simultaneously learn<strong>in</strong>g image clusters and boost<strong>in</strong>g classifiers. Risk<br />

map, def<strong>in</strong>ed on pixel level distance between samples, is also used to reduce the<br />

search<strong>in</strong>g space of the weak classifiers <strong>in</strong> [13]. To solve our problem, we propose<br />

an Enhanced Multiple Clusters Boost (EMC-Boost) algorithm to co-cluster the<br />

sample space and discrim<strong>in</strong>ative features automatically which comb<strong>in</strong>es the benefits<br />

of Cascade [15], CBT [14] and MC-Boost [13]. The selection of EMC-Boost<br />

<strong>in</strong>stead of MC-Boost is discussed <strong>in</strong> Sec. 7.<br />

Our contributions are summarized <strong>in</strong> four folds: 1) Intra-frame and Interframe<br />

Comparison Features (I 2 CF s) are proposed to comb<strong>in</strong>e appearance and<br />

motion <strong>in</strong>formation for human detection <strong>in</strong> video <strong>over</strong> large viewpo<strong>in</strong>t changes;<br />

2) An Enhanced Multiple Clusters Boost (EMC-Boost) algorithm is proposed<br />

to co-cluster the sample space and discrim<strong>in</strong>ative features automatically; 3) A<br />

Multiple <strong>Video</strong> Sampl<strong>in</strong>g (MVS) strategy is used to make our approach robust<br />

to human motion and video frame rate changes; 4) A two-stage tree structure<br />

detector is presented to fully m<strong>in</strong>e the discrim<strong>in</strong>ative features of the appearance<br />

and motion <strong>in</strong>formation. The experiments <strong>in</strong> challeng<strong>in</strong>g real-world scenes show<br />

that our approach is robust to human motion and frame rate changes.<br />

3 Intra-frame and Inter-frame Comparison Features<br />

3.1 Granular space<br />

Our proposed discrim<strong>in</strong>ative feature is def<strong>in</strong>ed <strong>in</strong> Granular space [16]. A granule<br />

is a square w<strong>in</strong>dow patch def<strong>in</strong>ed <strong>in</strong> grey images, which is def<strong>in</strong>ed as a triplet


1250 G. Duan, H. Ai, and S. Lao<br />

g(x, y, s), where (x, y) is the position and s is the scale. For <strong>in</strong>stance, g(x, y, s)<br />

<strong>in</strong>dicates that the size of this granule is 2 s × 2 s and its left-top corner is at<br />

position (x, y) of an image. In an image I, it can be calculated as<br />

g(x, y, s) =<br />

2<br />

1 ∑<br />

s −1<br />

2 s × 2 s<br />

j=0<br />

2∑<br />

s −1<br />

k=0<br />

I(x + k, y + j). (1)<br />

s is set to 0,1,2 or 3 <strong>in</strong> this paper and the four typical granules are shown <strong>in</strong><br />

Fig. 2 (a).<br />

In order to calculate the distance between two granules, Granular space G is<br />

mapped <strong>in</strong>to 3D space I, where for each element g ∈ G and γ ∈ I, g(x, y, s) →<br />

γ(x + 2 s , y + 2 s , 2 s ). The distance between two granules <strong>in</strong> G is def<strong>in</strong>ed to be the<br />

Euclidean distance between two correspond<strong>in</strong>g po<strong>in</strong>ts <strong>in</strong> I, d(g 1 , g 2 ) = d(γ 1 , γ 2 )<br />

where g 1 , g 2 ∈ G, γ 1 , γ 2 ∈ I and γ 1 , γ 2 correspond to g 1 , g 2 respectively.<br />

d(γ 1 (x 1 , y 1 , z 1 ), γ 2 (x 2 , y 2 , z 2 )) = √ (x 1 − x 2 ) 2 + (y 1 − y 2 ) 2 + (z 1 − z 2 ) 2 . (2)<br />

3.2 Intra-frame and Inter-frame Comparison Features (I 2 CF s)<br />

Similar to the approach <strong>in</strong> [1], we consider two frames each time, the previous<br />

one and the latter one, from which two pairs of granules are extracted to fully<br />

capture the appearance and motion features of an object. An I 2 CF can be<br />

represented as a five-tuple c = (mode, g1, i g j 1 , gi 2, g j 2 ), which is also called a cell<br />

accord<strong>in</strong>g to [5]. The mode is Appearance mode, Difference mode or Consistent<br />

mode. g1, i g j 1 , gi 2 and g j 2 are four granules. The first pair of granules, gi 1 and g j 1<br />

are from the previous frame to describe the appearance of an object. The second<br />

pair of granules, g2 i and g j 2 comes from the previous or latter frame to describe<br />

either appearance or motion <strong>in</strong>formation. When the second pair are from the<br />

previous frame, which means that both pairs are from the previous frame, this<br />

k<strong>in</strong>d of feature is Intra-frame Comparison Feature (Intra-frame CF); when<br />

the second pair come from the latter one, the feature becomes Inter-frame<br />

Comparison Feature (Inter-frame CF). Both of these two k<strong>in</strong>ds of comparison<br />

features are comb<strong>in</strong>ed to be Intra-frame and Inter-frame Comparison Feature<br />

(I 2 CF ).<br />

Appearance mode (A-mode). Pair<strong>in</strong>g Comparison of Color feature (PCC)<br />

is proved to be simple, fast and efficient <strong>in</strong> [5] . As PCC can describe the <strong>in</strong>variance<br />

of color to some extent, we extend this idea to 3D space. A-mode compares<br />

two pairs of granules simultaneously:<br />

f A (g i 1, g j 1 , gi 2, g j 2 ) = gi 1 ≥ g j 1 &&gi 2 ≥ g j 2 . (3)<br />

PCC feature is a special case of A-mode. f A (g i 1, g j 1 , gi 2, g j 2 ) = gi 1 ≥ g j 1 , when<br />

g i 1 == g i 2 and g j 1 == gj 2 .<br />

Difference mode (D-mode). D-mode computes the absolute subtractions<br />

of two pairs of granules, def<strong>in</strong>ed as:<br />

f D (g1, i g j 1 , gi 2, g j 2 ) = |gi 1 − g2| i ≥ |g j 1 − gj 2 |. (4)


<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong> Viewpo<strong>in</strong>t <strong>Changes</strong> 1251<br />

g(2,4,1)<br />

g(3,11,2)<br />

g(16,2,0)<br />

g(10,6,3)<br />

g2<br />

g1<br />

g2<br />

g1<br />

g3<br />

g4<br />

g1<br />

g2<br />

g3<br />

g4<br />

(a)<br />

(b) (c) (d)<br />

Fig. 2: Our proposed I 2 CF . (a) Granular space with four scales (s = 0,1,2,3) of granules<br />

comes from [5]. (b)Two granules g 1 and g 2 connected by a solid l<strong>in</strong>e form one pair of<br />

granules applied <strong>in</strong> APCF [5]. (c) Two pairs of granules are used <strong>in</strong> each cell of I 2 CF .<br />

The solid l<strong>in</strong>e between g 1 and g 2 (or g 3 and g 4) means that g 1 and g 2 (or g 3 and g 4)<br />

come from the same frame. The dashed l<strong>in</strong>e connect<strong>in</strong>g g 1 and g 3 (or g 2 and g 4) means<br />

that the locations of g 1 and g 3 (or g 2 and g 4) are related. This relation of locations<br />

is shown <strong>in</strong> (d). For example, g 3 is <strong>in</strong> the neighborhood of g 1. This way reduces the<br />

feature pool a lot but still reserves the discrim<strong>in</strong>ative weak features.<br />

The motion filters <strong>in</strong> [1] [2] calculate the difference between one region and<br />

a shifted one by mov<strong>in</strong>g it up, left, right or bottom 1 or 2 pixels <strong>in</strong> the second<br />

frame. There are three ma<strong>in</strong> differences between the D-mode and those methods:<br />

1) The restriction for the locations of these regions is def<strong>in</strong>ed spatially and much<br />

looser; 2) D-mode considers two pair of regions each time; 3) The only operation<br />

of D-mode is a comparison operator after subtractions.<br />

Consistent-mode (C-mode). C-mode compares the sums of two pairs of<br />

granules to take advantage of consistent <strong>in</strong>formation <strong>in</strong> the appearance of one<br />

frame or successive frames, def<strong>in</strong>ed as:<br />

f C (g1, i g j 1 , gi 2, g j 2 ) = (gi 1 + g2) i ≥ (g j 1 + gj 2 ). (5)<br />

C-mode is much simpler and can be quickly calculated compared with 3D<br />

volumetric features [10] and spatial temporal patches [11].<br />

An I 2 CF of length n is represented as {c 0 , c 1 , · · · , c n−1 } and its feature<br />

value is def<strong>in</strong>ed as a b<strong>in</strong>ary concatenation of correspond<strong>in</strong>g functions of cells <strong>in</strong><br />

reverse order as f I2 CF = [b n−1 b n−2 · · · b 2 b 1 ], where b k = f(mode, g i 1, g j 1 , gi 2, g j 2 )<br />

for 0 ≤ k < n.<br />

⎧<br />

⎪⎨ f A (g1, i g j<br />

f(mode, g1, i g j 1 , gi 2, g j 1 , gi 2, g j 2 ), mode = A,<br />

2 ) = f<br />

⎪ D (g1, i g j 1 , gi 2, g j 2 ), mode = D, (6)<br />

⎩<br />

f C (g1, i g j 1 , gi 2, g j 2 ), mode = C.<br />

3.3 Heuristic learn<strong>in</strong>g I 2 CF s<br />

Feature reduction. For 58 × 58 samples, there are ∑ 3<br />

s=0 (58 − 2s + 1) × (58 −<br />

2 s +1) = 12239 granules <strong>in</strong> total and the feature pool conta<strong>in</strong>s 3×12239 2 ≃ 6.7×<br />

10 16 weak features without any restrictions, which make the tra<strong>in</strong><strong>in</strong>g time and


1252 G. Duan, H. Ai, and S. Lao<br />

memory requirements impractical. With the distance of two granules def<strong>in</strong>ed <strong>in</strong><br />

Sec. 3.1, two effective constra<strong>in</strong>ts are <strong>in</strong>troduced <strong>in</strong>to I 2 CF : 1)Motivated by [5],<br />

the first pair of granules <strong>in</strong> I 2 CF is constra<strong>in</strong>ed as d(g i 1, g j 1 ) ≤ T 1. 2) Consider<strong>in</strong>g<br />

of the consistency <strong>in</strong> one frame or two near video frames, we constra<strong>in</strong> that the<br />

second pair of granules <strong>in</strong> I 2 CF is <strong>in</strong> the neighborhood of the first pair as shown<br />

<strong>in</strong> Fig. 2 (d):<br />

d(g i 1, g i 2) ≤ T 2 , d(g j 1 , gj 2 ) ≤ T 2. (7)<br />

We set T 1 = 8, T 2 = 4 <strong>in</strong> our experiments.<br />

Table 1: Learn<strong>in</strong>g algorithm of I 2 CF .<br />

Input: Sample set S = {(x i , y i )|1 ≤ i ≤ m} where y i = ±1.<br />

Initialize: Cell space (CS) with all possible cells and empty I 2 CF .<br />

Output: The learned I 2 CF .<br />

Loop:<br />

– Learn the first pair of granules as [5]. Denote the best f pairs as a set F .<br />

– Construct a new set CS’: In each cell of CS’, the first pair of granules is from F , the<br />

second pair of granules is generated by Eq. 7 and its mode is A-mode, D-mode or C-mode.<br />

Calculate Z value of I 2 CF by add<strong>in</strong>g each cell <strong>in</strong> CS’.<br />

– Select the cell with the lowest Z value, denoted as c ∗ . Add c ∗ to I 2 CF .<br />

– Ref<strong>in</strong>e I 2 CF by replac<strong>in</strong>g one or two granules <strong>in</strong> it without chang<strong>in</strong>g the mode.<br />

Heuristically learn<strong>in</strong>g I 2 CF starts with an empty I 2 CF . Each time select<br />

the most discrim<strong>in</strong>ative cell and add it to I 2 CF . The discrim<strong>in</strong>ability of a weak<br />

feature is measured by Z value, which reflects the classification power of the<br />

weak classifier as [17]:<br />

Z = 2 ∑ √<br />

W+W j −, j (8)<br />

j<br />

where W j + is the weight of positive samples that fall <strong>in</strong>to the j th b<strong>in</strong> while W j −<br />

is that of negatives. The less Z value is, the more discrim<strong>in</strong>ative a weak feature<br />

is. The learn<strong>in</strong>g algorithm of I 2 CF is summarized <strong>in</strong> Table 1. (See more details<br />

<strong>in</strong> [5] [16].)<br />

4 EMC-Boost<br />

We propose the EMC-Boost to co-cluster the sample space and discrim<strong>in</strong>ative<br />

features automatically. A perceptual cluster<strong>in</strong>g problem is shown <strong>in</strong> Fig. 3 (a)-<br />

(c). EMC-Boost consists of three components, Cascade Component (CC), Mixed<br />

Component (MC) and Separated Component (SC). The three components are<br />

comb<strong>in</strong>ed to become EMC-Boost. In fact, SC is similar to MC-Boost [13], which<br />

is the reason that our boost<strong>in</strong>g algorithm is named as EMC-Boost. In the follow<strong>in</strong>g<br />

descriptions, we formulate the three components explicitly first, and then<br />

demonstrate the learn<strong>in</strong>g algorithms, and summarize EMC-Boost at the end of<br />

this section.


<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong> Viewpo<strong>in</strong>t <strong>Changes</strong> 1253<br />

CC<br />

MC<br />

Cluster<br />

(b)<br />

SC<br />

Classifier<br />

(a) (c) (d)<br />

Fig. 3: A perceptual cluster<strong>in</strong>g problem <strong>in</strong> (a)-(c) and a general EMC-Boost <strong>in</strong> (d)<br />

where CC, MC and SC are three components of EMC-Boost.<br />

4.1 Three components CC/MC/SC<br />

CC deals with a standard 2-class classification problem that can be solved by any<br />

boost<strong>in</strong>g algorithm. MC and SC deal with K clusters. We formulate the detectors<br />

of MC or SC as K strong classifiers, each of which is a l<strong>in</strong>ear comb<strong>in</strong>ation of<br />

weak learners H k (x) = ∑ t α kth kt (x), k = 1, · · · , K with a threshold θ k (default<br />

is 0). Note that the K classifiers H k (x), k = 1, · · · , K are same <strong>in</strong> MC with K<br />

different thresholds θ k , which means H 1 (x) = H 2 (x) = · · · = H k (x), but they<br />

are totally different <strong>in</strong> SC. We present MC and SC uniformly below.<br />

The score y ik of the i th sample belong<strong>in</strong>g to the k th clusters can be computed<br />

as y ik = H k (x i )−θ k . Therefore, the probability of x i belong<strong>in</strong>g to the k th cluster<br />

1<br />

is P ik (x) = . For aggregat<strong>in</strong>g all scores of one sample on K classifiers,<br />

1+e −y ik<br />

we formulate Noisy-OR like [18] [13] as<br />

P i (x) = 1 −<br />

K∏<br />

(1 − P ik (x i )). (9)<br />

k=1<br />

The cost function is def<strong>in</strong>ed as J = ∏ i P ti<br />

i (1 − P i) 1−ti where t i ∈ {0, 1} is<br />

the label of i th sample, which is equivalent to maximize the log-likelihood<br />

log J = ∑ i<br />

t i log P i + (1 − t i ) log(1 − P i ). (10)<br />

4.2 Learn<strong>in</strong>g algorithms of CC/MC/SC<br />

The learn<strong>in</strong>g algorithm of CC is directly Real Adaboost [17]. The learn<strong>in</strong>g algorithm<br />

of MC or SC is different from that of CC. MC and SC learn weak classifiers<br />

to maximize ∑ K ∑<br />

k i w kih kt (x i ) and ∑ i w kih kt (x i ) respectively at t th round of<br />

boost<strong>in</strong>g. Initially, the sample weights are: 1) For positives, w ki = 1 if x i ∈ k and<br />

w ki = 0 otherwise, where i denotes the i th sample and k denotes the k th cluster<br />

or classifier; 2) For all negatives we set w ki = 1/K. Follow<strong>in</strong>g the AnyBoost<br />

method [19], we set the sample weights as the derivative of the cost function


1254 G. Duan, H. Ai, and S. Lao<br />

w.r.t. the classifier score. The weight of k th classifier <strong>over</strong> i th sample is updated<br />

by<br />

w ki = ∂ log J<br />

∂y ki<br />

= t i − P i<br />

P i<br />

P ki (x i ). (11)<br />

We sum up the tra<strong>in</strong><strong>in</strong>g algorithms of MC and SC <strong>in</strong> Table 2.<br />

Table 2: Learn<strong>in</strong>g algorithms of MC and SC.<br />

Input: Sample set S = {(x i , y i )|1 ≤ i ≤ m} where y i = ±1; <strong>Detection</strong> rate r <strong>in</strong> each layer.<br />

Output: H k (x) = P t α kt(x), k = 1, · · · , K.<br />

Loop: For t = 1, · · · , T<br />

(MC)<br />

P<br />

i w kih kt (x i ).<br />

– F<strong>in</strong>d weak classifiers h t, (h kt = h t, k = 1, · · · , K) that maximize P K<br />

k=1<br />

– F<strong>in</strong>d the weak-learner weights α kt , (k = 1, · · · , K) that maximize Γ (H + α kt h kt ).<br />

– Update weights by Eq.11.<br />

(SC) For k = 1, · · · , K<br />

– F<strong>in</strong>d weak classifiers h kt that maximize P i w kih kt (x i ).<br />

– F<strong>in</strong>d the weak-learner weights α kt that maximize Γ (H + α kt h kt ).<br />

– Update weights by Eq.11.<br />

Update thresholds θ k (k = 1, · · · , K) to satisfy detection rate r.<br />

4.3 A general EMC-Boost<br />

The three components of EMC-Boost have different properties. CC takes all<br />

samples as one cluster while MC or SC considers that the whole sample space<br />

consists of multiple clusters. CC tends to dist<strong>in</strong>guish positives from negatives;<br />

while SC tends to cluster the sample space, and MC can do both work at the<br />

same time but not so accurately as CC <strong>in</strong> classify<strong>in</strong>g positives from negatives<br />

or as SC <strong>in</strong> cluster<strong>in</strong>g. Compared with SC, one particular advantage of MC is<br />

shar<strong>in</strong>g weak features among all clusters. We comb<strong>in</strong>e these three components<br />

to become a general EMC-Boost as shown <strong>in</strong> Fig. 3 (d) which conta<strong>in</strong>s five<br />

steps:(Note that Step 2 is similar to CBT [14])<br />

Step 1. CC learns a classifier for all samples considered as one category.<br />

Step 2. K-means algorithm clusters sample space with learned weak features.<br />

Step 3. MC clusters sample space coarsely.<br />

Step 4. SC clusters the sample space further.<br />

Step 5. CC learns a classifier for each cluster center.<br />

5 Multiple <strong>Video</strong> Sampl<strong>in</strong>g<br />

In order to deal with the change <strong>in</strong> video frame rate or abrupt motion problem,<br />

we <strong>in</strong>troduce a Multiple <strong>Video</strong> Sampl<strong>in</strong>g (MVS) strategy as illustrated <strong>in</strong><br />

Fig. 4 (a) and (b). Consider<strong>in</strong>g five consecutive frames <strong>in</strong> (a), a positive sample<br />

is made up of two frames, where one is the first frame and the other is from the


<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong> Viewpo<strong>in</strong>t <strong>Changes</strong> 1255<br />

1 2 3 4 5<br />

(a)<br />

1 2 1 3 1 4 1 5<br />

(b)<br />

(c)<br />

Fig. 4: The MVS strategy and some positive samples.<br />

Appearance Motion Appearance+Motion<br />

23.3%<br />

i th cluster<br />

k th cluster<br />

34.9%<br />

15.5%<br />

1 st stage 2 nd stage<br />

(a)<br />

1 st stage 2 nd stage<br />

(b)<br />

10.5%<br />

15.8%<br />

Fig. 5: Two-Stage Tree Structure <strong>in</strong> (a) and an example <strong>in</strong> (b). The number <strong>in</strong> the box<br />

gives the percentage of samples belong<strong>in</strong>g to that branch.<br />

next four frames as shown <strong>in</strong> (b). In other words, one annotation corresponds to<br />

five consecutive frames and generates 4 positives. Some more positives are shown<br />

<strong>in</strong> Fig. 4 (c). Suppose that the orig<strong>in</strong>al frame rate is R and the used positives<br />

consist of the 1 st and the r th frames (r > 1), then the possible frame rate c<strong>over</strong>ed<br />

by MVS strategy is R/(r −1). If these positives are extracted from 30 fps videos,<br />

the tra<strong>in</strong>ed detector is able to deal with 30fps(30/1),15fps(30/2),10fps(30/3) and<br />

7.5fps(30/4) videos where r is 2, 3, 4 and 5 respectively.<br />

6 Overview of Our Approach<br />

We adopt EMC-Boost select<strong>in</strong>g I 2 CF as weak features to learn a strong classifier<br />

for multiple viewpo<strong>in</strong>t human detection, <strong>in</strong> which positive samples are achieved<br />

through MVS strategy. Due to the large amount of samples and features, it is<br />

difficult to learn a detector directly by a general EMC-Boost. We modify the<br />

detector structure slightly and propose a new structure conta<strong>in</strong><strong>in</strong>g two stages<br />

as shown <strong>in</strong> Fig. 5 (a) with an example <strong>in</strong> (b), which is called two-stage tree<br />

structure: In the 1 st stage, it only uses appearance <strong>in</strong>formation for learn<strong>in</strong>g and<br />

cluster<strong>in</strong>g; In the 2 nd stage, it uses both appearance and motion <strong>in</strong>formation for<br />

cluster<strong>in</strong>g first, and then for learn<strong>in</strong>g classifiers for all clusters.<br />

7 Experiments<br />

We carry out some experiments to evaluate our approach by False Positive Per<br />

Image (FPPI) on several real-world challeng<strong>in</strong>g datasets, ETHZ, PET2007 and<br />

our own collected dataset. When the <strong>in</strong>tersection between a detection response


1256 G. Duan, H. Ai, and S. Lao<br />

and a ground-truth box is larger than 50% of their union, we consider it to be<br />

a successful detection. Only one detection per annotation is counted as correct.<br />

For simplicity, the three typical viewpo<strong>in</strong>ts mentioned <strong>in</strong> Fig. 1 are represented<br />

as Horizontal Viewpo<strong>in</strong>t (HV), Slant Viewpo<strong>in</strong>t (SV) and Vertical Viewpo<strong>in</strong>t<br />

(VV) <strong>in</strong> turn from left to right.<br />

Datasets. The datasets used <strong>in</strong> the experiments are ETHZ dataset [20],<br />

PETS2007 dataset [21] and our own collected dataset. ETHZ dataset provides<br />

four video sequences, Seq.#0∼Seq.#3 (640×480 pixels at 15 fps). This dataset<br />

whose viewpo<strong>in</strong>t is near HV is recorded us<strong>in</strong>g a pair of cameras, and we only use<br />

the images provided by the left camera. PETS2007 dataset conta<strong>in</strong>s 9 sequences<br />

S00∼S08 (720×576 pixels at 30 fps) and each sequence has 4 fixed cameras and<br />

we choose the 3 rd camera whose viewpo<strong>in</strong>t is near SV. There are 3 scenarios <strong>in</strong><br />

PETS2007, with <strong>in</strong>creas<strong>in</strong>g scene complexity, loiter<strong>in</strong>g (S01 and S02), attended<br />

luggage removal (S03, S04, S05 and S06) and unattended luggage (S07 and S08).<br />

In the experiments, we use S01, S03, S05 and S06. In addition, we have collected<br />

several sequences by hand-held DV cameras: 2 sequences near HV (853×480<br />

pixels at 30 fps), and 2 sequences near SV (1280×720 pixels at 30 fps) and 8<br />

sequences near VV (2 sequences are 1280×720 pixels at 30 fps and the others<br />

are 704×576 pixels at 30 fps).<br />

Tra<strong>in</strong><strong>in</strong>g and test<strong>in</strong>g datasets. S01, S03, S05 and S06 of PETS2007 and<br />

our own dataset are labeled every five frames manually for tra<strong>in</strong><strong>in</strong>g and test<strong>in</strong>g,<br />

while ETHZ dataset provides the groundtruth already. The tra<strong>in</strong><strong>in</strong>g datasets<br />

conta<strong>in</strong> Seq.#0 of ETHZ, S01, S03 and S06 of PETS2007, 2 sequences (near HV),<br />

2 sequences (near SV) and 6 sequences (near VV) of ours. The test<strong>in</strong>g datasets<br />

conta<strong>in</strong> Seq.#1, Seq.#2 and Seq.#3 of ETHZ, S05 of PETS2007, and 2 sequences<br />

of ours (near VV). Note that the groundtruths of the <strong>in</strong>ternal unlabeled frames<br />

<strong>in</strong> the test<strong>in</strong>g datasets are achieved through <strong>in</strong>terpolation. The properties of the<br />

test<strong>in</strong>g datasets may have impacts on all detectors, like camera motion states<br />

(fixed or mov<strong>in</strong>g), illum<strong>in</strong>ation conditions (slightly or significantly light changes),<br />

etc. Details about the test<strong>in</strong>g datasets are summarized <strong>in</strong> Table 3.<br />

Tra<strong>in</strong><strong>in</strong>g detectors. We have labeled 11768 different humans <strong>in</strong> total and<br />

obta<strong>in</strong> 47072 positives after MVS, where the number of positives near HV, SV<br />

Table 3: Some details about the test<strong>in</strong>g datasets.<br />

Description Seq. 1 Seq. 2 S05 Seq.#1 Seq.#2 Seq.#3<br />

Source ours ours PETS2007 ETHZ ETHZ ETHZ<br />

Camera Fixed Fixed Fixed Mov<strong>in</strong>g Mov<strong>in</strong>g Mov<strong>in</strong>g<br />

Light changes Slightly Slightly Slightly Slightly Slightly Significantly<br />

Frame rate 30fps 30fps 30fps 15fps 15fps 15fps<br />

Size 704×576 704×576 720×576 640×480 640×480 640×480<br />

Frames 2420 1781 4500 999 450 354<br />

Annotations 591 1927 17067 5193 2359 1828


Reca l<br />

Reca l<br />

Reca l<br />

Reca l<br />

Reca l<br />

Reca l<br />

<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong> Viewpo<strong>in</strong>t <strong>Changes</strong> 1257<br />

0.95<br />

Test Sequence 1 (ours, 2420 frames, 591 annotations)<br />

0.9<br />

Test Sequence 2 (ours, 1718 frames, 1927 annotations)<br />

0.9<br />

0.85<br />

0.85<br />

0.8<br />

0.8<br />

0.75<br />

0.75<br />

0.7<br />

0.05 0.1 0.15 0.2 0.25 0.3<br />

FPPI<br />

0.75<br />

Intra-frame CF + Boost<br />

Intra-frame CF + EMC-Boost<br />

I 2 CF + EMC-Boost<br />

S05 (PETS2007, 4500 frames, 17067 annotations)<br />

0.7<br />

Intra-frame CF + Boost<br />

Intra-frame CF + EMC-Boost<br />

I 2 CF + EMC-Boost<br />

0.65<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

FPPI<br />

0.8<br />

Seq. #1 (ETHZ, 999 frames, 5193 annotations)<br />

0.7<br />

0.65<br />

0.75<br />

0.7<br />

0.65<br />

0.6<br />

0.6<br />

0.55<br />

0.55<br />

0.5<br />

0.5<br />

0.45<br />

Ess et al.<br />

Schwartz et al.<br />

0.45<br />

Intra-frame CF + Boost<br />

Intra-frame CF + EMC-Boost<br />

I 2 CF + EMC-Boost<br />

0.4<br />

0.35<br />

Wojek et al.(HOG, IMHwd and HIKSVM)<br />

Intra-frame CF + Boost<br />

Intra-frame CF + EMC-Boost<br />

I 2 CF + EMC-Boost<br />

0.4<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

FPPI<br />

0.75<br />

0.7<br />

0.65<br />

0.6<br />

Seq. #2 (ETHZ, 450 frames, 2359 annotations)<br />

0 1 2 3 4 5 6<br />

FPPI<br />

Seq. #3 (ETHZ, 354 frames, 1828 annotations)<br />

0.9<br />

0.8<br />

0.7<br />

0.55<br />

0.6<br />

0.5<br />

0.45<br />

Ess et al.<br />

Schwartz et al.<br />

0.4<br />

Wojek et al.(HOG, IMHwd and HIKSVM)<br />

Intra-frame CF + Boost<br />

0.35<br />

Intra-frame CF + EMC-Boost<br />

I 2 CF + EMC-Boost<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

FPPI<br />

0.5<br />

0.4<br />

0.3<br />

Ess et al.<br />

Schwartz et al.<br />

Wojek et al.(HOG, IMHwd and HIKSVM)<br />

Intra-frame CF + Boost<br />

Intra-frame CF + EMC-Boost<br />

I 2 CF + EMC-Boost<br />

0.2<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

FPPI<br />

Fig. 6: Evaluation of our approach and some results.<br />

and VV are 18976, 19248 and 8848 respectively. The size of positives is normalized<br />

to 58 × 58. Some positives are shown <strong>in</strong> Fig. 4. We tra<strong>in</strong> a detector based on<br />

EMC-Boost select<strong>in</strong>g I 2 CF as features. Implementation details. We cluster the<br />

sample space <strong>in</strong>to 2 clusters <strong>in</strong> the 1 st stage and cluster the two sub spaces <strong>in</strong>to<br />

2 and 3 clusters separately <strong>in</strong> the 2 nd stage as illustrated <strong>in</strong> Fig. 5 (b). When do<br />

we start and stop MC/SC? When the false positive rate is less than 10 −2 (10 −4 )<br />

<strong>in</strong> the 1 st (2 nd ) stage, we start MC and then start SC after learn<strong>in</strong>g by MC.<br />

Before describ<strong>in</strong>g when to stop MC or SC, we first def<strong>in</strong>e transferred samples. A<br />

sample is called transferred if it belongs to another cluster after current round<br />

boost<strong>in</strong>g. We stop MC (SC) when the number of transferred samples is less than<br />

10% (2%) of the total number of samples.<br />

Evaluation. To compare with our approach (denoted as I 2 CF +EMC-Boost),<br />

two other detectors are tra<strong>in</strong>ed: one is to adopt Intra-frame CF learned by a general<br />

Boost algorithm like [5] [15] (denoted as Intra-frame CF+Boost) and the<br />

other one is to adopt Intra-frame CF learned by EMC-Boost (denoted as Intraframe<br />

CF+EMC-Boost). Note that due to the large amount of Inter-frame CFs,<br />

the large amount of positives and memory limited, it is impractical to learn a<br />

detector of Inter-frame CF by Boost or EMC-Boost.<br />

We compare our approach with Intra-frame CF + Boost and Intra-frame CF<br />

+ EMC-Boost approaches on PETS2007 dataset and our own collected videos,<br />

and also with [4] [20] [22] on ETHZ dataset. We give the ROC curves and some<br />

results <strong>in</strong> Fig. 6. In general, our proposed approach which <strong>in</strong>tegrates appearance<br />

and motion <strong>in</strong>formation is superior to Intra-frame CF+Boost and Intra-frame<br />

CF+EMC-Boost approaches which only use appearance <strong>in</strong>formation. From another<br />

viewpo<strong>in</strong>t, this experiment also <strong>in</strong>dicates that <strong>in</strong>corporat<strong>in</strong>g motion <strong>in</strong>formation<br />

improves detection significantly as [4].


Recall<br />

Recall<br />

Recall<br />

Recall<br />

Recall<br />

Recall<br />

1258 G. Duan, H. Ai, and S. Lao<br />

1<br />

Test Sequence 1 of ours<br />

1<br />

Test Sequence 2 of ours<br />

0.75<br />

S05 of PETS2007<br />

0.95<br />

0.95<br />

0.7<br />

0.9<br />

0.9<br />

0.65<br />

0.85<br />

0.85<br />

0.6<br />

0.8<br />

0.75<br />

0.7<br />

0.65<br />

0.05 0.1 0.15 0.2 0.25 0.3<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.8<br />

0.75<br />

0.7<br />

1<br />

1/2<br />

1/3<br />

0.65<br />

1/4<br />

1/5<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6<br />

FPPI<br />

0.55<br />

0.5<br />

0.45<br />

0.4<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.7<br />

Seq.#1 of ETHZ<br />

0.9<br />

Seq.#2 of ETHZ<br />

0.8<br />

Seq.# 3 of ETHZ<br />

0.65<br />

0.8<br />

0.75<br />

0.6<br />

0.7<br />

0.55<br />

0.7<br />

0.65<br />

0.5<br />

0.6<br />

0.6<br />

0.55<br />

0.45<br />

0.5<br />

0.5<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.4<br />

0.3<br />

0.2<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6<br />

FPPI<br />

1<br />

1/2<br />

1/3<br />

1/4<br />

1/5<br />

0.45<br />

1<br />

0.4<br />

1/2<br />

1/3<br />

0.35<br />

1/4<br />

1/5<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

FPPI<br />

Fig. 7: The robustness evaluation of our approach to video frame rate changes. 1 represents<br />

the orig<strong>in</strong>al frame rate, 1/2 represents that the test<strong>in</strong>g dataset is downsampled<br />

to 1/2 orig<strong>in</strong>al frame rate and so as 1/3, 1/4 and 1/5.<br />

Compared to [20] [22], our approach is better. Furthermore, we have not used<br />

any additional cues like depth maps, ground-plane estimation, and occlusion reason<strong>in</strong>g,<br />

which are used <strong>in</strong> [20]. “HOG, IMHwd and HIKSVM” proposed <strong>in</strong> [4],<br />

comb<strong>in</strong>es HOG feature [18] and the Internal Motion Histogram wavelet difference<br />

(IMHwd) descriptor [3] together us<strong>in</strong>g histogram <strong>in</strong>tersection kernel SVM<br />

(HIKSVM) and achieves the best results than other approaches used <strong>in</strong> [4]. Our<br />

approach is not as good as but comparable to “ HOG, IMHwd and HIKSVM”and<br />

we argue that our approach is much simpler and faster. Currently, our approach<br />

takes about 0.29s on ETHZ, 0.61s on PETS and 0.55s on our dataset to process<br />

one frame <strong>in</strong> average.<br />

In order to evaluate the robustness of our approach to video frame rate<br />

changes, we downsample the videos to their 1/2, 1/3, 1/4 and 1/5 orig<strong>in</strong>al frame<br />

rate to compare with the orig<strong>in</strong>al frame rate. Our approach is evaluated on<br />

the test<strong>in</strong>g datasets with frame rate changes, and the ROC curves are shown <strong>in</strong><br />

Fig. 7. The results on the three sequences of ETHZ dataset and S05 of PETS2007<br />

are similar, but the results on our collected two sequences differ a lot. The ma<strong>in</strong><br />

reason is that: 1) as frame rates get lower, human motion changes more abruptly;<br />

2) note that human motion <strong>in</strong> near VV videos changes more fiercely than that<br />

<strong>in</strong> near HV or SV ones. Our collected two sequences are near VV, so relatively<br />

human motion changes even more drastically than the other test<strong>in</strong>g datasets.<br />

Generally speak<strong>in</strong>g, our approach is robust to the video frame rate change to a<br />

certa<strong>in</strong> extent.<br />

Discussions. We then discuss about the selection of EMC-Boost <strong>in</strong>stead of<br />

MC-Boost and a general Boost algorithm.<br />

MC-Boost runs several classifiers together at any time, which may be good at<br />

classification problems, but not at detection problems. Consider<strong>in</strong>g the shar<strong>in</strong>g


<strong>Human</strong> <strong>Detection</strong> <strong>in</strong> <strong>Video</strong> <strong>over</strong> <strong>Large</strong> Viewpo<strong>in</strong>t <strong>Changes</strong> 1259<br />

feature at the beg<strong>in</strong>n<strong>in</strong>g of cluster<strong>in</strong>g and the good clusters after cluster<strong>in</strong>g, we<br />

propose more suitable learn<strong>in</strong>g algorithms, MC and SC, for detection problems.<br />

Furthermore, risk map is an essential part of MC-Boost, <strong>in</strong> which the risk of one<br />

sample is related to predef<strong>in</strong>ed neighbors <strong>in</strong> the same class and <strong>in</strong> the opposite.<br />

But a proper neighborhood def<strong>in</strong>ition itself might be a tough question. For preserv<strong>in</strong>g<br />

the merits of MC-Boost and avoid<strong>in</strong>g its shortcom<strong>in</strong>gs, we argue that<br />

EMC-Boost is more suitable for a detection problem.<br />

A general Boost algorithm can work well when the sample space has fewer<br />

variations, while EMC-Boost is designed to cluster the space <strong>in</strong>to several sub<br />

ones which can make the learn<strong>in</strong>g process faster. But after cluster<strong>in</strong>g, a sample<br />

is considered as correct if it belongs to any sub space and thus it may also br<strong>in</strong>g<br />

with more negatives. This may be the reason that Intra-frame CF+EMC-Boost<br />

is <strong>in</strong>ferior to Intra-frame CF+Boost <strong>in</strong> Fig. 6. Ma<strong>in</strong>ly consider<strong>in</strong>g the cluster<br />

ability, we choose EMC-Boost other than a general Boost algorithm. In fact, it<br />

is impractical to learn a detector of I 2 CF +Boost without cluster<strong>in</strong>g because of<br />

the large amount of weak features and positives.<br />

8 Conclusion<br />

In this paper, we propose Intra-frame and Inter-frame Comparison Features<br />

(I 2 CF s), Enhanced Multiple Clusters Boost algorithm (EMC-Boost), Multiple<br />

<strong>Video</strong> Sampl<strong>in</strong>g (MVS) strategy and a two-stage tree structure detector to detect<br />

human <strong>in</strong> video <strong>over</strong> large viewpo<strong>in</strong>t change. I 2 CF s comb<strong>in</strong>e appearance<br />

<strong>in</strong>formation and motion <strong>in</strong>formation automatically. EMC-Boost can cluster a<br />

sample space quickly and efficiently. MVS strategy makes our approach robust<br />

to frame rate and human motion. The f<strong>in</strong>al detector is organized as a two-stage<br />

tree structure to fully m<strong>in</strong>e the discrim<strong>in</strong>ative features of the appearance and<br />

motion <strong>in</strong>formation. The evaluations on challeng<strong>in</strong>g datasets show the efficiency<br />

of our approach.<br />

There are some future works to be done. The large feature pool causes lots<br />

of difficulties dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. One work is to design a more efficient learn<strong>in</strong>g<br />

algorithm. MVS strategy can make the approach more robust to frame rate,<br />

but not handle arbitrary frame rates once a detector is learned. To achieve<br />

better results, another work may <strong>in</strong>tegrate object detection <strong>in</strong> video with object<br />

detection <strong>in</strong> static images and object track<strong>in</strong>g. An <strong>in</strong>terest<strong>in</strong>g question to EMC-<br />

Boost is what k<strong>in</strong>d of clusters EMC-Boost can obta<strong>in</strong>. Take human for example.<br />

Different poses, views, viewpo<strong>in</strong>ts or illum<strong>in</strong>ation will make human different and<br />

perceptual co-clusters of these samples differ with different criterion. The further<br />

relation between the discrim<strong>in</strong>ative features and samples is critical to the results.<br />

Another work may study the relations among features, objects and EMC-Boost.<br />

Our approach can also be applied to other object detection, multiple objects<br />

detection or object category as well.<br />

Acknowledgement. This work is supported by National Science Foundation of<br />

Ch<strong>in</strong>a under grant No.61075026, and it is also supported by a grant from Omron<br />

Corporation.


1260 G. Duan, H. Ai, and S. Lao<br />

References<br />

1. Viola, P., Jones, M., Snow, D.: Detect<strong>in</strong>g pedestrians us<strong>in</strong>g patterns of motion<br />

and appearance. In: IEEE International Conference on Computer Vision (ICCV).<br />

(2003)<br />

2. Jones, M., Snow, D.: Pedestrian detection us<strong>in</strong>g boosted features <strong>over</strong> many frames.<br />

In: International Conference on Pattern Recognition (ICPR), Motion, Track<strong>in</strong>g,<br />

<strong>Video</strong> Analysis. (2008)<br />

3. Dalal, N., Triggs, B., Schmid, C.: <strong>Human</strong> detection us<strong>in</strong>g oriented histograms of<br />

flow and appearance. In: ECCV. (2006)<br />

4. Wojek, C., Walk, S., Schiele, B.: Multi-cue onboard pedestrian detection. In:<br />

CVPR. (2009)<br />

5. Duan, G., Huang, C., Ai, H., Lao, S.: Boost<strong>in</strong>g associated pair<strong>in</strong>g comparison<br />

features for pedestrian detection. In: 9th Workshop on Visual Surveillance. (2009)<br />

6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:<br />

CVPR. (2005)<br />

7. Wu, B., Nevatia, R.: <strong>Detection</strong> of multiple, partially occluded humans <strong>in</strong> a s<strong>in</strong>gle<br />

image by bayesian comb<strong>in</strong>ation of edgelet part detectors. In: ICCV. (2005)<br />

8. Yang, M., Yuan, J., Wu, Y.: Spatial selection for attentional visual track<strong>in</strong>g. In:<br />

CVPR. (2007)<br />

9. Andriluka, M., Roth, S., Schiele, B.: People-track<strong>in</strong>g-by-detection and peopledetection-by-track<strong>in</strong>g.<br />

In: CVPR. (2008)<br />

10. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection us<strong>in</strong>g volumetric<br />

features. In: ICCV. (2005)<br />

11. Shechtman, E., Irani, M.: Space-time behavior based correlation. In: CVPR. (2005)<br />

12. Jordan, M., Jacobs, R.: Hierarchical mixture of experts and the em algorithm.<br />

Neural Computation 6 (1994) 181–214<br />

13. Kim, T.K., Cipolla, R.: Mcboost: Multiple classifier boost<strong>in</strong>g for perceptual cocluster<strong>in</strong>g<br />

of images and visual features. In: Advances <strong>in</strong> Neural Information Process<strong>in</strong>g<br />

Systems (NIPS). (2008)<br />

14. Wu, B., Nevatia, R.: Cluster boosted tree classifier for multi-view, multi-pose<br />

object detection. In: ICCV. (2007)<br />

15. Viola, P., Jones, M.: Rapid object detection us<strong>in</strong>g a boosted cascade of simple<br />

features. In: CVPR. (2001)<br />

16. HUANG, C., AI, H., LI, Y., LAO, S.: Learn<strong>in</strong>g sparse features <strong>in</strong> granular space<br />

for multi-view face detection. In: IEEE International Conference, Automatic Face<br />

and Gesture Recognition. (2006)<br />

17. Schapire, R.E., S<strong>in</strong>ger, Y.: Improved boost<strong>in</strong>g algorithms us<strong>in</strong>g confidence-rated<br />

predictions. Mach<strong>in</strong>e Learn<strong>in</strong>g 37 (1999) 297–336<br />

18. Viola, P., Platt, J., Zhang, C.: Multiple <strong>in</strong>stance boost<strong>in</strong>g for object detection. In:<br />

NIPS. (2005)<br />

19. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boost<strong>in</strong>g algorithms as gradient<br />

descent. In: Proc. Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems. (2000)<br />

20. Ess, A., Leibe, B., Gool, L.V.: Depth and appearance for mobile scene analysis.<br />

In: ICCV. (2007)<br />

21. PETS2007. (http://www.cvg.rdg.ac.uk/PETS2007/)<br />

22. Schwartz, W.R., Kembhavi, A., Harwood, D., Davis, L.S.: <strong>Human</strong> detection us<strong>in</strong>g<br />

partial least squares analysis. In: ICCV. (2009)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!