Master Thesis - Hochschule Bonn-Rhein-Sieg

FachhochschuleBonn-Rhein-SiegUniversity of Applied SciencesFachbereich InformatikDepartment of Computer ScienceMaster Thesisin the programMaster of Science in Computer ScienceReliable Recognition and Tracking of MultiplePersons in Work Safety RelevantEnvironmentsbyMoritz ViethSupervisor: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prof. Dr. Rainer HerpersCo-supervisor: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prof. Dr. Dietmar ReinertHanded in: 11.06.2007

Page iiiAbstractIn the course of the research project ”Berührungslos wirkende Schutzeinrichtung zur Fingererkennungan Kreissägen“ (Contact-less Operating Protection Device for the Detection of Fingers at TableSaws) of the Department of Computer Science of the Bonn-Rhein-Sieg University of Applied Sciencesand the BG-Institute for Occupational Safety and Health (Berufsgenossenschaftliches Institutfür Arbeitsschutz, BGIA) of the Hauptverband der gewerblichen Berufsgenossenschaften (HVBG)several concepts for the improvement of work safety in workshop environments have been developed.Several works have been concerned with the detection of persons in hazard areas, but could onlydetect a single person at a time. Thus, the aim of this thesis is to detect and track multiple personsin real-time simultaneously. For this, methods of Computer Vision are applied by surveilling the areaaround a hazard area using a digital video camera. The data acquired by the camera is processed todetect persons in its visual field.In order to be used in the area of work safety (and thus in real environments), several requirementshave to be met. In particular, the rate of false negative and false positive classifications has to bekept as low as possible, the regarded scenes have to be arbitrary, the system has to react in real-timeand changes in illumination have to be dealt with.At first, the works which have already been done in this project are discussed, in particular thoseconcerned with the detection of persons, following which works marking the state of the art arepresented. Taking the drawbacks of these works and the requirements of work safety into account,several objectives are set.Based on these objectives, the problem is analyzed in-depth. Needed sub-tasks are identified andmethods which can be used to solve these sub-tasks are presented. These methods are evaluated andthe subset most suited to solve the problem of this thesis is selected.Using this selection, a system was developed which uses a multi-modal foreground/background segmentationto identify distinct objects, which are then classified using a face recognition algorithmbased on Gabor wavelets. Persons are tracked using a specially developed algorithm. An in-depthevaluation of the system shows that up to 5 persons can be detected and tracked reliably. However,the evaluation also shows that the system is not fit for use in work safety relevant environmentsbecause the rate of false classifications (negative and positive) may be too high in some situations.

Page vAcknowledgmentsFirst and foremost I would like to thank my supervisors, Prof. Dr. Rainer Herpers and Prof. Dr.Dietmar Reinert for making this thesis possible and for their support and advice during this thesis.I would like to thank my family for their support and my friends and colleagues for their advice,for which I am very grateful. I would like to especially thank Oliver Zilken, without whom thework on this thesis would not have been the same.And finally, I want to thank Astrid Perkow, whose love, support and patience was most importantduring this process.This work has been funded by the Hauptverband der gewerblichen Berufsgenossenschaften(HVBG), under the grant ID FP239.

Page viiErklärungHiermit gelobe ich, dass ich die vorliegende Master Thesis selbständig verfasst und keine anderen alsdie angegebenen Quellen und Hilfsmittel verwendet habe.DeclarationI hereby declare that I have written this thesis without any help from others and without the use ofdocuments and aids other than those stated above.Ort, DatumMoritz Vieth

Page ixContents1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Model Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Work Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work and State of the Art 72.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Work concerning the immediate hazard area . . . . . . . . . . . . . . . . . . . . 72.1.2 Work concerning the detection of persons . . . . . . . . . . . . . . . . . . . . . 72.2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Problem Analysis 173.1 Sub-tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.2 Detection of Persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.3 Tracking of Persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Possible Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 Realization 434.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Application Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.4 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.1 Integration of the project into the framework . . . . . . . . . . . . . . . . . . . 525 Results and Evaluation 535.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.1.1 Variation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.1.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.1 Data base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.3 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 Conclusions and Further Work 696.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Page xBibliography 71

Page xiList of Tables5.1 Evaluation parameters, separated by static and dynamic parameters. . . . . . . . . . . 535.2 Overview over the test persons considered in the evaluation. . . . . . . . . . . . . . . . 57

Page xiiiList of Figures1.1 Schematic image of the setup of the project environment. The shaded area marks theobservation area, in which the worker is at a distance of about 2.5m from the camera,which is positioned so that the worker faces the camera. Other persons may enter theobservation area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Marlin F-003C camera used in this project. [Gmb07] . . . . . . . . . . . . . . . . . . . 32.1 Head-shoulder region as detected by [Bar03]. The ellipse marks the head, while theplus signs mark the detected shoulder points. [Bar03] . . . . . . . . . . . . . . . . . . 82.2 Erroneous situations using the methods of [Bar03]. (a) Work piece carried by thepersons is higher than the head. No person is detected. (b) Two persons are too closeto each other. Only one person is detected. [Bar03] . . . . . . . . . . . . . . . . . . . . 92.3 Improved head candidate search by Barth and Herpers (a) Input image, (b) Segmentedforeground mask, (c) Detected edges, (d) Skin color map. [BH05] . . . . . . . . . . . 102.4 Successfully detected arm-hand regions using steerable filters [Hah05] . . . . . . . . . 112.5 Output of the Reading People Tracker. Each person is marked with a bounding boxand a contour fitted to them. [Sie03] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Motion History Image (MHI). Brighter areas correspond to more recent movement.By taking the last 5 images into account, a ”trail” is introduced to the person, and thecentral part of the person is not regarded as moving. . . . . . . . . . . . . . . . . . . . 243.2 Edge search for fitting a contour to a shape. A contour representing a rough estimationof a moving person is centered on the blob. Based on several key points, the contouris tried to be matched against close edges. [Sie03] . . . . . . . . . . . . . . . . . . . . . 293.3 Haar-like features suitable for face detection. The features are divided into threeclasses: edge features, line features and center-surround features. For each class a setof basic orientations is given. [LLK03] . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Typical set of Gabor filter kernels. Parameters for these were σ = π, k max = π 2 ,f = √ 2, v = 0..4 and µ = 0..7. [OG07] . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5 Combination of different color spaces resulting in robust skin image. The image isscanned for skin colored pixels using the RGB, HSV and YCbCr color spaces, whichyield slightly different results. Combining these results into one image leads to a robustskin color classification.[Ze99] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 Sequence of shapes having the same set of Hu moments due to invariance in translation,rotation and scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.7 Different reference points for shapes. (a) Center of mass, (b) Mean value, (c) Boundingbox center. While the center of mass and the mean value yield approximately the samepoint, the center of the bounding box deviates from those. . . . . . . . . . . . . . . . . 384.1 General structure of the algorithm. The red parts indicate preprocessing steps, blueparts indicate the classification, and green parts denote the tracking steps. Dashedlines mark high-level feedback to the different stages . . . . . . . . . . . . . . . . . . . 444.2 Screen shot of the application. Each image depicts a result of a processing step. . . . . 454.3 Work flow for background segmentation. The dashed lines mark the influence of highlevelfeedback. First, search rectangles are determined based on motion segmentation,afterwards a difference mask is created based on the search rectangles. This mask isdilated and eroded in order to improve its quality, following which blobs are createdusing the mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Page xiv4.4 Hole filling algorithm. (a) Original mask (b) Mask after dilation, (c) Mask after erosion 464.5 Work flow for blob classification. The green box marks the steps for skin recognition,the blue box marks the head classification part. The skin recognition is only carriedout if the person has not already been classified as a person. If enough skin pixelsare present in the area of the blob, a head classification algorithm is carried out. Ifsuccessful for a defined number of frames, the blob is classified as a person. . . . . . . 484.6 Assignment of newly found blobs to previously detected blobs. The similarity to allpreviously detected blob is calculated for each blob. Afterwards blobs are assigned byorder of the highest similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.7 Work flow for head redetection. This is done for all blobs previously classified as person.The position of the head is estimated using the motion vector. At the estimatedposition, a akin blob is created and compared to the current head. If the similarity ishigh enough, the head is considered redetected. . . . . . . . . . . . . . . . . . . . . . . 515.1 Test configurations used for the evaluation. (a) Uniform background (b) Non-uniformbackground. Illumination was changed to achieve further configurations. . . . . . . . . 585.2 False negative classification rate for test configurations. In the worst case (non-uniformbackground, non-uniform illumination) a false negative classification rate of 0.43% isachieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3 Pure false negative classifications after elapsed number of frames. The classificationaccuracy improves significantly over time. . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 False positive classifications for test configurations. In the worst case (non-uniformbackground, non-uniform illumination) a false positive classification rate of 3.86% isachieved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.5 Overall loss rate for different test configurations. In the worst case (non-uniformbackground, non-uniform illumination) a tracking loss rate of 0.12% is achieved . . . . 625.6 Effect of fast motion on tracking and classification. While classification is heavilyimpacted by fast motion, tracking rates stay reliable. . . . . . . . . . . . . . . . . . . . 645.7 Effect of sudden and gradual change in illumination to (a) tracking loss and (b) falsepositive classification (constant illumination given for reference). while a gradualchange has no effect on classification and tracking, a sudden change has a severeimpact on the application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.8 Performance of the System. While having an average frame rate of about 20 framesper second (and thus a reaction time of about 50ms), the frame rate for the worst case(5 persons in the image) is 7 frames per second (about 143ms reaction time). . . . . . 67

Page 11 IntroductionThis chapter provides an introduction to the topic covered by this thesis. At first, the project whichthis thesis is a part of and the used environment are presented. Afterwards, a short introduction tothe topic of work safety is given, following which an overview over the rest of the thesis is provided.1.1 MotivationThis thesis is part of a research project by the Department of Computer Science of the Bonn-Rhein-Sieg University of Applied Sciences and the BG-Institute for Occupational Safety andHealth (Berufsgenossenschaftliches Institut für Arbeitsschutz, BGIA) of the Hauptverband dergewerblichen Berufsgenossenschaften (HVBG), both situated in St. Augustin. The title of theproject is ”Berührungslos wirkende Schutzeinrichtung zur Fingererkennung an Kreissägen“(Contact-less Operating Protection Device for the Detection of Fingers at TableSaws)[fAB07]. The project is funded by the HVBG (FP239 ).As studies have shown [dgBH03], several hundreds of accidents per year occur involving the use ofhand-fed machines such as circular table saws. While several approaches to detect body parts in theimmediate vicinity of the hazardous mechanism (such as a saw blade) have already been made (seesection 2.1), these approaches may benefit from the knowledge if a person is in the working area 1of the machine. Furthermore, general information about the presence of persons in certain areasmay be used to improve work safety in areas even without hand-fed machines (and thus, workshopenvironments in general).1.2 Problem DefinitionThe abstract problem, as can be deducted from the above section, is to detect multiple persons in apredefined area. Many approaches to this have been made, but most of those only consider singlepersons or rely heavily on restrictions to the input. But in order to apply to situations regardingwork safety, these limitations have to be overcome.1 Working area: The area which a worker has to enter in order to operate a machine.

Page 2 Chapter 1. IntroductionThe task of this thesis is to develop a system which is able to detect and track multiple personssimultaneously in an area where those persons need protection from hazards, while workingin real-time and having as few restrictions regarding the environment as possible. These hazardsmay be indirect, like the work on a machine (which would need further processing to eliminate theactual hazard), or direct hazards, like the operating area of a robot. By that, it shall be possible toenhance work safety in workshops or robot halls, for example.The approach to solve this task is to apply Computer Vision, in particular Machine Vision to theproblem. Computer Vision denotes the theoretical and technological means to obtain informationfrom image data automatically. Machine Vision denotes a certain area of Computer Vision whichis concerned with digital image data obtained through video sequences or video cameras, especiallywhen working fully automatically, i.e. without any interaction by humans.To do this, a camera system is set up, which observes an defined area (at about 2 up to 3 metersdistance from the camera) and tries to detect any person in this area. Because of the size anddistance of the regarded area (the observation area 2 ), up to 5 or 6 persons are to be detected andtracked simultaneously.1.2.1 Model Test SetupThe test setup of this thesis will use a table saw, type WA-80 by the company Altendorf, which servesas a model for hand-fed machines. At a height of about 210cm, an overview camera is situated whichobserves the working area at the saw as well as its surrounding area (the observation area). Personsentering this area are supposed to be detected and tracked subsequently.The information about the worker could be further processed by a subsequent system, as thosepresented in section 2.1.1.The camera used is a Marlin F-033C by Allied Vision. This is an industrial camera, using a highquality CCD sensor which provides images in VGA resolution (640 x 480) at frame rates of up to74Hz. The lens used is a wide angle lens by Pentax, type C481DC (TH). It employs a focal lengthof 4.8mm, which in respect to the chip size of the camera results to a field of view of 98 ◦ 22 ′ . Thisis necessary so that the observation area will be as large as possible while preserving a sufficientaccuracy and size of the persons to be detected. This ensures that persons can be detected very early2 Observation area: The area in the field of view of the camera, in which persons should be detected; ranging fromabout 2m up to 3m distance from the camera.

1.2. Problem Definition Page 3overviewcameravisual field ofoverview cameraTable sawSaw bladepiece of wooddistanceworker camera:ca. 2,50mmovablesaw sliderworkerobservation areaFigure 1.1: Schematic image of the setup of the project environment. The shaded area marks the observationarea, in which the worker is at a distance of about 2.5m from the camera, which is positioned so that the workerfaces the camera. Other persons may enter the observation area.(before entering the hazard area 3 ).Due to the wide angle of the lens, a heavy distortion is applied to the acquired image, which mayeventually have impact on the classification. However, it can be assumed that all persons are ata defined distance to the camera (comp. figure 1.2.1 and section 2.4), so that the distortion onlyhas minimal effect. Furthermore, a great variety of persons and hence their head shapes has to beassumed. Therefore, a calibration of the camera is not necessary.Figure 1.2: Marlin F-003C camera used in this project. [Gmb07]The camera communicates with the PC used via an IEEE-1394 (FireWire) port. The exact configurationof the PC used is described in chapter 5.3 Hazard area: The area in which a person is subject to hazards, e.g. a 12 cm radius around a saw blade, as definedby [fAB07]

Page 4 Chapter 1. Introduction1.2.2 Work SafetyThis section introduces the term ”work safety“, which is fundamental for this thesis. Since ”worksafety“ is a wide-ranging term, only the aspects relevant to this thesis will be discussed.Work safety in general denominates measures aiming at the protection of persons working inhazardous environments (like workshops) from accidents or even avoiding these accidents beforehand.These can be regulations for the work with hazardous machines (for instance, a safety areaof 12 cm surrounding saw blades has been defined, which must not be penetrated by body parts),but also concrete technical measures to apply these regulations. The aim of this thesis is to developsuch a measure, specifically the reliable detection and tracking of persons in an area requiring closesurveillance.There are several requirements to those technical measures regarding work safety:• Real-Time ReactionThe application has to be able to process and evaluate the data in as short time as possible, inorder to recognize and avert a potentially hazardous situation. For example, the hazard areaaround the saw blade of a circular saw has been defined as an area of 12cm radius around thesaw blade. Based on the estimated maximum speed of a hand when working (2m/s) [dgBH03],it can traverse the hazard area in about 50ms, so the reaction time is required to be lower thanthat.• No false negatives in the hazard areaA false negative result is present, if a target object (a person in this case) is not detected, inspite of being in the field of view of the camera. This has to be avoided in 100% of all cases.If this can not be achieved, the potential hazard is much higher than without safety measure,since people tend to heavily rely on safety equipment. This is especially critical if the subjectis in the hazard area.• Minimize false positivesA false positive result is computed if an area in the image is classified as a target objectalthough there is none at that position. These are less critical than false negative classificationsbut still have negative impacts. For instance, if these classifications lead to the shutdown of amachine this false analysis will lead to increased cost. Furthermore, frequent falsely positiveclassifications will lead to a low acceptance of the system.intention of the development of a work safety means.This contradicts the underlying

1.2. Problem Definition Page 5• Real environmentsA system which is used to improve work safety has to be able to act in real environmentsand thus can not be restricted to arbitrarily modifiable (and usually idealized) conditions oflaboratory environments. This means specifically:– Arbitrary backgroundThe background must not be restricted in any way. In a real workshop environment allimaginable variations of the background are possible. A cluttered background will be thenormal case, and moving objects in the background are possible as well.– Only minimal restrictions regarding lighting.The lighting of the environment cannot be defined beforehand, since the area tobe observed has already been constructed, without regard to vision applications.Although in workshop environments sufficient illumination can be assumed, it willmost likely not be uniformly distributed. Also, daylight will usually influence the scene.This introduces a gradual change in lighting over time, which has to be taken into account.– No or only little restriction of the clothing of personsIt is not possible to restrict the clothing of persons in work safety relevant areas. Toassure the acceptance of the system, no regulations regarding the clothing of workers canbe made (with the exception of the general regulations for work clothing). Thus, a greatvariance of target objects results on the one hand, on the other hand the analysis usingshapes and edges is hindered significantly. The shape of a person can deviate significantlyfrom predefined models; for example if a person is wearing hearing protection, the shapeof the head does not resemble an ellipse anymore, as many models assume. Furthermore,patches of different color in the clothing each introduce a set of edges to the image. Thus,it is significantly hard to determine the outer contour of a person.– No or only little restriction of the posture of personsIn real-world applications, persons may take up any posture while working, includingbending down or raising the hands above the head (so that the head is not the upmostpoint of the shape anymore). Additionally, it is usual that those persons are conveyingwork pieces. These cases have to be taken into account as well.Nevertheless it is possible to derive some assumptions. For instance, it can be assumedthat a worker has to stand at least once in ”nominal posture“ facing the camera, becausehe has to turn on the machine.

Page 6 Chapter 1. Introduction• Continuous UseSince a work safety relevant application will be put to use over a long time (up to constantoperation) it is crucial that the application meets the requirements put up by this. This means ithas to use resources carefully and provide a high stability and reliability. If a safety applicationfails because of bugs or defects in the hardware, this may lead to a hazardous situations. Thus,it is necessary to use hardware which is able to guarantee long-time stability as well as verystable algorithms.1.3 OverviewThis section provides an overview of the structure of this thesis.Chapter 2 addresses the state of the art and related work in the course of this projects, andpresents further related works on the topic of person detection. Based on that, concrete objectivesof this thesis are formulated in respect to the problem definition in section 1.2 and the requirementspresented in section 1.2.2.Chapter 3 presents the approach to solving the given task. Three main problems are identified,which are divided into several sub-tasks. These sub-tasks are presented and analyzed in detail.Following that, different methods to solve one or more sub-tasks are presented in section 3.2, as wellas synergies between these methods are identified.Based on the requirements and objectives of the project, section 3.2.2 selects the methods suitedbest to solve the tasks.After the methods used for a solution of the problem have been presented and selected inchapter 3, the actual implementation is addressed in chapter 4. Here, the precise approach to solvethe problem is explained, as well as specific parameters are described.Following that, chapter 5 provides a in-depth evaluation of the application regarding the chosenobjectives and the given requirements. Chapter 6 gives a conclusion and an outlook on futurework.

Page 72 Related Work and State of the ArtThis chapter provides an overview of previous work that has been done in the scope of this project,as well as other projects concerning person detection. Afterwards, the objectives for this thesis arederived, and assumptions needed to achieve these objectives are defined.2.1 Related WorkWithin the scope of this project several approaches have been developed and tested, which allowstatements about the use of different concepts of protection in workshop environments. These canbe sorted into two groups:The works in the first group deal with the immediate hazard area (which encloses the saw blade, inthis example) and try to detect the presence of hands or body parts directly there. The works inthe second group try to detect persons employing a coarse-to-fine strategy and use this informationto detect the hands in the image. This thesis is to be allocated to the second group, and thus thisgroup will be discussed with greater detail.2.1.1 Work concerning the immediate hazard areaSeveral works have been done in this area. They approached the problem of detection hands inthe immediate hazard area by various methods, namely passive-infrared sensors [Gra03], electric fieldsensors [Klu04], visual analysis [Zil07] and the theoretical analysis of multi-spectral analysis and lasertriangulation [Due04], based on which approaches using laser triangulation [Sch05] and multi-spectralactive infrared sensors [Sch06] were made.Since these works are not directly relevant to the problem approached in this thesis, they will not bediscussed in detail.2.1.2 Work concerning the detection of personsThis section presents the works in the scope of the project which approach the problem of detectinga person in the surrounding of the hazard area. Using a coarse-to-fine strategy, a statement on thepresence of the person should be made in order to support other systems (such as those above). The

Page 8 Chapter 2. Related Work and State of the Artcoarse-to-fine strategy tries to first detect the presence of a person, then finding its head, and usingthis knowledge to estimate the position of the hands of the person.These works can be understood as direct related work of this thesis.Alexander Barth:Entwicklung eines Verfahrens zur Detektion der Kopf-Schulter-Region in Bildsequenzen.[Bar03](Development of an Approach to Detect the Head-Shoulder Region in Image Sequences.)In his bachelor’s thesis, Alexander Barth has developed an approach to classify a person by detectingits head-shoulder region. This is done, based on a binary image which was acquired by segmentation,by subdividing the image using a connected component analysis into blobs, which are analyzedfurther. The topmost point of a blob is considered to be the top of the head of a person. This isverified by fitting an ellipse around the assumed head position using characteristic points (skull cap,ears, neck) of a head. If the verification is successful, candidates for the shoulder points can beestablished on the contour. In order to verify the placement of the shoulder points, a measurementfunction which was based on the euclidean distance of the characteristic points and shoulder pointsto each other was developed. If the score computed by the function meets a given threshold, theregion is classified as head-shoulder region.Figure 2.1: Head-shoulder region as detected by [Bar03]. The ellipse marks the head, while the plus signs markthe detected shoulder points. [Bar03]The results of this work are very reliable, if certain conditions are met. To be specifically pointedout is that the head of a person has to be the highest point of a blob and that only single persons

2.1. Related Work Page 9are in the field of view of the camera. Further problems occur when analyzing long-haired persons,because the contour of the head and shoulder region (especially the neck) is not prominent enough.Nevertheless, this approach allows to detect single persons, even if turning their back to the camera.(a)(b)Figure 2.2: Erroneous situations using the methods of [Bar03]. (a) Work piece carried by the persons is higherthan the head. No person is detected. (b) Two persons are too close to each other. Only one person is detected.[Bar03]In spite of the high reliability concerning the conditions this thesis can be rather seen as a proof ofconcept, since during development application performance was not of concern, resulting in framerates of about 8 - 12 frames per second, which is not fast enough so far. The assumption madein this thesis, that only one person is in the visual field of the camera does not match reality conditions.Alexander Barth, Rainer Herpers:Robust Head Detection and Tracking in Cluttered Workshop Environments UsingGMM. [BH05]Based on the thesis of Mr. Barth, he and Rainer Herpers further developed the approach of detectingthe head-shoulder region in order to achieve a higher robustness with respect to changes in lightingconditions. This is mainly achieved by improving the search for suitable candidates. For that, thesegmentation of foreground and background is improved by extending the GMMs as presented in[Bar03] with a local component and refining it using high level feedback. In addition, the search for

Page 10 Chapter 2. Related Work and State of the Artcandidates is reduced significantly by employing a combination of color and edge features.(a) (b) (c) (d)Figure 2.3: Improved head candidate search by Barth and Herpers (a) Input image, (b) Segmented foregroundmask, (c) Detected edges, (d) Skin color map. [BH05]By these measures a head can be detected significantly more robustly than in the previous work,since the head does not have to be the topmost point of the blob anymore. However, certainrestrictions have to be met here as well. Only single persons can be detected robustly, and thealgorithm can be confused by skin colored areas inside of a blob (e.g. by a wooden board held abovethe head level).In this thesis a basic tracking mechanism was implemented, which reduced the search space for thehead region by placing an initial guess at the last position of a known head. This mechanism isbased on the assumption that a person does not move significantly in between two frames. Thus,the he head of the person is initially assumed to be at the same position as before. By that, many“irritating” situations can be resolved, although the shoulders were not tracked.This system is not fit for practical use as well, because it still does assume only single persons to bein the image.Stefan Hahne:Modell- und bildbasierte Detektion und Verfolgung der Arme in Videosequenzen.[Hah05](Model and Image Based Detection and Tracking of Arms in Video Sequences.)Based on the results of the works of Alexander Barth, Stefan Hahne continued the coarse-to-finestrategy in his diploma thesis. It is assumed that a head-shoulder region is correctly detected (bythe algorithm of Mr. Barth), based on which the hand-arm region was traced by the use of steerablefilters. The fundamental assumption was that a continuous edge (the outer edge of an arm) existsbetween shoulder and hand, which ends in a skin colored area.Steerable filters provide a bank of Gaussian edge detection filters with varying scale and orientation.So, dependent on the current edge (its scale and orientation) the best matching filter can be selected.

2.2. State of the Art Page 11Figure 2.4: Successfully detected arm-hand regions using steerable filters [Hah05]By that, non-straight continuous edges can be traced. A more in-depth explanation of steerablefilters can be found in [Hah05].This advantage was exploited by Mr. Hahne by detecting hands based on the assumption that betweena hand and its corresponding shoulder a continuous edge exists. If this assumption is met, for instancebecause the test person has smooth clothing, this can be accomplished satisfactory. While thismethod allows satisfactory results in real environments (see figure 2.4), these can only be achievedunder certain conditions. Most importantly, the person has to be in ”nominal” posture, and theclothing must be uniform at the arms. Otherwise, a small change in the edge, like by the rolled uparms of a pullover or the creases of a shirt, can influence the filter method so that the “arm edge”can not be traced correctly.2.2 State of the ArtAside from the works which have been done in the scope of this project, there exist many approachesto the problem of detecting persons in images and to track them over time.Most of these approaches are based on a fundamental model of a person, which is matched on blobs 1in the image [Bau03, Sie03, Joh98, HHD00, SBF00]. Often, it is assumed that the persons areseparate in the image (i.e. the areas occupied by different persons does not overlap in the image).Singled out and discussed exemplarily shall be the Reading People Tracker [Sie03], whichemerged of a project of the University College of Reading by further developing the Leeds PeopleTracker [Bau03], which was developed at the University of Leeds.This tracker is used to detect and track persons in video sequences of surveillance cameras, whichcan be found in many subway stations and parking lots.1 Blob: A connected area inside of an image, distinct from other areas

Page 12 Chapter 2. Related Work and State of the ArtFigure 2.5: Output of the Reading People Tracker. Each person is marked with a bounding box and a contourfitted to them. [Sie03]This tracking system first detects areas in an image which are moving, and tries to approximateheads based on a very naive model (in order to segregate several people who are moving in one blob)and to detect persons using a shape-fitting algorithm. Here, a contour is laid over the area to checkand is deformed to approximate the edges of the blob. In subsequent images, the approximatedcontour is adapted.The system is able to robustly detect persons and to track them over a long time span. However,general model assumptions need to be made. A certain perspective common for surveillance camerasis assumed, which leads to the situation that all persons in the image are almost completely visible,but only fill up a small portion of the image. By that, the base contour used for initial fitting is aroughly approximated shape of a walking person, as can be seen in figure 2.5. Also, strong problemsregarding occlusion occur, and the program is not robust against illumination changes.The application reaches a performance of less than 5 frames per second, by which it is not applicablefor work safety relevant applications. Furthermore, the perspective of the camera can not be met bygeneral work safety applications, because most require a closer view of the persons. Especially thework in this thesis regards persons in frontal view in a distance of about two up to three meters.2.3 ObjectivesBased on the results of the works presented in sections 2.1 and 2.2, several objectives can be specified,taking the base conditions of work safety as described in section 1.2.2 into account. The followinglist is sorted by importance, with the most important objective on top.

2.3. Objectives Page 13• Robust detection of multiple personsWhile single persons can already be detected robustly, this is to be expanded to detect multiplepersons at once. Since the observation area is at two up to three meters distance to the camera,this is restricted to 5 persons in this thesis.In workshop environments it is to be assumed that multiple persons are in the visual field ofthe camera. These have to be detected robustly regardless of their position to each other. Arobust detection means that there are limitations to classification errors:– No false negatives in the working areaIf there is a person in the working area of a machine (and thus close to the hazard area) itmust be detected in 100% of all cases. A false negative classification can result in accidents,since people tend to rely on safety measures.– Very few false positives in the working areaWhile a falsely positive classification of an object in the working area is not as criticalas a falsely negative one, problems, like a wrongly shutdown of the machine, may arisenevertheless. Because of that the number of false positive results has to be kept as low aspossible.• Continuous tracking of detected personsIf an object has been classified as an object of interest, it has to be tracked continuously.This ensures, on the one hand, that the position of the object is known, even when an initialdetection would not classify it as such (e.g. because a person is kneeling or having the backtowards the camera). On the other hand, a significant increase in performance can be achieved,since eventually complex classification algorithms do not need to be carried out, because theknowledge about the classification of an object can be transferred.• Arbitrary backgroundThe algorithm has to be able to work with arbitrary, especially cluttered backgrounds. Also,isolated changes may be applied to the background, for example by placing a work piece in thevisual field of the camera. A restriction to the background is not possible because of the needto operate in real environments.• Real-time performanceThe program has to be able to react in a time span acceptable for work safety. Since personsare able to move very fast (so that a motion blur may be introduced to the images), a highframe rate is very important to ensure a robust tracking.In order to assure a sufficiently short reaction time to potentially hazardous situations, at least20 images per second should be processed, which corresponds to a reaction time of about 50ms.

Page 14 Chapter 2. Related Work and State of the Art• Robustness against background changesIf the background of the image changes, be it by objects being added to the scene (like a boxwhich is put into the visual field of the camera) or by a change in lighting, this has to bedetected and compensated for. This is necessary in order to achieve a sufficiently accuratedistinction into fore- and background.2.4 AssumptionsIn order to reach the objectives set in section 2.3 it is necessary to make certain assumptions. Becauseof the fact that the system should be able to work in real environments, this may only be a few.They are:• Sufficient illuminationIn order to be able to detect persons in the field of view of a camera at all, there has to besufficiently bright illumination.• Face must be visibleIn order to detect persons initially, it is necessary that the features needed for classification arevisible. Regarding a person, the main feature by which it can be classified is the face. Thus, itis crucial that it is visible to the camera at least for the initial detection. This means that theperson must not wear a welding mask, for example.After the initial detection, this assumption may be overcome or weakened by tracking mechanisms.• Person must face the camera initiallyAnalogical to the above point it is mandatory that for the initial detection the person has toface the camera at least roughly. This can be achieved by positioning the camera in a way thatpersons must face the camera when turning on the machine.• Persons are at a defined distance to the cameraBecause of the use of a wide-angle lens, the size of a person in the image may change dramatically.Therefore, it is necessary that persons to be detected are at a defined distance to thecamera. According to the setup presented in section 1.2.1, they have to be at a distance ofbetween 2 or 3 meters from the camera.• People are separated from each other most of the timeIn order to allow a robust classification of persons in the observation area, it is assumed that

2.4. Assumptions Page 15if multiple persons are visible they are separate from each other most of the time. This alsomeans that they do not enter the observation area simultaneously.

Page 173 Problem AnalysisThis chapter analyzes the problem of this thesis in detail. In section 3.1 several sub-tasks are definedand described in detail. Subsequently, section 3.2 presents several methods to solve these sub-tasks.These methods are evaluated in respect to their applicability to each sub-task, afterwards possiblesynergies between different methods are pointed out and finally the most suitable methods are selected.The problem to be solved by this thesis, as stated in section 1.2, is to detect multiple persons(up to about 6) in the field of view of a digital camera, restricted to a certain observation areaat a distance of about 2 up to 3 meters from the camera. The background of the scene is expectedto be cluttered, and isolated changes may be applied to it. The system has to be able to work inreal environments.Regarding this, several traits of the captured images sequences can be identified:• Multiple persons may be in the imageSince the problem is to detect multiple persons, they have to be distincted from each other.Occlusions and adjacency have to be accounted for.• Persons may or may not moveThe movement of persons (especially when entering the observation area) is an important cueto detection. Nevertheless, the persons may and will eventually stop moving, so this has to beaccounted for as well.• The position of persons will not change significantly between subsequent framesBecause of the need to act in real time, a significantly high frame rate (of about 20 framesper second) can be assumed. Because of that, the position a person takes up in an image willnot differ significantly from the position in the previous image. This can be exploited usingtracking mechanisms.• The persons may have a diverse appearanceBecause of the applicability to real environments, the appearance of the persons to be de-

Page 18 Chapter 3. Problem Analysistected may not be restricted. That means that the examined persons may have any skin color,hairstyle, clothing or physiognomy, for example.• Each person will take up a large part of the imageBecause of the distance of the observation area to the camera, a person will take up a largepart (about one sixth) of the image. Because of that, mostly the torso and parts of the legswill be visible. Furthermore, this restricts the number of persons who can be detected in theimage to about 5 or 6.• A person has to face the camera in order to operate a machineBecause of the camera setup (comp. figure 1.2.1), a person operating (or turning on) the machinehas to face the camera. That way, the possibility of detecting a face is guaranteed.• The background appearance is arbitrarySince the system is to be used in real environments, there can be no restrictions to the appearanceof the background. In fact, a cluttered background is to be expected. This may imposedifficulties on the processing of the images.• The background will be mostly static, but isolated changes may occur.A mostly static background enables the use of a reference image for background distinction.Because of isolated change in the background, it has to be adapted over time.• The illumination may changeBecause the system should be able to operate over a long time, and the environment is notrestricted, it must be assumed that the illumination will change, either by the change of daylight(gradual change) or by turning lights on or off (sudden change). This introduces further needto adapt the background.• The images captured will be at a resolution of 640 x 480 pixelsBecause of the need to act in real time, a resolution of 640 x 480 pixels per image is too muchdata to process. Thus, the images need to be scaled down to allow faster processing, even ifthis means a lack in information and accuracy.• The images are captured using a CCD sensorThe use of a CCD sensor introduces noise to the images.preprocessing.This noise has to be reduced by3.1 Sub-tasksThese above traits identify three main parts of the problem:

3.1. Sub-tasks Page 19i. Preprocessingii. Detection of personsiii. Tracking of detected personsThese parts are dividable into several sub-tasks which have to be solved individually (using feedbackof other tasks, eventually).3.1.1 PreprocessingIn order to gather usable data from the acquired images, these have to be prepared. Noise has tobe removed from the images, the images have to be scaled down in order to reduce the amount ofdata to be processed, and they have to be prepared in order to ease the detection of persons. Thisincludes background adaptation (in order to react to change), background identification and regionof interest identification.Preprocessing tasks are:• Background IdentificationSince an acquired image will contain everything in the field of view of the camera, it is crucialto distinguish between areas which are “still” and thus make up the background and those whichcontain an object of interest, e.g. a person (the foreground).The process of doing so is called a foreground/background segmentation. Depending on themethods used to accomplish this, it leaves only the regions which may contain the desired informationin the image, while eliminating anything that may hinder or complicate the detectionprocess.• Background AdaptationIf a program is running over a long period of time, especially in ’real world’ environmentslike workshops, the background will change. This has to be accounted for, otherwise theforeground/background segmentation will yield false results.There are three ways a change may be introduced to the background:i. LightingWhen working in non-laboratory (and thus, only partially controlled) environments, theobserved area will be subject to changes in lighting, both by the movement of the sun aswell as the turning on or off of artificial lighting. The former is a gradual change, whichcan likewise be adapted gradually, the latter is usually a sudden change, which the systemhas to react to more spontaneously.

Page 20 Chapter 3. Problem Analysisii. MovementEspecially in workshop environments, objects will be moved into and removed from theobserved area. Both these actions cause a sudden but local change in the background.iii. Movement of the cameraAlthough mostly avoidable, it is still possible that the camera (and thus the field of view)itself is moved. This may happen, for instance, by something hitting the camera mount.A movement of the camera has sudden impact on the whole image.Although this should not happen under “normal“ circumstances, it has to be accountedfor.• ROI IdentificationWhile not immediately crucial to subsequent processing, the identification of Regions of Interest(ROIs) may impose a significant speedup to it. This is especially helpful regarding therequirements to the reaction time imposed by the context of work safety.3.1.2 Detection of PersonsAfter the preparation of the images, they have to be analyzed for the presence of (previously notdetected) persons. For this, several steps are necessary:• High-Level knowledge acquisitionA important step when analyzing an image is the transformation from regarding each pixelindividually to a state where pixels are grouped and set in relation to each other. By formingsuch groups (which will be called blobs in the following) it is possible to treat these as entities 1 .Once these blobs are created, they can be analyzed (and compared) in respect to shape, size,color, features, position and movement. This is necessary for further analysis and tracking.• Entity distinctionA single object (which is to be detected) or person (an entity) in an image can be dividedinto several blobs, while different entities may be next to each other. To correctly analyze theentities it is necessary to know which blobs form an entity and to distinct those entities whichare next to each other. Also, correct tracking is only possible in the context of entities.• Object/Person classificationNot all entities found in an preprocessed image are persons or directly influenced by them (i.e.by being carried or pushed), like autonomously moving machine parts or a driving fork lift in1 Entity: A distinct object in the image sequence which consists of one ore more blobs.

3.1. Sub-tasks Page 21back of the field of view. Since only persons are to be analyzed in detail, it is necessary todistinct between persons and non-persons (objects). The knowledge acquired by this step isespecially useful for tracking.The following tasks are important to achieve this:– Skin color classificationA necessary prerequisite for a person to be detected is the presence of skin. If there is noskin colored area in an entity, it can be assumed to not be a person. Thus, the classificationof the color of a pixel regarding it as skin or not is an important cue to the classificationof an object.– Head/Face detectionThe most reliable way to (initially) classify a shape as a person is to detect its head orface. Once this is done, the position of the head can give further cues for the subsequentanalysis of the person3.1.3 Tracking of PersonsAfter an entity has been classified, it can be tracked in the subsequent frames. This is necessary onthe one hand to reduce computational cost, on the other hand to not falsely classify an entity whichhas been recognized as a person to be an object (e.g. when the person turns around or the headregion is temporarily occluded). This is also divided into several tasks:• Area matchingIt is necessary to match areas to each other in order to track entities. In particular, this meanscertain areas in one frame have to be assigned to areas in a subsequent frame.• Motion estimationIt is necessary to estimate the motion of an entity in order to ease the matching of areas sincethe motion of entities will usually only change slightly between subsequent frames. Thus, it isvery likely that the area to match will be located at or close to the position estimated by thealgorithm.In addition to this, the head of a person has to be tracked even if no face is visible anymore(e.g. because the person turned around).• Occlusion recognitionIn a scene with several moving objects there is a high probability that those objects will occludeone another at some point. This severely hinders matching, since objects to match will be not or

Page 22 Chapter 3. Problem Analysisonly partially visible. In addition, a previously detected person may not be redetected (becausethe head is occluded, for instance). Nevertheless, it is possible to keep track of the expectedposition of the object and redetect it when not occluded anymore.3.2 Possible MethodsThere is a multitude of possible methods to solve the tasks identified in the previous section. Someof these methods ’overlap’, so that not only one task may profit from using it, but rather severaltasks can use the results of it. Furthermore, some methods can only be used in conjunction.In this section possibly suitable methods are identified (with regard to the ones presented in section2.1), possible synergies are pointed out and, concluding, a selection of the most suitable methods ismade.3.2.1 OverviewForeground/Background SegmentationThere are mainly two methods used for segmentation: the calculation of a difference image to a givenreference frame and segmentation based on motion history.Difference ImageThis is a classic method of image processing. By calculating the difference of the current imageand a reference frame which has been previously calculated (see below), only those areas whichhave changed in respect to the reference frame are determined. The difference image I diff can becalculated as follows:⎧⎪⎨ I curr (x, y), if |I curr (x, y) − I ref (x, y)| ≥ δI diff (x, y) =⎪⎩ 0, else(3.1)Here, I curr (x, y) and I ref (x, y) are the values of the pixel at the position (x,y) in the current imageand the reference image, respectively. δ represents the minimum deviance of the pixel values (thethreshold) needed for two pixel values to be considered different from each other. This thresholdis needed to compensate for small deviations an image can have during subsequent frames, as wellas little noise. Equation 3.1 is valid for grayscale images as well as the single channels of an RGBimage.The advantage of this method is that is it very simple and relatively fast. However, this method issubject to noise, since pixels are considered isolated. In addition, the casting of shadows or a change

3.2. Possible Methods Page 23in illumination lead to the impacted areas being considered as foreground.If this method is used on color images there is the possibility to perform the differentiation in adifferent color space, in order to increase the robustness in regard to illumination changes. Forinstance, a difference image can be calculated using the HSV color space. Here a color is notseparated in its red, green and blue components, but rather in its hue, value and saturation. Sincethe hue is separate from the value and saturation, mainly the difference in that channel needs to beconsidered.A further problem arises if the color of an area of the foreground matches that of the background(e.g. a white shirt in front of a white wall). In that case, the foreground cannot be separated correctly.Motion SegmentationA further possibility is to not just compute a difference image, but to determine which areas in animage are moving, independent of a reference image. For that, a Motion History Image (MHI) iscalculated out of several subsequent difference images:⎧τ,⎪⎨if Idiff M (x, y) = 1MHI(x, y) = 0, if Idiff M (x, y) = 0 ∧ MHI(x, y) ≤ (τ − δ t)(3.2)⎪⎩ MHI(x, y),elseEquation 3.2 indicates the calculation of the MHI. Idiff M denotes a difference mask which is calculatedanalogous to equation 3.1, whereas the predecessing image I curr−1 is taken as the reference image,and instead of setting the current pixel value, the value 1 is copied into the mask. If the value at(x, y) is 1, the current time stamp τ is copied into the MHI. Otherwise, if the time stamp at thatposition in the MHI (if any) is older than δ t (the time delta), the value is erased. The value of δ tthus denotes (in combination with the frame rate) the number of images which influence the MHI.The fact that the MHI consist of time stamps which denote the last change of the image at a certaincoordinate allows to calculate a motion gradient for each pixel, based on these values.Motion segmentation is a very fast working procedure which also allows a first partitioning of theforeground, based on the motion gradients of the pixelsHowever, depending on the number of images used to calculate the MHI (and thus, the time delta),only the ”frame“ of larger unicolored areas is detected here, since pixel in the inner region of such anarea do not seem to change in subsequent frames. Additionally, this method causes a ”trail“ behindeach moving area, which is called the motion history (hence the name). These are areas which usuallybelong to the background and should thus not be added to the foreground. A larger time delta leadsto a more complete reconstruction of a moving object, while simultaneously causing a larger ”trail“.Objects, especially persons, which do not move anymore will not be detected anymore as well.

Page 24 Chapter 3. Problem AnalysisFigure 3.1: Motion History Image (MHI). Brighter areas correspond to more recent movement. By takingthe last 5 images into account, a ”trail” is introduced to the person, and the central part of the person is notregarded as moving.Background AdaptationAdaptation of the background is needed in order to keep a reference image used to calculate adifference image (see above) up to date. This leads to a higher robustness against changes inillumination. Several techniques are possible here.Periodic UpdateA very simple practice is to periodically take the current image as background. This has theadvantage that at least the areas where there is no actual foreground area effectively represent thecurrent background. However, objects currently in the area of the image are taken as background aswell and thus cannot be redetected. This may be avoided by merely updating the reference imagewhen no object has been detected. It cannot be guaranteed that this condition is ever met, though.Adjacent Frame DifferenceIn this case the predecessing image is take as a reference image. This bears the advantage that it isvery easy to implement, and that no initial reference image has to be found. Still, similar problemsto periodic updates arise, since non-moving objects will not be detected anymore.

3.2. Possible Methods Page 25Averaging/BlendingA contrary approach to those presented above is to compute the reference image over time. This canbe done by taking the mean (or median) of all values for a pixel over a certain time span (averaging)(see equation 3.3) or by blending the current pixel values into the image based on a certain updatefactor α (alpha blending) (see equation 3.4).I ref (x, y) =∑ curri=curr−τ I i(x, y), τ ≤ curr (3.3)τI ref (x, y) = (1 − α) · I ref (x, y) + α · I curr (x, y), 0 ≤ α ≤ 1 (3.4)Equation 3.3 describes the calculation of a mean image from the last τ frames. A higher value forτ leads to a more accurate reference image, but also a slower adaptation rate. Usually, τ is a fixednumber. When set to curr, the reference image is calculated over all images processed so far. Thiscan be useful when the presence of foreground objects in the image is sparse, but there is almostalways at least one moving object (e.g. a highway)Following that, equation 3.4 describes the calculation of a reference image using alpha blending. Ahigher value for α leads to a significantly higher adaptation rate. Typical values range from 0.01 upto 0.3. Anything above 0.05 may be considered a high update rate, whereas a value of 0.3 meansalmost instant adaptation. In the extreme case (α = 1.0) this is equivalent to the replacement of thereference image and thus the adjacent frame difference.As described above, these techniques lead to a relatively fast or slow adaptation rate, dependingon the selection of τ and α, respectively. A slow adaptation rate cannot react very well tochanges in illumination, while a fast adaptation rate will lead to foreground regions being integratedinto the reference image very fast.Gaussian Mixture Models (GMMs)The adaptive techniques described so far are well suited for static backgrounds, however situationsmay arise where parts of the background are moving more or less constantly, but restricted to smallareas (e.q. bushes swaying in the mind or even robots or machines continuously repeating thesame action). This areas will be constantly falsely considered as belonging to the foreground (andeventually hinder object classification).A technique to compensate for that is the use of a Gaussian Mixture Models (GMM). Here, nota single static image is defined as a reference image, but rather a set models defining a possiblebackground is established for each pixel. These models are determined by a Gaussian distributionusing a mean and covariance distribution around an expected value. An in-depth explanation canbe found in [BH05].

Page 26 Chapter 3. Problem AnalysisA background adaptation using GMMs still has the problem that object belonging to the foregroundwill be included in the background model if they are not moving over a certain amount of time.A possibility to eliminate or at least diminish the problem of adaptive techniques (the inclusionof foreground pixels into the background model) is to not consider areas in the image whichhave been marked as ”target objects“ when adapting the background. In this case the backgroundmodel may differ significantly from the actual background at areas where a ”target object“ has beenover a longer time, but this can be accommodated for with a suitably high update rate.In fact, this tactic allows to use really high update rates but requires that target objects are detectedrobustly.ROI IdentificationIn general there are two possible ways to identify ROIs. On the one hand the areas can bedetermined previously (static ROIs), on the other hand they can be calculated based on the presentdata (dynamic ROIs).Static ROIs are very easy to implement, albeit almost completely unflexible. Nevertheless, carefullyplaced static ROIs can lead to a significant increase in performance. Such ROIs would be placed atareas where an initial appearance of the target objects is to be expected (for this work this wouldbe the sides of the image and eventually present doors, for example) or where a special surveillanceis needed (for instance, the working area at a machine).Dynamic ROIs can be particularly determined by a motion segmentation, since image areas in whichnothing is moving do usually not need to be considered for further processing. In addition, the areaaround a detected target object should always be considered as ROI, since in can be expected thatthe target object has not moved significantly (if at all) between subsequent frames (provided theframe rate is sufficiently high).High-Level Knowledge AcquisitionThere is a multitude of possibilities to get from plain pixel-based knowledge to continuous areas, i.e.assign single pixels to certain groups. In this section, only a selection of those will be presented.Flood Filling/Region GrowthUsing the technique of region growth, an area is formed starting at a certain point by adding all

3.2. Possible Methods Page 27neighbors of that pixel to the area if it matches certain condition. This is done recursively until allpixels have been tested or are not suitable for adding. Based on the condition, two types can beidentified:• Binary region growthWith binary region growth, every pixel that has a color (i.e. has been identified as foreground)or corresponds to a mask, respectively, will be added to the area (the blob). This results inlarge, continuous blobs, however the problem that different entities may be put together, if theyare adjacent to each other, arises. Nevertheless, singular entities will be recognized as such.• Color based region growthUsing color-based region growth, a pixel is added to the blob, if the color difference to one ormore reference pixels exceeds a given threshold:⎧⎪⎨ p neigh , if |val(p neigh ) − val(p ref )| > δB = B ∪(3.5)⎪⎩ ∅, elseEquation 3.5 is given for grayscale images for simplicity reasons. To apply it to color images, thepixel difference has to be evaluated for each channel separately. Here, B denotes the currentblob and p neigh and p ref the currently examined and the reference pixel, respectively. Thefunction val(p) returns the current value of the pixel, while δ denotes the threshold.In this case, p ref can either be the current pixel (of which the neighboring pixels are examined)or the pixel at which the region growth was started (the seed pixel). By calculating the differenceto the seed pixel it is assured that the established blob color does not differ significantly fromthat of the seed color. However, this leads to problems regarding colored areas which have agradient color (e.g. a diffuse shadow on a table). This can also be a problem if the seed pixelitself differs in color from the actual blob (e.g. if it was chosen in the dark border area of theblob).If the difference to the current pixel is taken, color gradients can be detected and dealt with.This may be problematic, however, since by the adaptation of light color gradients may arisewhich do not directly correspond to the blob, but come close to an adjacent area which shouldactually be distinct from the blob. This way, several blobs may be combined albeit they shouldbe treated separately. A solution for that is to include both differences in the equation, weightedby a factor ω seed and ω curr , respectively:⎧if (ω seed · |val(p neigh ) − val(p seed )| > δ seed )∧⎪⎨ p neigh ,B = B ∪(ω curr · |val(p neigh ) − val(p curr )| > δ curr ) (3.6)⎪⎩ ∅, else

Page 28 Chapter 3. Problem AnalysisThe choice of the weights (which sum up to 1) is dependent on the respective applicationand usually has to be determined empirically, but could also be calculated, e.g. based on thedistance to the seed pixel. Apart from that, it is not uncommon to choose the same weight forboth distances.Depending on the image data and the chosen threshold, usually several blobs per entity areconstructed, which not seldomly are very small. In this case the problem arises that they haveto be agglomerated into larger entities.Edge-Based contour searchA method quite similar to region growth, but much faster is the search for contours based on edgedetection. At first, the current image is transformed into a binary image by applying an edgedetection algorithm. Afterwards, the detected edges are divided into contours, either singular orhierarchically, based on the implementation. Those contours enclose the areas delimited by theedges. If done hierarchically, the outermost contours describe the shapes. A problem with thisapproach is that, based on the appearance (like clothing) of the objects, many edges may be found,which greatly hinders the distinction into contours.Motion Segmentation-Based AssignmentSince motion segmentation already divides the image area into several segments (which have asimilar motion gradient), this division can be used as a measure for the assignment to blobs. Becauseit can be assumed that parts of an object will move in at least similar directions this assignmentis feasible. Still, problems may arise if several adjacent objects are moving into the same direction.Here a subsequent distinction is needed, which can on the one hand be achieved by tracking, but onthe other hand has to be supported by further methods of analysis (see ”Entity Distinction“).Template MatchingOne possibility to find possible target objects is to underlay a certain model which describes thetarget. Based on that, areas in the image which correspond to this model can be searched in theimage. The model may be a contour, but also a division into different-colored areas or certainfeatures.An example for that can be found in [Sie03]. Here, the coarse shape of a walking person is beingmatched to moving areas. If this shape is satisfyingly approximated by the moving area, the area isclassified as a person.

3.2. Possible Methods Page 29Figure 3.2: Edge search for fitting a contour to a shape. A contour representing a rough estimation of a movingperson is centered on the blob. Based on several key points, the contour is tried to be matched against closeedges. [Sie03]Entity DistinctionIt is exceedingly difficult to distinct separate entities based on the information in a single imagealone. Using tracking methods (see below) and motion segmentation this may be achieved, providedthe entities were not adjacent previously or are moving into different directions. If this is not thecase, several possibilities arise. What is important is that the target objects are detected. Becauseof that, newly detected blobs have to be examined if they consist of multiple target objects. Thiscan be done, for instance, by searching for sufficient features (e.g. faces) in this blob. If more thanone of these features is found, the blob has to be separated. How this is achieved is dependent onthe implementation, but can be done arbitrarily. In the case only one or no such feature is found,the whole blob can be treated as one entity, since it is not important if a person carries a box, forinstance. Merely a correct classification is needed.Person ClassificationThere are multiple possibilities to classify a person.contour-based and feature-based classification.These can be subdivided into two areas:Contour Based ClassificationContour-based methods merely examine the shape or contour of an object to classify it. They underliestrong model knowledge. An example for this method is the work of Alexander Barth [Bar03],

Page 30 Chapter 3. Problem Analysisas discussed in section 2.1.2. Here, the head/shoulder region of a person is detected, based on theassumption that the skull cap marks the topmost position of a person. This method is very reliable ifthe given requirements (skull cap is the topmost point, persons are not too close to one another, neckis distinguishable from the head) are met. This, however, can not be guaranteed in real environments.Feature Based ClassificationThere are many different ways to classify a person based on its features. Since the face is the mostprominent feature of a person, it stands to reason to use this feature as classification. In this projectthe assumption is made that the face of a person has to be (initially) visible. If a face has been found,the object can be classified as a person.The faces themselves can be classified by multiple, feature-based methods. A feature in an image is aregion that meets certain criteria, like a sudden change in contrast or a prominent edge. Exemplifying,Haar-like features and Gabor wavelets are presented here.• Haar-like featuresThe use of Haar-like features was first described in [VJ]. Haar-like features describe certainchanges in contrast between regions in an image. Those features are defined by their basicform, their orientation and their size (comp. figure 3.2.1). Theses features are tried to be foundin a so-called integral image.An integral image, as described in [VJ] is calculated by combining the sum of all pixels to thetop and the left of the current pixel for each pixel in the original image, as described in equation3.7. ii denotes the integral image, whereas i denotes the original image, and ii(x, y) and i(x, y)denote the respective value of the image at position (x, y).ii(x, y) =∑i(x ′ , y ′ ) (3.7)x ′ ≤x,y ′ ≤yUsing equations 3.8 and 3.9, assuming s(x, −1) = 0 and s(−1, y) = 0, the integral image canbe calculated in one run over the image.s(x, y) = s(x, y − 1) + i(x, y) (3.8)ii(x, y) = ii(x − 1, y) + s(x, y) (3.9)Equations 3.7, 3.8 and 3.9 are taken from [VJ]. The integral image allows to determine thesum of all pixels inside of any rectangle by four array references, namely the corners of therectangle. Thus, an access to the sums can be performed in O(1).Using the integral image, Haar-like features can be determined. Several features suited best

3.2. Possible Methods Page 31Figure 3.3: Haar-like features suitable for face detection. The features are divided into three classes: edgefeatures, line features and center-surround features. For each class a set of basic orientations is given. [LLK03]for face detection have been identified in [LLK03], which are depicted in figure 3.2.1. Using apreviously trained classifier, areas in the image which contain a combination of features typicalto faces are determined. If such an area is found, it is classified as a face.A correct classification is highly dependent on the training set that is used. That training setis generated using a method called AdaBoost, as presented in [FS96]. It actually consists of twoparts: a set of positive and a set of negative examples. The positive examples only contain thetarget objects (faces, in this case), while the negative set (which has to be significantly larger)contains images in which the target object is not present. Using AdaBoost, a set of featuresunambiguously describing the target object can be generated. Also, the number of featuresneeded to describe an object is vastly reduced, by a factor of about 10 3 .In order to speed up the classification procedure, a cascade of classifiers is used, each workingwith a greater and more precise set of features than the previous classifier. That way, it ispossible to reject areas not containing the target object very early, and in addition the numberof false positive classifications can be kept low.The training of such a cascade is very time-consuming. The calculation of a cascade having avery high accuracy while maintaining real-time performance takes, depending on the size of thetraining set (which should be several hundred positive and more than 1000 negative examples),from several hours up to over a week.• Gabor waveletsGabor wavelets can be used to represent a set of local features in an image. Such a wavelet consistsof the filter response of an image to a set of Gabor filters, which are combined into a vector.In order to calculate the similarity between to image areas, the cosine similarity (the nor-

Page 32 Chapter 3. Problem Analysismalized dot product) of those vectors can be taken. Gabor wavelets were presented in [ZVM04].A Gabor filter has the form of a complex plain wave restricted by a Gaussian function:Ψ i (⃗x) = ||⃗ k i ||σ 2 · exp(− ||⃗ k i || 2 · ||⃗x|| 22 · σ 2 ) · [exp(j ⃗ k i ⃗x) − exp( σ22 ] (3.10)In equation 3.10, exp(− || ⃗ k i || 2·||⃗x|| 22·σ 2 ) denotes the Gauss function to restrict the wave and exp(j ⃗ k i ⃗x)denotes the complex plain wave which determines the oscillatory part of the kernel. This isthe general equation describing a Gabor filter kernel, whereas the specific form of the the filterkernel is defined by the standard deviation σ and the vector ⃗ k i . Usually, σ is constant, while⃗k i is defined by( ) ( )kix kv cosΘ µ⃗k i = =k iy k v sinΘ µ(3.11)where the size of the filter is specified by k v = kmaxf v and the orientation is specified by Θ mu =µ · π8. This results in a set of filter kernels, whose filter responses are combined into a featurevector which describes the Gabor wavelet. A set of such filter kernels is depicted in figure 3.2.1.The cosine similarity between the feature vectors of two Gabor wavelets denotes the similaritybetween the respective image areas. This method can be used for face detection, if a trainingFigure 3.4: Typical set of Gabor filter kernels. Parameters for these were σ = π, k max = π 2 , f = √ 2, v = 0..4and µ = 0..7. [OG07]set is established prior to the detection. The preparation of such a training set is quite timeconsuming,since multiple images of faces (typically more than 100) in different orientationshave to be selected and prepared. The actual training itself, however, can be carried out in

3.2. Possible Methods Page 33very short time (less than a minute for 150 faces), since the calculation of the Gabor wavelets,depending on the number and size of the filter kernels can be done very fast. This has theadvantage that the training set can be expanded on-the-fly, as additional faces are classified.Of course, these methods can be applied to any kind of object; this is completely dependent on thetraining.Usually, these methods iterate over the whole image (in certain steps) and determine which positionshave a sufficiently high probability to be considered a face. This is very time-consuming, but can begreatly improved by predetermining which areas are to be checked. Since faces can only be presentin skin colored areas, this can be used to restrict the areas to check for faces.The detection of skin in an image has been (and still is) a major topic in computer vision for thelast years. There have been many approaches to classify skin regions in an image, using differentmethods and color spaces. [Ze99] presents an approach to skin detection using three different colorspaces (namely, RGB, HSV and YCbCr), thus exploiting their individual strenghts.The first part detects skin color pixels as proposed in [KPS03]. Here, two different cases are distinguished:i. Constant, uniform illumination or daylightI r (x, y) > 95 ∧ I b (x, y) > 20 ∧ (3.12)max{I r (x, y), I g (x, y), I blue (x, y)} − min{I r (x, y), I g (x, y), I b (x, y)} > 15 ∧ (3.13)|I r (x, y) − I g (x, y) > 15 ∧ (3.14)I r (x, y) > I gr (x, y) ∧ I r (x, y) > I b (x, y) (3.15)ii. Non-uniform and bright ambient illuminationI r (x, y) > 220 ∧ I g (x, y) > 210 ∧ I b (x, y) > 170 ∧ (3.16)|I red (x, y) − I b (x, y)| ≤ 15 ∧ (3.17)I r (x, y) > I b (x, y) ∧ I g (x, y) > I b (x, y) (3.18)I n (x, y), n ∈ {r, g, b} are the red, green or blue values, respectively, of the image at position (x,y).Equations 3.12, 3.13 and 3.16 make sure the respective case is selected. In case 1, equation 3.14eliminates gray areas and 3.15 states that the red channel must always be the dominating component.In the second case, equation 3.17 allows white areas due to spotlights and equation 3.18 states thatin this case, the blue channel has to be the minimal component.Since the environment aimed at in this project is a workshop, a constant and bright illuminationcan be assumed, so only the first case has to be considered.

Page 34 Chapter 3. Problem AnalysisThe second part works on the YCbCr color space, as presented in [CN99]. Since this colorspace separates color information from luminance, very close threshold can be used to determine thecolor area:I ycbcr (Cb) > T Cbmin ∧ I ycbcr (Cb) < T Cbmax ∧ I ycbcr (Cr) > T Crmin ∧ I ycbcr (Cr) < T Crmax (3.19)The threshold T [Cb,Cr][min,max]may be chosen according to [CN99].The third part works very similar, using the HSV color space (which also separates hue informationfrom saturation and brightness), according to [KK96]. Here, skin color is determined by:I hsv (H) > T H min ∧ I hsv (H) < T H max ∧ I hsv (S) > T S min ∧ I hsv (S) < T S max (3.20)The result of each of these steps is a binary image, containing the possible skin regions as classifiedby the respective methods. These images will differ from each other, depending on the type of skinin the image. Combining theses images will result in an image which describes best the possible skinareas. This has also been used in [Zil07] to distinguish between wood and skin color.Figure 3.5: Combination of different color spaces resulting in robust skin image. The image is scanned for skincolored pixels using the RGB, HSV and YCbCr color spaces, which yield slightly different results. Combiningthese results into one image leads to a robust skin color classification.[Ze99]

3.2. Possible Methods Page 35Area MatchingIt is indispensable for the tracking of entities to redetect them in subsequent frames. By that it canbe assured that previously determined properties and classifications can be transferred. For that,either a top-down or a bottom-up strategy can be applied.Top-down strategiesA top-down strategy takes a shape of the previous image and tries to redetect it in the currentimage. In other words, a template is generated based on the last known shape, which is used for atemplate matching algorithm.A special variant of this are snake images or snake contours. Here, a contour (usually defined bya set of certain points) is laid over an area. Afterwards, a fitting algorithm is used to change thecontour by moving its points so that their energy is maximized. The energy of a point is highest atedges. Thus, the snake contour is (if possible) fitted to an object of roughly the same shape as theoriginal shape of the contour. If this is successful, the distorted contour describes the shape of thesearched object (see figure 3.2).Basically, snake contours are a technique to detect objects of a (roughly) known shape (as appliedin[Sie03]). If the last known contour of the object to search is taken as input contour, this may beused for object tracking.Bottom-up strategiesWith a bottom-up approach, at first the blobs in the current frame are determined (see HighKnowledge Acquisition) and afterwards they are mapped to the ones found in the previous image.Here it is necessary to find a measurement for the similarity between two compared blobs. This canbe achieved by multiple means. A widely used procedure is to compare a certain set of momentsinvariant to translation, rotation and scale (Hu Moments)[Hu62] of the contours. This leads to ahigh similarity score, even if those blobs are different in size, for instance.Those values are calculated as follows:The two-dimensional moments of a grayscale image of dimensions M × M are defined asm pq =x=M−1∑x=0y=M−1∑y=0(x) p · (y) q f(x, y) p, q ∈ N (3.21)where f(x, y) denotes the gray function at coordinates (x, y). The moments translated by an amount(a, b) are defined asµ pq =x=M−1∑x=0y=M−1∑y=0(x + a) p · (y + b) q f(x, y) (3.22)

Page 36 Chapter 3. Problem Analysis(a) (b) (c) (d)Figure 3.6: Sequence of shapes having the same set of Hu moments due to invariance in translation, rotationand scaleAfter scaling and normalization, the central moments are given asη pq = µ pq (p + q)µ γ , γ = + 1 (3.23)00 2Based on that, Hu defined 6 moments that are invariant to translation, rotation and scale (equations3.24 through 3.29) as well as a seventh moment which is invariant to skew (equation 3.30):M 1 = (η 20 + η 02 ) (3.24)M 2 = (η 20 − η 02 ) 2 + 4η 2 11 (3.25)M 3 = (η 30 − 3η 12 ) 2 + (3η 21 − η 03 ) 2 (3.26)M 4 = (η 30 + η 12 ) 2 + (η 21 + η 03 ) 2 (3.27)M 5 = (η 30 − 3η 12 )(η 30 + η 12 )[(η 30 + η 12 ) 2 − 3(η 21 + η 03 ) 2 ]+(3η 21 − η 03 )(η 21 + η 03 )[3(η 30 + η 12 ) 2 − (η 21 + η 03 ) 2 ]M 6 = (η 20 − η 02 )[(η 30 + η 12 ) 2 − (η 21 + η 03 ) 2 ]+4η 11 (η 30 + η 12 )(η 21 + η 03 )M 7 = (3η 21 − η 03 )(η 30 + η 12 )[(η 30 + η 12 ) 2 − 3(η 21 + η 03 ) 2 ]−(η 30 + 3η 12 )(η 21 + η 03 )[3(η 30 + η 12 ) 2 − (η 21 + η 30 ) 2 ](3.28)(3.29)(3.30)A further approach is to not merely compare the contours of two blobs, but to take color distribution,size and position into account, since these are important parameters for the description of a blob.Furthermore, the invariance of Hu moments is reasonable for general template matching, but causesproblems if several blobs of the same of similar shape are present in the image (as is expected whentracking several humans). A similarity measure based on these values has been developed during thisthesis:σ = 1 − (ω C · ∆C m∆A m+ ω A ·C max max(A 1 , A 2 ) + ω M · ∆M m+ ω R · ∆R m ) (3.31)M maxIn equation 3.31 each aspect of a blob is taken into account by its respective weight ω C , ω A and ω M ,which sum up to 1. ∆C m marks the difference in mean color, ∆A m the difference in size, ∆M m the

3.2. Possible Methods Page 37difference in position and ∆R m the difference in aspect ratio. Here, the information about size andposition is fundamental for tracking, which is disregarded by most other approaches.Motion EstimationTo improve the tracking and to allow a reasonable matching at all, the movement of a blob has tobe estimated. This will on the one hand vastly reduce the search space for a matching blob, andallows a harder restriction on the position of a potentially match.Kalman FilterA Kalman filter is a recursive filter which can estimate the state of a system based on a set ofincomplete and noisy data. It is used very often in computer vision applications to track motionsand to predict the velocity and motion gradient of an object. These applications, however, mostlyhave a very low sample rate and thus suffer from noise.Simple DifferenceBasically, it can be assumed that, a sufficiently high sample rate provided (as necessary in real timeapplications), the velocity and motion gradient of an object will not change significantly betweensubsequent frames.Thus, it is possible to use the difference in motion direction and velocity of the last two frames as avery simple (and easily implementable) motion estimation. In this case, the velocity would be theeuclidean distance between reference points in the blobs. Three possible reference points suggestthemselves for that: the center of mass of a blob, the mean of the points contained in the blobs orthe simple center of its bounding box. The latter can only be applied reasonably on rigid objects,since the bounding box of a person may change significantly, e.g. when the person raises an arm,while the person itself did not actually move to another position.The change in angle of the vectors from one reference point to another (the movement vectors) inthree subsequent frames may be taken as the motion gradient.Motion SegmentationIf a motion segmentation is done, a motion gradient is calculated for each pixel (albeit withoutmagnitude). This direction is significantly more accurate than the comparison of the movementvectors, since it is not based on a common reference point, but is calculated individually for eachpixel.

Page 38 Chapter 3. Problem Analysis(a) (b) (c)Figure 3.7: Different reference points for shapes. (a) Center of mass, (b) Mean value, (c) Bounding box center.While the center of mass and the mean value yield approximately the same point, the center of the boundingbox deviates from those.Occlusion DetectionOcclusion detection can be carried out by simple math in conjunction with the information gatheredby the motion estimation. In that case, the bounding boxes and the motion vectors of the objects areregarded. If an overlap (or a close proximity) of the boxes is detected by moving the bounding boxesby the calculated motion vectors, it has to be assumed that those areas will overlap in the followingframe. This has to be taken into account while tracking and matching.3.2.2 SelectionThis section deals with the selection of the methods presented in section 3.2 for the use in theapplication developed by this thesis. Advantages and disadvantages of the methods in regard to theapplication are pointed out. The objectives of this thesis (see section 2.3) (mainly reliability, stabilityand real-time performance) are of central importance for this selection.Foreground-/Background Segmentation and ROI IdentificationAs described in section 3.2.1, certain disadvantages for difference images as well as motion segmentationexist. A difference image is subject to noise and has to be performed over the whole image,while a motion segmentation develops a “corona” around the moving object, their “inner area” is notregarded, and non-moving objects are completely left out.A significantly more precise and efficient segmentation can be accomplished by combining theaforementioned methods. Thus, the areas in the image which are moving are determined usingmotion segmentation. These areas form ROIs in which a difference image is calculated. That way,image areas in which no movement has occurred will be left out (since these are not needed for

3.2. Possible Methods Page 39further processing). This is a lot faster than computing a difference image from the whole imageand is much less subject to noise. In addition, the ”holes“ and ”coronas“ of a motion segmentationare avoided.Using this procedure would however result in that target objects which are not moving will not berecognized. Because of this, it is necessary that a high-level feedback of the person classificationis taken into account. Using that, further ROIs are to be computed by using the feedback of theperson classification. Whenever an object is classified as a person, the area around it (eventuallymoved be the motion vector) has to be marked for segmentation in any case.Considering this,the foreground/background segmentation will be done by the combinationof motion segmentation and a difference image, taking high-level feedback into account.Background AdaptationIn order to meet the requirements of the problem, it is imperative to employ an adaptive methodfor background acquisition. Since a change in background has to be reacted to in minimal time, it ismost suited to use either alpha blending or GMMs for that.Since alpha blending is very easy to implement, the decision was made to use it in this thesis.Furthermore, this method is sufficient (for the time being), since few ”small-sized“, periodic changesare to be expected in the target environment (workshops). In addition, it can be derived from otherwork (comp. [Hah05]) that the processing time needed to calculate GMMs is significantly higherthan for alpha blending; moreover GMMs are not able to react to changes as quickly. Nevertheless,the implementation GMMs could be by all means reasonable as an extension.Common to both methods is the fact that non-moving objects are included into the background,analogous to motion segmentation. Thus, it is important to consult high-level feedback into theadaptation as to provide that (target) objects are included in the background calculation (independenton their movement, if there is any at all). This has also been done in [BH05].High-Level Knowledge AcquisitionPrevious implementations have shown severe problems when using the presented methods. Binaryregion growth will assign too large areas to each other, all differentiation information is lost. Colorbasedregion growth leads to much better results, depending on the thresholds and seed points, butalso leads to too small areas in many cases. In addition, this method is quite costly, so that ananalysis in real time is not possible.An edge-based contour detection fails due to the environment constraints. Due to the fact that the

Page 40 Chapter 3. Problem Analysisclothing of persons and the appearance of objects is not restricted, many edges occur, which leadsto contours that are too small. A hierarchical contour search would in theory lead to large ”outer“contours which enclose the smaller ones, but practical experience shows that the outer contours areclosed seldomly and are therefore unable to form ”hulls“.A template matching presumes a certain shape for the objects to be detected. However, this can onlybe true for rigid objects, while for a human a nearly infinite amount of postures is possible. Thus itis barely possible to fit a static template to this.Since the motion segmentation which is used for foreground/background segmentation already identifiesdistinct areas in the image which have a similar motion gradient each, this can be used forblob detection. Based on the assumption that connected image parts moving in the same directionbelong to the same object, the possibility arises to combine these as a blob. Because of the factthat the motion segmentation is performed in any case, this will have positive effects on applicationperformance as well.However, since it is possible that multiple persons move in the same direction or enter the image areatogether, a further distinction into multiple possible objects is necessary. In this case, the high-levelfeedback of the application can be used as well.Object/Person ClassificationThe results of [Bar03] and [BH05] let the approach of detecting the head-shoulder region seem promising.However, it proved to be not suitable for this application because it it based on assumptionswhich are too strong. While it is not necessary for a person to be (initially) facing the camera, itis required that the head marks the highest (skin colored) point of the person and that the personis positioned sufficiently singular in the image. Both assumptions cannot be met in real workshopenvironments. Furthermore, a certain posture of the person is presumed.The usage of AdaBoost cascades and the employment of Gabor wavelets proved similarly feasible intest implementations, also a significant difference in speed could not be discerned. A major drawbackof AdaBoost cascades is that they have to be trained offline, however there are many ”prefabricated“cascades that can be used. Basically, the detection of skin areas is not necessary here, since theclassifiers work on grayscale images. The search for matching areas in an image takes too long,however, so that real-time performance cannot be guaranteed. A skin color based filtering providesa significant gain in performance that comes close to that of Gabor wavelets.The advantage of Gabor wavelets is that they can be calculated on-the-fly and thus, newly classifiedfaces can be included into the training set. This can be done by the usage of a database. Also, thiscan be used to aid tracking. Nevertheless, the initial training set has to be determined offline as wellin order to ensure a robust classification. Unfortunately, depending on the quality of the case base a

3.2. Possible Methods Page 41certain amount of false positives arises, which have to be filtered by model knowledge.The usage of Gabor wavelets in combination with a case based reasoning system (as proposed in[ZLJ06]) proved to be suited best. A application using this has been implemented by Marius Osterand Joachim Günther as a semester project at the Bonn-Rhein-Sieg University of Applied Sciences[OG07]. Their work was integrated into the application developed in this thesis.Area MatchingThe usage of a top-down approach (especially the use of snake contours) seems promising, since thisallows the skipping of several analysis steps. Namely, the high-level knowledge does not need to beacquired again. Unfortunately, test implementations have shown that the results of snake contoursare too inaccurate, especially when using cluttered backgrounds. Most importantly, the head of aperson failed to be included in the resulting shape.A bottom-up strategy however allows far more accurate matching, without losing important imageinformation. Matching blobs using Hu moments proved impracticable, since the outer contours ofa person may change significantly in-between subsequent frames. Furthermore, position and size(which are invariant for Hu moments) are a major factor for assignment, since multiple objects ofthe same shape may be in the image.A score combining size, aspect ratio, color distribution, position and movement proved very robustand accurate.Motion EstimationIn spite of the vast spread of Kalman filters in computer vision (and especially tracking), they provedtoo inaccurate for this application. This can be attributed to the fact that Kalman filters expectincomplete and noisy sets, which is not provided if the sample rate (frame rate) is high enough (inother words: the data set is accurate and very complete).Since the motion gradients for the separate entities are already calculated when performing a motionsegmentation, these can be employed to predict the movement of a blob, based on the assumptionthat the moment does not change significantly in between two frames. Because the magnitude ofthe movement is not calculated here, it is estimated from the euclidean distance of the center ofmasses of the blob in the two predecessing frames.The following chapter deals with the implementation of the methods selected here.

Page 434 RealizationIn this chapter, the practical implementation of the methods which have been chosen in the previouschapter will be explored in detail.At first, the general structure and work flow of the program will be portrayed, where it will be dividedinto the three main tasks presented in section 3.1. The following sections are ordered by these tasksand describe the implementation.At last, section 4.3 shows the part of this application in the surrounding VisionLab2 framework it isdeveloped in.4.1 General StructureThe general structure of the application is as depicted in figure 4.1.The separate tasks of the problem are highlighted by color in this figure. Red parts belong to thepreprocessing, blue parts to the classification and green parts to the tracking. The respective partswill be illustrated in the following.4.2 Application StepsIn this section the single sub-tasks of the processing are to be discussed in-depth. The order is roughlybased on the diagram presented in figure 4.1, but will be changed occasionally where reasonable.4.2.1 InitializationThis is not an own sub-task, but the foundations for later processing of the data is laid out here. So,an initial background image is calculated (see section 4.2.2) and a sufficient buffer for the creation ofa motion history image is fabricated.

Page 44 Chapter 4. RealizationFigure 4.1: General structure of the algorithm. The red parts indicate preprocessing steps, blue parts indicatethe classification, and green parts denote the tracking steps. Dashed lines mark high-level feedback to thedifferent stages4.2.2 PreprocessingForeground/Background SegmentationThe background segmentation is divided into several steps, according to the following principle:

4.2. Application Steps Page 45Figure 4.2: Screen shot of the application. Each image depicts a result of a processing step.Figure 4.3: Work flow for background segmentation. The dashed lines mark the influence of high-level feedback.First, search rectangles are determined based on motion segmentation, afterwards a difference mask is createdbased on the search rectangles. This mask is dilated and eroded in order to improve its quality, following whichblobs are created using the mask.At first, a motion segmentation as described in section 3.2.1 is carried out. Here, different areasof the image which have moved are identified, and the bounding rectangles for these are formed(subsequently called motion rectangles). Because these areas may overlap partially, for example if a

Page 46 Chapter 4. Realizationperson lifts an arm, such areas are merged (provided that no occlusion of detected objects has beenpredicted for that area). This allows for a more exact calculation of a difference image.In the following step those areas which are to be segmented distinctly are identified. This is doneusing high-level feedback. Areas in which a person was detected last are moved by the motion vectorof the person and expanded by a certain border. This is necessary since a segmentation must becarried out in these areas, even if no motion was detected. Afterwards, those areas are expandedby those motion rectangles which overlap with them (up to a maximum allowed size). The resultingareas will subsequently be called search rectangles.The remaining motion rectangles (which could not be assigned to a search rectangle) apply to thesegmentation as new search rectangles.After the search rectangles have been identified, a difference image is calculated, This is done by firstcalculating a difference mask based on the search rectangles:⎧⎪⎨ 1, if |I curr (x, y) − I ref (x, y)| ≥ δ ∧ x, y ∈ (SearchRects)M diff (x, y) =⎪⎩ 0, else(4.1)The difference calculation by equation 4.1 is done in the HSV color space. Since it separates hue fromsaturation and brightness, it is possible to recognize similar colors, despite a change in brightness.Furthermore, colors which are almost black or white are left out, because an accurate result canonly be achieved with high difficulty. The actual thresholds used in this application vary dependingon the illumination and need to be adjusted for each setup individually.This mask is still prone to noise, and actual shapes may have “holes” (i. e. inner areas which arenot recognized as different from the background) in them. This may happen if the color of thebackground corresponds to that of the foreground (e. g. if a person is wearing a white shirt, standingin front of a white wall). These problems can be solved or at least heavily reduced by applyingdilation and erosion.A dilation works by “expanding” shapes by taking the respective maximum value out of a(a) (b) (c)Figure 4.4: Hole filling algorithm. (a) Original mask (b) Mask after dilation, (c) Mask after erosionneighborhood to be defined (usually the 3x3 or 5x5 neighborhood). By subsequently iterating this

4.2. Application Steps Page 47the result can be enhanced significantly. Ideally, the “holes” in the mask are filled by now. However,the border of the shape will be expanded as well, and by that including areas not belonging to theshape. This can be reversed by applying erosion, which causes the exact opposite of dilation. Here,the minimum value of the defined neighborhood (which should be the same size as that used fordilation) is taken for each pixel. This leads to a degeneration of the border areas while inner areas(if they have sufficiently been closed) are left unchanged. This technique is also known as “holefilling” (comp. [Bar03]).After the mask has been adjusted, distinct blobs are formed using the search rectangles and themask, where search rectangles of blobs already classified as persons are preferred. Since a searchrectangle portrays a distinct movement or person area, all pixels in this area can be assigned to theblob, as described in section 3.2.2. A special treatment is applied if a possible occlusion has beenpredicted or detected for that area. In the case of a first occlusion, both blobs are assigned the wholearea. Using the similarity measure (see section 3.2.2) it can be determined which blob is foremost.This will be handled with preference in the subsequent frames, while only those areas which havenot been assigned to that blob will be considered for the occluded blob. This is because it must beassumed that the same blob will be occluding the other one while occupying the same image spacein subsequent frames.Background AdaptionIn order to adapt the background image, a mask excluding areas not to be adapted is created at first,using high level feedback of the classification of the blobs and the motion segmentation. Foremost,these areas are classified persons (plus a certain border), as well as all areas of the images in whicha motion has been detected. Thus, it is guaranteed that solely non-moving and irrelevant areas aretaken into account for the background calculation.The actual adaption is done by alpha blending (see section 3.2.1). A relatively high α is chosen, inorder to allow a very flexible reaction to changes of the background (especially regarding illumination).4.2.3 ClassificationClassification is divided into two main parts, skin color recognition and face classification, plus overhead(preselection and evaluation of the gathered data), as can be seen in figure 4.5:At first, it is checked if a further classification or a repeated search for the head (because it couldnot be redetected, for instance) is necessary at all. If this is the case, the search area is cut down tothe upper third of the area occupied by the blob. After that a mask representing the shape of the

Page 48 Chapter 4. RealizationFigure 4.5: Work flow for blob classification. The green box marks the steps for skin recognition, the blue boxmarks the head classification part. The skin recognition is only carried out if the person has not already beenclassified as a person. If enough skin pixels are present in the area of the blob, a head classification algorithmis carried out. If successful for a defined number of frames, the blob is classified as a person.

4.2. Application Steps Page 49blob is created in order to exclude areas not belonging to the blob, such as partially occluded objects.Using this mask, skin areas are identified using the method presented in section 3.2.1. If enough skinarea is present, blobs are created using a binary region growth algorithm. Here, blobs which are toolarge are divided into smaller parts using model knowledge in order to avoid false classifications.After the skin blobs have been created, they are classified using Gabor wavelets according to themethod selected in section 3.2.2.For this purpose a database was created in which the Gabor wavelets of a large training set of headsare stored. The mean cosine similarity of these Gabor wavelets and the one of the shape to check isthe basic probability of the shape to be a head or a face, respectively. This probability is verified byapplying model knowledge, since the head has to be in a certain position in regard to the rest of thebody. As an example, it has to be close to the vertical axis of the main mass of the body (arms mustnot be considered in this case).Afterwards, the maximum probability of all detected skin blobs is checked against a minimum probabilityfor accepting a head. If successful, the corresponding skin blob is classified as a head, andan according counter is increased. If a blob has been recognized as containing a head a sufficientnumber of times, it is classified as a person. A final classification is not done instantly in order tokeep the number of false positive classifications as low as possible. Given a appropriately high framerate (as is requirement of this thesis) the classification can still be carried out in a sufficiently shorttime span while simultaneously increasing stability.If the detected skin area is not large enough or if no skin blob can meet the probability requirements,although the respective blob has been previously classified as a person, the head is marked as “lost”,and the estimated position of it is stored. The classification as a person will not be lost for a blob.4.2.4 TrackingBlob MatchingMatching is done in order to assign the blobs detected in predecessing frames to the ones currentlydetected. By that, the classification status of a blob, its movement vector and other properties canbe transferred. This is done according to figure 4.6.Here, for each blob the similarity to each of the previously detected blobs is calculated by themeasure as depicted in section 3.2.1. These similarities are stored in one array for each blob, ofwhich each is then sorted. Afterwards, the pair of blobs with the highest similarity score is pickedand assigned to each other. The blobs are then removed from the similarity arrays of the other blobs.This procedure is iterated until all new blobs have been assigned to an old blob, or no old blobs areleft.

Page 50 Chapter 4. RealizationFigure 4.6: Assignment of newly found blobs to previously detected blobs. The similarity to all previouslydetected blob is calculated for each blob. Afterwards blobs are assigned by order of the highest similarity.If any old blobs have not been assigned, these are marked as “lost” (taking an calculated eventualocclusion into account) and are tried to be assigned again in the subsequent frames. If this cannotbe achieved over a certain time, the blob is regarded as permanently lost and will be deleted.Head RedetectionIn the head redetection part it is tried to redetect the head of a previously classified blob in thecurrent frame (in order to aid further processing). This is done as shown in figure 4.7.At first, an area in which the head is estimated to be (the search area) is determined. It is composedof the old position of the head moved by the movement vector of the blob and a margin around theestimated position. This area is then narrowed down by merely regarding those areas which containsskin colored pixels. Here, a hole filling algorithm is applied to enhance the search area.A Gabor wavelet is then computed, based on the blob acquired by that, and is compared to the oneof the current head. If the cosine similarity is sufficiently high, the acquired blob is taken as the

4.3. Framework Page 51Figure 4.7: Work flow for head redetection. This is done for all blobs previously classified as person. Theposition of the head is estimated using the motion vector. At the estimated position, a akin blob is created andcompared to the current head. If the similarity is high enough, the head is considered redetected.current head, and the corresponding Gabor wavelet is stored. Otherwise the head is marked as notdetected, which results in a new (and more exhaustive) search during object classification.This does not have any influence on the already established classification of the blob as being aperson. This allows the person to be tracked even if the head is not in the visible field of the cameraanymore (for instance, because the person turned around), if the person has been correctly classifiedonce.Occlusion PredictionOcclusion prediction is purely mathematical, taking into account the motion estimation. The boundingboxes of each blob are moved by its respective movement vector and are then tested for overlap.If an overlap is detected, the blobs are marked as possibly overlapping, in order to allow a specialtreatment in subsequent steps.4.3 FrameworkThe application developed in the scope of this project is a module of the VisionLab2 framework whichwas developed by Stefan Hahne in order to provide an environment for projects in the ComputerVision Lab of the Department of Computer Science at the Bonn-Rhein-Sieg University of AppliedSciences, and to ease the integration and cooperation of these projects. It has been developedexplicitly for the use in multi-processor environments. A module (the implementation of a project)in this framework is modeled as a pipeline, which expects images as input and produces one or more

Page 52 Chapter 4. Realizationresults.Several of these modules can be run in parallel by creating several pipelines, without cutbacks to theperformance of the individual pipelines, if run on a multi-processor system. A pipeline consists ofone or more pipe stages which may model the distinct processing steps and can be run in parallel.This also leads to an increase in performance if run on a multi-processor system.The framework provides fundamental functions and utilities needed for image processing. These are,among others, the in- and output of images, the connection to the camera system, textual outputand logging. Furthermore, performance measurements are constantly carried out.4.3.1 Integration of the project into the frameworkThe module is realized as a pipeline consisting of only one pipe stage. While the module consists ofseveral processing steps which could be carried out in parallel, most of these are highly dependenton high level feedback by other steps, making an execution in parallel hardly reasonable.Only the background adaption could be realized as a distinct thread, especially if computationintensive methods like GMMs are being used. However, many race conditions can occur in that case.

Page 535 Results and EvaluationThis chapter provides an in-depth evaluation of the application developed in this project, as presentedin chapter 4. For that, section 5.1 defines parameters of the environment and criteria by which tojudge the performance. Based on these parameters, section 5.2 defines several test cases based on theparameters and evaluates these against the identified criteria.5.1 IntroductionIn order to judge the developed system and to be able to make statements about whether it is ableto meet the objectives and requirements defined in section 2.3, an in-depth evaluation is necessary.Furthermore, advantages and disadvantages of the approach can be identified that way.5.1.1 Variation ParametersThere are several external parameters that influence the execution and the effect of the application.Those parameters mainly describe the environment of the system, but also the surveilled scene itself.Those can be divided in several classes, which are listed in table 5.1. Each of those classes describesa set of parameters which is disjoint from the other classes.Parameter Type VariationsBackground static uniform (white), clutteredIlluminationgeneral static (uniform, bright), (non-uniform, bright), (dark)change static constant, gradual, suddenNumber of persons dynamic single, multipleSkin color of persons dynamic light, dark, black, asianClothing of persons dynamic uniform, diverseMovementdynamic none, slow, fastTable 5.1: Evaluation parameters, separated by static and dynamic parameters.It is necessary to test as much combinations of those parameters as possible in order to be able toperform a conclusive evaluation. As can be seen in table 5.1, those classes are divided into static and

Page 54 Chapter 5. Results and Evaluationdynamic parameters. Static parameters describe the environment and can not or only partially bechanged, but generally are consistent over a longer period of time. Dynamic parameters, however,are changing throughout the execution of the program, as they are usually dependent on the numberand appearance of persons in the scene.To evaluate the specific behavior of the system regarding one of the parameters, configurations ofthose have to be changed singularly. This basically only applies to the static parameters, as thedynamic parameters may be tested simultaneously.5.1.2 Evaluation CriteriaIn order to determine the properties of the system, the test scenarios which are executed based onthe parameters described in section 5.1.1 have to be evaluated. For that, several criteria have to bedefined, based on the objectives given in section 2.3. That way, it is possible to determine whetherthe system meets those objectives.The evaluation criteria are:• Person detection rateThe person detection rate is a measure for the quality of the classification algorithm, independentof its speed. To determine the detection rate, the errors have to be quantified. There aretwo kinds of possible errors:– False negativesThe set of false negatives are the cases where a person is in the field of view of the camera(and thus in the image) but is not classified as such (see equation 5.1).Ω fn = 1 − |P classified||P total |(5.1)Here, P classified denotes the number of persons in the images that have been correctlyclassified, and P total denotes the number of total persons in the images. Thus, Ω fn denotesthe number of persons which were not classified. This is especially critical in the hazardarea, since a falsely negative classification may lead to hazardous situations and, in theworst case, to injuries. Thus, it is necessary that the amount of those is equal or close tozero.– False positivesContrary to false negatives, false positives define the set of cases, in which a region of theimage was classified as a person, although there is no person in that area (see equation

5.1. Introduction Page 555.2) In this equation, P found denotes the set of cases in which the system classified an areaas a person.Ω fp = |{p|p ∈ P found ∧ p /∈ P total }||P total |(5.2)This may happen, for instance, if structures that correspond to the features of a face orcome close to those are present in the images. Those errors may occur especially whenthere is a change in the background.A falsely positive classification is not as critical as a falsely negative one, since it can notimpose a hazard. Nevertheless, the amount of those should be kept as low as possible,since such a classification could possibly cause an alert or the shutdown of a machine,which greatly influences the acceptance and thus the applicability to practical use.• Reaction timeIt is of central significance for the application to be able to evaluate the given situation in asufficiently short time span, in order to eventually avert a hazardous situation. Since the timeneeded to process an image defines the reaction time of the system, it is possible to make aprecise statement about reaction based on the frame rate.In order to make a differentiated statement, the best, worst and mean performance for all testcases have to be determined, where the worst case performance is the critical and thus definingvalue.• TrackingIn addition to the classification it is necessary that persons, once correctly classified, are trackedsubsequently, so that a statement about their position can be made at any given time, even ifit would otherwise not be classified as a person (e.g. if its back is facing the camera).The quality of the tracking algorithm is characterized mainly by the loss rate Ω loss , which isdetermined using the the number of frames a person could be tracked (P tracked ) divided by thetotal number of frames the person was present (P total ), as stated in equation 5.3.Ω loss = |P tracked||P total |(5.3)This measure should be as low as possible because only that way it can be guaranteed that therate of falsely negative classifications can be kept low.Special case: occlusionThe tracking performance regarding occlusion has to be examined separately. In this case it canbe assumed that objects can eventually not be tracked further, because they are fully occluded.Nevertheless the system should detect that and estimate the object to track in a certain areain the image. Once the object is not occluded anymore, it should be assigned again. If and at

Page 56 Chapter 5. Results and Evaluationwhich quality this happens is defined by the measure Ω occl , as given in equation 5.4.Ω occl = |P tracked||P occl |(5.4)This designates the amount of cases an object could be correctly assigned after being occludedin relation to the total number of cases an object should have been assigned. This rate shouldbe as high as possible in order to avoid falsely negative classifications.Since this connotes a special case, an incorrect assignment after occlusion is not included in theloss rate Ω loss .Excluding the performance rating, all these criteria need ground truth to be evaluated. This meansthat all cases have to be examined manually and be compared with the data computed by theapplication. Most of the measures have a range between 0 and 1, while the significance of thosevalues depends on the criteria.5.2 Tests5.2.1 Data baseMany of the parameters given in section 5.1.1 (namely, the dynamic ones) relate to persons, becauseof which they possess a high variance. Thus, a group of persons as diverse as possible was selected forthe test cases in order to cover a wide range of these variances. Table 5.2 shows a listing of the testpersons examined for the evaluation, while stating the parameters possibly affecting the processingof the data. The amount of 11 test persons is considered sufficient because it covers most of thepossible combinations. The remaining (static) parameters are varied between test cases as describedin the next subsection.5.2.2 SetupIn order to be able to make a statement as precise as possible about the behavior of the system,several configurations for the test cases have been chosen, based on the static parameters aspresented in section 5.1.1. These configurations have been tested with different persons as describedin section 5.2.1 and a varying number of visible persons, whereas not all configurations have beentested with all persons. Nevertheless, a significantly precise analysis can be made because severaltraits of the persons are found in multiple persons.There are four basic test configurations:

5.2. Tests Page 57Nr Sex Skin Color Hair Clothing Miscellaneous1 male light dark, short uniform, long glasses2 male light dark, short uniform, short glasses, hat3 male very light blond, long uniform, short4 male light brown, short uniform, short beard5 male light black, long uniform, long beard6 male dark block, short diverse, long beard7 male dark shaved diverse, short8 male light, asian black, short uniform, long glasses9 female tanned blonder, long diverse, long10 female tanned black, long diverse, short11 male light blond, short uniform, long glassesTable 5.2: Overview over the test persons considered in the evaluation.Configuration 1: uniform background, uniform illuminationConfiguration 2: uniform background, non-uniform illuminationConfiguration 3: non-uniform background, uniform illuminationConfiguration 4: non-uniform background, non-uniform illuminationSetups with insufficient illumination are not considered, since this is excluded by the assumptionsdefined in section 2.4. The setup of the test environment for some of the configurationsresembles the one presented in figure 1.2.1 in section 1.2.1, where applicable and follows itsconstraints (especially distance) otherwise. Figure 5.1 shows configurations 2 and 3. Configurations1 and 4 were constructed by changing the illumination.5.2.3 Test CasesThe following subsections describe the test cases for the application and their evaluation regardinga certain criteria. The evaluation is done individually for all basic configuration (and thus the staticparameters), whereas a uniform distribution of the dynamic parameters (due to the variety of persons)can be assumed.

Page 58 Chapter 5. Results and Evaluation(a)(b)Figure 5.1: Test configurations used for the evaluation. (a) Uniform background (b) Non-uniform background.Illumination was changed to achieve further configurations.Test Case 1: General Classification AccuracyThis test case examines the algorithm in respect to its detection rate. For this, test sequences withone or more persons in the field of view of the camera are carried out for all configuration and thefalse positive and false negative classification rate is determinedOnly frames containing at least one person are considered to determine the false negativeclassification rate. A classification is rated false negative if a person is not classified as such althoughit is in the field of view of the camera. In addition, a false negative classification is only rated if theperson is meeting the requirements of section 2.4, which means that it is at a certain distance to thecamera and its face is visible and is at an angle of at most 30 degrees is respect to the camera.The false negative classification rate is computed based on the number of persons in the images.If two persons are present in one image and both are not correctly classified, this leads to twofalse negative classifications. According to that, the total number of possible classification emergesfrom the total number of persons in all frames (the ground truth). Two possible evaluations areneeded here, because a sophisticated tracking eventually renders a further classification of a personunnecessary. Therefore, the “pure” classification rate (only considering non-tracked persons) and the“total” classification rate (including tracked persons) are determined.The first case is a measure for the sole classification step, while the second case rates the reliabilityof the system in general.As can be seen in figure 5.2, the false negative classification rate is quite high (1.45%) if classificationis done in each frame individually and anew. A comparison of the two result sets, however, reveals

5.2. Tests Page 5943.5pure classificationtotal classification3Classification rate (percent)2.521.510.512Configuration34Figure 5.2: False negative classification rate for test configurations. In the worst case (non-uniform background,non-uniform illumination) a false negative classification rate of 0.43% is achieved.that the rate improves significantly (0.43%), so that the detection can be considered reliable, albeitit does not entirely fulfill the requirements imposed by work safety.The fact that the detection rate including tracking is significantly higher can be attributed to thatpositive classifications are kept throughout subsequent processing. This also means that a falsenegative classification eventually only occurs once for a person. Therefore, a evaluation of theclassification rate (only considering “pure” classification) after a certain amount of time may imposea more meaningful result. The test results are given in figure 5.3.The results presented in the figure show that the detection rate of the persons is significantly morestable and reliable after 5 frames, and that after 20 frames it is close to zero. A sufficiently highframe rate provided, this means that a person can be reliably detected in a time span shorter thana second. It can be assumed that this is sufficient to detect them before entering a hazard area.The false positive detection rate provides the number of frames in which an area in the field ofview of the camera was falsely classified as a person, in respect to the total number of frames. Ifmultiple false positive classifications occur in one frame, all of those are taken into account, whilethe respective frame is counted only once. Hence, it is theoretically possible to achieve false positive

Page 60 Chapter 5. Results and Evaluation3.5Configuration 1Configuration 2Configuration 3Configuration 43Classification rate (percent)2.521.510.5originalafter 5framesafter 10framesafter 15framesafter 20framesFigure 5.3: Pure false negative classifications after elapsed number of frames.improves significantly over time.The classification accuracyclassification rates greater than 1 (100%).Figure 5.4 shows clearly that the rate of false positive classifications is quite high. This can be4false positives3.5Classification rate (percent)32.521.510.512Configuration34Figure 5.4: False positive classifications for test configurations. In the worst case (non-uniform background,non-uniform illumination) a false positive classification rate of 3.86% is achieved

5.2. Tests Page 61attributed to the fact that the classification is carried out in relatively small image areas (about 15to 40 pixels in square, at most). This leads to the situation that some features may eventually notbe identified correctly. An analysis of the video data shows that false positive classifications mostlyoccur in areas with prominent edges that include skin colored areas. These areas may be segmentedfrom the background because of a change in illumination (or shadow casting) and then be classifiedas a person.The results show that a further processing is needed to eliminate or reduce false positive classifications.

Page 62 Chapter 5. Results and EvaluationTest Case 2: Tracking AccuracyThe aim of this test case is to determine the quality of the tracking algorithm. Here, the trackingloss is calculated, which is computed by the number of falsely unassigned blobs over the numberof total frames. A blob is considered falsely unassigned if an object is present in two subsequentframes, but the tracking algorithm is not able to establish a correspondence between the blobs ofthe object in the frames. Only the blobs previously classified as persons were considered, since theseare decisive for fulfilling the objectives of the system. This was done starting with the first frame inwhich the object was classified as a person over a span of at most 1000 frames, or until the personleft the field of view of the camera. The special case of occlusion was not considered because thissignificantly complicated tracking for occluded blobs.The overall loss rate results from the mean value of the loss rates of all considered blobs.Figure 5.5 shows that the program is able to track classified blobs very reliably. Problems occurred2tracking loss1.5Tracking loss rate (percent)10.512Configuration34Figure 5.5: Overall loss rate for different test configurations.non-uniform illumination) a tracking loss rate of 0.12% is achievedIn the worst case (non-uniform background,however when considering similar blobs which were situated close to each other and moving in thesame direction.Special case: OcclusionWhile the overall tracking rate is very high and reliable, occlusion introduces severe problems totracking. While a possible occlusion can be predicted very precisely, it is very hard to determine

5.2. Tests Page 63which of the overlapping blobs is occluded by the other. This is especially difficult if the overlappingblobs are very similar (as with persons wearing dark, uniform clothing). Tests show that thedetection rate for the actually foremost blob does not change, but the program performs veryunreliable if the occluded person does change its movement while occluded.Since the results are heavily dependent on the behavior of the persons (and hence the specificsituation), it is not possible to present a meaningful result set.

Page 64 Chapter 5. Results and EvaluationTest Case 3: Reaction to MovementIn this test it is to be determined if very fast movement (which cannot be depicted correctly due toframe rate limitation) may lead to problems regarding tracking and classification. These tests werecarried out with only a single person because eventually occurring problems are independent on thenumber of persons in the image.Figure 5.6 shows the results of the test case. The tests were evaluated by false negative and falsefalse negativesfalse positivestracking loss43.5Rate (percent)32.521.510.512Configuration34Figure 5.6: Effect of fast motion on tracking and classification. While classification is heavily impacted by fastmotion, tracking rates stay reliable.positive classification rate and tracking loss (after correct classification) separately. The tests weredone at a frame rate of 20 frames per second and the motion of the person was fast enough toimpose a motion blur to the image.The results show that very fast movement has significant influence on the classification rate(especially considering false negative classification), but only little implication on tracking loss.Therefore fast movement should be reacted to separately, because a reliable detection of persons cannot be guaranteed.

5.2. Tests Page 65Test Case 4: Reaction to Illumination ChangesIn this test case the behavior of the application in reaction to changes in illumination is to beexamined, especially in respect to the tracking behavior, as well as false positive classifications. Falsenegative classifications are factored out, since the classification is merely dependent on the currentframe and does not include feedback from the adjacent frames.Two cases have been tested: a sudden and a gradual change in illumination. The results of these aregiven in figure 5.7. Furthermore, only configurations 1 and 3 have been tested, since they differ fromconfigurations 2 and 4, respectively, only by the lighting parameters, and this parameter is changedin this test.As can be deducted from the figure, the gradual change in illumination has no effect whatsoever onthe calculation, while a sudden change can cause a rise in false positive classifications. Furthermore,a very strong lighting change can lead to errors in tracking. Nevertheless, the tracking algorithmproved quite robust against changes in illumination.

Page 66 Chapter 5. Results and Evaluation7.576.5gradual changeno changesudden change65.5Tracking loss rate (percent)54.543.532.521.510.51Configuration3(a)7.576.5gradual changeno changesudden change65.5False positive rate (percent)54.543.532.521.510.51Configuration3(b)Figure 5.7: Effect of sudden and gradual change in illumination to (a) tracking loss and (b) false positiveclassification (constant illumination given for reference). while a gradual change has no effect on classificationand tracking, a sudden change has a severe impact on the application.

5.2. Tests Page 67Test Case 5: Reaction TimeThe aim of this test case is the analysis of the processing speed and hence the reaction time of thesystem. According to this, no statement about the quality of the algorithm is made but rather aboutthe performance and hence the applicability of the system to work safety relevant environments inrespect to time constraints.The frame rates given in figure 5.8 were taken using a test sequence in which a varying numberof persons entered or left the field of view of the camera and out of this, the minimum, maximumand mean frame rate were determined. While the mean frame rate allows a statement about theoverall performance of the system, the minimum frame rate is the critical factor, since it restrictsthe reaction time.The measurements were made on a commodity hardware PC, namely a 2GHz Dual Opteron CPUwith 4 GB of RAM running Gentoo Linux.The tests were restricted to 5 persons, because this is the maximum number of persons which canbe present and distinguished in the field of view of the camera in the given distance. Figure 5.830Overall Performance, worst case scenarioMean Performance, worst case scenario2520Time [ms]1510500 200 400 600 800 1000 1200 1400 1600Frame no.Figure 5.8: Performance of the System. While having an average frame rate of about 20 frames per second(and thus a reaction time of about 50ms), the frame rate for the worst case (5 persons in the image) is 7 framesper second (about 143ms reaction time).shows that a processing speed of at least 7 frames per second can be achieved (in the worst case,containing 5 persons in the image), which corresponds to a reaction time of 143 milliseconds. Thiswill usually not be sufficient for work safety requirements, but is dependent on the respective case.

Page 68 Chapter 5. Results and EvaluationThe mean processing speed is 20 frames per second, which is equal to a reaction time of about 50milliseconds, which meets the requirements in any case. I only one or two persons are in the image, aspeed of 23 - 25 frames per second can be reached, which corresponds to a reaction time of 40 - 43.5ms.

Page 696 Conclusions and Further WorkIn this chapter a summary of the results of this thesis is given, and a statement if the objectivesdefined for this work were met is made. Furthermore, it is analyzed if the developed system is suitablefor practical use. This is done in sections 6.1Building up on that, section 6.2 defines which further steps are needed in order to meet the objectivesand requirements if those could not be met by this thesis, also suggestions for further use of the resultsof this thesis are given.6.1 ConclusionsWhen comparing the results determined in chapter 5 with the objectives and constraints givenin chapters 1 and 2 it can be said that most of the objectives have been achieved, albeit withrestrictions. Provided that a correct initial classification of a person is allowed to have a delay of onesecond, multiple persons in the field of view of the camera can be detected correctly and reliably.Yet, a detection rate of 100% can not be achieved. The tracking of correctly classified personsproved very robust, and changes in illumination, if not too strong, can be compensated for. Whenexperiencing heavy occlusion, however, serious problems arise; this objective could not be met. Anoccluded person may be eventually not tracked further and lose its classification status.The methods selected in chapter 3 thus proved suitable for the approach on solving the problem,only the face detection yields too many false positive results. Here, a further filtering has to be doneor more precise image data has to be examined.The application is able to work on commodity PC hardware with very short reaction time.Furthermore, an arbitrary background situation (especially cluttered background) does not haveany influence on the performance of the program. However, false positive classifications may ariseon heavily cluttered backgrounds.Concluding, it can be said that the system is NOT suitable for the practical use in worksafety relevant environments. A measure in work safety requires lower false negative classificationrates (which are almost met, however) and, most importantly, significantly lower false positiveclassification rates. The high number of false positive classifications heavily constraints acceptance

Page 70 Chapter 6. Conclusions and Further Workand thus the practical applicability of the system.Nevertheless, it has been shown that the usage in real environments is possible in principle, and thatpersons can be detected regardless of their posture, surrounding, clothing and overall appearance.This is the main benefit of this work.6.2 Further WorkIn order to achive applicability of this system to the area of work safety, further work is needed.Hence, the rate of false positive classifications has to be lowered significantly. This could be achievedby further filtering using model knowledge, or eventually the algorithm for face classification wouldhave to be improved.An enhancement of the classification by the use of model knowledge could also impose a gain to thesystem, because persons can only be correctly classified if their face is visible at least once. However,the dependency of a certain posture and shape has to be considered.The problem of occlusion is very hard to solve because of the missing information on depth.Acquiring this information by the use of a ultrasound sensor or a second camera for instance couldcontribute significantly to solving these inaccuracies. Furthermore, false positive classifications couldbe reduced heavily by the use of depth information. Nevertheless, it has to be considered that anadditional sensor system imposes further cost; moreover, heavy losses in performance may result.This would have to be examined.The background adaption proved suitable and sufficiently robust against changes in the tests.However, a static background is assumed, which may be changed (e.g. by a box placed in the fieldof view) but adapts these changes statically. Periodic changes as by the leaves of a tree or movingmachine parts are not considered. This could be solved by the use of GMMs. However, an impacton processing speed is to be expected.While the system is able to detect persons, this only contributes to work safety if areas which nopersons may enter are surveilled. However, it could be used as a preprocessing step for furtheranalyzing application like a hand detection at a table saw (comp. chapter 2). For this, a postureanalysis and reconstruction of the detected persons (or those in the hazard area) would be needed.This requires a precise detection of the contour of a detected person as well as a reconstruction ofthe posture from 2D data.

Page 71Bibliography[Bar03][Bau03][BH05][CN99]Alexander Barth. Entwicklung eines Verfahrens zur Detektion der Kopf-Schulter-Region inBildsequenzen, 2003. Bachelor Thesis, Department of Computer Science, Bonn-Rhein-SiegUniversity of Applied Sciences.Adam Baumberg. Learning Deformable Models for Tracking Human Motion. PhD thesis,School of Computer Studies, The University of Reading, Reading, UK, October 2003.Alexander Barth and Rainer Herpers. Robust head detection and tracking in clutteredworkshop environments using gmm. In DAGM-Symposium, pages 442–450, 2005.D. Cahi and K.N. Ngan. Face segmentation using skin-color map in videophone applications.In IEEE Transaction on Circuit and Systems for Video Technology, pages 551 –564, 1999.[dgBH03] Hauptverband der gewerblichen Berufsgenossenschaften (HVBG). Arbeit und Gesundheitspezial, Kette statt Saege. Hauptverband der gewerblichen Berufsgenossenschaften, Berlin,August 2003.[Due04] Frank Duecker. Konzeption und Realisierung eines eingebetteten Systems zur Handdetektionmittels optischer Sensoren in einem sicherheitsbezogenen Umfeld, 2004. DiplomaThesis, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences.[fAB07][FS96]Das Berufsgenossenschaftliche Institut fuer Arbeitsschutz BGIA. Projekt-Nr. BGIA0073:Beruehrungslos wirkende Schutzeinrichtung zur Fingererkennung an Kreissaegen, zuletztonline 11.02.2007. http://www.hvbg.de/d/bia/pro/pro1/pr0073.html.Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. InInternational Conference on Machine Learning, pages 148 – 156, 1996.[Gmb07] Allied Vision Technologies GmbH. Marlin Prospekt Deutsch, zuletzt online11.02.2007. http://www.alliedvisiontec.de/files/pdf/produkte/broschueren/MARLIN_Prospekt_deutsch.pdf.[Gra03][Hah05]Frank Graziola. Entwicklung eines Prototypen einer Passiv-Infrarot Schutzeinrichtung zurErkennung der Hand-Arm Region bei Saegearbeiten an einer Kreissaege, 2003. DiplomaThesis, Department of Electrical Enginieering, Mechanical Engineering and Technical Journalism,Bonn-Rhein-Sieg University of Applied Sciences.Stefan Hahne. Modell- und bildbasierte Detektion und Verfolgung der Arme in Videosequenzen,2005. Diplomarbeit, Fachbereich Informatik der Fachhochschule Bonn-Rhein-Sieg.[HHD00] Ismail Haritaoglu, David Harwood, and Larry S Davis. Real-time surveillance of peopleand their actions. In IEEE Transactions on Pattern Analysis and Machine Intelligence,pages 809 – 830, 2000.[Hu62][Joh98]M-K. Hu. Visual pattern recognition by moment invariants. In IRE Trans. on InformationTheory, IT-8, pages 179 – 187, 1962.Neil Johnson. Learning Object Behaviour Models. PhD thesis, School of Computer Studies,The University of Leed, Leeds, UK, September 1998.

Page 72Bibliography[KK96][Klu04]R. Kjeldsen and J. Kender. Finding skin in color images. In Second International Conferenceon Automatic Face and Gesture Recognition, pages 312 – 317, 1996.Bjoern Klumbies. Diversitaer-redundantes Schutzsystem fuer Tisch- und Formatkreissaegen,2004. Diploma Thesis, Department of Computer Science, Bonn-Rhein-Sieg Universityof Applied Sciences.[KPS03] J. Kovac, P. Peer, and F. Solina. Human skin color clustering for face detection. InEUROCON 2003. Computer as a Tool, pages 144 – 148, 2003.[LLK03][OG07][SBF00][Sch05]R. Lienhart, Luhong Liang, and A. Kuranov. A detector tree of boosted classifiers for realtimeobject detection and tracking. In ICME ’03: Proceedings of the 2003 InternationalConference on Multimedia and Expo, pages 277–280, Washington, DC, USA, 2003. IEEEComputer Society.Marius Oster and Joachim Guenther. A Case-Based Reasoning System For Visual ObjectTracking, 2007. Project Report, Department of Computer Science, Bonn-Rhein-SiegUniversity of Applied Sciences.Hedvig Sidenbladh, Michael J Black, and David J Fleet. Stochastic tracking of 3d humanfigures using 2d image motion. In 6th European Conference on Computer Vision (ECCV2000, pages 702 – 718, 2000.Wolfgang Schulz. Anwendung und Vergleich von Algorithmen zur sicherheitsgerichtetenAuswertung von 2D Laserscanner-Entfernungsdaten, 2005. Bachelor Thesis, Departmentof Computer Science, Bonn-Rhein-Sieg University of Applied Sciences.[Sch06] Oliver Schwaneberg. Entwicklung einer beruehrungslos wirkenden Schutzeinrichtung imnahen Infrarotbereich, 2006. Master Thesis, Department of Computer Science, Bonn-Rhein-Sieg University of Applied Sciences.[Sie03] Nils T Siebel. Design and Implementation of People Tracking Algorithms for VisualSurveillance Applications. PhD thesis, Department of Computer Science, The Universityof Reading, Reading, UK, March 2003.[VJ][Ze99]P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.In CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, 2001., volume 1, page 511.B.D. Zarit and B.J. Super et.al. Comparison of five color models in skin pixel classification.In International Workshop on Recognition, Analysis, and Tracking of Faces and Gesturesin Real-Time Systems, pages 58 – 63, 1999.[Zil07] Oliver Zilken. Modellbasierte Erkennung von Handgliedmassen im Gefahrenbereich vonMaschinen mit Handvorschub, 2007. Master Thesis, Department of Computer Science,Bonn-Rhein-Sieg University of Applied Sciences.[ZLJ06]Zhiwei Zhu, Wenhui Liao, and Qiang Ji. Robust visual tracking using case-based reasoningwith confidence. In IEEE Computer Society Conference on Computer Vision and PatternRecognition, 2006, volume 1, pages 806–816, June 2006.[ZVM04] Jianke Zhu, Mang I. Vai, and Peng Un Mak. Gabor wavelets transform and extendednearest feature space classifier for face recognition. In ICIG ’04: Proceedings of the ThirdInternational Conference on Image and Graphics (ICIG’04), pages 246 – 249, Washington,DC, USA, 2004. IEEE Computer Society.

Master Thesis - Hochschule Bonn-Rhein-Sieg

Create successful ePaper yourself

Delete template?

Save as template?