11.07.2015 Views

Presentation

Presentation

Presentation

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Spatial and temporal localization ofobjects and actions in videos usingtext and video analysisJHU CLSP Summer Workshop 20106/21/2010


The Team• Senior Members– C. Fermueller (UMD), J. Kosecka (GMU), J. Neumann(Comcast), E. Tzoukermann (Comcast), R. Vidal (JHU)– Affiliated members: Y. Aloimonos (UMD) and G. Hager(JHU)• Graduate Students– R. Chaudhry (JHU), Y. Li (UMD), B. Sapp (UPenn(UPenn), G. Singh(GMU), X. Yu (UMD)– Affiliated: D. Summer-Stay Stay (UMD), C. L. Teo (UMD)• Undergraduates– F. Ferraro (UofRochester(UofRochester), I. Perera (UPenn(UPenn), R. He(HongkongPolytech Univ)


Human action analysis: Motivation• Huge amount of video is available and growing• Human actions are major events in movies,TV news, personal video …Action recognition useful for:• Content-based browsinge.g. fast-forward to the next goal scoring scene• Video indexing and searche.g. find “Bush shaking hands with Putin”• Roboticse.g. help a robot to recognize an action when observing it


What are human actions?Definition 1:• Physical body motion[Niebles et al.’06, Shechtman&Irani’05,Dollar et al.’05, Schuldt et al.’04, Efros et al.’03Zelnik-Manor&Irani’01, Yacoob&Black’98,Polana&Nelson’97, Bobick&Wilson’95, … ]KTH action datasetDefinition 2:• Interaction with environment on specific purposesame physical motion -- different actions depending on the context


Challenges in action recognition• Similar problems to static object recognition:variations in views, lightning, background, appearance, …• Additional problems: variations in individual motion; camera motionExample:Difference in shapeDifference in motionDrinkingSmokingBoth actions are similar inoverall shape (humanposture) and motion(hand motion)Data variation for actions might be higher than for objectsBut: Contextual constrains between objects and actions providean additional discriminative cues


Example• Cooking show – cutting and frying cabbage• Running commentary describing the objects andactions• Example screenshots• Output: detect objects (cabbage) and actions(cutting, frying) mentioned and objects predicted(knife, pan)


Example Crafts Domain


Overview of Approach• Task: Given a list of concrete nouns and action verbs, tell mewhen and where they occur in the video• Problem: Video analysis allows us to localize objects andactions both spatially and temporally, but generalobject/action detection is very hard• Idea: Language contains many cues regarding the semanticrelationship between objects and actions• Goal: Use semantic and relational information suggested bytext analysis to improve detection of objects/actions ofinterest within a video


High-level tasks• Use natural language information to identifysemantically meaningful objects and actions in avideo and their spatio-temporal temporal relationships• Use computer vision to detect the suggested objectsand actions in the video• Use machine learning to combine the classifieroutputs with contextual constraints between objectsand actions implied by the text– focus is on actions involving interactions between objects


NLP Tools• Part-of-speech tagger or phrase chunker• Dependency parser for Verb-Object relations– We have tuples of Verb, Object, Instrument, Location– Ex: Stir (v) chili (o) with a wooden spoon (instr) in apot (loc)• Collocations for Instrument and Location– Coocurrence from Google– Ex: “place a wooden spoon across the pot to keep itfrom boiling”• And more


Bag of words in object recognition


Bag of words in action recognition


Motion descriptors


Evolution of binsChaudhary & Vidal, CVPR 2009


Human Pose Estimation• Goal: Determine 2d locations of anatomicalparts from single imagesinputoutputtractableinference


Approach: Pictorial StructuresHURALRATULALLANote: A similar modelis possible for hands ifwe have a close-upviewSapp et al., CVPR 2010


List of Technical Tasks• NLP– Extract the relevant sentence parts from the text– Relate the different entities in the text– Find related nouns or verbs via ontologies• Computer Vision– Pre-process the videos using shot boundary estimation and camera viewclustering– Design context descriptors that help us to narrow down when we should lookfor what object– Detect and track the position of the persons and their body parts (e.g. faces,hands and arms) in the video– Detect and track the position of the objects of interest in the video– Detect and localize the action of interest given the objects and the humanbody part trajectories• Machine Learning– Learn object and action representations by modeling the relationshipsbetween body parts, objects and motion patterns– Infer spatial and temporal location of objects and actions in the video


• Cooking domain1. * CMU Kitchen dataset• multi-camera + mocapData sets• 5 recipes, 10 individuals each2. URADL: U. of Rochester Activities of Daily Living• 12 activities, 5 individuals, 3 recordings each3. PBS Kids: Sprout – 5 shows4. DVD’s: Cook like a chef, Martha’s Favorite FamilyDinners, Joanne Wier’s cooking class• Craft domain1. * PBS Kids: Sprout – 31 shows


Sprout - Alphabet bookAction Verb Freq Direct ObjectInstrumentHumanInteraction LocationTo Thread 1 Thread Hand Both Hands Construction PaperTo Tie 1 Thread Hand Both Hands Construction PaperTo Write 1 Ink Pen Both Hands PaperTo Decorate 2 Ink Pen Both Hands PaperTo Color 2 Ink Pen Both Hands PaperTo Draw 1 Ink Pen Both Hands Paper


Nr Action Objects Human Interaction Begin Time End Time Duration1 Washing Sink, Soap Washing Hands 00:38.2 00:40.6 00:02.42 Drying Hand Towel Drying Hands 00:40.6 00:44.4 00:03.73 Filling Sink, Pot Hands fill pot with water 00:45.3 00:47.2 00:01.94 Pouring Bowl, Broth, PotChild pours broth from bowl topot 00:48.2 00:51.4 00:03.25 Firing Stove, Pot Hand turns on the burner 00:54.1 00:57.1 00:03.06 CuttingRed Pepper, Knife,Cutting Board Adult Male cuts red pepper 00:58.1 01:00.0 00:01.97 Deseeding Red Pepper, scoop8 PlacingAdult and child deseed redpepper 01:03.0 01:03.9 00:00.8Pot, Spoon, RedPepper Adult places red pepper in pot 01:09.7 01:12.2 00:02.59 Adding Bowl of Rice, Pot Adult adds rice to pot 01:14.2 01:17.7 00:03.410 Opening Can Opener, Can Hands open a can 01:20.2 01:23.3 00:03.0Parsley, Measuring11 Tearing cup Child tears off parsley leaves 01:24.2 01:27.4 00:03.212 Adding Can, PotHand adds can of veggies topot 01:32.0 01:35.0 00:03.013 Adding Measuring cup, Pot Child adds parsley to pot 01:35.6 01:38.2 00:03.0


CMU Kitchen Set - Verbs


Martha Stewart – 191 action verbsto pour 33 to spoon 4to add 20 to measure 4to stir 17 to glaze 3to slice 17 to garnish 2to cut 11 to spread 2to place 11 to cover 2to mix 6 to tie 2to remove 6 to Scrape 2to rub 6 to dry 1to turn 6 to beat 1to deglaze 6 to b roil 1to serve 5 to sear 1to wisk 5 to wrap 1to top 4 to Grate 1to process (in afood Processor) 4 to Bake 1


Action Recognition and ComplexityInput1.transcripts and closed captions2.text transcripts alone3.list of ingredients and utensils with(out)instructions Evaluation can follow these levels


Evaluation• Baseline– Working system for object and action recognition and humandetection in isolation– Annotated ground-truth data sets• Evaluation Criteria– identification and localization accuracy of object and action labelsgiven as input• Research questions:– How does knowing the relation between object locations + motionhelp compared to only motion-based or appearance-basedinformation alone?– How well can we extract the relevant relations from text?– Effect of language as a prior on which objects to look for?– Can changes in object appearance over time indicate the presence ofan action?– (If time allows) Effect of learning object classifiers on demand byutilizing labeled internet databases?


Expected outcomes and benefits forthe research community• Novel insights into– how to leverage NLP to improve visual sceneunderstanding– Action recognition for human actions defined byinteractions with the environment• Software pipeline to annotate a video withsemantic information extracted from a text• A publicly available data set of annotated videoswith realistic and rich action-object object interactions– PBS Sprout: 31 craft shows with 8 to 11 individualactions each


QUESTIONS?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!