Machine Learning for Computer Vision - Ecole Centrale Paris
Machine Learning for Computer Vision - Ecole Centrale Paris Machine Learning for Computer Vision - Ecole Centrale Paris
Machine Learning for Computer Vision – Lecture 1 Machine Learning for Computer Vision 1 1 October 2012 MVA – ENS Cachan Lecture 1: Introduction to Classification Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Computational Vision / Galen Group Ecole Centrale Paris / INRIA-Saclay
- Page 2 and 3: Machine Learning for Computer Visio
- Page 4 and 5: Machine Learning for Computer Visio
- Page 6 and 7: Machine Learning for Computer Visio
- Page 8 and 9: Machine Learning for Computer Visio
- Page 10 and 11: Machine Learning for Computer Visio
- Page 12 and 13: Machine Learning for Computer Visio
- Page 14 and 15: Machine Learning for Computer Visio
- Page 16 and 17: Machine Learning for Computer Visio
- Page 18 and 19: Machine Learning for Computer Visio
- Page 20 and 21: Machine Learning for Computer Visio
- Page 22 and 23: Machine Learning for Computer Visio
- Page 24 and 25: Machine Learning for Computer Visio
- Page 26 and 27: Machine Learning for Computer Visio
- Page 28 and 29: Machine Learning for Computer Visio
- Page 30 and 31: Machine Learning for Computer Visio
- Page 32 and 33: Machine Learning for Computer Visio
- Page 34 and 35: Machine Learning for Computer Visio
- Page 36 and 37: Machine Learning for Computer Visio
- Page 38 and 39: Machine Learning for Computer Visio
- Page 40 and 41: Machine Learning for Computer Visio
- Page 42 and 43: Machine Learning for Computer Visio
- Page 44 and 45: Machine Learning for Computer Visio
- Page 46 and 47: Machine Learning for Computer Visio
- Page 48 and 49: Machine Learning for Computer Visio
- Page 50 and 51: Machine Learning for Computer Visio
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong><br />
1<br />
1 October 2012<br />
MVA – ENS Cachan<br />
Lecture 1: Introduction to Classification<br />
Iasonas Kokkinos<br />
Iasonas.kokkinos@ecp.fr<br />
Center <strong>for</strong> Computational <strong>Vision</strong> / Galen Group<br />
<strong>Ecole</strong> <strong>Centrale</strong> <strong>Paris</strong> / INRIA-Saclay
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture outline<br />
2<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
3<br />
Class objectives<br />
• Treatment of a broad range of learning techniques.<br />
• Hands-on experience through computer vision applications.<br />
• By the end: you should be able to understand and implement a paper<br />
lying at the interface of vision and learning.
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Who will need this class?<br />
Faces Recognition<br />
Segmentation<br />
<strong>Learning</strong><br />
4<br />
Submission/Acceptance Statistics from CVPR 2010
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Boundary detection problem<br />
Object/Surface Boundaries<br />
5
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
How can we detect boundaries?<br />
6<br />
Filtering approaches<br />
Canny (1984), Morrone and Owens (1987), Perona and Malik (1991),..<br />
Scale-Space approaches<br />
Tony Lindeberg `Edge Detection and Ridge Detection with Automatic Scale Selection.’,<br />
IJCV, 30(2), 117-156, (1998)<br />
Variational approaches<br />
V. Caselles, R. Kimmel, G. Sapiro: Geodesic Active Contours. IJCV22(1): 61-79 (1997)<br />
K. Siddiqi, Y. Lauzière, A. Tannenbaum, S. Zucker: Area and length minimizing flows<br />
<strong>for</strong> shape segmentation. IEEE TIP 7(3): 433-443 (1998)<br />
Statistical approaches<br />
Agnès Desolneux, Lionel Moisan, Jean-Michel Morel: `Meaningful<br />
Alignments’. International Journal of <strong>Computer</strong> <strong>Vision</strong> 40(1): 7-23 (2000)
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
<strong>Learning</strong>-based approaches<br />
7<br />
Boundary or non-boundary?<br />
Use human-annotated segmentations<br />
Use any visual cue as input to the decision function.<br />
Use decision trees/logisitic regression/boosting/… and learn to combine the individual inputs.<br />
S. Konishi, A.Yuille, J. Coughlan, S.C. Zhu, “Statistical Edge Detection: <strong>Learning</strong> and Evaluating Edge Cues”, IEEE PAMI, 2003<br />
D. Martin, C. Fowlkes, J. Malik. "<strong>Learning</strong> to Detect Natural Image Boundaries Using Local Brightness, Color and Texture<br />
Cues", IEEE PAMI, 2004
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Progress during the last 4 decades<br />
8<br />
Humans<br />
<strong>Learning</strong>-based, ‘08<br />
<strong>Learning</strong>-based, ‘04<br />
Best up to ~1990<br />
1965<br />
Precision-Recall Curves on the B erk 100 test images<br />
Reference: Maire, Arberaez, et. al., IEEE PAMI 2011
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
<strong>Learning</strong> and <strong>Vision</strong> problem II: Face Detection<br />
• How do digital cameras detect faces?<br />
9<br />
• Input to a digital camera: intensity at pixel locations
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
`Faceness function’: classifier<br />
10<br />
Background<br />
Decision boundary<br />
Face
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
11<br />
Sliding window approaches<br />
• Scan window over image<br />
– Multiple scales<br />
– Multiple orientations<br />
• Classify window as either:<br />
– Face<br />
– Non-face<br />
Window<br />
Classifier<br />
Face<br />
Non-face<br />
Slide credit: B. Leibe
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Two main approaches<br />
12<br />
• Discriminative<br />
• Generative (model-based)<br />
Class decision<br />
Input<br />
Input<br />
Class models
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Discriminative techniques<br />
13<br />
• Lectures 1-4:<br />
– Linear and Logistic Regression<br />
– Adaboost, Decision Trees, Random Forests<br />
– Support Vector <strong>Machine</strong>s<br />
• Unified treatment as loss-based learning<br />
z: y*f(x)<br />
Ideal misclassification cost H(-z) (# training errors)<br />
Exponential Error exp(-z) (Adaboost)<br />
Cross Entropy error<br />
ln(1 + exp(-z)) (Logistic regression)<br />
Hinge loss max(0,1-z) (SVMs)<br />
13
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Generative Techniques, Structured models<br />
14<br />
• Lectures 5-7<br />
– Hidden Variables, EM, Component Analysis<br />
– Structured Models (HMMs, De<strong>for</strong>mable Part Models)<br />
– Latent SVM/Multiple Instance <strong>Learning</strong><br />
• Efficient object detection algorithms (Branch & Bound)<br />
14
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
From part 1 to part 2<br />
15
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
16<br />
Administrative details<br />
• 2 lab exercises (10 points)<br />
– Start with small preparatory exercises (synthetic data)<br />
– Evaluation: real image data<br />
• 1 Project (10 points)<br />
– finish the object detection system of Labs 1 & 2<br />
– or implement a recent ICCV/CVPR/ECCV/NIPS/ICML paper<br />
– or work on a small-scale research project (20/20)<br />
• Tutorials, slides & complementary handouts<br />
http://www.mas.ecp.fr/vision/Personnel/iasonas/teaching.html<br />
• Today: I need a list with everyone’s email!
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture outline<br />
17<br />
Introduction to the course<br />
Introduction to the classification problem<br />
Linear Classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
18<br />
Classification Problem<br />
• Based on our experience, should we give a loan to this customer?<br />
– Binary decision: yes/no<br />
Decision boundary<br />
features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
<strong>Learning</strong> problem <strong>for</strong>mulation<br />
19<br />
Given: Training set of feature-label pairs<br />
Wanted: `simple’<br />
that `works well’ <strong>for</strong><br />
Why `simple’? good generalization outside training set<br />
`works well’: quantified by loss criterion
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
20<br />
Classifier function<br />
• Input-output mapping<br />
– Output: y<br />
– Input: x<br />
– Method: f<br />
– Parameters: w<br />
• Aspects of the learning problem<br />
– Identify methods that fit the problem setting<br />
– Determine parameters that properly classify the training set<br />
– Measure and control the `complexity’ of these functions<br />
Slide credit: B. Leibe/B. Schiele
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Loss criterion<br />
21<br />
Desired outputs<br />
Responses<br />
• Observations<br />
– Euclidean distance is not so good <strong>for</strong> classification<br />
– Maybe we should weigh positives more?<br />
• Loss should quantify the probability of error, while keeping the learning<br />
problem tractable (e.g. leading to convex objectives)<br />
Slide credit: B. Leibe/B. Schiele
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture outline<br />
22<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Linear regression and least squares<br />
Regularization: ridge regression<br />
Bias-Variance decomposition<br />
Logistic regression
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Linear regression<br />
23<br />
Classifier: mapping from features<br />
to labels<br />
Linear regression: linear<br />
binary decision can be obtained by thresholding
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Linear Classifiers<br />
• Find linear expression (hyperplane) to separate positive and negative examples<br />
24<br />
Feature coordinate j<br />
x<br />
x<br />
i<br />
i<br />
positive :<br />
negative :<br />
x<br />
x<br />
i<br />
i<br />
⋅w<br />
+ b<br />
⋅w<br />
+ b<br />
≥ 0<br />
< 0<br />
Each data point has<br />
a class label:<br />
+1 ( )<br />
y t =<br />
-1 ( )<br />
Feature coordinate i
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Loss function <strong>for</strong> linear regression<br />
25<br />
Training: given<br />
, estimate optimal<br />
Loss function: quantify appropriateness of<br />
sum of individual errors (`additive’)<br />
quadratic<br />
Why this loss function?<br />
Easy to optimize!
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Least squares solution <strong>for</strong> linear regression<br />
26<br />
Loss function:<br />
Introduce vectors and matrixes to rewrite as quadratic expression:<br />
Residual :
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Questions<br />
27<br />
Is the loss function appropriate?<br />
Quadratic loss: convex cost, closed-<strong>for</strong>m solution<br />
But does the optimized quantity indicate classifier’s per<strong>for</strong>mance?<br />
Is the classifier appropriate?<br />
Linear classifier: fast computation<br />
But could e.g. a non-linear classifier have better per<strong>for</strong>mance?<br />
Are the estimated parameters good?<br />
Parameters recover input-output mapping on training data<br />
How can we know they do not simply memorize training data?
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Questions<br />
28<br />
Is the loss function appropriate?<br />
Quadratic loss: convex cost, closed-<strong>for</strong>m solution<br />
But does the optimized quantity indicate classifier’s per<strong>for</strong>mance?<br />
Is the classifier appropriate?<br />
Linear classifier: fast computation<br />
But could e.g. a non-linear classifier have better per<strong>for</strong>mance?<br />
Are the estimated parameters good?<br />
Parameters recover input-output mapping on training data<br />
How can we know they do not simply memorize training data?
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Inappropriateness of quadratic penalty<br />
We chose the quadratic cost function <strong>for</strong> convenience<br />
Single, global minimum & closed <strong>for</strong>m expression<br />
29<br />
But does it indicate classification per<strong>for</strong>mance?<br />
Computed Decision Boundary<br />
Linear Fit<br />
Desired decision boundary<br />
Quadratic norm penalizes outputs that are `too good’<br />
Logistic regression, SVMs, Adaboost: more appropriate loss
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Questions<br />
30<br />
Is the loss function appropriate?<br />
Quadratic loss: convex cost, closed-<strong>for</strong>m solution<br />
But does the optimized quantity indicate classifier’s per<strong>for</strong>mance?<br />
Is the classifier appropriate?<br />
Linear classifier: fast computation<br />
But could e.g. a non-linear classifier have better per<strong>for</strong>mance?<br />
Are the estimated parameters good?<br />
Parameters recover input-output mapping on training data<br />
How can we know they do not simply memorize training data?
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
31<br />
Classes may not be linearly separable<br />
Linear classifier cannot properly separate these data<br />
Feature coordinate j<br />
x t=1<br />
x t=2<br />
x t<br />
Each data point has<br />
a class label:<br />
+1 ( )<br />
y t =<br />
-1 ( )<br />
Feature coordinate i
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Beyond linear boundaries<br />
32<br />
Non-linear features: non-linear classifiers & decision boundaries<br />
How do we pick the right features?<br />
This class:<br />
domain knowledge<br />
Next classes: kernel trick (svms) greedy selection (boosting)
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Questions<br />
33<br />
Is the loss function appropriate?<br />
Quadratic loss: convex cost, closed-<strong>for</strong>m solution<br />
But does the optimized quantity indicate classifier’s per<strong>for</strong>mance?<br />
Is the classifier appropriate?<br />
Linear classifier: fast computation<br />
But could e.g. a non-linear classifier have better per<strong>for</strong>mance?<br />
Are the estimated parameters good?<br />
Parameters recover input-output mapping on training data<br />
How can we know they do not simply memorize training data?
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture outline<br />
34<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear regression<br />
Linear regression and least squares<br />
Regularization: ridge regression<br />
Bias-Variance decomposition<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Overfitting problem<br />
35<br />
<strong>Learning</strong> problem: 100 faces, 1000 background images<br />
Image resolution: 100 x 100 pixels (10000 intensity values)<br />
Linear regression:<br />
More unknowns than equations: ill posed problem<br />
perfect per<strong>for</strong>mance on training set<br />
unpredictable per<strong>for</strong>mance on new data<br />
Rank-deficient matrix<br />
`Curse of dimensionality’: in high-dimensional spaces data become sparse
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
L2 Regularization: Ridge regression<br />
36<br />
Penalize classifier’s L2 norm:<br />
Loss function:<br />
data term complexity term<br />
So how do we set ?<br />
Full-rank matrix
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Tuning the model’s complexity<br />
37<br />
A flexible model approximates the target function well in the training set<br />
but can overtrain and have poor per<strong>for</strong>mance on the test set<br />
A rigid model’s per<strong>for</strong>mance is more predictable in the test set<br />
but the model may not be good even on the training set
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Selecting<br />
with cross-validation<br />
38<br />
• Cross validation technique<br />
– Exclude part of the training data from parameter estimation<br />
– Use them only to predict the test error<br />
• 10-fold cross validation:<br />
Training<br />
Validation<br />
• Use cross-validation <strong>for</strong> different values of<br />
– pick value that minimizes cross-validation error
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture outline<br />
39<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Domain knowledge<br />
40<br />
We may know that data undergo trans<strong>for</strong>mations irrelevant to their class<br />
E-mail address: capital letters (Iasonas@gmail.com = iasonas@gmail.com)<br />
Speech recognition: voice amplitude is irrelevant to uttered words<br />
<strong>Computer</strong> vision: illumination variations<br />
Invariant features: not affected by irrelevant signal trans<strong>for</strong>mations
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Photometry-invariant patch features<br />
41<br />
Photometric trans<strong>for</strong>mation: I → a I + b<br />
• Make each patch have zero mean:<br />
• Then make it have unit variance:
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Dealing with texture<br />
42<br />
What kind of features can appropriately describe texture patterns?<br />
`appropriately': in terms of well-behaved functions<br />
Gabor wavelets:<br />
Increasing
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Envelope estimation (demodulation)<br />
43<br />
Convolve
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Multiband demodulation with a Gabor filterbank<br />
44<br />
Havlicek & Bovik, IEEE TIP ’00
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Dealing with changes in scale and orientation<br />
45<br />
Scale-invariant blob detector
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Scale-Invariant Feature Trans<strong>for</strong>m (SIFT) descriptor<br />
Use location and characteristic scale given by blob detector<br />
Estimate orientation from orientation histogram<br />
46<br />
0 2 π<br />
Break patch in 4x4 location blocks<br />
8-bin orientation histogram per block<br />
8x4x4 = 128-D descriptor<br />
Normalize to unit norm<br />
Invariance to: scale, orientation, multiplicative & additive changes
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Histogram of Orientated Gradients (HOG) descriptor<br />
• Dalal and Triggs, ICCV 2005<br />
– Like SIFT descriptor, but <strong>for</strong> arbitrary box aspect ratio, and<br />
computed over all image locations and scales<br />
– Highly accurate detection using linear classifier<br />
47
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Haar Features <strong>for</strong> face detection<br />
Haar features: Value = ∑ (pixels in white area) –<br />
48<br />
∑ (pixels in black area)<br />
Haar features chosen by boosting<br />
Main advantage: rapid computation using integral images (4 operations/box)
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture summary<br />
49<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Who will need this class?<br />
Faces Recognition<br />
Segmentation<br />
<strong>Learning</strong><br />
50<br />
Submission/Acceptance Statistics from CVPR 2010
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Progress during the last 4 decades<br />
51<br />
Humans<br />
<strong>Learning</strong>-based, ‘08<br />
<strong>Learning</strong>-based, ‘04<br />
Best up to ~1990<br />
1965<br />
Precision-Recall Curves on the B erk 100 test images
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture summary<br />
52<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
53<br />
Classifier function<br />
• Input-output mapping<br />
– Output: y<br />
– Input: x<br />
– Method: f<br />
– Parameters: w<br />
• Aspects of the learning problem<br />
– Identify methods that fit the problem setting<br />
– Determine parameters that properly classify the training set<br />
– Measure and control the `complexity’ of these functions
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture summary<br />
54<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Linear Classifiers<br />
55<br />
• Linear expression (hyperplane) to separate positive and negative examples<br />
Feature coordinate j<br />
x<br />
x<br />
i<br />
i<br />
positive :<br />
negative :<br />
x<br />
x<br />
i<br />
i<br />
⋅w<br />
+ b<br />
⋅w<br />
+ b<br />
≥ 0<br />
< 0<br />
Each data point has<br />
a class label:<br />
+1 ( )<br />
y t =<br />
-1 ( )<br />
Feature coordinate i
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Linear regression<br />
56<br />
Least-squares:<br />
Ridge regression:<br />
Tuning : cross-validation
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
L2 Regularization: Ridge regression<br />
57<br />
Penalize classifier’s L2 norm:<br />
Loss function:<br />
data term complexity term<br />
So how do we set ?<br />
Full-rank matrix<br />
What is a good tradeoff between accuracy and complexity?
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Tuning the model’s complexity<br />
58<br />
A flexible model approximates the target function well in the training set<br />
but can be fooled by noise and overtrain<br />
A rigid model is more robust<br />
but will not always provide a good fit
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lecture summary<br />
59<br />
Introduction to the class<br />
Introduction to the problem of classification<br />
Linear classifiers<br />
Image-based features
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Gabor, SIFT, HOG, Haar...<br />
60<br />
Encapsulate domain knowledge about desired invariances<br />
computational efficiency<br />
degree of invariance<br />
task-specific per<strong>for</strong>mance<br />
analytical tractability<br />
...
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Appendix-I What is the right amount of flexibility?<br />
61<br />
Slide credit: Hastie & Tibshirany, Elements of Statistical <strong>Learning</strong>, Springer 2001
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Bias-Variance-I<br />
62<br />
Assume underlying function:<br />
Our model approximates it by:<br />
Approximation quality: affected by model’s flexibility, and the training set.<br />
Different training set realizations: different models<br />
Model’s value at<br />
: random variable<br />
Express the expected generalization error of the model at :
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Appendix-II: Ridge regression = parameter shrinkage<br />
63<br />
Reference: Hastie & Tibshirani, Elements of Statistical <strong>Learning</strong>, Springer 2001<br />
Least squares parameter estimation: minimization of
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
SVD-based interpretation of least squares<br />
64<br />
Singular Value Decomposition (SVD) of<br />
Reconstruction of y on the subspace spanned by X’s columns
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
SVD-based interpretation of Ridge Regression<br />
65<br />
• Minimization of<br />
• Regularization: penalty on large values of<br />
• Solution<br />
• SVD interpretation<br />
• `Shrinkage’
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Feature Space Interpretation of ridge regression<br />
66<br />
• Covariance matrix (centered data):<br />
• : eigenvectors of covariance matrix<br />
• : eigenvalues<br />
• Shrinkage: downplay coefficients corresponding to smaller axes<br />
• Effect <strong>for</strong><br />
– Projections:<br />
– Eigenvalues<br />
– Shrinkage factors
<strong>Machine</strong> <strong>Learning</strong> <strong>for</strong> <strong>Computer</strong> <strong>Vision</strong> – Lecture 1<br />
Lasso<br />
67<br />
• Minimization of<br />
• Regularization: penalty on sum of absolute values of<br />
• Comparison with Ridge Regression<br />
– Gradient does not depend on value of<br />
– Sparsity & subset selection