10.07.2015 Views

Inductive Bias 1 - Neural Networks and Machine Learning Lab

Inductive Bias 1 - Neural Networks and Machine Learning Lab

Inductive Bias 1 - Neural Networks and Machine Learning Lab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 1


Noise vs. ExceptionsCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 2


The Hypothesis space H is the set all the possible models hwhich can be learned by the current learning algorithm– e.g. Set of possible weight settings for a perceptron Restricted hypothesis space– Can be easier to search– May avoid overfit since they are usually simpler (e.g. linear or loworder decision surface)– Often will underfit Unrestricted Hypothesis Space– Can represent any possible function <strong>and</strong> thus can fit the training setwell– Mechanisms must be used to avoid overfitCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 3


Simplest accurate model: accuracy vs. complexity tradeoff.Find h ∈ H which minimizes an objective function ofthe form:F(h) = Error(h) + λ·Complexity(h)– Complexity could be number of nodes, size of tree, magnitude ofweights, order of decision surface, etc. More Training Data (vs. overtraining on same data) Stopping criteria with any constructive model (Accuracyincrease vs Statistical significance) – Noise vs. Exceptions Validation Set (next slide: requires separate test set) Will discuss other approaches laterCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 4


SSEValidation SetEpochs (new h at each)Training SetThere is a different model h after each epochSelect a model in the area where the validation set accuracy flattensThe validation set comes out of training set dataStill need a separate test set to use after selecting model h to predictfuture accuracyCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 5


The approach used to decide how to generalize novel casesOne common approach is Occam’s Razor – The simplest hypothesiswhich explains/fits the data is usually the bestMany other rationale biasesABC ⇒ ZAB C ⇒ ZABC ⇒ ZAB C ⇒ ZA B C ⇒ ZA BC ⇒ ?When you get the new input Ā B C. What is your output?€CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 6


<strong>Inductive</strong> <strong>Bias</strong>: Any basis for choosing one generalizationover another, other than strict consistency with theobserved training instancesSometimes just called the <strong>Bias</strong> of the algorithm (don't confusewith the bias weight in a neural network).CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 7


Restricted Hypothesis Space - Can just try tominimize error since hypotheses are already simple– Linear or low order threshold function– k-DNF, k-CNF, etc.– Low order polynomial Preference <strong>Bias</strong> – Prefer one hypothesis over anothereven though they have similar training accuracy– Occam’s Razor– “Smallest” DNF representation which matches well– Shallow decision tree with high information gain– <strong>Neural</strong> Network with low validation error <strong>and</strong> smallmagnitude weightsCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 8


2 2n Boolean functions of n inputsx1 x2 x3 Class Possible Consistent Function Hypotheses0 0 0 10 0 1 10 1 0 10 1 1 11 0 01 0 11 1 01 1 1 ?CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 9


2 2n Boolean functions of n inputsx1 x2 x3 Class Possible Consistent Function Hypotheses0 0 0 1 10 0 1 1 10 1 0 1 10 1 1 1 11 0 0 01 0 1 01 1 0 01 1 1 ? 0CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 10


2 2n Boolean functions of n inputsx1 x2 x3 Class Possible Consistent Function Hypotheses0 0 0 1 1 10 0 1 1 1 10 1 0 1 1 10 1 1 1 1 11 0 0 0 01 0 1 0 01 1 0 0 01 1 1 ? 0 1CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 11


2 2n Boolean functions of n inputsx1 x2 x3 Class Possible Consistent Function Hypotheses0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 11 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 11 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 11 1 1 ? 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 12


2 2n Boolean functions of n inputsx1 x2 x3 Class Possible Consistent Function Hypotheses0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 11 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 11 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 11 1 1 ? 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1What happens in this case if we use simplicity (Occam’s Razor) asour inductive <strong>Bias</strong>CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 13


“Raster Screen” ProblemPattern Theory– Regularity in a task– CompressibilityDon’t care features <strong>and</strong> Impossible statesInteresting/Learnable Problems– What we actually deal with– Can we formally characterize them?<strong>Learning</strong> a training set vs. generalizing– A function where each output is set r<strong>and</strong>omly (coin-flip)– Output class is independent of all other instances in the data setComputability vs. Learnability (Optional)CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 14


Finite problems assume finite number of mappings (Finite Table)– Fixed input size arithmetic– R<strong>and</strong>om memory in a RAM Learnable: Can do better than r<strong>and</strong>om on novel examplesCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 15


Finite problems assume finite number of mappings (Finite Table)– Fixed input size arithmetic– R<strong>and</strong>om memory in a RAM Learnable: Can do better than r<strong>and</strong>om on novel examplesFinite ProblemsAll are ComputableLearnable Problems:Those with RegularityCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 16


Infinite number of mappings (Infinite Table)– Arbitrary input size arithmetic– Halting Problem (no limit on input size)– Do two arbitrary strings matchCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 17


Infinite number of mappings (Infinite Table)– Arbitrary input size arithmetic– Halting Problem (no limit on input size)– Do two arbitrary strings matchInfinite ProblemsLearnable Problems:A reasonably queriedinfinite subset has regularityComputable Problems:Only those where all but a finiteset of mappings have regularityCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 18


Any inductive bias chosen will have equal accuracy compared to anyother bias over all possible functions/tasks, assuming all functions areequally likely. If a bias is correct on some cases, it must be incorrecton equally many cases.Is this a problem?– R<strong>and</strong>om vs. Regular– Anti-<strong>Bias</strong>? (even though regular)– The “Interesting” Problems – subset of learnable?Are all functions equally likely in the real world?CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 19


Interesting Problems <strong>and</strong> <strong>Bias</strong>esAll ProblemsStructured ProblemsInteresting Problems<strong>Inductive</strong> <strong>Bias</strong><strong>Inductive</strong> <strong>Bias</strong><strong>Inductive</strong> <strong>Bias</strong>P I<strong>Inductive</strong> <strong>Bias</strong><strong>Inductive</strong> <strong>Bias</strong>CS 478 - <strong>Inductive</strong> <strong>Bias</strong>20


<strong>Inductive</strong> <strong>Bias</strong> requires some set of prior assumptionsabout the tasks being considered <strong>and</strong> the learningapproaches available Mitchell’s definition: <strong>Inductive</strong> <strong>Bias</strong> of a learner is the setof additional assumptions sufficient to justify its inductiveinferences as deductive inferences We consider st<strong>and</strong>ard ML algorithms/hypothesis spaces tobe different inductive biases: C4.5 (Greedy bestattributes), Backpropagation (simple to complex), etc.CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 21


Not one <strong>Bias</strong> that is best on all problems Our experiments– Over 50 real world problems– Over 400 inductive biases – mostly variations on critical variablebiases vs. similarity biases Different biases were a better fit for different problems Given a data set, which <strong>Learning</strong> model (<strong>Inductive</strong> <strong>Bias</strong>)should be chosen?CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 22


Defining <strong>and</strong> characterizing the set of Interesting/Learnable problemsTo what extent do current biases cover the set of interesting problemsAutomatic feature selectionAutomatic selection of <strong>Bias</strong> (before <strong>and</strong>/or during learning), includingall learning parametersDynamic <strong>Inductive</strong> <strong>Bias</strong>es (in time <strong>and</strong> space)Combinations of <strong>Bias</strong>es – Ensembles, Oracle <strong>Learning</strong>CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 23


Can be discovered as you learn May want to learn general rules first followed by trueexceptions Can be based on ease of learning the problem Example: SoftProp – From Lazy <strong>Learning</strong> to BackpropCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 24


CS 478 - <strong>Inductive</strong> <strong>Bias</strong> 25


Just aData Setorjust anexplanationof the problemAutomatedLearnerOutputsHypothesisInput FeaturesCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 26


Proposing New <strong>Learning</strong> Algorithms (<strong>Inductive</strong> <strong>Bias</strong>es)Theoretical issues– Defining the set of Interesting/Learnable problems– Analytical/empirical studies of differences between biasesEnsembles – Wagging, Mimicking, Oracle <strong>Learning</strong>, etc.Meta-<strong>Learning</strong> – A priori decision regarding which learning model to use– Features of the data set/application– <strong>Learning</strong> from model experienceAutomatic selection of Parameters– Constructive Algorithms – ASOCS, DMPx, etc.– <strong>Learning</strong> Parameters – Windowed momentum, Automatic improved distancefunctions (IVDM)Automatic <strong>Bias</strong> in time – SoftPropAutomatic <strong>Bias</strong> in space – Overfitting, sensitivity to complex portions of thespace: DMP, higher order featuresCS 478 - <strong>Inductive</strong> <strong>Bias</strong> 27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!