13.07.2015 Views

x - Neural Networks and Machine Learning Lab

x - Neural Networks and Machine Learning Lab

x - Neural Networks and Machine Learning Lab

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CS 478 - Perceptrons 1


CS 478 - Perceptrons 2


CS 478 - Perceptrons 3


! First neural network learning model in the 1960’s! Simple <strong>and</strong> limited (single layer models)! Basic concepts are similar for multi-layer models so this isa good learning tool! Still used in many current applications (modems, etc.)CS 478 - Perceptrons 4


x 1w 1x 2w 2θzx nw nz=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiCS 478 - Perceptrons 5


x 1.4.1zx 2-.2x 1 x 2 t.8 .3 1.4 .1 0z=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiCS 478 - Perceptrons 7


.8.4.1z=1.3-.2net = .8*.4 + .3*-.2 = .26x 1 x 2 t.8 .3 1.4 .1 0z=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiCS 478 - Perceptrons 8


.4.4.1z=1.1-.2net = .4*.4 + .1*-.2 = .14x 1 x 2 t.8 .3 1.4 .1 0z=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiΔw i = (t - z) * c * x iCS 478 - Perceptrons 9


Δw ij = c(t j – z j ) x i! Where w ij is the weight from node i to node j, c is the learning rate, t j is thetarget of node j for the current instance, z j is the current output of node j,<strong>and</strong> x i is the output of node i.! Least perturbation principle– Only change weights if there is an error– small c rather than changing weights sufficient to make current pattern correct– Scale by x i! Create a perceptron node with n inputs! Iteratively apply a pattern from the training set <strong>and</strong> apply the perceptronrule! Each iteration through the training set is an epoch! Continue training until total training set error ceases to improve! Perceptron Convergence Theorem: Guaranteed to find a solution in finitetime if a solution existsCS 478 - Perceptrons 10


CS 478 - Perceptrons 11


1 0 1 -> 01 0 0 -> 1Augmented Version1 0 1 1 -> 01 0 0 1 -> 1! Treat threshold like any other weight. No special case.Call it a bias since it biases the output up or down.! Since we start with r<strong>and</strong>om weights anyways, can ignorethe -θ notion, <strong>and</strong> just think of the bias as an extraavailable weight.! Always use a bias weightCS 478 - Perceptrons 12


! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 <strong>and</strong> initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0CS 478 - Perceptrons 13


! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 <strong>and</strong> initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0CS 478 - Perceptrons 14


! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 <strong>and</strong> initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1CS 478 - Perceptrons 15


! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 <strong>and</strong> initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1CS 478 - Perceptrons 16


! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 <strong>and</strong> initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1 3 1 0 -1 -1 -10 0 1 1 0 1 0 0 0CS 478 - Perceptrons 17


! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 <strong>and</strong> initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1 3 1 0 -1 -1 -10 0 1 1 0 1 0 0 0 0 0 0 0 0 01 1 1 1 1 1 0 0 0 1 1 0 0 0 01 0 1 1 1 1 0 0 0 1 1 0 0 0 00 1 1 1 0 1 0 0 0 0 0 0 0 0 0CS 478 - Perceptrons 18


! Assume a Probability of Error at each bit ! 0 0 1 0 1 1 0 0 1 1 0 -> 0 1 1 0 ! i.e. P(error) = .05 ! Or a probability that the algorithm is applied wrong(opposite) occasionally ! Averages out over learning CS 478 - Perceptrons 19


Linear SeparabilityX 2••••00000X 12-d case (two inputs)W1X1 + W2X2 > (Z=1)W1X1 + W2X2 < (Z=0)So, what is decision boundary?W1X1 + W2X2 =X2 + W1X1/W2 = /W2X2 = (-W1/W2)X1 + /W2Y = MX + BIf no bias then thehyperplane must gothrough the originCS 478 - Perceptrons 20


CS 478 - Perceptrons 21


When is data noise vs. a legitimate exceptionCS 478 - Perceptrons 22


CS 478 - Perceptrons 23


! This is an issue with any learning model which onlysupports binary classification (perceptron, SVM, etc.)! Create 1 perceptron for each output class, where thetraining set considers all other classes to be negativeexamples– Run all perceptrons on novel data <strong>and</strong> set the output to the class ofthe perceptron which outputs high– If there is a tie, choose the perceptron with the highest net value! Create 1 perceptron for each pair of output classes, wherethe training set only contains examples from the 2 classes– Run all perceptrons on novel data <strong>and</strong> set the output to be the classwith the most wins (votes) from the perceptrons– In case of a tie, use the net values to decide– Number of models grows by the square of the output classesCS 478 - Perceptrons 24


! How do we judge the quality of a particular model (e.g.Perceptron with a particular setting of weights)! Consider how accurate the model is on the data set– Classification accuracy = # Correct/Total instances– Classification error = # Misclassified/Total instances (= 1 – acc)! For real valued outputs <strong>and</strong>/or targets– Pattern error = Target – output! Errors could cancel each other! Common approach is Squared Error = Σ(t i – z i ) 2– Total sum squared error = Σ Pattern Errors = Σ Σ (t i – z i ) 2! For nominal data, pattern error is typically 1 for a mismatch <strong>and</strong>0 for a match– For nominal (including binary) output <strong>and</strong> targets, SSE <strong>and</strong>classification error are equivalentCS 478 - Perceptrons 25


! Mean Squared Error (MSE) – SSE/n where n is the numberof instances in the data set– This can be nice because it normalizes the error for data sets ofdifferent sizes– MSE is the average squared error per pattern! Root Mean Squared Error (RMSE) – is the square root ofthe MSE– This puts the error value back into the same units as the features<strong>and</strong> can thus be more intuitive– RMSE is the average distance (error) of targets from the outputs inthe same scale as the featuresCS 478 - Perceptrons 26


SSE:SumSquaredErrorΣ (t i – z i ) 2Error L<strong>and</strong>scape0Weight ValuesCS 478 - Perceptrons 27


! Goal is to decrease overall error (or other objectivefunction) each time a weight is changed! Total Sum Squared error one possible objective function E:Σ (t i – z i ) 2∂E! Seek a weight changing algorithm such that is∂w negativeij! If a formula can be found then we have a gradient descentlearning algorithm! Delta rule is a variant of the perceptron rule which gives agradient descent learning algorithmCS 478 - Perceptrons 28


! Delta rule uses (target - net) before the net value goes through thethreshold in the learning rule to decide weight updateΔw i= η∑d ∈D! Weights are updated even when the output would be correct! Because this model is single layer <strong>and</strong> because of the SSE objectivefunction, the error surface is guaranteed to be parabolic with only oneminima €! <strong>Learning</strong> rate– If learning rate is too large can jump around global minimum– If too small, will work, but will take a longer time– Can decrease learning rate over time to give higher speed <strong>and</strong> stillattain the global minimum (although exact minimum is still just fortraining set <strong>and</strong> thus…)(t d− net d)x idCS 478 - Perceptrons 29


! To get the true gradient with the delta rule, we need to sum errors overthe entire training set <strong>and</strong> only update weights at the end of each epoch! Batch (gradient) vs stochastic (on-line, incremental)– With the stochastic delta rule algorithm, you update after every pattern, just likewith the perceptron algorithm (even though that means each change may not beexactly along the true gradient)– Stochastic is more efficient <strong>and</strong> best to use in almost all cases, though not all havefigured it out yet! Why is Stochastic better? (Save for later)– Top of the hill syndrome– Speed - still not understood by many - talk about later– Other parameters usually make it not true gradient anyways– True gradient descent only in limit of 0 learning rate– Only minima for the training set, exact minima for real task?CS 478 - Perceptrons 30


! Perceptron rule (target - thresholded output) guaranteed toconverge to a separating hyperplane if the problem islinearly separable. Otherwise may not converge – couldget in cycle! Singe layer Delta rule guaranteed to have only one globalminimum. Thus it will converge to the best SSE solutionwhether the problem is linearly separable or not.– Could have a higher misclassification rate than with the perceptronrule <strong>and</strong> a less intuitive decision surface – we will discuss withregression! Stopping Criteria – For these models stop when no longermaking progress– When you have gone a few epochs with no significantimprovement/change between epochsCS 478 - Perceptrons 31


Exclusive OrX 21001X 1Is there a dividing hyperplane?CS 478 - Perceptrons 32


! d = # of dimensions CS 478 - Perceptrons 33


! d = # of dimensions ! P = 2 d = # of Patterns CS 478 - Perceptrons 34


! d = # of dimensions ! P = 2 d = # of Patterns ! 2 P = 2 2d = # of Functions n Total Functions Linearly Separable Functions 0 2 2 1 4 4 2 16 14 CS 478 - Perceptrons 35


! d = # of dimensions ! P = 2 d = # of Patterns ! 2 P = 2 2d = # of Functions n Total Functions Linearly Separable Functions 0 2 2 1 4 4 2 16 14 3 256 104 4 65536 1882 5 4.3 × 10 9 94572 6 1.8 × 10 19 1.5 × 10 7 7 3.4 × 10 38 8.4 × 10 9 CS 478 - Perceptrons 36


Linearly Separable FunctionsdLS(P,d) = 2 ∑i=0(P-1)!(P-1-i)!i!for P > d= 2P for P ≤ d(All patterns for d=P)i.e. all 8 ways of dividing 3 vertices of acube for d=P=3Where P is the # of patterns for training <strong>and</strong>d is the # of inputslimd -> ∞(# of LS functions) = ∞CS 478 - Perceptrons 37


! So far we have usedf (x,w) = sign( ∑ w ix i)n1=1! We could preprocess the inputs in a non-linear way <strong>and</strong> do€f (x,w) = sign( ∑ w iφ i(x))! To the perceptron algorithm it looks just the same <strong>and</strong> can usethe same learning algorithm, it just has different inputs! For example,€for a problem with two inputs x <strong>and</strong> y (plus thebias), we could also add the inputs x 2 , y 2 , <strong>and</strong> x·y! The perceptron would just think it is a 5 dimensional task, <strong>and</strong> itis linear in those 5 dimensions– But what kind of decision surfaces would it allow for the 2-d inputspace?m1=1CS 478 - Perceptrons 38


! All quadratic surfaces (2 nd order)– ellipsoid– parabola– etc.! That significantly increases the number of problems thatcan be solved, but still many problem which are notquadrically separable! Could go to 3 rd <strong>and</strong> higher order features, but number ofpossible features grows exponentially! Multi-layer neural networks will allow us to discover highorderfeatures automatically from the input spaceCS 478 - Perceptrons 39


-3 -2 -1 0 1 2 3f 1! Perceptron with just feature f 1 cannot separate the data! Could we add a transformed feature to our perceptron?CS 478 - Perceptrons 40


-3 -2 -1 0 1 2 3f 1! Perceptron with just feature f 1 cannot separate the data! Could we add a transformed feature to our perceptron?! f 2 = f 12CS 478 - Perceptrons 41


f 2-3 -2 -1 0 1 2 3f 1-3 -2 -1 0 1 2 3! Perceptron with just feature f 1 cannot separate the data! Could we add another feature to our perceptron f 2 = f 12! Note could also think of this as just using feature f 1 butnow allowing a quadric surface to separate the dataf 1CS 478 - Perceptrons 42

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!