x - Neural Networks and Machine Learning Lab

CS 478 - Perceptrons 1



! First neural network learning model in the 1960’s! Simple and limited (single layer models)! Basic concepts are similar for multi-layer models so this isa good learning tool! Still used in many current applications (modems, etc.)CS 478 - Perceptrons 4

x 1w 1x 2w 2θzx nw nz=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiCS 478 - Perceptrons 5

x 1.4.1zx 2-.2x 1 x 2 t.8 .3 1.4 .1 0z=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiCS 478 - Perceptrons 7

.8.4.1z=1.3-.2net = .8*.4 + .3*-.2 = .26x 1 x 2 t.8 .3 1.4 .1 0z=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiCS 478 - Perceptrons 8

.4.4.1z=1.1-.2net = .4*.4 + .1*-.2 = .14x 1 x 2 t.8 .3 1.4 .1 0z=10ififn∑i=1n∑i=1xiwi≥ θx w < θiiΔw i = (t - z) * c * x iCS 478 - Perceptrons 9

Δw ij = c(t j – z j ) x i! Where w ij is the weight from node i to node j, c is the learning rate, t j is thetarget of node j for the current instance, z j is the current output of node j,and x i is the output of node i.! Least perturbation principle– Only change weights if there is an error– small c rather than changing weights sufficient to make current pattern correct– Scale by x i! Create a perceptron node with n inputs! Iteratively apply a pattern from the training set and apply the perceptronrule! Each iteration through the training set is an epoch! Continue training until total training set error ceases to improve! Perceptron Convergence Theorem: Guaranteed to find a solution in finitetime if a solution existsCS 478 - Perceptrons 10


1 0 1 -> 01 0 0 -> 1Augmented Version1 0 1 1 -> 01 0 0 1 -> 1! Treat threshold like any other weight. No special case.Call it a bias since it biases the output up or down.! Since we start with random weights anyways, can ignorethe -θ notion, and just think of the bias as an extraavailable weight.! Always use a bias weightCS 478 - Perceptrons 12

! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 and initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0CS 478 - Perceptrons 13

! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 and initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0CS 478 - Perceptrons 14

! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 and initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1CS 478 - Perceptrons 15

! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 and initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1CS 478 - Perceptrons 16

! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 and initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1 3 1 0 -1 -1 -10 0 1 1 0 1 0 0 0CS 478 - Perceptrons 17

! Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)! Assume a learning rate c of 1 and initial weights all 0: Δw ij = c(t j – z j ) x i! Training set 0 0 1 -> 01 1 1 -> 11 0 1 -> 10 1 1 -> 0Pattern Target Weight Vector Net Output ΔW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1 3 1 0 -1 -1 -10 0 1 1 0 1 0 0 0 0 0 0 0 0 01 1 1 1 1 1 0 0 0 1 1 0 0 0 01 0 1 1 1 1 0 0 0 1 1 0 0 0 00 1 1 1 0 1 0 0 0 0 0 0 0 0 0CS 478 - Perceptrons 18

! Assume a Probability of Error at each bit ! 0 0 1 0 1 1 0 0 1 1 0 -> 0 1 1 0 ! i.e. P(error) = .05 ! Or a probability that the algorithm is applied wrong(opposite) occasionally ! Averages out over learning CS 478 - Perceptrons 19

Linear SeparabilityX 2••••00000X 12-d case (two inputs)W1X1 + W2X2 > (Z=1)W1X1 + W2X2 < (Z=0)So, what is decision boundary?W1X1 + W2X2 =X2 + W1X1/W2 = /W2X2 = (-W1/W2)X1 + /W2Y = MX + BIf no bias then thehyperplane must gothrough the originCS 478 - Perceptrons 20


When is data noise vs. a legitimate exceptionCS 478 - Perceptrons 22


! This is an issue with any learning model which onlysupports binary classification (perceptron, SVM, etc.)! Create 1 perceptron for each output class, where thetraining set considers all other classes to be negativeexamples– Run all perceptrons on novel data and set the output to the class ofthe perceptron which outputs high– If there is a tie, choose the perceptron with the highest net value! Create 1 perceptron for each pair of output classes, wherethe training set only contains examples from the 2 classes– Run all perceptrons on novel data and set the output to be the classwith the most wins (votes) from the perceptrons– In case of a tie, use the net values to decide– Number of models grows by the square of the output classesCS 478 - Perceptrons 24

! How do we judge the quality of a particular model (e.g.Perceptron with a particular setting of weights)! Consider how accurate the model is on the data set– Classification accuracy = # Correct/Total instances– Classification error = # Misclassified/Total instances (= 1 – acc)! For real valued outputs and/or targets– Pattern error = Target – output! Errors could cancel each other! Common approach is Squared Error = Σ(t i – z i ) 2– Total sum squared error = Σ Pattern Errors = Σ Σ (t i – z i ) 2! For nominal data, pattern error is typically 1 for a mismatch and0 for a match– For nominal (including binary) output and targets, SSE andclassification error are equivalentCS 478 - Perceptrons 25

! Mean Squared Error (MSE) – SSE/n where n is the numberof instances in the data set– This can be nice because it normalizes the error for data sets ofdifferent sizes– MSE is the average squared error per pattern! Root Mean Squared Error (RMSE) – is the square root ofthe MSE– This puts the error value back into the same units as the featuresand can thus be more intuitive– RMSE is the average distance (error) of targets from the outputs inthe same scale as the featuresCS 478 - Perceptrons 26

SSE:SumSquaredErrorΣ (t i – z i ) 2Error Landscape0Weight ValuesCS 478 - Perceptrons 27

! Goal is to decrease overall error (or other objectivefunction) each time a weight is changed! Total Sum Squared error one possible objective function E:Σ (t i – z i ) 2∂E! Seek a weight changing algorithm such that is∂w negativeij! If a formula can be found then we have a gradient descentlearning algorithm! Delta rule is a variant of the perceptron rule which gives agradient descent learning algorithmCS 478 - Perceptrons 28

! Delta rule uses (target - net) before the net value goes through thethreshold in the learning rule to decide weight updateΔw i= η∑d ∈D! Weights are updated even when the output would be correct! Because this model is single layer and because of the SSE objectivefunction, the error surface is guaranteed to be parabolic with only oneminima €! Learning rate– If learning rate is too large can jump around global minimum– If too small, will work, but will take a longer time– Can decrease learning rate over time to give higher speed and stillattain the global minimum (although exact minimum is still just fortraining set and thus…)(t d− net d)x idCS 478 - Perceptrons 29

! To get the true gradient with the delta rule, we need to sum errors overthe entire training set and only update weights at the end of each epoch! Batch (gradient) vs stochastic (on-line, incremental)– With the stochastic delta rule algorithm, you update after every pattern, just likewith the perceptron algorithm (even though that means each change may not beexactly along the true gradient)– Stochastic is more efficient and best to use in almost all cases, though not all havefigured it out yet! Why is Stochastic better? (Save for later)– Top of the hill syndrome– Speed - still not understood by many - talk about later– Other parameters usually make it not true gradient anyways– True gradient descent only in limit of 0 learning rate– Only minima for the training set, exact minima for real task?CS 478 - Perceptrons 30

! Perceptron rule (target - thresholded output) guaranteed toconverge to a separating hyperplane if the problem islinearly separable. Otherwise may not converge – couldget in cycle! Singe layer Delta rule guaranteed to have only one globalminimum. Thus it will converge to the best SSE solutionwhether the problem is linearly separable or not.– Could have a higher misclassification rate than with the perceptronrule and a less intuitive decision surface – we will discuss withregression! Stopping Criteria – For these models stop when no longermaking progress– When you have gone a few epochs with no significantimprovement/change between epochsCS 478 - Perceptrons 31

Exclusive OrX 21001X 1Is there a dividing hyperplane?CS 478 - Perceptrons 32

! d = # of dimensions CS 478 - Perceptrons 33

! d = # of dimensions ! P = 2 d = # of Patterns CS 478 - Perceptrons 34

! d = # of dimensions ! P = 2 d = # of Patterns ! 2 P = 2 2d = # of Functions n Total Functions Linearly Separable Functions 0 2 2 1 4 4 2 16 14 CS 478 - Perceptrons 35

! d = # of dimensions ! P = 2 d = # of Patterns ! 2 P = 2 2d = # of Functions n Total Functions Linearly Separable Functions 0 2 2 1 4 4 2 16 14 3 256 104 4 65536 1882 5 4.3 × 10 9 94572 6 1.8 × 10 19 1.5 × 10 7 7 3.4 × 10 38 8.4 × 10 9 CS 478 - Perceptrons 36

Linearly Separable FunctionsdLS(P,d) = 2 ∑i=0(P-1)!(P-1-i)!i!for P > d= 2P for P ≤ d(All patterns for d=P)i.e. all 8 ways of dividing 3 vertices of acube for d=P=3Where P is the # of patterns for training andd is the # of inputslimd -> ∞(# of LS functions) = ∞CS 478 - Perceptrons 37

! So far we have usedf (x,w) = sign( ∑ w ix i)n1=1! We could preprocess the inputs in a non-linear way and do€f (x,w) = sign( ∑ w iφ i(x))! To the perceptron algorithm it looks just the same and can usethe same learning algorithm, it just has different inputs! For example,€for a problem with two inputs x and y (plus thebias), we could also add the inputs x 2 , y 2 , and x·y! The perceptron would just think it is a 5 dimensional task, and itis linear in those 5 dimensions– But what kind of decision surfaces would it allow for the 2-d inputspace?m1=1CS 478 - Perceptrons 38

! All quadratic surfaces (2 nd order)– ellipsoid– parabola– etc.! That significantly increases the number of problems thatcan be solved, but still many problem which are notquadrically separable! Could go to 3 rd and higher order features, but number ofpossible features grows exponentially! Multi-layer neural networks will allow us to discover highorderfeatures automatically from the input spaceCS 478 - Perceptrons 39

-3 -2 -1 0 1 2 3f 1! Perceptron with just feature f 1 cannot separate the data! Could we add a transformed feature to our perceptron?CS 478 - Perceptrons 40

-3 -2 -1 0 1 2 3f 1! Perceptron with just feature f 1 cannot separate the data! Could we add a transformed feature to our perceptron?! f 2 = f 12CS 478 - Perceptrons 41

f 2-3 -2 -1 0 1 2 3f 1-3 -2 -1 0 1 2 3! Perceptron with just feature f 1 cannot separate the data! Could we add another feature to our perceptron f 2 = f 12! Note could also think of this as just using feature f 1 butnow allowing a quadric surface to separate the dataf 1CS 478 - Perceptrons 42

x - Neural Networks and Machine Learning Lab

Create successful ePaper yourself

Delete template?

Save as template?