A tutorial on support vector regression - LASA

A tutorial on support vector regression - LASA A tutorial on support vector regression - LASA

12.07.2015 Views

A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 201Substituting (7), (8), and (9) into (5) yields the dual optimizati<strong>on</strong>problem.⎧⎪⎨ − 1 l∑(α i − αi ∗ 2)(α j − α ∗ j )〈x i, x j 〉i, j=1maximizel∑l∑⎪⎩ −ε (α i + αi ∗ ) + y i (α i − αi ∗ ) (10)i=1i=1l∑subject to (α i − αi ∗ ) = 0 and α i,αi ∗ ∈ [0, C]i=1In deriving (10) we already eliminated the dual variables η i ,ηi∗through c<strong>on</strong>diti<strong>on</strong> (9) which can be reformulated as η (∗)i= C −. Equati<strong>on</strong> (8) can be rewritten as followsα (∗)iw =l∑(α i −αi ∗ )x i, thus f (x) =i=1l∑(α i −αi ∗ )〈x i, x〉+b. (11)This is the so-called Support Vector expansi<strong>on</strong>, i.e. w can becompletely described as a linear combinati<strong>on</strong> of the trainingpatterns x i .Inasense, the complexity of a functi<strong>on</strong>’s representati<strong>on</strong>by SVs is independent of the dimensi<strong>on</strong>ality of the inputspace X , and depends <strong>on</strong>ly <strong>on</strong> the number of SVs.Moreover, note that the complete algorithm can be describedin terms of dot products between the data. Even when evaluatingf (x) weneed not compute w explicitly. These observati<strong>on</strong>swill come in handy for the formulati<strong>on</strong> of a n<strong>on</strong>linearextensi<strong>on</strong>.1.4. Computing bSo far we neglected the issue of computing b. The latter can bed<strong>on</strong>e by exploiting the so called Karush–Kuhn–Tucker (KKT)c<strong>on</strong>diti<strong>on</strong>s (Karush 1939, Kuhn and Tucker 1951). These statethat at the point of the soluti<strong>on</strong> the product between dual variablesand c<strong>on</strong>straints has to vanish.andi=1α i (ε + ξ i − y i +〈w, x i 〉+b) = 0α ∗ i (ε + ξ ∗i + y i −〈w, x i 〉−b) = 0(C − α i )ξ i = 0(C − α ∗ i )ξ ∗i = 0.(12)(13)This allows us to make several useful c<strong>on</strong>clusi<strong>on</strong>s. Firstly <strong>on</strong>lysamples (x i , y i ) with corresp<strong>on</strong>ding α (∗)i= C lie outside the ε-insensitive tube. Sec<strong>on</strong>dly α i αi∗ = 0, i.e. there can never be a setof dual variables α i ,αi ∗ which are both simultaneously n<strong>on</strong>zero.This allows us to c<strong>on</strong>clude thatε − y i +〈w, x i 〉+b ≥ 0 and ξ i = 0 ifα i < C (14)ε − y i +〈w, x i 〉+b ≤ 0 ifα i > 0 (15)In c<strong>on</strong>juncti<strong>on</strong> with an analogous analysis <strong>on</strong> αi ∗ we havemax{−ε + y i −〈w, x i 〉|α i < C or αi∗ > 0} ≤b ≤min{−ε + y i −〈w, x i 〉|α i > 0orαi ∗ (16)< C}If some α (∗)i∈ (0, C) the inequalities become equalities. Seealso Keerthi et al. (2001) for further means of choosing b.Another way of computing b will be discussed in the c<strong>on</strong>textof interior point optimizati<strong>on</strong> (cf. Secti<strong>on</strong> 5). There b turns outto be a by-product of the optimizati<strong>on</strong> process. Further c<strong>on</strong>siderati<strong>on</strong>sshall be deferred to the corresp<strong>on</strong>ding secti<strong>on</strong>. See alsoKeerthi et al. (1999) for further methods to compute the c<strong>on</strong>stantoffset.A final note has to be made regarding the sparsity of the SVexpansi<strong>on</strong>. From (12) it follows that <strong>on</strong>ly for | f (x i ) − y i |≥εthe Lagrange multipliers may be n<strong>on</strong>zero, or in other words, forall samples inside the ε–tube (i.e. the shaded regi<strong>on</strong> in Fig. 1)the α i ,αi∗ vanish: for | f (x i ) − y i | < ε the sec<strong>on</strong>d factor in(12) is n<strong>on</strong>zero, hence α i ,αi ∗ has to be zero such that the KKTc<strong>on</strong>diti<strong>on</strong>s are satisfied. Therefore we have a sparse expansi<strong>on</strong>of w in terms of x i (i.e. we do not need all x i to describe w). Theexamples that come with n<strong>on</strong>vanishing coefficients are calledSupport Vectors.2. Kernels2.1. N<strong>on</strong>linearity by preprocessingThe next step is to make the SV algorithm n<strong>on</strong>linear. This, forinstance, could be achieved by simply preprocessing the trainingpatterns x i byamap : X → F into some feature space F,as described in Aizerman, Braverman and Roz<strong>on</strong>oér (1964) andNilss<strong>on</strong> (1965) and then applying the standard SV regressi<strong>on</strong>algorithm. Let us have a brief look at an example given in Vapnik(1995).Example 1 (Quadratic features in R 2 ). C<strong>on</strong>sider the map :R 2 → R 3 with (x 1 , x 2 ) = (x1 2, √ 2x 1 x 2 , x2 2 ). It is understoodthat the subscripts in this case refer to the comp<strong>on</strong>ents of x ∈ R 2 .Training a linear SV machine <strong>on</strong> the preprocessed features wouldyield a quadratic functi<strong>on</strong>.While this approach seems reas<strong>on</strong>able in the particular exampleabove, it can easily become computati<strong>on</strong>ally infeasiblefor both polynomial features of higher order and higher dimensi<strong>on</strong>ality,as the number of different m<strong>on</strong>omial featuresof degree p is ( d+p−1p), where d = dim(X ). Typical valuesfor OCR tasks (with good performance) (Schölkopf, Burgesand Vapnik 1995, Schölkopf et al. 1997, Vapnik 1995) arep = 7, d = 28 · 28 = 784, corresp<strong>on</strong>ding to approximately3.7 · 10 16 features.2.2. Implicit mapping via kernelsClearly this approach is not feasible and we have to find a computati<strong>on</strong>allycheaper way. The key observati<strong>on</strong> (Boser, Guy<strong>on</strong>


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 203Note that the c<strong>on</strong>diti<strong>on</strong>s in Theorem 7 are <strong>on</strong>ly necessary butnot sufficient. The rules stated above can be useful tools forpractiti<strong>on</strong>ers both for checking whether a kernel is an admissibleSV kernel and for actually c<strong>on</strong>structing new kernels. The generalcase is given by the following theorem.Theorem 8 (Schoenberg 1942). A kernel of dot-product typek(x, x ′ ) = k(〈x, x ′ 〉) defined <strong>on</strong> an infinite dimensi<strong>on</strong>al Hilbertspace, with a power series expansi<strong>on</strong>∞∑k(t) = a n t n (27)n=0is admissible if and <strong>on</strong>ly if all a n ≥ 0.A slightly weaker c<strong>on</strong>diti<strong>on</strong> applies for finite dimensi<strong>on</strong>alspaces. For further details see Berg, Christensen and Ressel(1984) and Smola, Óvári and Williams<strong>on</strong> (2001).2.4. ExamplesIn Schölkopf, Smola and Müller (1998b) it has been shown, byexplicitly computing the mapping, that homogeneous polynomialkernels k with p ∈ N andk(x, x ′ ) =〈x, x ′ 〉 p (28)are suitable SV kernels (cf. Poggio 1975). From this observati<strong>on</strong><strong>on</strong>e can c<strong>on</strong>clude immediately (Boser, Guy<strong>on</strong> and Vapnik 1992,Vapnik 1995) that kernels of the typek(x, x ′ ) = (〈x, x ′ 〉+c) p (29)i.e. inhomogeneous polynomial kernels with p ∈ N, c ≥ 0 areadmissible, too: rewrite k as a sum of homogeneous kernels andapply Corollary 3. Another kernel, that might seem appealingdue to its resemblance to Neural Networks is the hyperbolictangent kernelk(x, x ′ ) = tanh(ϑ + κ〈x, x ′ 〉). (30)By applying Theorem 8 <strong>on</strong>e can check that this kernel doesnot actually satisfy Mercer’s c<strong>on</strong>diti<strong>on</strong> (Ovari 2000). Curiously,the kernel has been successfully used in practice; cf. Scholkopf(1997) for a discussi<strong>on</strong> of the reas<strong>on</strong>s.Translati<strong>on</strong> invariant kernels k(x, x ′ ) = k(x − x ′ ) arequite widespread. It was shown in Aizerman, Braverman andRoz<strong>on</strong>oér (1964), Micchelli (1986) and Boser, Guy<strong>on</strong> and Vapnik(1992) thatk(x, x ′ ) = e − ‖x−x′ ‖ 22σ 2 (31)is an admissible SV kernel. Moreover <strong>on</strong>e can show (Smola1996, Vapnik, Golowich and Smola 1997) that (1 X denotes theindicator functi<strong>on</strong> <strong>on</strong> the set X and ⊗ the c<strong>on</strong>voluti<strong>on</strong> operati<strong>on</strong>)k(x, x ′ ) = B 2n+1 (‖x − x ′ ‖) with B k :=k⊗i=11 [−12 , 1 2] (32)B-splines of order 2n + 1, defined by the 2n + 1 c<strong>on</strong>voluti<strong>on</strong> ofthe unit inverval, are also admissible. We shall postp<strong>on</strong>e furtherc<strong>on</strong>siderati<strong>on</strong>s to Secti<strong>on</strong> 7 where the c<strong>on</strong>necti<strong>on</strong> to regularizati<strong>on</strong>operators will be pointed out in more detail.3. Cost functi<strong>on</strong>sSo far the SV algorithm for regressi<strong>on</strong> may seem rather strangeand hardly related to other existing methods of functi<strong>on</strong> estimati<strong>on</strong>(e.g. Huber 1981, St<strong>on</strong>e 1985, Härdle 1990, Hastie andTibshirani 1990, Wahba 1990). However, <strong>on</strong>ce cast into a morestandard mathematical notati<strong>on</strong>, we will observe the c<strong>on</strong>necti<strong>on</strong>sto previous work. For the sake of simplicity we will, again,<strong>on</strong>ly c<strong>on</strong>sider the linear case, as extensi<strong>on</strong>s to the n<strong>on</strong>linear <strong>on</strong>eare straightforward by using the kernel method described in theprevious chapter.3.1. The risk functi<strong>on</strong>alLet us for a moment go back to the case of Secti<strong>on</strong> 1.2. There, wehad some training data X := {(x 1 , y 1 ),...,(x l , y l )}⊂X × R.We will assume now, that this training set has been drawn iid(independent and identically distributed) from some probabilitydistributi<strong>on</strong> P(x, y). Our goal will be to find a functi<strong>on</strong> fminimizing the expected risk (cf. Vapnik 1982)∫R[ f ] = c(x, y, f (x))dP(x, y) (33)(c(x, y, f (x)) denotes a cost functi<strong>on</strong> determining how we willpenalize estimati<strong>on</strong> errors) based <strong>on</strong> the empirical data X.Giventhat we do not know the distributi<strong>on</strong> P(x, y) wecan<strong>on</strong>ly useX for estimating a functi<strong>on</strong> f that minimizes R[ f ]. A possibleapproximati<strong>on</strong> c<strong>on</strong>sists in replacing the integrati<strong>on</strong> by theempirical estimate, to get the so called empirical risk functi<strong>on</strong>alR emp [ f ]:= 1 ll∑c(x i , y i , f (x i )). (34)i=1A first attempt would be to find the empirical risk minimizerf 0 := argmin f ∈H R emp [ f ] for some functi<strong>on</strong> class H.However,if H is very rich, i.e. its “capacity” is very high, as for instancewhen dealing with few data in very high-dimensi<strong>on</strong>al spaces,this may not be a good idea, as it will lead to overfitting and thusbad generalizati<strong>on</strong> properties. Hence <strong>on</strong>e should add a capacityc<strong>on</strong>trol term, in the SV case ‖w‖ 2 ,which leads to the regularizedrisk functi<strong>on</strong>al (Tikh<strong>on</strong>ov and Arsenin 1977, Morozov 1984,Vapnik 1982)R reg [ f ]:= R emp [ f ] + λ 2 ‖w‖2 (35)where λ>0 is a so called regularizati<strong>on</strong> c<strong>on</strong>stant. Manyalgorithms like regularizati<strong>on</strong> networks (Girosi, J<strong>on</strong>es andPoggio 1993) or neural networks with weight decay networks(e.g. Bishop 1995) minimize an expressi<strong>on</strong> similar to (35).


204 Smola and Schölkopf3.2. Maximum likelihood and density modelsThe standard setting in the SV case is, as already menti<strong>on</strong>ed inSecti<strong>on</strong> 1.2, the ε-insensitive lossc(x, y, f (x)) =|y − f (x)| ε . (36)It is straightforward to show that minimizing (35) with the particularloss functi<strong>on</strong> of (36) is equivalent to minimizing (3), the<strong>on</strong>ly difference being that C = 1/(λl).Loss functi<strong>on</strong>s such like |y − f (x)| εp with p > 1maynotbe desirable, as the superlinear increase leads to a loss of therobustness properties of the estimator (Huber 1981): in thosecases the derivative of the cost functi<strong>on</strong> grows without bound.For p < 1, <strong>on</strong> the other hand, c becomes n<strong>on</strong>c<strong>on</strong>vex.For the case of c(x, y, f (x)) = (y − f (x)) 2 we recover theleast mean squares fit approach, which, unlike the standard SVloss functi<strong>on</strong>, leads to a matrix inversi<strong>on</strong> instead of a quadraticprogramming problem.The questi<strong>on</strong> is which cost functi<strong>on</strong> should be used in (35). Onthe <strong>on</strong>e hand we will want to avoid a very complicated functi<strong>on</strong> cas this may lead to difficult optimizati<strong>on</strong> problems. On the otherhand <strong>on</strong>e should use that particular cost functi<strong>on</strong> that suits theproblem best. Moreover, under the assumpti<strong>on</strong> that the sampleswere generated by an underlying functi<strong>on</strong>al dependency plusadditive noise, i.e. y i = f true (x i ) + ξ i with density p(ξ), then theoptimal cost functi<strong>on</strong> in a maximum likelihood sense isc(x, y, f (x)) =−log p(y − f (x)). (37)This can be seen as follows. The likelihood of an estimateX f := {(x 1 , f (x 1 )),...,(x l , f (x l ))} (38)for additive noise and iid data isl∏p(X f | X) = p( f (x i ) | (x i , y i )) =i=1l∏p(y i − f (x i )). (39)Maximizing P(X f | X) isequivalent to minimizing − log P(X f | X). By using (37) we getl∑− log P(X f | X) = c(x i , y i , f (x i )). (40)i=1i=1However, the cost functi<strong>on</strong> resulting from this reas<strong>on</strong>ing mightbe n<strong>on</strong>c<strong>on</strong>vex. In this case <strong>on</strong>e would have to find a c<strong>on</strong>vexproxy in order to deal with the situati<strong>on</strong> efficiently (i.e. to findan efficient implementati<strong>on</strong> of the corresp<strong>on</strong>ding optimizati<strong>on</strong>problem).If, <strong>on</strong> the other hand, we are given a specific cost functi<strong>on</strong> froma real world problem, <strong>on</strong>e should try to find as close a proxy tothis cost functi<strong>on</strong> as possible, as it is the performance wrt. thisparticular cost functi<strong>on</strong> that matters ultimately.Table 1c<strong>on</strong>tains an overview over some comm<strong>on</strong> densitymodels and the corresp<strong>on</strong>ding loss functi<strong>on</strong>s as defined by(37).The <strong>on</strong>ly requirement we will impose <strong>on</strong> c(x, y, f (x)) in thefollowing is that for fixed x and y we have c<strong>on</strong>vexity in f (x).This requirement is made, as we want to ensure the existence anduniqueness (for strict c<strong>on</strong>vexity) of a minimum of optimizati<strong>on</strong>problems (Fletcher 1989).3.3. Solving the equati<strong>on</strong>sFor the sake of simplicity we will additi<strong>on</strong>ally assume c tobe symmetric and to have (at most) two (for symmetry) disc<strong>on</strong>tinuitiesat ±ε, ε ≥ 0inthe first derivative, and to bezero in the interval [−ε, ε]. All loss functi<strong>on</strong>s from Table 1bel<strong>on</strong>g to this class. Hence c will take <strong>on</strong> the followingform.c(x, y, f (x)) = ˜c(|y − f (x)| ε ) (41)Note the similarity to Vapnik’s ε-insensitive loss. It is ratherstraightforward to extend this special choice to more generalc<strong>on</strong>vex cost functi<strong>on</strong>s. For n<strong>on</strong>zero cost functi<strong>on</strong>s in the interval[−ε, ε] use an additi<strong>on</strong>al pair of slack variables. Moreoverwe might choose different cost functi<strong>on</strong>s ˜c i , ˜ci∗ and differentvalues of ε i , εi∗ for each sample. At the expense of additi<strong>on</strong>alLagrange multipliers in the dual formulati<strong>on</strong> additi<strong>on</strong>al disc<strong>on</strong>tinuitiesalso can be taken care of. Analogously to (3) we arrive ata c<strong>on</strong>vex minimizati<strong>on</strong> problem (Smola and Schölkopf 1998a).To simplify notati<strong>on</strong> we will stick to the <strong>on</strong>e of (3) and use CTable 1. Comm<strong>on</strong> loss functi<strong>on</strong>s and corresp<strong>on</strong>ding density modelsLoss functi<strong>on</strong>Density modelε-insensitive c(ξ) =|ξ| ε p(ξ) = 1 exp(−|ξ| 2(1+ε) ε)Laplacian c(ξ) =|ξ| p(ξ) = 1 exp(−|ξ|)2 ( )Gaussian c(ξ) = 1 ξ 2 p(ξ) = √ 12 2πexp − ξ 22{ ⎧ ( )12σHuber’s robust loss c(ξ) =(ξ)2 if |ξ| ≤σ⎨ exp − ξ 2if |ξ| ≤σ2σp(ξ) ∝ ( )|ξ|− σ otherwise⎩ σ2exp −|ξ| otherwise2Polynomial c(ξ) = 1 p |ξ|p p(ξ) = p2Ɣ(1/p) exp(−|ξ|p { ⎧ ( )1(ξ) p if |ξ| ≤σ⎨ exp −ξ pif |ξ| ≤σpσPiecewise polynomial c(ξ) =p−1 pσ|ξ|−σ p−1p(ξ) ∝ (p−1 )otherwise⎩pexp −|ξ| otherwiseσ p−1p


206 Smola and SchölkopfFig. 2. Architecture of a regressi<strong>on</strong> machine c<strong>on</strong>structed by the SValgorithmthe map . This corresp<strong>on</strong>ds to evaluating kernel functi<strong>on</strong>sk(x i , x). Finally the dot products are added up using the weightsν i = α i − αi ∗ . This, plus the c<strong>on</strong>stant term b yields the finalpredicti<strong>on</strong> output. The process described here is very similar toregressi<strong>on</strong> in a neural network, with the difference, that in theSV case the weights in the input layer are a subset of the trainingpatterns.Figure 3 dem<strong>on</strong>strates how the SV algorithm chooses theflattest functi<strong>on</strong> am<strong>on</strong>g those approximating the original datawith a given precisi<strong>on</strong>. Although requiring flatness <strong>on</strong>ly infeature space, <strong>on</strong>e can observe that the functi<strong>on</strong>s also arevery flat in input space. This is due to the fact, that kernelscan be associated with flatness properties via regular-izati<strong>on</strong> operators. This will be explained in more detail inSecti<strong>on</strong> 7.Finally Fig. 4 shows the relati<strong>on</strong> between approximati<strong>on</strong> qualityand sparsity of representati<strong>on</strong> in the SV case. The lower theprecisi<strong>on</strong> required for approximating the original data, the fewerSVs are needed to encode that. The n<strong>on</strong>-SVs are redundant, i.e.even without these patterns in the training set, the SV machinewould have c<strong>on</strong>structed exactly the same functi<strong>on</strong> f . One mightthink that this could be an efficient way of data compressi<strong>on</strong>,namely by storing <strong>on</strong>ly the <strong>support</strong> patterns, from which the estimatecan be rec<strong>on</strong>structed completely. However, this simpleanalogy turns out to fail in the case of high-dimensi<strong>on</strong>al data,and even more drastically in the presence of noise. In Vapnik,Golowich and Smola (1997) <strong>on</strong>e can see that even for moderateapproximati<strong>on</strong> quality, the number of SVs can be c<strong>on</strong>siderablyhigh, yielding rates worse than the Nyquist rate (Nyquist 1928,Shann<strong>on</strong> 1948).5. Optimizati<strong>on</strong> algorithmsWhile there has been a large number of implementati<strong>on</strong>s of SValgorithms in the past years, we focus <strong>on</strong> a few algorithms whichwill be presented in greater detail. This selecti<strong>on</strong> is somewhatbiased, as it c<strong>on</strong>tains these algorithms the authors are most familiarwith. However, we think that this overview c<strong>on</strong>tains someof the most effective <strong>on</strong>es and will be useful for practiti<strong>on</strong>erswho would like to actually code a SV machine by themselves.But before doing so we will briefly cover major optimizati<strong>on</strong>packages and strategies.Fig. 3. Left to right: approximati<strong>on</strong> of the functi<strong>on</strong> sinc x with precisi<strong>on</strong>s ε = 0.1, 0.2, and 0.5. The solid top and the bottom lines indicate the sizeof the ε-tube, the dotted line in between is the regressi<strong>on</strong>Fig. 4. Left to right: regressi<strong>on</strong> (solid line), datapoints (small dots) and SVs (big dots) for an approximati<strong>on</strong> with ε = 0.1, 0.2, and 0.5. Note thedecrease in the number of SVs


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 2075.1. Implementati<strong>on</strong>sMost commercially available packages for quadratic programmingcan also be used to train SV machines. These are usuallynumerically very stable general purpose codes, with special enhancementsfor large sparse systems. While the latter is a featurethat is not needed at all in SV problems (there the dot productmatrix is dense and huge) they still can be used with good success.6OSL: This package was written by IBM-Corporati<strong>on</strong> (1992). Ituses a two phase algorithm. The first step c<strong>on</strong>sists of solvinga linear approximati<strong>on</strong> of the QP problem by the simplex algorithm(Dantzig 1962). Next a related very simple QP problemis dealt with. When successive approximati<strong>on</strong>s are closeenough together, the sec<strong>on</strong>d subalgorithm, which permits aquadratic objective and c<strong>on</strong>verges very rapidly from a goodstarting value, is used. Recently an interior point algorithmwas added to the software suite.CPLEX by CPLEX-Optimizati<strong>on</strong>-Inc. (1994) uses a primal-duallogarithmic barrier algorithm (Megiddo 1989) instead withpredictor-corrector step (see e.g. Lustig, Marsten and Shanno1992, Mehrotra and Sun 1992).MINOS by the Stanford Optimizati<strong>on</strong> Laboratory (Murtagh andSaunders 1983) uses a reduced gradient algorithm in c<strong>on</strong>juncti<strong>on</strong>with a quasi-Newt<strong>on</strong> algorithm. The c<strong>on</strong>straints arehandled by an active set strategy. Feasibility is maintainedthroughout the process. On the active c<strong>on</strong>straint manifold, aquasi-Newt<strong>on</strong> approximati<strong>on</strong> is used.MATLAB: Until recently the matlab QP optimizer delivered <strong>on</strong>lyagreeable, although below average performance <strong>on</strong> classificati<strong>on</strong>tasks and was not all too useful for regressi<strong>on</strong> tasks(for problems much larger than 100 samples) due to the factthat <strong>on</strong>e is effectively dealing with an optimizati<strong>on</strong> problemof size 2l where at least half of the eigenvalues of theHessian vanish. These problems seem to have been addressedin versi<strong>on</strong> 5.3 / R11. Matlab now uses interior point codes.LOQO by Vanderbei (1994) is another example of an interiorpoint code. Secti<strong>on</strong> 5.3 discusses the underlying strategies indetail and shows how they can be adapted to SV algorithms.Maximum margin perceptr<strong>on</strong> by Kowalczyk (2000) is an algorithmspecifically tailored to SVs. Unlike most other techniquesit works directly in primal space and thus does nothave to take the equality c<strong>on</strong>straint <strong>on</strong> the Lagrange multipliersinto account explicitly.Iterative free set methods The algorithm by Kaufman (Bunch,Kaufman and Parlett 1976, Bunch and Kaufman 1977, 1980,Drucker et al. 1997, Kaufman 1999), uses such a techniquestarting with all variables <strong>on</strong> the boundary and adding them asthe Karush Kuhn Tucker c<strong>on</strong>diti<strong>on</strong>s become more violated.This approach has the advantage of not having to computethe full dot product matrix from the beginning. Instead it isevaluated <strong>on</strong> the fly, yielding a performance improvementin comparis<strong>on</strong> to tackling the whole optimizati<strong>on</strong> problemat <strong>on</strong>ce. However, also other algorithms can be modified bysubset selecti<strong>on</strong> techniques (see Secti<strong>on</strong> 5.5) to address thisproblem.5.2. Basic noti<strong>on</strong>sMost algorithms rely <strong>on</strong> results from the duality theory in c<strong>on</strong>vexoptimizati<strong>on</strong>. Although we already happened to menti<strong>on</strong> somebasic ideas in Secti<strong>on</strong> 1.2 we will, for the sake of c<strong>on</strong>venience,briefly review without proof the core results. These are neededin particular to derive an interior point algorithm. For details andproofs (see e.g. Fletcher 1989).Uniqueness: Every c<strong>on</strong>vex c<strong>on</strong>strained optimizati<strong>on</strong> problemhas a unique minimum. If the problem is strictly c<strong>on</strong>vex thenthe soluti<strong>on</strong> is unique. This means that SVs are not plaguedwith the problem of local minima as Neural Networks are. 7Lagrange functi<strong>on</strong>: The Lagrange functi<strong>on</strong> is given by the primalobjective functi<strong>on</strong> minus the sum of all products betweenc<strong>on</strong>straints and corresp<strong>on</strong>ding Lagrange multipliers (cf. e.g.Fletcher 1989, Bertsekas 1995). Optimizati<strong>on</strong> can be seenas minimzati<strong>on</strong> of the Lagrangian wrt. the primal variablesand simultaneous maximizati<strong>on</strong> wrt. the Lagrange multipliers,i.e. dual variables. It has a saddle point at the soluti<strong>on</strong>.Usually the Lagrange functi<strong>on</strong> is <strong>on</strong>ly a theoretical device toderive the dual objective functi<strong>on</strong> (cf. Secti<strong>on</strong> 1.2).Dual objective functi<strong>on</strong>: It is derived by minimizing theLagrange functi<strong>on</strong> with respect to the primal variables andsubsequent eliminati<strong>on</strong> of the latter. Hence it can be writtensolely in terms of the dual variables.Duality gap: For both feasible primal and dual variables the primalobjective functi<strong>on</strong> (of a c<strong>on</strong>vex minimizati<strong>on</strong> problem)is always greater or equal than the dual objective functi<strong>on</strong>.Since SVMs have <strong>on</strong>ly linear c<strong>on</strong>straints the c<strong>on</strong>straint qualificati<strong>on</strong>sof the str<strong>on</strong>g duality theorem (Bazaraa, Sherali andShetty 1993, Theorem 6.2.4) are satisfied and it follows thatgap vanishes at optimality. Thus the duality gap is a measurehow close (in terms of the objective functi<strong>on</strong>) the current setof variables is to the soluti<strong>on</strong>.Karush–Kuhn–Tucker (KKT) c<strong>on</strong>diti<strong>on</strong>s: Aset of primal anddual variables that is both feasible and satisfies the KKTc<strong>on</strong>diti<strong>on</strong>s is the soluti<strong>on</strong> (i.e. c<strong>on</strong>straint · dual variable = 0).The sum of the violated KKT terms determines exactly thesize of the duality gap (that is, we simply compute thec<strong>on</strong>straint · Lagrangemultiplier part as d<strong>on</strong>e in (55)). Thisallows us to compute the latter quite easily.A simple intuiti<strong>on</strong> is that for violated c<strong>on</strong>straints the dualvariable could be increased arbitrarily, thus rendering theLagrange functi<strong>on</strong> arbitrarily large. This, however, is in c<strong>on</strong>traditi<strong>on</strong>to the saddlepoint property.5.3. Interior point algorithmsIn a nutshell the idea of an interior point algorithm is to computethe dual of the optimizati<strong>on</strong> problem (in our case the dualdual of R reg [ f ]) and solve both primal and dual simultaneously.This is d<strong>on</strong>e by <strong>on</strong>ly gradually enforcing the KKT c<strong>on</strong>diti<strong>on</strong>s


208 Smola and Schölkopfto iteratively find a feasible soluti<strong>on</strong> and to use the dualitygap between primal and dual objective functi<strong>on</strong> to determinethe quality of the current set of variables. The special flavourof algorithm we will describe is primal-dual path-following(Vanderbei 1994).In order to avoid tedious notati<strong>on</strong> we will c<strong>on</strong>sider the slightlymore general problem and specialize the result to the SVM later.It is understood that unless stated otherwise, variables like αdenote <strong>vector</strong>s and α i denotes its i-th comp<strong>on</strong>ent.1minimize q(α) +〈c,α〉2 (50)subject to Aα = b and l ≤ α ≤ uwith c,α,l, u ∈ R n , A ∈ R n·m , b ∈ R m , the inequalities between<strong>vector</strong>s holding comp<strong>on</strong>entwise and q(α) being a c<strong>on</strong>vexfuncti<strong>on</strong> of α. Now we will add slack variables to get rid of allinequalities but the positivity c<strong>on</strong>straints. This yields:1minimize q(α) +〈c,α〉2subject to Aα = b,α− g = l, α+ t = u, (51)g, t ≥ 0, αfreeThe dual of (51) ismaximizesubject to12 (q(α) −〈 ⃗∂q(α),α)〉+〈b, y〉+〈l, z〉−〈u, s〉1∂q(α)2 ⃗ + c − (Ay) ⊤ + s = z, s, z ≥ 0, y free(52)Moreover we get the KKT c<strong>on</strong>diti<strong>on</strong>s, namelyg i z i = 0 and s i t i = 0 for all i ∈ [1 ...n]. (53)A necessary and sufficient c<strong>on</strong>diti<strong>on</strong> for the optimal soluti<strong>on</strong> isthat the primal/dual variables satisfy both the feasibility c<strong>on</strong>diti<strong>on</strong>sof (51) and (52) and the KKT c<strong>on</strong>diti<strong>on</strong>s (53). We proceedto solve (51)–(53) iteratively. The details can be found inAppendix A.5.4. Useful tricksBefore proceeding to further algorithms for quadratic optimizati<strong>on</strong>let us briefly menti<strong>on</strong> some useful tricks that can be appliedto all algorithms described subsequently and may have significantimpact despite their simplicity. They are in part derivedfrom ideas of the interior-point approach.Training with different regularizati<strong>on</strong> parameters: For severalreas<strong>on</strong>s (model selecti<strong>on</strong>, c<strong>on</strong>trolling the number of <strong>support</strong><strong>vector</strong>s, etc.) it may happen that <strong>on</strong>e has to train a SV machinewith different regularizati<strong>on</strong> parameters C, but otherwiserather identical settings. If the parameters C new = τC oldis not too different it is advantageous to use the rescaled valuesof the Lagrange multipliers (i.e. α i ,αi ∗ )asastarting pointfor the new optimizati<strong>on</strong> problem. Rescaling is necessary tosatisfy the modified c<strong>on</strong>straints. One getsα new = τα old and likewise b new = τb old . (54)Assuming that the (dominant) c<strong>on</strong>vex part q(α) ofthe primalobjective is quadratic, the q scales with τ 2 where as thelinear part scales with τ.However, since the linear term dominatesthe objective functi<strong>on</strong>, the rescaled values are still abetter starting point than α = 0. In practice a speedup ofapproximately 95% of the overall training time can be observedwhen using the sequential minimizati<strong>on</strong> algorithm,cf. (Smola 1998). A similar reas<strong>on</strong>ing can be applied whenretraining with the same regularizati<strong>on</strong> parameter but different(yet similar) width parameters of the kernel functi<strong>on</strong>. SeeCristianini, Campbell and Shawe-Taylor (1998) for detailsthere<strong>on</strong> in a different c<strong>on</strong>text.M<strong>on</strong>itoring c<strong>on</strong>vergence via the feasibility gap: Inthe case ofboth primal and dual feasible variables the following c<strong>on</strong>necti<strong>on</strong>between primal and dual objective functi<strong>on</strong> holds:Dual Obj. = Primal Obj. − ∑ (g i z i + s i t i ) (55)iThis can be seen immediately by the c<strong>on</strong>structi<strong>on</strong> of theLagrange functi<strong>on</strong>. In Regressi<strong>on</strong> Estimati<strong>on</strong> (with the ε-insensitive loss functi<strong>on</strong>) <strong>on</strong>e obtains for ∑ i g i z i + s i t i⎡+ max(0, f (x i ) − (y i + ε i ))(C − αi ∗)⎤∑− min(0, f (x i ) − (y i + ε i ))α ∗ i⎢⎣ + max(0, (yii − εi ∗) − f (x ⎥i))(C − α i ) ⎦ . (56)− min(0, (y i − εi ∗) − f (x i))α iThus c<strong>on</strong>vergence with respect to the point of the soluti<strong>on</strong>can be expressed in terms of the duality gap. An effectivestopping rule is to require∑i g i z i + s i t i|Primal Objective|+1 ≤ ε tol (57)for some precisi<strong>on</strong> ε tol . This c<strong>on</strong>diti<strong>on</strong> is much in the spirit ofprimal dual interior point path following algorithms, wherec<strong>on</strong>vergence is measured in terms of the number of significantfigures (which would be the decimal logarithm of (57)), ac<strong>on</strong>venti<strong>on</strong> that will also be adopted in the subsequent partsof this expositi<strong>on</strong>.5.5. Subset selecti<strong>on</strong> algorithmsThe c<strong>on</strong>vex programming algorithms described so far can beused directly <strong>on</strong> moderately sized (up to 3000) samples datasetswithout any further modificati<strong>on</strong>s. On large datasets, however, itis difficult, due to memory and cpu limitati<strong>on</strong>s, to compute thedot product matrix k(x i , x j ) and keep it in memory. A simplecalculati<strong>on</strong> shows that for instance storing the dot product matrixof the NIST OCR database (60.000 samples) at single precisi<strong>on</strong>would c<strong>on</strong>sume 0.7 GBytes. A Cholesky decompositi<strong>on</strong> thereof,which would additi<strong>on</strong>ally require roughly the same amount ofmemory and 64 Teraflops (counting multiplies and adds separately),seems unrealistic, at least at current processor speeds.A first soluti<strong>on</strong>, which was introduced in Vapnik (1982) relies<strong>on</strong> the observati<strong>on</strong> that the soluti<strong>on</strong> can be rec<strong>on</strong>structed fromthe SVs al<strong>on</strong>e. Hence, if we knew the SV set beforehand, and


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 209it fitted into memory, then we could directly solve the reducedproblem. The catch is that we do not know the SV set beforesolving the problem. The soluti<strong>on</strong> is to start with an arbitrarysubset, a first chunk that fits into memory, train the SV algorithm<strong>on</strong> it, keep the SVs and fill the chunk up with data the currentestimator would make errors <strong>on</strong> (i.e. data lying outside the ε-tube of the current regressi<strong>on</strong>). Then retrain the system and keep<strong>on</strong> iterating until after training all KKT-c<strong>on</strong>diti<strong>on</strong>s are satisfied.The basic chunking algorithm just postp<strong>on</strong>ed the underlyingproblem of dealing with large datasets whose dot-product matrixcannot be kept in memory: it will occur for larger training setsizes than originally, but it is not completely avoided. Hencethe soluti<strong>on</strong> is Osuna, Freund and Girosi (1997) to use <strong>on</strong>ly asubset of the variables as a working set and optimize the problemwith respect to them while freezing the other variables. Thismethod is described in detail in Osuna, Freund and Girosi (1997),Joachims (1999) and Saunders et al. (1998) for the case of patternrecogniti<strong>on</strong>. 8An adaptati<strong>on</strong> of these techniques to the case of regressi<strong>on</strong>with c<strong>on</strong>vex cost functi<strong>on</strong>s can be found in Appendix B. Thebasic structure of the method is described by Algorithm 1.Algorithm 1.: Basic structure of a working set algorithmInitialize α i ,α ∗ i= 0Choose arbitrary working set S wrepeatCompute coupling terms (linear and c<strong>on</strong>stant) for S w (seeAppendix A.3)Solve reduced optimizati<strong>on</strong> problemChoose new S w from variables α i ,α ∗ inot satisfying theKKT c<strong>on</strong>diti<strong>on</strong>suntil working set S w =∅5.6. Sequential minimal optimizati<strong>on</strong>Recently an algorithm—Sequential Minimal Optimizati<strong>on</strong>(SMO)—was proposed (Platt 1999) that puts chunking to theextreme by iteratively selecting subsets <strong>on</strong>ly of size 2 and optimizingthe target functi<strong>on</strong> with respect to them. It has beenreported to have good c<strong>on</strong>vergence properties and it is easilyimplemented. The key point is that for a working set of 2 theoptimizati<strong>on</strong> subproblem can be solved analytically without explicitlyinvoking a quadratic optimizer.While readily derived for pattern recogniti<strong>on</strong> by Platt (1999),<strong>on</strong>e simply has to mimick the original reas<strong>on</strong>ing to obtain anextensi<strong>on</strong> to Regressi<strong>on</strong> Estimati<strong>on</strong>. This is what will be d<strong>on</strong>ein Appendix C (the pseudocode can be found in Smola andSchölkopf (1998b)). The modificati<strong>on</strong>s c<strong>on</strong>sist of a pattern dependentregularizati<strong>on</strong>, c<strong>on</strong>vergence c<strong>on</strong>trol via the number ofsignificant figures, and a modified system of equati<strong>on</strong>s to solvethe optimizati<strong>on</strong> problem in two variables for regressi<strong>on</strong> analytically.Note that the reas<strong>on</strong>ing <strong>on</strong>ly applies to SV regressi<strong>on</strong> withthe ε insensitive loss functi<strong>on</strong>—for most other c<strong>on</strong>vex cost functi<strong>on</strong>san explicit soluti<strong>on</strong> of the restricted quadratic programmingproblem is impossible. Yet, <strong>on</strong>e could derive an analogous n<strong>on</strong>quadraticc<strong>on</strong>vex optimizati<strong>on</strong> problem for general cost functi<strong>on</strong>sbut at the expense of having to solve it numerically.The expositi<strong>on</strong> proceeds as follows: first <strong>on</strong>e has to derivethe (modified) boundary c<strong>on</strong>diti<strong>on</strong>s for the c<strong>on</strong>strained 2 indices(i, j) subproblem in regressi<strong>on</strong>, next <strong>on</strong>e can proceed to solve theoptimizati<strong>on</strong> problem analytically, and finally <strong>on</strong>e has to check,which part of the selecti<strong>on</strong> rules have to be modified to makethe approach work for regressi<strong>on</strong>. Since most of the c<strong>on</strong>tent isfairly technical it has been relegated to Appendix C.The main difference in implementati<strong>on</strong>s of SMO for regressi<strong>on</strong>can be found in the way the c<strong>on</strong>stant offset b is determined(Keerthi et al. 1999) and which criteri<strong>on</strong> is used to select a newset of variables. We present <strong>on</strong>e such strategy in Appendix C.3.However, since selecti<strong>on</strong> strategies are the focus of current researchwe recommend that readers interested in implementingthe algorithm make sure they are aware of the most recent developmentsin this area.Finally, we note that just as we presently describe a generalizati<strong>on</strong>of SMO to regressi<strong>on</strong> estimati<strong>on</strong>, other learning problemscan also benefit from the underlying ideas. Recently, a SMOalgorithm for training novelty detecti<strong>on</strong> systems (i.e. <strong>on</strong>e-classclassificati<strong>on</strong>) has been proposed (Schölkopf et al. 2001).6. Variati<strong>on</strong>s <strong>on</strong> a themeThere exists a large number of algorithmic modificati<strong>on</strong>s of theSV algorithm, to make it suitable for specific settings (inverseproblems, semiparametric settings), different ways of measuringcapacity and reducti<strong>on</strong>s to linear programming (c<strong>on</strong>vex combinati<strong>on</strong>s)and different ways of c<strong>on</strong>trolling capacity. We willmenti<strong>on</strong> some of the more popular <strong>on</strong>es.6.1. C<strong>on</strong>vex combinati<strong>on</strong>s and l 1 -normsAll the algorithms presented so far involved c<strong>on</strong>vex, and atbest, quadratic programming. Yet <strong>on</strong>e might think of reducingthe problem to a case where linear programming techniquescan be applied. This can be d<strong>on</strong>e in a straightforward fashi<strong>on</strong>(Mangasarian 1965, 1968, West<strong>on</strong> et al. 1999, Smola, Schölkopfand Rätsch 1999) for both SV pattern recogniti<strong>on</strong> and regressi<strong>on</strong>.The key is to replace (35) byR reg [ f ]:= R emp [ f ] + λ‖α‖ 1 (58)where ‖α‖ 1 denotes the l 1 norm in coefficient space. Hence <strong>on</strong>euses the SV kernel expansi<strong>on</strong> (11)f (x) =l∑α i k(x i , x) + bi=1with a different way of c<strong>on</strong>trolling capacity by minimizingR reg [ f ] = 1 l∑l∑c(x i , y i , f (x i )) + λ |α i |. (59)li=1i=1


210 Smola and SchölkopfFor the ε-insensitive loss functi<strong>on</strong> this leads to a linear programmingproblem. In the other cases, however, the problem still staysa quadratic or general c<strong>on</strong>vex <strong>on</strong>e, and therefore may not yieldthe desired computati<strong>on</strong>al advantage. Therefore we will limitourselves to the derivati<strong>on</strong> of the linear programming problemin the case of |·| ε cost functi<strong>on</strong>. Reformulating (59) yieldsl∑l∑minimize (α i + αi ∗ ) + C (ξ i + ξ ∗subject toi=1i=1⎧l∑y i − (α j − α ∗ j⎪⎨)k(x j, x i ) − b ≤ ε + ξ ij=1l∑(α j − α ∗ j )k(x j, x i ) + b − y i ≤ ε + ξi∗j=1 ⎪⎩α i ,αi ∗ ,ξ i,ξi ∗ ≥ 0Unlike in the classical SV case, the transformati<strong>on</strong> into its dualdoes not give any improvement in the structure of the optimizati<strong>on</strong>problem. Hence it is best to minimize R reg [ f ] directly, whichcan be achieved by a linear optimizer, (e.g. Dantzig 1962, Lustig,Marsten and Shanno 1990, Vanderbei 1997).In (West<strong>on</strong> et al. 1999) a similar variant of the linear SV approachis used to estimate densities <strong>on</strong> a line. One can show(Smola et al. 2000) that <strong>on</strong>e may obtain bounds <strong>on</strong> the generalizati<strong>on</strong>error which exhibit even better rates (in terms of theentropy numbers) than the classical SV case (Williams<strong>on</strong>, Smolaand Schölkopf 1998).6.2. Automatic tuning of the insensitivity tubeBesides standard model selecti<strong>on</strong> issues, i.e. how to specify thetrade-off between empirical error and model capacity there alsoexists the problem of an optimal choice of a cost functi<strong>on</strong>. Inparticular, for the ε-insensitive cost functi<strong>on</strong> we still have theproblem of choosing an adequate parameter ε in order to achievegood performance with the SV machine.Smola et al. (1998a) show the existence of a linear dependencybetween the noise level and the optimal ε-parameter forSV regressi<strong>on</strong>. However, this would require that we know somethingabout the noise model. This knowledge is not available ingeneral. Therefore, albeit providing theoretical insight, this findingby itself is not particularly useful in practice. Moreover, if wereally knew the noise model, we most likely would not choosethe ε-insensitive cost functi<strong>on</strong> but the corresp<strong>on</strong>ding maximumlikelihood loss functi<strong>on</strong> instead.There exists, however, a method to c<strong>on</strong>struct SV machinesthat automatically adjust ε and moreover also, at least asymptotically,have a predetermined fracti<strong>on</strong> of sampling points as SVs(Schölkopf et al. 2000). We modify (35) such that ε becomes avariable of the optimizati<strong>on</strong> problem, including an extra term inthe primal objective functi<strong>on</strong> which attempts to minimize ε. Inother wordsminimize R ν [ f ]:= R emp [ f ] + λ 2 ‖w‖2 + νε (60)i )for some ν>0. Hence (42) becomes (again carrying out theusual transformati<strong>on</strong> between λ, l and C)( )1l∑minimize2 ‖w‖2 + C ( ˜c(ξ i ) + ˜c(ξi ∗ )) + lνεi=1⎧(61)⎪⎨ y i −〈w, x i 〉−b ≤ ε + ξ isubject to 〈w, x i 〉+b − y i ≤ ε + ξi∗⎪⎩ξ i ,ξi ∗ ≥ 0We c<strong>on</strong>sider the standard |·| ε loss functi<strong>on</strong>. Computing the dualof (62) yields⎧⎪⎨ − 1 l∑(α i − αi ∗ 2)(α j − α ∗ j )k(x i, x j )i, j=1maximizel∑⎪⎩ + y i (α i − αi ∗ )i=1⎧(62)l∑(α i − αi ∗ ⎪⎨) = 0i=1subject to l∑(α i + αi ∗ ) ≤ Cνli=1 ⎪⎩α i ,αi ∗ ∈ [0, C]Note that the optimizati<strong>on</strong> problem is thus very similar to the ε-SV <strong>on</strong>e: the target functi<strong>on</strong> is even simpler (it is homogeneous),but there is an additi<strong>on</strong>al c<strong>on</strong>straint. For informati<strong>on</strong> <strong>on</strong> how thisaffects the implementati<strong>on</strong> (cf. Chang and Lin 2001).Besides having the advantage of being able to automaticallydetermine ε (63) also has another advantage. It can be used topre–specify the number of SVs:Theorem 9 (Schölkopf et al. 2000).1. ν is an upper bound <strong>on</strong> the fracti<strong>on</strong> of errors.2. ν is a lower bound <strong>on</strong> the fracti<strong>on</strong> of SVs.3. Suppose the data has been generated iid from a distributi<strong>on</strong>p(x, y) = p(x)p(y | x) with a c<strong>on</strong>tinuous c<strong>on</strong>diti<strong>on</strong>al distributi<strong>on</strong>p(y | x).With probability 1, asymptotically,νequalsthe fracti<strong>on</strong> of SVs and the fracti<strong>on</strong> of errors.Essentially, ν-SV regressi<strong>on</strong> improves up<strong>on</strong> ε-SV regressi<strong>on</strong> byallowing the tube width to adapt automatically to the data. Whatis kept fixed up to this point, however, is the shape of the tube.One can, however, go <strong>on</strong>e step further and use parametric tubemodels with n<strong>on</strong>-c<strong>on</strong>stant width, leading to almost identical optimizati<strong>on</strong>problems (Schölkopf et al. 2000).Combining ν-SV regressi<strong>on</strong> with results <strong>on</strong> the asymptoticaloptimal choice of ε for a given noise model (Smola et al. 1998a)leads to a guideline how to adjust ν provided the class of noisemodels (e.g. Gaussian or Laplacian) is known.Remark 10 (Optimal choice of ν). Denote by p a probabilitydensity with unit variance, and by P afamliy of noise modelsgenerated from p by P := {p|p = 1 σ p( y )}. Moreover assumeσ


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 211by making problems seemingly easier yet reliable via a map intosome even higher dimensi<strong>on</strong>al space.In this secti<strong>on</strong> we focus <strong>on</strong> the c<strong>on</strong>necti<strong>on</strong>s between SVmethods and previous techniques like Regularizati<strong>on</strong> Networks(Girosi, J<strong>on</strong>es and Poggio 1993). 9 In particular we will showthat SV machines are essentially Regularizati<strong>on</strong> Networks (RN)with a clever choice of cost functi<strong>on</strong>s and that the kernels areGreen’s functi<strong>on</strong> of the corresp<strong>on</strong>ding regularizati<strong>on</strong> operators.Forafull expositi<strong>on</strong> of the subject the reader is referred to Smola,Schölkopf and Müller (1998c).Fig. 5. Optimal ν and ε for various degrees of polynomial additivenoisethat the data were drawn iid from p(x, y) = p(x)p(y − f (x))with p(y − f (x)) c<strong>on</strong>tinuous. Then under the assumpti<strong>on</strong> ofuniform c<strong>on</strong>vergence, the asymptotically optimal value of ν iswhereν = 1 −∫ ε−εp(t) dtε := argmin(p(−τ) + p(τ))(1 −2 −τ∫ τ−τ)p(t) dt(63)For polynomial noise models, i.e. densities of type exp(−|ξ| p )<strong>on</strong>e may compute the corresp<strong>on</strong>ding (asymptotically) optimalvalues of ν. They are given in Fig. 5. For further details see(Schölkopf et al. 2000, Smola 1998); an experimental validati<strong>on</strong>has been given by Chalimourda, Schölkopf and Smola (2000).We c<strong>on</strong>clude this secti<strong>on</strong> by noting that ν-SV regressi<strong>on</strong> isrelated to the idea of trimmed estimators. One can show that theregressi<strong>on</strong> is not influenced if we perturb points lying outside thetube. Thus, the regressi<strong>on</strong> is essentially computed by discardinga certain fracti<strong>on</strong> of outliers, specified by ν, and computing theregressi<strong>on</strong> estimate from the remaining points (Schölkopf et al.2000).7. Regularizati<strong>on</strong>So far we were not c<strong>on</strong>cerned about the specific properties ofthe map into feature space and used it <strong>on</strong>ly as a c<strong>on</strong>venienttrick to c<strong>on</strong>struct n<strong>on</strong>linear regressi<strong>on</strong> functi<strong>on</strong>s. In some casesthe map was just given implicitly by the kernel, hence the mapitself and many of its properties have been neglected. A deeperunderstanding of the kernel map would also be useful to chooseappropriate kernels for a specific task (e.g. by incorporatingprior knowledge (Schölkopf et al. 1998a)). Finally the featuremap seems to defy the curse of dimensi<strong>on</strong>ality (Bellman 1961)7.1. Regularizati<strong>on</strong> networksLet us briefly review the basic c<strong>on</strong>cepts of RNs. As in (35)we minimize a regularized risk functi<strong>on</strong>al. However, rather thanenforcing flatness in feature space we try to optimize somesmoothness criteri<strong>on</strong> for the functi<strong>on</strong> in input space. Thus wegetR reg [ f ]:= R emp [ f ] + λ 2 ‖Pf‖2 . (64)Here P denotes a regularizati<strong>on</strong> operator in the sense ofTikh<strong>on</strong>ov and Arsenin (1977), i.e. P is a positive semidefiniteoperator mapping from the Hilbert space H of functi<strong>on</strong>s f underc<strong>on</strong>siderati<strong>on</strong> to a dot product space D such that the expressi<strong>on</strong>〈Pf · Pg〉 is well defined for f, g ∈ H. For instance by choosinga suitable operator that penalizes large variati<strong>on</strong>s of f <strong>on</strong>ecan reduce the well–known overfitting effect. Another possiblesetting also might be an operator P mapping from L 2 (R n ) intosome Reproducing Kernel Hilbert Space (RKHS) (Ar<strong>on</strong>szajn,1950, Kimeldorf and Wahba 1971, Saitoh 1988, Schölkopf 1997,Girosi 1998).Using an expansi<strong>on</strong> of f in terms of some symmetric functi<strong>on</strong>k(x i , x j ) (note here, that k need not fulfill Mercer’s c<strong>on</strong>diti<strong>on</strong>and can be chosen arbitrarily since it is not used to define aregularizati<strong>on</strong> term),f (x) =l∑α i k(x i , x) + b, (65)i=1and the ε-insensitive cost functi<strong>on</strong>, this leads to a quadratic programmingproblem similar to the <strong>on</strong>e for SVs. UsingD ij :=〈(Pk)(x i ,.) · (Pk)(x j ,.)〉 (66)we get α = D −1 K (β − β ∗ ), with β,β ∗ being the soluti<strong>on</strong> ofminimizesubject to12 (β∗ − β) ⊤ KD −1 K (β ∗ − β)l∑−(β ∗ − β) ⊤ y − ε (β i + βi ∗ ) (67)i=1l∑(β i − βi ∗ ) = 0 and β i,βi ∗ ∈ [0, C].i=1


212 Smola and SchölkopfUnfortunately, this setting of the problem does not preserve sparsityin terms of the coefficients, as a potentially sparse decompositi<strong>on</strong>in terms of β i and β ∗ iis spoiled by D −1 K ,which is notin general diag<strong>on</strong>al.7.2. Green’s functi<strong>on</strong>sComparing (10) with (67) leads to the questi<strong>on</strong> whether and underwhich c<strong>on</strong>diti<strong>on</strong> the two methods might be equivalent andtherefore also under which c<strong>on</strong>diti<strong>on</strong>s regularizati<strong>on</strong> networksmight lead to sparse decompositi<strong>on</strong>s, i.e. <strong>on</strong>ly a few of the expansi<strong>on</strong>coefficients α i in f would differ from zero. A sufficientc<strong>on</strong>diti<strong>on</strong> is D = K and thus KD −1 K = K (if K does not havefull rank we <strong>on</strong>ly need that KD −1 K = K holds <strong>on</strong> the image ofK ):k(x i , x j ) =〈(Pk)(x i ,.) · (Pk)(x j ,.)〉 (68)Our goal now is to solve the following two problems:1. Given a regularizati<strong>on</strong> operator P, find a kernel k such that aSV machine using k will not <strong>on</strong>ly enforce flatness in featurespace, but also corresp<strong>on</strong>d to minimizing a regularized riskfuncti<strong>on</strong>al with P as regularizer.2. Given an SV kernel k, find a regularizati<strong>on</strong> operator P suchthat a SV machine using this kernel can be viewed as a Regularizati<strong>on</strong>Network using P.These two problems can be solved by employing the c<strong>on</strong>ceptof Green’s functi<strong>on</strong>s as described in Girosi, J<strong>on</strong>es and Poggio(1993). These functi<strong>on</strong>s were introduced for the purpose of solvingdifferential equati<strong>on</strong>s. In our c<strong>on</strong>text it is sufficient to knowthat the Green’s functi<strong>on</strong>s G xi (x)ofP ∗ P satisfy(P ∗ PG xi )(x) = δ xi (x). (69)Here, δ xi (x)isthe δ-distributi<strong>on</strong> (not to be c<strong>on</strong>fused with the Kr<strong>on</strong>eckersymbol δ ij )which has the property that 〈 f ·δ xi 〉= f (x i ).The relati<strong>on</strong>ship between kernels and regularizati<strong>on</strong> operators isformalized in the following propositi<strong>on</strong>:Propositi<strong>on</strong> 1 (Smola, Schölkopf and Müller 1998b). Let Pbe a regularizati<strong>on</strong> operator, and G be the Green’s functi<strong>on</strong> ofP ∗ P.Then G is a Mercer Kernel such that D = K.SV machinesusing G minimize risk functi<strong>on</strong>al (64) with P as regularizati<strong>on</strong>operator.In the following we will exploit this relati<strong>on</strong>ship in both ways:to compute Green’s functi<strong>on</strong>s for a given regularizati<strong>on</strong> operatorP and to infer the regularizer, given a kernel k.7.3. Translati<strong>on</strong> invariant kernelsLet us now more specifically c<strong>on</strong>sider regularizati<strong>on</strong> operatorsˆP that may be written as multiplicati<strong>on</strong>s in Fourier space〈Pf · Pg〉 =1(2π) n/2 ∫f ˜(ω) g(ω) ˜dω (70)P(ω)with f ˜(ω) denoting the Fourier transform of f (x), and P(ω) =P(−ω) real valued, n<strong>on</strong>negative and c<strong>on</strong>verging to 0 for |ω| →∞ and := supp[P(ω)]. Small values of P(ω) corresp<strong>on</strong>d toa str<strong>on</strong>g attenuati<strong>on</strong> of the corresp<strong>on</strong>ding frequencies. Hencesmall values of P(ω) for large ω are desirable since high frequencycomp<strong>on</strong>ents of f˜corresp<strong>on</strong>d to rapid changes in f .P(ω) describes the filter properties of P ∗ P. Note that no attenuati<strong>on</strong>takes place for P(ω) = 0asthese frequencies have beenexcluded from the integrati<strong>on</strong> domain.Forregularizati<strong>on</strong> operators defined in Fourier Space by (70)<strong>on</strong>e can show by exploiting P(ω) = P(−ω) = P(ω) thatG(x i , x) =∫1(2π) n/2R n e iω(x i −x) P(ω) dω (71)is a corresp<strong>on</strong>ding Green’s functi<strong>on</strong> satisfying translati<strong>on</strong>al invariance,i.e.G(x i , x j ) = G(x i − x j ) and G(ω) ˜ = P(ω). (72)This provides us with an efficient tool for analyzing SV kernelsand the types of capacity c<strong>on</strong>trol they exhibit. In fact the aboveis a special case of Bochner’s theorem (Bochner 1959) statingthat the Fourier transform of a positive measure c<strong>on</strong>stitutes apositive Hilbert Schmidt kernel.Example 2 (Gaussian kernels). Following the expositi<strong>on</strong> ofYuille and Grzywacz (1988) as described in Girosi, J<strong>on</strong>es andPoggio (1993), <strong>on</strong>e can see that for∫‖Pf‖ 2 = dx ∑ σ 2mmm!2 ( Oˆm f (x)) 2 (73)mwith Oˆ2m = m and Oˆ2m+1 =∇ m , being the Laplacianand ∇ the Gradient operator, we get Gaussians kernels (31).Moreover, we can provide an equivalent representati<strong>on</strong> of Pin terms of its Fourier properties, i.e. P(ω) = e − σ 2 ‖ω‖ 22 up to amultiplicative c<strong>on</strong>stant.Training an SV machine with Gaussian RBF kernels (Schölkopfet al. 1997) corresp<strong>on</strong>ds to minimizing the specific cost functi<strong>on</strong>with a regularizati<strong>on</strong> operator of type (73). Recall that (73)means that all derivatives of f are penalized (we have a pseudodifferentialoperator) to obtain a very smooth estimate. This alsoexplains the good performance of SV machines in this case, as itis by no means obvious that choosing a flat functi<strong>on</strong> in some highdimensi<strong>on</strong>al space will corresp<strong>on</strong>d to a simple functi<strong>on</strong> in lowdimensi<strong>on</strong>al space, as shown in Smola, Schölkopf and Müller(1998c) for Dirichlet kernels.The questi<strong>on</strong> that arises now is which kernel to choose. Letus think about two extreme situati<strong>on</strong>s.1. Suppose we already knew the shape of the power spectrumPow(ω)ofthe functi<strong>on</strong> we would like to estimate. In this casewe choose k such that ˜k matches the power spectrum (Smola1998).2. If we happen to know very little about the given data a generalsmoothness assumpti<strong>on</strong> is a reas<strong>on</strong>able choice. Hence


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 213we might want to choose a Gaussian kernel. If computingtime is important <strong>on</strong>e might moreover c<strong>on</strong>sider kernels withcompact <strong>support</strong>, e.g. using the B q –spline kernels (cf. (32)).This choice will cause many matrix elements k ij = k(x i −x j )to vanish.The usual scenario will be in between the two extreme cases andwe will have some limited prior knowledge available. For moreinformati<strong>on</strong> <strong>on</strong> using prior knowledge for choosing kernels (seeSchölkopf et al. 1998a).7.4. Capacity c<strong>on</strong>trolAll the reas<strong>on</strong>ing so far was based <strong>on</strong> the assumpti<strong>on</strong> that thereexist ways to determine model parameters like the regularizati<strong>on</strong>c<strong>on</strong>stant λ or length scales σ of rbf–kernels. The model selecti<strong>on</strong>issue itself would easily double the length of this reviewand moreover it is an area of active and rapidly moving research.Therefore we limit ourselves to a presentati<strong>on</strong> of the basic c<strong>on</strong>ceptsand refer the interested reader to the original publicati<strong>on</strong>s.It is important to keep in mind that there exist several fundamentallydifferent approaches such as Minimum Descripti<strong>on</strong>Length (cf. e.g. Rissanen 1978, Li and Vitányi 1993) which isbased <strong>on</strong> the idea that the simplicity of an estimate, and thereforealso its plausibility is based <strong>on</strong> the informati<strong>on</strong> (number of bits)needed to encode it such that it can be rec<strong>on</strong>structed.Bayesian estimati<strong>on</strong>, <strong>on</strong> the other hand, c<strong>on</strong>siders the posteriorprobability of an estimate, given the observati<strong>on</strong>s X ={(x 1 , y 1 ),...(x l , y l )}, anobservati<strong>on</strong> noise model, and a priorprobability distributi<strong>on</strong> p( f ) over the space of estimates(parameters). It is given by Bayes Rule p( f | X)p(X) =p(X | f )p( f ). Since p(X) does not depend <strong>on</strong> f , <strong>on</strong>e can maximizep(X | f )p( f )toobtain the so-called MAP estimate. 10 Asarule of thumb, to translate regularized risk functi<strong>on</strong>als intoBayesian MAP estimati<strong>on</strong> schemes, all <strong>on</strong>e has to do is to c<strong>on</strong>siderexp(−R reg [ f ]) = p( f | X). For a more detailed discussi<strong>on</strong>(see e.g. Kimeldorf and Wahba 1970, MacKay 1991, Neal 1996,Rasmussen 1996, Williams 1998).A simple yet powerful way of model selecti<strong>on</strong> is cross validati<strong>on</strong>.This is based <strong>on</strong> the idea that the expectati<strong>on</strong> of the error<strong>on</strong> a subset of the training sample not used during training isidentical to the expected error itself. There exist several strategiessuch as 10-fold crossvalidati<strong>on</strong>, leave-<strong>on</strong>e out error (l-foldcrossvalidati<strong>on</strong>), bootstrap and derived algorithms to estimatethe crossvalidati<strong>on</strong> error itself (see e.g. St<strong>on</strong>e 1974, Wahba 1980,Efr<strong>on</strong> 1982, Efr<strong>on</strong> and Tibshirani 1994, Wahba 1999, Jaakkolaand Haussler 1999) for further details.Finally, <strong>on</strong>e may also use uniform c<strong>on</strong>vergence bounds suchas the <strong>on</strong>es introduced by Vapnik and Cherv<strong>on</strong>enkis (1971). Thebasic idea is that <strong>on</strong>e may bound with probability 1 − η (withη>0) the expected risk R[ f ]byR emp [ f ] + (F,η), where is a c<strong>on</strong>fidence term depending <strong>on</strong> the class of functi<strong>on</strong>s F.Several criteria for measuring the capacity of F exist, such as theVC-Dimensi<strong>on</strong> which, in pattern recogniti<strong>on</strong> problems, is givenby the maximum number of points that can be separated by thefuncti<strong>on</strong> class in all possible ways, the Covering Number whichis the number of elements from F that are needed to cover F withaccuracy of at least ε, Entropy Numbers which are the functi<strong>on</strong>alinverse of Covering Numbers, and many more variants thereof(see e.g. Vapnik 1982, 1998, Devroye, Györfi and Lugosi 1996,Williams<strong>on</strong>, Smola and Schölkopf 1998, Shawe-Taylor et al.1998).8. C<strong>on</strong>clusi<strong>on</strong>Due to the already quite large body of work d<strong>on</strong>e in the field ofSV research it is impossible to write a <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> SV regressi<strong>on</strong>which includes all c<strong>on</strong>tributi<strong>on</strong>s to this field. This also wouldbe quite out of the scope of a <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> and rather be relegated totextbooks <strong>on</strong> the matter (see Schölkopf and Smola (2002) for acomprehensive overview, Schölkopf, Burges and Smola (1999a)for a snapshot of the current state of the art, Vapnik (1998) for anoverview <strong>on</strong> statistical learning theory, or Cristianini and Shawe-Taylor (2000) for an introductory textbook). Still the authorshope that this work provides a not overly biased view of the stateof the art in SV regressi<strong>on</strong> research. We deliberately omitted(am<strong>on</strong>g others) the following topics.8.1. Missing topicsMathematical programming: Starting from a completely differentperspective algorithms have been developed that are similarin their ideas to SV machines. A good primer mightbe (Bradley, Fayyad and Mangasarian 1998). (Also seeMangasarian 1965, 1969, Street and Mangasarian 1995). Acomprehensive discussi<strong>on</strong> of c<strong>on</strong>necti<strong>on</strong>s between mathematicalprogramming and SV machines has been given by(Bennett 1999).Density estimati<strong>on</strong>: with SV machines (West<strong>on</strong> et al. 1999,Vapnik 1999). There <strong>on</strong>e makes use of the fact that the cumulativedistributi<strong>on</strong> functi<strong>on</strong> is m<strong>on</strong>ot<strong>on</strong>ically increasing,and that its values can be predicted with variable c<strong>on</strong>fidencewhich is adjusted by selecting different values of ε in the lossfuncti<strong>on</strong>.Dicti<strong>on</strong>aries: were originally introduced in the c<strong>on</strong>text ofwavelets by (Chen, D<strong>on</strong>oho and Saunders 1999) to allowfor a large class of basis functi<strong>on</strong>s to be c<strong>on</strong>sidered simultaneously,e.g. kernels with different widths. In the standard SVcase this is hardly possible except by defining new kernels aslinear combinati<strong>on</strong>s of differently scaled <strong>on</strong>es: choosing theregularizati<strong>on</strong> operator already determines the kernel completely(Kimeldorf and Wahba 1971, Cox and O’Sullivan1990, Schölkopf et al. 2000). Hence <strong>on</strong>e has to resort to linearprogramming (West<strong>on</strong> et al. 1999).Applicati<strong>on</strong>s: The focus of this review was <strong>on</strong> methods andtheory rather than <strong>on</strong> applicati<strong>on</strong>s. This was d<strong>on</strong>e to limitthe size of the expositi<strong>on</strong>. State of the art, or even recordperformance was reported in Müller et al. (1997), Druckeret al. (1997), Stits<strong>on</strong> et al. (1999) and Mattera and Haykin(1999).


214 Smola and SchölkopfIn many cases, it may be possible to achieve similar performancewith neural network methods, however, <strong>on</strong>ly ifmany parameters are optimally tuned by hand, thus dependinglargely <strong>on</strong> the skill of the experimenter. Certainly, SVmachines are not a “silver bullet.” However, as they have<strong>on</strong>ly few critical parameters (e.g. regularizati<strong>on</strong> and kernelwidth), state-of-the-art results can be achieved with relativelylittle effort.8.2. Open issuesBeing a very active field there exist still a number of open issuesthat have to be addressed by future research. After thatthe algorithmic development seems to have found a more stablestage, <strong>on</strong>e of the most important <strong>on</strong>es seems to be to findtight error bounds derived from the specific properties of kernelfuncti<strong>on</strong>s. It will be of interest in this c<strong>on</strong>text, whetherSV machines, or similar approaches stemming from a linearprogramming regularizer, will lead to more satisfactoryresults.Moreover some sort of “luckiness framework” (Shawe-Tayloret al. 1998) for multiple model selecti<strong>on</strong> parameters, similar tomultiple hyperparameters and automatic relevance detecti<strong>on</strong> inBayesian statistics (MacKay 1991, Bishop 1995), will have tobe devised to make SV machines less dependent <strong>on</strong> the skill ofthe experimenter.It is also worth while to exploit the bridge between regularizati<strong>on</strong>operators, Gaussian processes and priors (see e.g. (Williams1998)) to state Bayesian risk bounds for SV machines in orderto compare the predicti<strong>on</strong>s with the <strong>on</strong>es from VC theory. Optimizati<strong>on</strong>techniques developed in the c<strong>on</strong>text of SV machinesalso could be used to deal with large datasets in the Gaussianprocess settings.Prior knowledge appears to be another important questi<strong>on</strong> inSV regressi<strong>on</strong>. Whilst invariances could be included in patternrecogniti<strong>on</strong> in a principled way via the virtual SV mechanismand restricti<strong>on</strong> of the feature space (Burges and Schölkopf 1997,Schölkopf et al. 1998a), it is still not clear how (probably) moresubtle properties, as required for regressi<strong>on</strong>, could be dealt withefficiently.Reduced set methods also should be c<strong>on</strong>sidered for speedingup predicti<strong>on</strong> (and possibly also training) phase for large datasets(Burges and Schölkopf 1997, Osuna and Girosi 1999, Schölkopfet al. 1999b, Smola and Schölkopf 2000). This topic is of greatimportance as data mining applicati<strong>on</strong>s require algorithms thatare able to deal with databases that are often at least <strong>on</strong>e order ofmagnitude larger (1 milli<strong>on</strong> samples) than the current practicalsize for SV regressi<strong>on</strong>.Many more aspects such as more data dependent generalizati<strong>on</strong>bounds, efficient training algorithms, automatic kernel selecti<strong>on</strong>procedures, and many techniques that already have madetheir way into the standard neural networks toolkit, will have tobe c<strong>on</strong>sidered in the future.Readers who are tempted to embark up<strong>on</strong> a more detailedexplorati<strong>on</strong> of these topics, and to c<strong>on</strong>tribute their own ideas tothis exciting field, may find it useful to c<strong>on</strong>sult the web pagewww.kernel-machines.org.Appendix A: Solving the interior-pointequati<strong>on</strong>sA.1. Path followingRather than trying to satisfy (53) directly we will solve a modifiedversi<strong>on</strong> thereof for some µ>0 substituted <strong>on</strong> the rhs in the firstplace and decrease µ while iterating.g i z i = µ, s i t i = µ for all i ∈ [1 ...n]. (74)Still it is rather difficult to solve the n<strong>on</strong>linear system of equati<strong>on</strong>s(51), (52), and (74) exactly. However we are not interestedin obtaining the exact soluti<strong>on</strong> to the approximati<strong>on</strong> (74). Instead,we seek a somewhat more feasible soluti<strong>on</strong> for a given µ,then decrease µ and repeat. This can be d<strong>on</strong>e by linearizing theabove system and solving the resulting equati<strong>on</strong>s by a predictor–corrector approach until the duality gap is small enough. Theadvantage is that we will get approximately equal performanceas by trying to solve the quadratic system directly, provided thatthe terms in 2 are small enough.A(α + α) = bα + α − g − g = lα + α + t + t = uc + 1 2 ∂ αq(α) + 1 2 ∂2 αq(α)α − (A(y + y))⊤+ s + s = z + z(g i + g i )(z i + z i ) = µ(s i + s i )(t i + t i ) = µSolving for the variables in we getAα = b − Aα =: ρα − g = l − α + g =: να + t = u − α − t =: τ(Ay) ⊤ + z − s − 1 2 ∂2 α q(α)α= c − (Ay) ⊤ + s − z + 1 2 ∂ αq(α) =: σg −1 zg + z = µg −1 − z − g −1 gz =: γ zt −1 st + s = µt −1 − s − t −1 ts =: γ swhere g −1 denotes the <strong>vector</strong> (1/g 1 ,...,1/g n ), and t analogously.Moreover denote g −1 z and t −1 s the <strong>vector</strong> generatedby the comp<strong>on</strong>entwise product of the two <strong>vector</strong>s. Solving for


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 215g,t,z,s we getg = z −1 g(γ z − z) z = g −1 z(ˆν − α)t = s −1 t(γ s − s) s = t −1 s(α − ˆτ)(75)where ˆν := ν − z −1 gγ zˆτ := τ − s −1 tγ sNow we can formulate the reduced KKT–system (see (Vanderbei1994) for the quadratic case):[ −H A⊤][ ] [ α σ − g −1 z ˆν − t −1 ]s ˆτ=(76)A 0 yρwhere H := ( 1 2 ∂2 α q(α) + g−1 z + t −1 s).A.2. Iterati<strong>on</strong> strategiesFor the predictor-corrector method we proceed as follows. Inthe predictor step solve the system of (75) and (76) with µ = 0and all -terms <strong>on</strong> the rhs set to 0, i.e. γ z = z,γ s = s. Thevalues in are substituted back into the definiti<strong>on</strong>s for γ z andγ s and (75) and (76) are solved again in the corrector step. As thequadratic part in (76) is not affected by the predictor–correctorsteps, we <strong>on</strong>ly need to invert the quadratic matrix <strong>on</strong>ce. This isd<strong>on</strong>e best by manually pivoting for the H part, as it is positivedefinite.Next the values in obtained by such an iterati<strong>on</strong> step are usedto update the corresp<strong>on</strong>ding values in α, s, t, z,....Toensurethat the variables meet the positivity c<strong>on</strong>straints, the steplengthξ is chosen such that the variables move at most 1 − ε of theirinitial distance to the boundaries of the positive orthant. Usually(Vanderbei 1994) <strong>on</strong>e sets ε = 0.05.Another heuristic is used for computing µ, the parameter determininghow much the KKT-c<strong>on</strong>diti<strong>on</strong>s should be enforced.Obviously it is our aim to reduce µ as fast as possible, howeverif we happen to choose it too small, the c<strong>on</strong>diti<strong>on</strong> of the equati<strong>on</strong>swill worsen drastically. A setting that has proven to workrobustly isµ =〈g, z〉+〈s, t〉2n( ) ξ − 1 2. (77)ξ + 10The rati<strong>on</strong>ale behind (77) is to use the average of the satisfacti<strong>on</strong>of the KKT c<strong>on</strong>diti<strong>on</strong>s (74) as point of reference and thendecrease µ rapidly if we are far enough away from the boundariesof the positive orthant, to which all variables (except y) arec<strong>on</strong>strained to.Finally <strong>on</strong>e has to come up with good initial values. Analogouslyto Vanderbei (1994) we choose a regularized versi<strong>on</strong> of(76) in order to determine the initial c<strong>on</strong>diti<strong>on</strong>s. One solves[ (−12 ∂2 α q(α) + 1) ] [A ⊤ ] [ ] α c=(78)A 1 y band subsequently restricts the soluti<strong>on</strong> to a feasible set( )ux = max x,100g = min(α − l, u)t = min(u − α, u) (79)( ( )1z = min 2 ∂ αq(α) + c − (Ay) ⊤ + u )100 , u(s = min (− 1 )2 ∂ αq(α) − c + (Ay) ⊤ + u )100 , u(·) denotes the Heavyside functi<strong>on</strong>, i.e. (x) = 1 for x > 0and (x) = 0 otherwise.A.3. Special c<strong>on</strong>siderati<strong>on</strong>s for SV regressi<strong>on</strong>The algorithm described so far can be applied to both SV patternrecogniti<strong>on</strong> and regressi<strong>on</strong> estimati<strong>on</strong>. For the standard settingin pattern recogniti<strong>on</strong> we havel∑q(α) = α i α j y i y j k(x i , x j ) (80)i, j=0and c<strong>on</strong>sequently ∂ αi q(α) = 0, ∂α 2 i α jq(α) = y i y j k(x i , x j ), i.e.the Hessian is dense and the <strong>on</strong>ly thing we can do is computeits Cholesky factorizati<strong>on</strong> to compute (76). In the case of SV regressi<strong>on</strong>,however we have (with α := (α 1 ,...,α l ,α1 ∗,...,α∗ l ))l∑q(α) = (α i − αi ∗ )(α j − α ∗ j )k(x i, x j )and thereforei, j=1+ 2C∂ αi q(α) =l∑T (α i ) + T (αi ∗ ) (81)i=1ddα iT (α i )∂ 2 α i α jq(α) = k(x i , x j ) + δ ijd 2∂ 2 α i α ∗ j q(α) =−k(x i, x j )dα 2 iT (α i ) (82)and ∂α 2 q(α), ∂ 2 i ∗α∗ j αi ∗α q(α) analogously. Hence we are dealing withja matrix of type M := [ K +D −K−K K+D ′ ]where D, D′ are diag<strong>on</strong>almatrices. By applying an orthog<strong>on</strong>al transformati<strong>on</strong> M can beinverted essentially by inverting an l × l matrix instead of a2l × 2l system. This is exactly the additi<strong>on</strong>al advantage <strong>on</strong>ecan gain from implementing the optimizati<strong>on</strong> algorithm directlyinstead of using a general purpose optimizer. One can show thatfor practical implementati<strong>on</strong>s (Smola, Schölkopf and Müller1998b) <strong>on</strong>e can solve optimizati<strong>on</strong> problems using nearly arbitraryc<strong>on</strong>vex cost functi<strong>on</strong>s as efficiently as the special case ofε-insensitive loss functi<strong>on</strong>s.Finally note that due to the fact that we are solving the primaland dual optimizati<strong>on</strong> problem simultaneously we are also


216 Smola and Schölkopfcomputing parameters corresp<strong>on</strong>ding to the initial SV optimizati<strong>on</strong>problem. This observati<strong>on</strong> is useful as it allows us to obtainthe c<strong>on</strong>stant term b directly, namely by setting b = y. (see Smola(1998) for details).at sample x i , i.e.[ m∑ϕ i := y i − f (x i ) = y i − k(x i , x j )(α i − αi ]. ∗ ) + b (84)j=1Appendix B: Solving the subset selecti<strong>on</strong>problemB.1. Subset optimizati<strong>on</strong> problemWe will adapt the expositi<strong>on</strong> of Joachims (1999) to the case ofregressi<strong>on</strong> with c<strong>on</strong>vex cost functi<strong>on</strong>s. Without loss of generalitywe will assume ε ≠ 0 and α ∈ [0, C] (the other situati<strong>on</strong>scan be treated as a special case). First we will extract a reducedoptimizati<strong>on</strong> problem for the working set when all other variablesare kept fixed. Denote S w ⊂{1,...,l} the working setand S f :={1,...,l}\S w the fixed set. Writing (43) as an optimizati<strong>on</strong>problem <strong>on</strong>ly in terms of S w yieldsmaximizesubject to⎧− 1 ∑(α i − αi ∗ 2)(α j − α ∗ j )〈x i, x j 〉i, j∈S w⎪⎨+ ∑ (α i − αi (y ∗ ) i − ∑ )(α j − α ∗ j )〈x i, x j 〉i∈S w j∈S f⎪⎩+ ∑ i∈S w(−ε(α i + α ∗ i ) + C(T (α i) + T (α ∗ i )))⎧ ∑⎨ (α i − αi ∗ ) =−∑ (α i − αi ∗ )i∈S w i∈S f⎩α i ∈ [0, C](83)Hence we <strong>on</strong>ly have to update the linear term by the couplingwith the fixed set − ∑ i∈S w(α i −αi ∗) ∑ j∈S f(α j −α ∗ j )〈x i, x j 〉 andthe equality c<strong>on</strong>straint by − ∑ i∈S f(α i − αi ∗ ). It is easy to seethat maximizing (83) also decreases (43) by exactly the sameamount. If we choose variables for which the KKT–c<strong>on</strong>diti<strong>on</strong>sare not satisfied the overall objective functi<strong>on</strong> tends to decreasewhilst still keeping all variables feasible. Finally it is boundedfrom below.Even though this does not prove c<strong>on</strong>vergence (c<strong>on</strong>trary tostatement in Osuna, Freund and Girosi (1997)) this algorithmproves very useful in practice. It is <strong>on</strong>e of the few methods (besides(Kaufman 1999, Platt 1999)) that can deal with problemswhose quadratic part does not completely fit into memory. Stillin practice <strong>on</strong>e has to take special precauti<strong>on</strong>s to avoid stallingof c<strong>on</strong>vergence (recent results of Chang, Hsu and Lin (1999)indicate that under certain c<strong>on</strong>diti<strong>on</strong>s a proof of c<strong>on</strong>vergence ispossible). The crucial part is the <strong>on</strong>e of S w .B.2. A note <strong>on</strong> optimalityFor c<strong>on</strong>venience the KKT c<strong>on</strong>diti<strong>on</strong>s are repeated in a slightlymodified form. Denote ϕ i the error made by the current estimateRewriting the feasibility c<strong>on</strong>diti<strong>on</strong>s (52) in terms of α yields2∂ αi T (α i ) + ε − ϕ i + s i − z i = 02∂ α ∗iT (α ∗ i ) + ε + ϕ i + s ∗ i − z ∗ i = 0for all i ∈ {1,...,m} with z i , zi ∗, s i, si∗feasible variables z, s is given by(85)≥ 0. A set of dualz i = max ( 2∂ αi T (α i ) + ε − ϕ i , 0 )s i =−min ( 2∂ αi T (α i ) + ε − ϕ i , 0 ) (86)zi ∗ = max ( 2∂ α ∗iT (αi ∗ ) + ε + ϕ i, 0 )si ∗ =−min ( 2∂ α ∗iT (αi ∗ ) + ε + ϕ i, 0 )C<strong>on</strong>sequently the KKT c<strong>on</strong>diti<strong>on</strong>s (53) can be translated intoα i z i = 0 and (C − α i )s i = 0α ∗ i z∗ i = 0 and (C − α ∗ i )s∗ i = 0(87)All variables α i ,α ∗ iviolating some of the c<strong>on</strong>diti<strong>on</strong>s of (87) maybe selected for further optimizati<strong>on</strong>. In most cases, especially inthe initial stage of the optimizati<strong>on</strong> algorithm, this set of patternsis much larger than any practical size of S w . UnfortunatelyOsuna, Freund and Girosi (1997) c<strong>on</strong>tains little informati<strong>on</strong> <strong>on</strong>how to select S w . The heuristics presented here are an adaptati<strong>on</strong>of Joachims (1999) to regressi<strong>on</strong>. See also Lin (2001) for details<strong>on</strong> optimizati<strong>on</strong> for SVR.B.3. Selecti<strong>on</strong> rulesSimilarly to a merit functi<strong>on</strong> approach (El-Bakry et al. 1996) theidea is to select those variables that violate (85) and (87) most,thus c<strong>on</strong>tribute most to the feasibility gap. Hence <strong>on</strong>e defines ascore variable ζ i byζ i := g i z i + s i t i= α i z i + α ∗ i z∗ i + (C − α i )s i + (C − α ∗ i )s∗ i (88)By c<strong>on</strong>structi<strong>on</strong>, ∑ i ζ i is the size of the feasibility gap (cf. (56)for the case of ε-insensitive loss). By decreasing this gap, <strong>on</strong>eapproaches the the soluti<strong>on</strong> (upper bounded by the primal objectiveand lower bounded by the dual objective functi<strong>on</strong>). Hence,the selecti<strong>on</strong> rule is to choose those patterns for which ζ i is


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 217C.2. Analytic soluti<strong>on</strong> for regressi<strong>on</strong>i j)ilargest. Some algorithms useHζ i ′ := α i (z i ) + αi ∗ (z∗ i )Next <strong>on</strong>e has to solve the optimizati<strong>on</strong> problem analytically. Wemake use of (84) and substitute the values of φ i into the reducedor+ (C − α i )(s i ) + (C − αi ∗ )(s i)optimizati<strong>on</strong> problem (83). In particular we use(89)ζ i′′ := (α i )z i + (αi ∗ )z∗ iy i − ∑ (α i − α+ (C − α i )s i + (C − αi ∗ )s i ∗ )K ij = ϕ i + b + ∑ (αoldi − α ∗ old )i K ij .i.j∉S w j∈S wOne can see that ζ i = 0, ζi ′ = 0, and ζi′′ = 0 mutually imply each(91)other. However, <strong>on</strong>ly ζ i gives a measure for the c<strong>on</strong>tributi<strong>on</strong> ofthe variable i to the size of the feasibility gap.Finally, note that heuristics like assigning sticky–flags (cf.Burges 1998) to variables at the boundaries, thus effectivelysolving smaller subproblems, or completely removingthe corresp<strong>on</strong>ding patterns from the training set while accountingfor their couplings (Joachims 1999) can significantlydecrease the size of the problem <strong>on</strong>e has to solve andmaximize − 1 2 (α i − αi ∗ )2 η − ε(α i + αi ∗ )(1 − s)thus result in a noticeable speedup. Also caching (Joachims1999, Kowalczyk 2000) of already computed entries of the+ (α i − αi ∗ )( φ i − φ j + η ( αiold − α ∗ old ))i (92)dot product matrix may have a significant impact <strong>on</strong> the subject to α (∗)i∈ [L (∗) , H (∗) ].performance.The unc<strong>on</strong>strained maximum of (92) with respect to α i or αi∗can be found below.Appendix C: Solving the SMO equati<strong>on</strong>sC.1. Pattern dependent regularizati<strong>on</strong>(I) α i ,α j α oldi+ η −1 (ϕ i − ϕ j )(II) α i ,α ∗ jα oldi+ η −1 (ϕ i − ϕ j − 2ε)C<strong>on</strong>sider the c<strong>on</strong>strained optimizati<strong>on</strong> problem (83) for two indices,say (i, j). Pattern dependent regularizati<strong>on</strong> means that C j α ∗old i(III) αi ∗ − η −1 (ϕ i − ϕ j + 2ε)i (IV) αi ∗may be different for every pattern (possibly even different forjα ∗old i− η −1 (ϕ i − ϕ j )α i and αi ∗ ). Since at most two variables may become n<strong>on</strong>zeroat the same time and moreover we are dealing with a c<strong>on</strong>strainedoptimizati<strong>on</strong> problem we may express everything in four quadrants (I)–(IV) c<strong>on</strong>tains the soluti<strong>on</strong>. However, by c<strong>on</strong>-The problem is that we do not know beforehand which of theterms of just <strong>on</strong>e variable. From the summati<strong>on</strong> c<strong>on</strong>straint we sidering the sign of γ we can distinguish two cases: for γ>0obtain<strong>on</strong>ly (I)–(III) are possible, for γ 0itisbest to start with quadrant (I), test whether theunc<strong>on</strong>strained soluti<strong>on</strong> hits <strong>on</strong>e of the boundaries L, H and if so,probe the corresp<strong>on</strong>ding adjacent quadrant (II) or (III). γ < 0can be dealt with analogously.Due to numerical instabilities, it may happen that η


218 Smola and SchölkopfC.3. Selecti<strong>on</strong> rule for regressi<strong>on</strong>Finally, <strong>on</strong>e has to pick indices (i, j) such that the objectivefuncti<strong>on</strong> is maximized. Again, the reas<strong>on</strong>ing of SMO (Platt 1999,Secti<strong>on</strong> 12.2.2) for classificati<strong>on</strong> will be mimicked. This meansthat a two loop approach is chosen to maximize the objectivefuncti<strong>on</strong>. The outer loop iterates over all patterns violating theKKT c<strong>on</strong>diti<strong>on</strong>s, first <strong>on</strong>ly over those with Lagrange multipliersneither <strong>on</strong> the upper nor lower boundary, and <strong>on</strong>ce all of themare satisfied, over all patterns violating the KKT c<strong>on</strong>diti<strong>on</strong>s, toensure self c<strong>on</strong>sistency <strong>on</strong> the complete dataset. 12 This solvesthe problem of choosing i.Now for j: Tomake a large step towards the minimum, <strong>on</strong>elooks for large steps in α i .Asitiscomputati<strong>on</strong>ally expensive tocompute η for all possible pairs (i, j) <strong>on</strong>e chooses the heuristic tomaximize the absolute value of the numerator in the expressi<strong>on</strong>sfor α i and αi ∗, i.e. |ϕ i − ϕ j | and |ϕ i − ϕ j ± 2ε|. The index jcorresp<strong>on</strong>ding to the maximum absolute value is chosen for thispurpose.If this heuristic happens to fail, in other words if little progressis made by this choice, all other indices j are looked at (this iswhat is called “sec<strong>on</strong>d choice hierarcy” in Platt (1999) in thefollowing way:1. All indices j corresp<strong>on</strong>ding to n<strong>on</strong>–bound examples arelooked at, searching for an example to make progress <strong>on</strong>.2. In the case that the first heuristic was unsuccessful, allother samples are analyzed until an example is found whereprogress can be made.3. If both previous steps fail proceed to the next i.For amore detailed discussi<strong>on</strong> (see Platt 1999). Unlike interiorpoint algorithms SMO does not automatically provide a valuefor b.However this can be chosen like in Secti<strong>on</strong> 1.4 by havinga close look at the Lagrange multipliers α (∗)iobtained.C.4. Stopping criteriaBy essentially minimizing a c<strong>on</strong>strained primal optimizati<strong>on</strong>problem <strong>on</strong>e cannot ensure that the dual objective functi<strong>on</strong> increaseswith every iterati<strong>on</strong> step. 13 Nevertheless <strong>on</strong>e knows thatthe minimum value of the objective functi<strong>on</strong> lies in the interval[dual objective i , primal objective i ] for all steps i, hence also inthe interval [(max j≤i dual objective j ), primal objective i ]. Oneuses the latter to determine the quality of the current soluti<strong>on</strong>.The calculati<strong>on</strong> of the primal objective functi<strong>on</strong> from the predicti<strong>on</strong>errors is straightforward. One uses∑(α i − αi ∗ )(α j − α ∗ j )k ij =− ∑ (α i − αi ∗ )(ϕ i + y i − b),i, ji(93)i.e. the definiti<strong>on</strong> of ϕ i to avoid the matrix–<strong>vector</strong> multiplicati<strong>on</strong>with the dot product matrix.AcknowledgmentsThis work has been <strong>support</strong>ed in part by a grant of the DFG(Ja 379/71, Sm 62/1). The authors thank Peter Bartlett, ChrisBurges, Stefan Harmeling, Olvi Mangasarian, Klaus-RobertMüller, Vladimir Vapnik, Jas<strong>on</strong> West<strong>on</strong>, Robert Williams<strong>on</strong>, andAndreas Ziehe for helpful discussi<strong>on</strong>s and comments.Notes1. Our use of the term ‘regressi<strong>on</strong>’ is somewhat lose in that it also includescases of functi<strong>on</strong> estimati<strong>on</strong> where <strong>on</strong>e minimizes errors other than the meansquare loss. This is d<strong>on</strong>e mainly for historical reas<strong>on</strong>s (Vapnik, Golowichand Smola 1997).2. A similar approach, however using linear instead of quadratic programming,was taken at the same time in the USA, mainly by Mangasarian (1965, 1968,1969).3. See Smola (1998) for an overview over other ways of specifying flatness ofsuch functi<strong>on</strong>s.4. This is true as l<strong>on</strong>g as the dimensi<strong>on</strong>ality of w is much higher than thenumber of observati<strong>on</strong>s. If this is not the case, specialized methods canoffer c<strong>on</strong>siderable computati<strong>on</strong>al savings (Lee and Mangasarian 2001).5. The table displays CT(α) instead of T (α) since the former can be pluggeddirectly into the corresp<strong>on</strong>ding optimizati<strong>on</strong> equati<strong>on</strong>s.6. The high price tag usually is the major deterrent for not using them. Moreover<strong>on</strong>e has to bear in mind that in SV regressi<strong>on</strong>, <strong>on</strong>e may speed up the soluti<strong>on</strong>c<strong>on</strong>siderably by exploiting the fact that the quadratic form has a specialstructure or that there may exist rank degeneracies in the kernel matrixitself.7. For large and noisy problems (e.g. 100.000 patterns and more with a substantialfracti<strong>on</strong> of n<strong>on</strong>bound Lagrange multipliers) it is impossible to solve theproblem exactly: due to the size <strong>on</strong>e has to use subset selecti<strong>on</strong> algorithms,hence joint optimizati<strong>on</strong> over the training set is impossible. However, unlikein Neural Networks, we can determine the closeness to the optimum. Notethat this reas<strong>on</strong>ing <strong>on</strong>ly holds for c<strong>on</strong>vex cost functi<strong>on</strong>s.8. A similar technique was employed by Bradley and Mangasarian (1998) inthe c<strong>on</strong>text of linear programming in order to deal with large datasets.9. Due to length c<strong>on</strong>straints we will not deal with the c<strong>on</strong>necti<strong>on</strong> betweenGaussian Processes and SVMs. See Williams (1998) for an excellentoverview.10. Strictly speaking, in Bayesian estimati<strong>on</strong> <strong>on</strong>e is not so much c<strong>on</strong>cerned aboutthe maximizer fˆof p( f | X) but rather about the posterior distributi<strong>on</strong> off .11. Negative values of η are theoretically impossible since k satisfies Mercer’sc<strong>on</strong>diti<strong>on</strong>: 0 ≤‖(x i ) − (x j )‖ 2 = K ii + K jj − 2K ij = η.12. It is sometimes useful, especially when dealing with noisy data, to iterateover the complete KKT violating dataset already before complete self c<strong>on</strong>sistency<strong>on</strong> the subset has been achieved. Otherwise much computati<strong>on</strong>alresources are spent <strong>on</strong> making subsets self c<strong>on</strong>sistent that are not globallyself c<strong>on</strong>sistent. This is the reas<strong>on</strong> why in the pseudo code a global loopis initiated already when <strong>on</strong>ly less than 10% of the n<strong>on</strong> bound variableschanged.13. It is still an open questi<strong>on</strong> how a subset selecti<strong>on</strong> optimizati<strong>on</strong> algorithmcould be devised that decreases both primal and dual objective functi<strong>on</strong>at the same time. The problem is that this usually involves a number ofdual variables of the order of the sample size, which makes this attemptunpractical.ReferencesAizerman M.A., Braverman É.M., and Roz<strong>on</strong>oér L.I. 1964. Theoreticalfoundati<strong>on</strong>s of the potential functi<strong>on</strong> method in pattern recogniti<strong>on</strong>learning. Automati<strong>on</strong> and Remote C<strong>on</strong>trol 25: 821–837.Ar<strong>on</strong>szajn N. 1950. Theory of reproducing kernels. Transacti<strong>on</strong>s of theAmerican Mathematical Society 68: 337–404.


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 219Bazaraa M.S., Sherali H.D., and Shetty C.M. 1993. N<strong>on</strong>linear Programming:Theory and Algorithms, 2nd editi<strong>on</strong>, Wiley.Bellman R.E. 1961. Adaptive C<strong>on</strong>trol Processes. Princet<strong>on</strong> UniversityPress, Princet<strong>on</strong>, NJ.Bennett K. 1999. Combining <strong>support</strong> <strong>vector</strong> and mathematical programmingmethods for inducti<strong>on</strong>. In: Schölkopf B., Burges C.J.C., andSmola A.J., (Eds.), Advances in Kernel Methods—SV Learning,MIT Press, Cambridge, MA, pp. 307–326.Bennett K.P. and Mangasarian O.L. 1992. Robust linear programmingdiscriminati<strong>on</strong> of two linearly inseparable sets. Optimizati<strong>on</strong>Methods and Software 1: 23–34.Berg C., Christensen J.P.R., and Ressel P. 1984. Harm<strong>on</strong>ic Analysis <strong>on</strong>Semigroups. Springer, New York.Bertsekas D.P. 1995. N<strong>on</strong>linear Programming. Athena Scientific,Belm<strong>on</strong>t, MA.Bishop C.M. 1995. Neural Networks for Pattern Recogniti<strong>on</strong>.Clarend<strong>on</strong> Press, Oxford.Blanz V., Schölkopf B., Bülthoff H., Burges C., Vapnik V., and VetterT. 1996. Comparis<strong>on</strong> of view-based object recogniti<strong>on</strong> algorithmsusing realistic 3D models. In: v<strong>on</strong> der Malsburg C., v<strong>on</strong> SeelenW.,Vorbrüggen J.C., and Sendhoff B. (Eds.), Artificial NeuralNetworks ICANN’96, Berlin. Springer Lecture Notes in ComputerScience, Vol. 1112, pp. 251–256.Bochner S. 1959. Lectures <strong>on</strong> Fourier integral. Princet<strong>on</strong> Univ. Press,Princet<strong>on</strong>, New Jersey.Boser B.E., Guy<strong>on</strong> I.M., and Vapnik V.N. 1992. A training algorithm foroptimal margin classifiers. In: Haussler D. (Ed.), Proceedings ofthe Annual C<strong>on</strong>ference <strong>on</strong> Computati<strong>on</strong>al Learning Theory. ACMPress, Pittsburgh, PA, pp. 144–152.Bradley P.S., Fayyad U.M., and Mangasarian O.L. 1998. Data mining:Overview and optimizati<strong>on</strong> opportunities. Technical Report98–01, University of Wisc<strong>on</strong>sin, Computer Sciences Department,Madis<strong>on</strong>, January. INFORMS Journal <strong>on</strong> Computing, toappear.Bradley P.S. and Mangasarian O.L. 1998. Feature selecti<strong>on</strong> via c<strong>on</strong>caveminimizati<strong>on</strong> and <strong>support</strong> <strong>vector</strong> machines. In: Shavlik J.(Ed.), Proceedings of the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> MachineLearning, Morgan Kaufmann Publishers, San Francisco, California,pp. 82–90. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps.Z.Bunch J.R. and Kaufman L. 1977. Some stable methods for calculatinginertia and solving symmetric linear systems. Mathematics ofComputati<strong>on</strong> 31: 163–179.Bunch J.R. and Kaufman L. 1980. A computati<strong>on</strong>al method for theindefinite quadratic programming problem. Linear Algebra andIts Applicati<strong>on</strong>s, pp. 341–370, December.Bunch J.R., Kaufman L., and Parlett B. 1976. Decompositi<strong>on</strong> of a symmetricmatrix. Numerische Mathematik 27: 95–109.Burges C.J.C. 1996. Simplified <strong>support</strong> <strong>vector</strong> decisi<strong>on</strong> rules. InL. Saitta (Ed.), Proceedings of the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Machine Learning, Morgan Kaufmann Publishers, San Mateo,CA, pp. 71–77.Burges C.J.C. 1998. A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> machines for patternrecogniti<strong>on</strong>. Data Mining and Knowledge Discovery 2(2): 121–167.Burges C.J.C. 1999. Geometry and invariance in kernel based methods.In Schölkopf B., Burges C.J.C., and Smola A.J., (Eds.), Advancesin Kernel Methods—Support Vector Learning, MIT Press, Cambridge,MA, pp. 89–116.Burges C.J.C. and Schölkopf B. 1997. Improving the accuracy and speedof <strong>support</strong> <strong>vector</strong> learning machines. In Mozer M.C., Jordan M.I.,and Petsche T., (Eds.), Advances in Neural Informati<strong>on</strong> ProcessingSystems 9, MIT Press, Cambridge, MA, pp. 375–381.Chalimourda A., Schölkopf B., and Smola A.J. 2004. Experimentallyoptimal ν in <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> for different noise modelsand parameter settings. Neural Networks 17(1): 127–141.Chang C.-C., Hsu C.-W., and Lin C.-J. 1999. The analysis of decompositi<strong>on</strong>methods for <strong>support</strong> <strong>vector</strong> machines. In Proceeding ofIJCAI99, SVM Workshop.Chang C.C. and Lin C.J. 2001. Training ν-<strong>support</strong> <strong>vector</strong> classifiers:Theory and algorithms. Neural Computati<strong>on</strong> 13(9): 2119–2147.Chen S., D<strong>on</strong>oho D., and Saunders M. 1999. Atomic decompositi<strong>on</strong> bybasis pursuit. Siam Journal of Scientific Computing 20(1): 33–61.Cherkassky V. and Mulier F. 1998. Learning from Data. John Wiley andS<strong>on</strong>s, New York.Cortes C. and Vapnik V. 1995. Support <strong>vector</strong> networks. Machine Learning20: 273–297.Cox D. and O’Sullivan F. 1990. Asymptotic analysis of penalized likelihoodand related estimators. Annals of Statistics 18: 1676–1695.CPLEX Optimizati<strong>on</strong> Inc. Using the CPLEX callable library. Manual,1994.Cristianini N. and Shawe-Taylor J. 2000. An Introducti<strong>on</strong> to SupportVector Machines. Cambridge University Press, Cambridge, UK.Cristianini N., Campbell C., and Shawe-Taylor J. 1998. Multiplicativeupdatings for <strong>support</strong> <strong>vector</strong> learning. NeuroCOLT Technical ReportNC-TR-98-016, Royal Holloway College.Dantzig G.B. 1962. Linear Programming and Extensi<strong>on</strong>s. Princet<strong>on</strong>Univ. Press, Princet<strong>on</strong>, NJ.Devroye L., Györfi L., and Lugosi G. 1996. A Probabilistic Theory ofPattern Recogniti<strong>on</strong>. Number 31 in Applicati<strong>on</strong>s of mathematics.Springer, New York.Drucker H., Burges C.J.C., Kaufman L., Smola A., and Vapnik V. 1997.Support <strong>vector</strong> regressi<strong>on</strong> machines. In: Mozer M.C., Jordan M.I.,and Petsche T. (Eds.), Advances in Neural Informati<strong>on</strong> ProcessingSystems 9, MIT Press, Cambridge, MA, pp. 155–161.Efr<strong>on</strong> B. 1982. The jacknife, the bootstrap, and other resampling plans.SIAM, Philadelphia.Efr<strong>on</strong> B. and Tibshirani R.J. 1994. An Introducti<strong>on</strong> to the Bootstrap.Chapman and Hall, New York.El-Bakry A., Tapia R., Tsuchiya R., and Zhang Y. 1996. On the formulati<strong>on</strong>and theory of the Newt<strong>on</strong> interior-point method for n<strong>on</strong>linearprogramming. J. Optimizati<strong>on</strong> Theory and Applicati<strong>on</strong>s 89: 507–541.Fletcher R. 1989. Practical Methods of Optimizati<strong>on</strong>. John Wiley andS<strong>on</strong>s, New York.Girosi F. 1998. An equivalence between sparse approximati<strong>on</strong> and <strong>support</strong><strong>vector</strong> machines. Neural Computati<strong>on</strong> 10(6): 1455–1480.Girosi F., J<strong>on</strong>es M., and Poggio T. 1993. Priors, stabilizers and basisfuncti<strong>on</strong>s: From regularizati<strong>on</strong> to radial, tensor and additivesplines. A.I. Memo No. 1430, Artificial Intelligence Laboratory,Massachusetts Institute of Technology.Guy<strong>on</strong> I., Boser B., and Vapnik V. 1993. Automatic capacity tuning ofvery large VC-dimensi<strong>on</strong> classifiers. In: Hans<strong>on</strong> S.J., Cowan J.D.,and Giles C.L. (Eds.), Advances in Neural Informati<strong>on</strong> ProcessingSystems 5. Morgan Kaufmann Publishers, pp. 147–155.Härdle W. 1990. Applied n<strong>on</strong>parametric regressi<strong>on</strong>, volume 19 ofEc<strong>on</strong>ometric Society M<strong>on</strong>ographs. Cambridge University Press.


220 Smola and SchölkopfHastie T.J. and Tibshirani R.J. 1990. Generalized Additive Models,volume 43 of M<strong>on</strong>ographs <strong>on</strong> Statistics and Applied Probability.Chapman and Hall, L<strong>on</strong>d<strong>on</strong>.Haykin S. 1998. Neural Networks: A Comprehensive Foundati<strong>on</strong>. 2ndediti<strong>on</strong>. Macmillan, New York.Hearst M.A., Schölkopf B., Dumais S., Osuna E., and Platt J. 1998.Trends and c<strong>on</strong>troversies—<strong>support</strong> <strong>vector</strong> machines. IEEE IntelligentSystems 13: 18–28.Herbrich R. 2002. Learning Kernel Classifiers: Theory and Algorithms.MIT Press.Huber P.J. 1972. Robust statistics: A review. Annals of Statistics43: 1041.Huber P.J. 1981. Robust Statistics. John Wiley and S<strong>on</strong>s, New York.IBM Corporati<strong>on</strong>. 1992. IBM optimizati<strong>on</strong> subroutine library guideand reference. IBM Systems Journal, 31, SC23-0519.Jaakkola T.S. and Haussler D. 1999. Probabilistic kernel regressi<strong>on</strong>models. In: Proceedings of the 1999 C<strong>on</strong>ference <strong>on</strong> AI and Statistics.Joachims T. 1999. Making large-scale SVM learning practical.In: Schölkopf B., Burges C.J.C., and Smola A.J. (Eds.), Advancesin Kernel Methods—Support Vector Learning, MIT Press,Cambridge, MA, pp. 169–184.Karush W. 1939. Minima of functi<strong>on</strong>s of several variables with inequalitiesas side c<strong>on</strong>straints. Master’s thesis, Dept. of Mathematics,Univ. of Chicago.Kaufman L. 1999. Solving the quadratic programming problem arisingin <strong>support</strong> <strong>vector</strong> classificati<strong>on</strong>. In: Schölkopf B., Burges C.J.C.,and Smola A.J. (Eds.), Advances in Kernel Methods—SupportVector Learning, MIT Press, Cambridge, MA, pp. 147–168Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murthy K.R.K. 1999.Improvements to Platt’s SMO algorithm for SVM classifier design.Technical Report CD-99-14, Dept. of Mechanical and Producti<strong>on</strong>Engineering, Natl. Univ. Singapore, Singapore.Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murty K.R.K. 2001.Improvements to platt’s SMO algorithm for SVM classifier design.Neural Computati<strong>on</strong> 13: 637–649.Kimeldorf G.S. and Wahba G. 1970. A corresp<strong>on</strong>dence betweenBayesian estimati<strong>on</strong> <strong>on</strong> stochastic processes and smoothing bysplines. Annals of Mathematical Statistics 41: 495–502.Kimeldorf G.S. and Wahba G. 1971. Some results <strong>on</strong> Tchebycheffianspline functi<strong>on</strong>s. J. Math. Anal. Applic. 33: 82–95.Kowalczyk A. 2000. Maximal margin perceptr<strong>on</strong>. In: Smola A.J.,Bartlett P.L., Schölkopf B., and Schuurmans D. (Eds.), Advancesin Large Margin Classifiers, MIT Press, Cambridge, MA, pp. 75–113.Kuhn H.W. and Tucker A.W. 1951. N<strong>on</strong>linear programming. In: Proc.2nd Berkeley Symposium <strong>on</strong> Mathematical Statistics and Probabilistics,Berkeley. University of California Press, pp. 481–492.Lee Y.J. and Mangasarian O.L. 2001. SSVM: A smooth <strong>support</strong> <strong>vector</strong>machine for classificati<strong>on</strong>. Computati<strong>on</strong>al optimizati<strong>on</strong> and Applicati<strong>on</strong>s20(1): 5–22.Li M. and Vitányi P. 1993. An introducti<strong>on</strong> to Kolmogorov Complexityand its applicati<strong>on</strong>s. Texts and M<strong>on</strong>ographs in Computer Science.Springer, New York.Lin C.J. 2001. On the c<strong>on</strong>vergence of the decompositi<strong>on</strong> method for<strong>support</strong> <strong>vector</strong> machines. IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks12(6): 1288–1298.Lustig I.J., Marsten R.E., and Shanno D.F. 1990. On implementingMehrotra’s predictor-corrector interior point method for linear programming.Princet<strong>on</strong> Technical Report SOR 90–03., Dept. of CivilEngineering and Operati<strong>on</strong>s Research, Princet<strong>on</strong> University.Lustig I.J., Marsten R.E., and Shanno D.F. 1992. On implementingMehrotra’s predictor-corrector interior point method for linearprogramming. SIAM Journal <strong>on</strong> Optimizati<strong>on</strong> 2(3): 435–449.MacKay D.J.C. 1991. Bayesian Methods for Adaptive Models. PhDthesis, Computati<strong>on</strong> and Neural Systems, California Institute ofTechnology, Pasadena, CA.Mangasarian O.L. 1965. Linear and n<strong>on</strong>linear separati<strong>on</strong> of patterns bylinear programming. Operati<strong>on</strong>s Research 13: 444–452.Mangasarian O.L. 1968. Multi-surface method of pattern separati<strong>on</strong>.IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory IT-14: 801–807.Mangasarian O.L. 1969. N<strong>on</strong>linear Programming. McGraw-Hill, NewYork.Mattera D. and Haykin S. 1999. Support <strong>vector</strong> machines for dynamicrec<strong>on</strong>structi<strong>on</strong> of a chaotic system. In: Schölkopf B., BurgesC.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, pp. 211–242.McCormick G.P. 1983. N<strong>on</strong>linear Programming: Theory, Algorithms,and Applicati<strong>on</strong>s. John Wiley and S<strong>on</strong>s, New York.Megiddo N. 1989. Progressin Mathematical Programming, chapterPathways to the optimal set in linear programming, Springer, NewYork, NY, pp. 131–158.Mehrotra S. and Sun J. 1992. On the implementati<strong>on</strong> of a (primal-dual)interior point method. SIAM Journal <strong>on</strong> Optimizati<strong>on</strong> 2(4): 575–601.Mercer J. 1909. Functi<strong>on</strong>s of positive and negative type and their c<strong>on</strong>necti<strong>on</strong>with the theory of integral equati<strong>on</strong>s. Philosophical Transacti<strong>on</strong>sof the Royal Society, L<strong>on</strong>d<strong>on</strong> A 209: 415–446.Micchelli C.A. 1986. Algebraic aspects of interpolati<strong>on</strong>. Proceedingsof Symposia in Applied Mathematics 36: 81–102.Morozov V.A. 1984. Methods for Solving Incorrectly Posed Problems.Springer.Müller K.-R., Smola A., Rätsch G., Schölkopf B., Kohlmorgen J., andVapnik V. 1997. Predicting time series with <strong>support</strong> <strong>vector</strong> machines.In: Gerstner W., Germ<strong>on</strong>d A., Hasler M., and Nicoud J.-D.(Eds.), Artificial Neural Networks ICANN’97, Berlin. SpringerLecture Notes in Computer Science Vol. 1327 pp. 999–1004.Murtagh B.A. and Saunders M.A. 1983. MINOS 5.1 user’s guide. TechnicalReport SOL 83-20R, Stanford University, CA, USA, Revised1987.Neal R. 1996. Bayesian Learning in Neural Networks. Springer.Nilss<strong>on</strong> N.J. 1965. Learning machines: Foundati<strong>on</strong>s of Trainable PatternClassifying Systems. McGraw-Hill.Nyquist. H. 1928. Certain topics in telegraph transmissi<strong>on</strong> theory.Trans. A.I.E.E., pp. 617–644.Osuna E., Freund R., and Girosi F. 1997. An improved training algorithmfor <strong>support</strong> <strong>vector</strong> machines. In Principe J., Gile L., MorganN., and Wils<strong>on</strong> E. (Eds.), Neural Networks for Signal ProcessingVII—Proceedings of the 1997 IEEE Workshop, pp. 276–285, NewYork, IEEE.Osuna E. and Girosi F. 1999. Reducing the run-time complexity in<strong>support</strong> <strong>vector</strong> regressi<strong>on</strong>. In: Schölkopf B., Burges C.J.C., andSmola A. J. (Eds.), Advances in Kernel Methods—Support VectorLearning, pp. 271–284, Cambridge, MA, MIT Press.Ovari Z. 2000. Kernels, eigenvalues and <strong>support</strong> <strong>vector</strong> machines. H<strong>on</strong>oursthesis, Australian Nati<strong>on</strong>al University, Canberra.


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 221Platt J. 1999. Fast training of <strong>support</strong> <strong>vector</strong> machines using sequentialminimal optimizati<strong>on</strong>. In: Schölkopf B., Burges C.J.C., andSmola A.J. (Eds.) Advances in Kernel Methods—Support VectorLearning, pp. 185–208, Cambridge, MA, MIT Press.Poggio T. 1975. On optimal n<strong>on</strong>linear associative recall. BiologicalCybernetics, 19: 201–209.Rasmussen C. 1996. Evaluati<strong>on</strong> of Gaussian Processes andOther Methods for N<strong>on</strong>-Linear Regressi<strong>on</strong>. PhD thesis,Department of Computer Science, University of Tor<strong>on</strong>to,ftp://ftp.cs.tor<strong>on</strong>to.edu/pub/carl/thesis.ps.gz.Rissanen J. 1978. Modeling by shortest data descripti<strong>on</strong>. Automatica,14: 465–471.Saitoh S. 1988. Theory of Reproducing Kernels and its Applicati<strong>on</strong>s.L<strong>on</strong>gman Scientific & Technical, Harlow, England.Saunders C., Stits<strong>on</strong> M.O., West<strong>on</strong> J., Bottou L., Schölkopf B., andSmola A. 1998. Support <strong>vector</strong> machine—reference manual. TechnicalReport CSD-TR-98-03, Department of Computer Science,Royal Holloway, University of L<strong>on</strong>d<strong>on</strong>, Egham, UK. SVM availableat http://svm.dcs.rhbnc.ac.uk/.Schoenberg I. 1942. Positive definite functi<strong>on</strong>s <strong>on</strong> spheres. DukeMath. J., 9: 96–108.Schölkopf B. 1997. Support Vector Learning. R. OldenbourgVerlag, München. Doktorarbeit, TU Berlin. Download:http://www.kernel-machines.org.Schölkopf B., Burges C., and Vapnik V. 1995. Extracting <strong>support</strong> datafor a given task. In: Fayyad U.M. and Uthurusamy R. (Eds.), Proceedings,First Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Knowledge Discovery& Data Mining, Menlo Park, AAAI Press.Schölkopf B., Burges C., and Vapnik V. 1996. Incorporating invariancesin <strong>support</strong> <strong>vector</strong> learning machines. In: v<strong>on</strong> der Malsburg C., v<strong>on</strong>Seelen W., Vorbrüggen J. C., and Sendhoff B. (Eds.), ArtificialNeural Networks ICANN’96, pp. 47–52, Berlin, Springer LectureNotes in Computer Science, Vol. 1112.Schölkopf B., Burges C.J.C., and Smola A.J. 1999a. (Eds.) Advancesin Kernel Methods—Support Vector Learning. MIT Press,Cambridge, MA.Schölkopf B., Herbrich R., Smola A.J., and Williams<strong>on</strong> R.C. 2001. Ageneralized representer theorem. Technical Report 2000-81, NeuroCOLT,2000. To appear in Proceedings of the Annual C<strong>on</strong>ference<strong>on</strong> Learning Theory, Springer (2001).Schölkopf B., Mika S., Burges C., Knirsch P., Müller K.-R., Rätsch G.,and Smola A. 1999b. Input space vs. feature space in kernel-basedmethods. IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks, 10(5): 1000–1017.Schölkopf B., Platt J., Shawe-Taylor J., Smola A.J. , and Williams<strong>on</strong> R.C.2001. Estimating the <strong>support</strong> of a high-dimensi<strong>on</strong>al distributi<strong>on</strong>.Neural Computati<strong>on</strong>, 13(7): 1443–1471.Schölkopf B., Simard P., Smola A., and Vapnik V. 1998a. Prior knowledgein <strong>support</strong> <strong>vector</strong> kernels. In: Jordan M.I., Kearns M.J., andSolla S.A. (Eds.) Advances in Neural Informati<strong>on</strong> Processing Systems10, MIT Press. Cambridge, MA, pp. 640–646.Schölkopf B., Smola A., and Müller K.-R. 1998b. N<strong>on</strong>linear comp<strong>on</strong>entanalysis as a kernel eigenvalue problem. Neural Computati<strong>on</strong>,10: 1299–1319.Schölkopf B., Smola A., Williams<strong>on</strong> R.C., and Bartlett P.L. 2000. New<strong>support</strong> <strong>vector</strong> algorithms. Neural Computati<strong>on</strong>, 12: 1207–1245.Schölkopf B. and Smola A.J. 2002. Learning with Kernels. MIT Press.Schölkopf B., Sung K., Burges C., Girosi F., Niyogi P., Poggio T.,and Vapnik V. 1997. Comparing <strong>support</strong> <strong>vector</strong> machines withGaussian kernels to radial basis functi<strong>on</strong> classifiers. IEEE Transacti<strong>on</strong>s<strong>on</strong> Signal Processing, 45: 2758–2765.Shann<strong>on</strong> C.E. 1948. A mathematical theory of communicati<strong>on</strong>. BellSystem Technical Journal, 27: 379–423, 623–656.Shawe-Taylor J., Bartlett P.L., Williams<strong>on</strong> R.C., and Anth<strong>on</strong>y M.1998. Structural risk minimizati<strong>on</strong> over data-dependent hierarchies.IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory, 44(5): 1926–1940.Smola A., Murata N., Schölkopf B., and Müller K.-R. 1998a. Asymptoticallyoptimal choice of ε-loss for <strong>support</strong> <strong>vector</strong> machines.In: Niklass<strong>on</strong> L., Bodén M., and Ziemke T. (Eds.) Proceedingsof the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Artificial Neural Networks,Perspectives in Neural Computing, pp. 105–110, Berlin,Springer.Smola A., Schölkopf B., and Müller K.-R. 1998b. The c<strong>on</strong>necti<strong>on</strong> betweenregularizati<strong>on</strong> operators and <strong>support</strong> <strong>vector</strong> kernels. NeuralNetworks, 11: 637–649.Smola A., Schölkopf B., and Müller K.-R. 1998c. General cost functi<strong>on</strong>sfor <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong>. In: Downs T., Frean M.,and Gallagher M. (Eds.) Proc. of the Ninth Australian C<strong>on</strong>f. <strong>on</strong>Neural Networks, pp. 79–83, Brisbane, Australia. University ofQueensland.Smola A., Schölkopf B., and Rätsch G. 1999. Linear programs forautomatic accuracy c<strong>on</strong>trol in regressi<strong>on</strong>. In: Ninth Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Artificial Neural Networks, C<strong>on</strong>ference Publicati<strong>on</strong>sNo. 470, pp. 575–580, L<strong>on</strong>d<strong>on</strong>. IEE.Smola. A.J. 1996. Regressi<strong>on</strong> estimati<strong>on</strong> with <strong>support</strong> <strong>vector</strong> learningmachines. Diplomarbeit, Technische Universität München.Smola A.J. 1998. Learning with Kernels. PhD thesis, Technische UniversitätBerlin. GMD Research Series No. 25.Smola A.J., Elisseeff A., Schölkopf B., and Williams<strong>on</strong> R.C. 2000.Entropy numbers for c<strong>on</strong>vex combinati<strong>on</strong>s and MLPs. In SmolaA.J., Bartlett P.L., Schölkopf B., and Schuurmans D. (Eds.) Advancesin Large Margin Classifiers, MIT Press, Cambridge, MA,pp. 369–387.Smola A.J., Óvári Z.L., and Williams<strong>on</strong> R.C. 2001. Regularizati<strong>on</strong> withdot-product kernels. In: Leen T.K., Dietterich T.G., and Tresp V.(Eds.) Advances in Neural Informati<strong>on</strong> Processing Systems 13,MIT Press, pp. 308–314.Smola A.J. and Schölkopf B. 1998a. On a kernel-based method forpattern recogniti<strong>on</strong>, regressi<strong>on</strong>, approximati<strong>on</strong> and operator inversi<strong>on</strong>.Algorithmica, 22: 211–231.Smola A.J. and Schölkopf B. 1998b. A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong>.NeuroCOLT Technical Report NC-TR-98-030, RoyalHolloway College, University of L<strong>on</strong>d<strong>on</strong>, UK.Smola A.J. and Schölkopf B. 2000. Sparse greedy matrix approximati<strong>on</strong>for machine learning. In: Langley P. (Ed.), Proceedings of the Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Machine Learning, Morgan KaufmannPublishers, San Francisco, pp. 911–918.Stits<strong>on</strong> M., Gammerman A., Vapnik V., Vovk V., Watkins C., andWest<strong>on</strong> J. 1999. Support <strong>vector</strong> regressi<strong>on</strong> with ANOVA decompositi<strong>on</strong>kernels. In: Schölkopf B., Burges C.J.C., and Smola A.J.(Eds.), Advances in Kernel Methods—Support Vector Learning,MIT Press Cambridge, MA, pp. 285–292.St<strong>on</strong>e C.J. 1985. Additive regressi<strong>on</strong> and other n<strong>on</strong>parametric models.Annals of Statistics, 13: 689–705.St<strong>on</strong>e M. 1974. Cross-validatory choice and assessment of statisticalpredictors (with discussi<strong>on</strong>). Journal of the Royal Statistical Society,B36: 111–147.


222 Smola and SchölkopfStreet W.N. and Mangasarian O.L. 1995. Improved generalizati<strong>on</strong> viatolerant training. Technical Report MP-TR-95-11, University ofWisc<strong>on</strong>sin, Madis<strong>on</strong>.Tikh<strong>on</strong>ov A.N. and Arsenin V.Y. 1977. Soluti<strong>on</strong> of Ill-posed problems.V. H. Winst<strong>on</strong> and S<strong>on</strong>s.Tipping M.E. 2000. The relevance <strong>vector</strong> machine. In: Solla S.A., LeenT.K., and Müller K.-R. (Eds.), Advances in Neural Informati<strong>on</strong>Processing Systems 12, MIT Press, Cambridge, MA, pp. 652–658.Vanderbei R.J. 1994. LOQO: An interior point code for quadratic programming.TR SOR-94-15, Statistics and Operati<strong>on</strong>s Research,Princet<strong>on</strong> Univ., NJ.Vanderbei R.J. 1997. LOQO user’s manual—versi<strong>on</strong> 3.10. TechnicalReport SOR-97-08, Princet<strong>on</strong> University, Statistics and Operati<strong>on</strong>sResearch, Code available at http://www.princet<strong>on</strong>.edu/˜rvdb/.Vapnik V. 1995. The Nature of Statistical Learning Theory. Springer,New York.Vapnik V. 1998. Statistical Learning Theory. John Wiley and S<strong>on</strong>s,New York.Vapnik. V. 1999. Three remarks <strong>on</strong> the <strong>support</strong> <strong>vector</strong> method offuncti<strong>on</strong> estimati<strong>on</strong>. In: Schölkopf B., Burges C.J.C., and SmolaA.J. (Eds.), Advances in Kernel Methods—Support VectorLearning, MIT Press, Cambridge, MA, pp. 25–42.Vapnik V. and Cherv<strong>on</strong>enkis A. 1964. A note <strong>on</strong> <strong>on</strong>e class ofperceptr<strong>on</strong>s. Automati<strong>on</strong> and Remote C<strong>on</strong>trol, 25.Vapnik V. and Cherv<strong>on</strong>enkis A. 1974. Theory of Pattern Recogniti<strong>on</strong>[in Russian]. Nauka, Moscow. (German Translati<strong>on</strong>: WapnikW. & Tscherw<strong>on</strong>enkis A., Theorie der Zeichenerkennung,Akademie-Verlag, Berlin, 1979).Vapnik V., Golowich S., and Smola A. 1997. Support <strong>vector</strong> methodfor functi<strong>on</strong> approximati<strong>on</strong>, regressi<strong>on</strong> estimati<strong>on</strong>, and signalprocessing. In: Mozer M.C., Jordan M.I., and Petsche T. (Eds.)Advances in Neural Informati<strong>on</strong> Processing Systems 9, MA, MITPress, Cambridge. pp. 281–287.Vapnik V. and Lerner A. 1963. Pattern recogniti<strong>on</strong> using generalizedportrait method. Automati<strong>on</strong> and Remote C<strong>on</strong>trol, 24: 774–780.Vapnik V.N. 1982. Estimati<strong>on</strong> of Dependences Based <strong>on</strong> EmpiricalData. Springer, Berlin.Vapnik V.N. and Cherv<strong>on</strong>enkis A.Y. 1971. On the uniform c<strong>on</strong>vergenceof relative frequencies of events to their probabilities. Theory ofProbability and its Applicati<strong>on</strong>s, 16(2): 264–281.Wahba G. 1980. Spline bases, regularizati<strong>on</strong>, and generalizedcross-validati<strong>on</strong> for solving approximati<strong>on</strong> problems with largequantities of noisy data. In: Ward J. and Cheney E. (Eds.), Proceedingsof the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Approximati<strong>on</strong> theory inh<strong>on</strong>our of George Lorenz, Academic Press, Austin, TX, pp. 8–10.Wahba G. 1990. Spline Models for Observati<strong>on</strong>al Data, volume 59 ofCBMS-NSF Regi<strong>on</strong>al C<strong>on</strong>ference Series in Applied Mathematics.SIAM, Philadelphia.Wahba G. 1999. Support <strong>vector</strong> machines, reproducing kernel Hilbertspaces and the randomized GACV. In: Schölkopf B., BurgesC.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA. pp. 69–88.West<strong>on</strong> J., Gammerman A., Stits<strong>on</strong> M., Vapnik V., Vovk V., and WatkinsC. 1999. Support <strong>vector</strong> density estimati<strong>on</strong>. In: Schölkopf B.,Burges C.J.C., and Smola A.J. (Eds.) Advances in KernelMethods—Support Vector Learning, MIT Press, Cambridge,MA. pp. 293–306.Williams C.K.I. 1998. Predicti<strong>on</strong> with Gaussian processes: From linearregressi<strong>on</strong> to linear predicti<strong>on</strong> and bey<strong>on</strong>d. In: Jordan M.I. (Ed.),Learning and Inference in Graphical Models, Kluwer Academic,pp. 599–621.Williams<strong>on</strong> R.C., Smola A.J., and Schölkopf B. 1998. Generalizati<strong>on</strong>performance of regularizati<strong>on</strong> networks and <strong>support</strong> <strong>vector</strong>machines via entropy numbers of compact operators. TechnicalReport 19, NeuroCOLT, http://www.neurocolt.com. Publishedin IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory, 47(6): 2516–2532(2001).Yuille A. and Grzywacz N. 1988. The moti<strong>on</strong> coherence theory.In: Proceedings of the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> ComputerVisi<strong>on</strong>, IEEE Computer Society Press, Washingt<strong>on</strong>, DC, pp. 344–354.


A Tutorial <strong>on</strong> ν-Support Vector MachinesPai-Hsuen Chen 1 , Chih-Jen Lin 1 , and Bernhard Schölkopf 2⋆1 Department of Computer Science and Informati<strong>on</strong> EngineeringNati<strong>on</strong>al Taiwan UniversityTaipei 106, Taiwan2 Max Planck Institute for Biological Cybernetics, Tübingen, Germanybernhard.schoelkopf@tuebingen.mpg.deAbstract. We briefly describe the main ideas of statistical learning theory, <strong>support</strong><strong>vector</strong> machines (SVMs), and kernel feature spaces. We place particular emphasis<strong>on</strong> a descripti<strong>on</strong> of the so-called ν-SVM, including details of the algorithmand its implementati<strong>on</strong>, theoretical results, and practical applicati<strong>on</strong>s.1 An Introductory ExampleSuppose we are given empirical data(x 1 , y 1 ), . . . , (x m , y m ) ∈ X × {±1}. (1)Here, the domain X is some n<strong>on</strong>empty set that the patterns x i are taken from; the y iare called labels or targets.Unless stated otherwise, indices i and j will always be understood to run over thetraining set, i.e., i, j = 1, . . . , m.Note that we have not made any assumpti<strong>on</strong>s <strong>on</strong> the domain X other than it being aset. In order to study the problem of learning, we need additi<strong>on</strong>al structure. In learning,we want to be able to generalize to unseen data points. In the case of pattern recogniti<strong>on</strong>,this means that given some new pattern x ∈ X , we want to predict the corresp<strong>on</strong>dingy ∈ {±1}. By this we mean, loosely speaking, that we choose y such that (x, y) is insome sense similar to the training examples. To this end, we need similarity measures inX and in {±1}. The latter is easy, as two target values can <strong>on</strong>ly be identical or different.For the former, we require a similarity measurek : X × X → R,(x, x ′ ) ↦→ k(x, x ′ ), (2)i.e., a functi<strong>on</strong> that, given two examples x and x ′ , returns a real number characterizingtheir similarity. For reas<strong>on</strong>s that will become clear later, the functi<strong>on</strong> k is called a kernel([24], [1], [8]).A type of similarity measure that is of particular mathematical appeal are dot products.For instance, given two <strong>vector</strong>s x, x ′ ∈ R N , the can<strong>on</strong>ical dot product is defined⋆ Parts of the present article are based <strong>on</strong> [31].


as(x · x ′ ) :=N∑(x) i (x ′ ) i . (3)i=1Here, (x) i denotes the ith entry of x.The geometrical interpretati<strong>on</strong> of this dot product is that it computes the cosine ofthe angle between the <strong>vector</strong>s x and x ′ , provided they are normalized to length 1. Moreover,it allows computati<strong>on</strong> of the length of a <strong>vector</strong> x as √ (x · x), and of the distancebetween two <strong>vector</strong>s as the length of the difference <strong>vector</strong>. Therefore, being able tocompute dot products amounts to being able to carry out all geometrical c<strong>on</strong>structi<strong>on</strong>sthat can be formulated in terms of angles, lengths and distances.Note, however, that we have not made the assumpti<strong>on</strong> that the patterns live in adot product space. In order to be able to use a dot product as a similarity measure, wetherefore first need to transform them into some dot product space H, which need notbe identical to R N . To this end, we use a mapΦ : X → Hx ↦→ x. (4)The space H is called a feature space. To summarize, there are three benefits to transformthe data into H1. It lets us define a similarity measure from the dot product in H,k(x, x ′ ) := (x · x ′ ) = (Φ(x) · Φ(x ′ )). (5)2. It allows us to deal with the patterns geometrically, and thus lets us study learningalgorithm using linear algebra and analytic geometry.3. The freedom to choose the mapping Φ will enable us to design a large variety oflearning algorithms. For instance, c<strong>on</strong>sider a situati<strong>on</strong> where the inputs already livein a dot product space. In that case, we could directly define a similarity measureas the dot product. However, we might still choose to first apply a n<strong>on</strong>linear map Φto change the representati<strong>on</strong> into <strong>on</strong>e that is more suitable for a given problem andlearning algorithm.We are now in the positi<strong>on</strong> to describe a pattern recogniti<strong>on</strong> learning algorithm thatis arguable <strong>on</strong>e of the simplest possible. The basic idea is to compute the means of thetwo classes in feature space,c + = 1m +c − = 1m −∑{i:y i=+1}∑{i:y i =−1}x i , (6)x i , (7)where m + and m − are the number of examples with positive and negative labels, respectively(see Figure 1). We then assign a new point x to the class whose mean is


+++wo c-+cc +.xx-cooFig. 1. A simple geometric classificati<strong>on</strong> algorithm: given two classes of points (depicted by ‘o’and ‘+’), compute their means c + , c − and assign a test pattern x to the <strong>on</strong>e whose mean is closer.This can be d<strong>on</strong>e by looking at the dot product between x − c (where c = (c + + c −)/2) andw := c + − c − , which changes sign as the enclosed angle passes through π/2. Note that thecorresp<strong>on</strong>ding decisi<strong>on</strong> boundary is a hyperplane (the dotted line) orthog<strong>on</strong>al to w (from [31]).closer to it. This geometrical c<strong>on</strong>structi<strong>on</strong> can be formulated in terms of dot products.Half-way in between c + and c − lies the point c := (c + +c − )/2. We compute the classof x by checking whether the <strong>vector</strong> c<strong>on</strong>necting c and x encloses an angle smaller thanπ/2 with the <strong>vector</strong> w := c + − c − c<strong>on</strong>necting the class means, in other wordsHere, we have defined the offsety = sgn ((x − c) · w)y = sgn ((x − (c + + c − )/2) · (c + − c − ))= sgn ((x · c + ) − (x · c − ) + b). (8)b := 1 2(‖c− ‖ 2 − ‖c + ‖ 2) . (9)It will be proved instructive to rewrite this expressi<strong>on</strong> in terms of the patterns x i inthe input domain X . To this end, note that we do not have a dot product in X , all wehave is the similarity measure k (cf. (5)). Therefore, we need to rewrite everything interms of the kernel k evaluated <strong>on</strong> input patterns. To this end, substitute (6) and (7) into(8) to get the decisi<strong>on</strong> functi<strong>on</strong>⎛y = sgn ⎝ 1= sgn⎛∑m +{i:y i=+1}⎝ 1m +∑{i:y i=+1}(x · x i ) − 1k(x, x i ) − 1m −∑m −{i:y i=−1}∑{i:y i=−1}⎞(x · x i ) + b⎠⎞k(x, x i ) + b⎠ . (10)


Similarly, the offset becomes⎛b := 1 ⎝ 1 ∑2 m 2 −{(i,j):y i =y j =−1}k(x i , x j ) − 1m 2 +∑{(i,j):y i =y j =+1}⎞k(x i , x j ) ⎠ . (11)Let us c<strong>on</strong>sider <strong>on</strong>e well-known special case of this type of classifier. Assume that theclass means have the same distance to the origin (hence b = 0), and that k can be viewedas a density, i.e., it is positive and has integral 1,∫k(x, x ′ )dx = 1 for all x ′ ∈ X . (12)XIn order to state this assumpti<strong>on</strong>, we have to require that we can define an integral <strong>on</strong>X .If the above holds true, then (10) corresp<strong>on</strong>ds to the so-called Bayes decisi<strong>on</strong> boundaryseparating the two classes, subject to the assumpti<strong>on</strong> that the two classes were generatedfrom two probability distributi<strong>on</strong>s that are correctly estimated by the Parzenwindows estimators of the two classes,p 1 (x) := 1m +p 2 (x) := 1m −∑{i:y i=+1}∑{i:y i =−1}k(x, x i ) (13)k(x, x i ). (14)Given some point x, the label is then simply computed by checking which of the two,p 1 (x) or p 2 (x), is larger, which directly leads to (10). Note that this decisi<strong>on</strong> is the bestwe can do if we have no prior informati<strong>on</strong> about the probabilities of the two classes.For further details, see [31].The classifier (10) is quite close to the types of learning machines that we willbe interested in. It is linear in the feature space, and while in the input domain, it isrepresented by a kernel expansi<strong>on</strong> in terms of the training points. It is example-basedin the sense that the kernels are centered <strong>on</strong> the training examples, i.e., <strong>on</strong>e of the twoarguments of the kernels is always a training example. The main points that the moresophisticated techniques to be discussed later will deviate from (10) are in the selecti<strong>on</strong>of the examples that the kernels are centered <strong>on</strong>, and in the weights that are put <strong>on</strong> theindividual data in the decisi<strong>on</strong> functi<strong>on</strong>. Namely, it will no l<strong>on</strong>ger be the case that alltraining examples appear in the kernel expansi<strong>on</strong>, and the weights of the kernels in theexpansi<strong>on</strong> will no l<strong>on</strong>ger be uniform. In the feature space representati<strong>on</strong>, this statementcorresp<strong>on</strong>ds to saying that we will study all normal <strong>vector</strong>s w of decisi<strong>on</strong> hyperplanesthat can be represented as linear combinati<strong>on</strong>s of the training examples. For instance,we might want to remove the influence of patterns that are very far away from thedecisi<strong>on</strong> boundary, either since we expect that they will not improve the generalizati<strong>on</strong>error of the decisi<strong>on</strong> functi<strong>on</strong>, or since we would like to reduce the computati<strong>on</strong>al costof evaluating the decisi<strong>on</strong> functi<strong>on</strong> (cf. (10)). The hyperplane will then <strong>on</strong>ly depend <strong>on</strong>a subset of training examples, called <strong>support</strong> <strong>vector</strong>s.


2 Learning Pattern Recogniti<strong>on</strong> from ExamplesWith the above example in mind, let us now c<strong>on</strong>sider the problem of pattern recogniti<strong>on</strong>in a more formal setting ([37], [38]), following the introducti<strong>on</strong> of [30]. In two-classpattern recogniti<strong>on</strong>, we seek to estimate a functi<strong>on</strong>f : X → {±1} (15)based <strong>on</strong> input-output training data (1). We assume that the data were generated independentlyfrom some unknown (but fixed) probability distributi<strong>on</strong> P (x, y). Our goalis to learn a functi<strong>on</strong> that will correctly classify unseen examples (x, y), i.e., we wantf(x) = y for examples (x, y) that were also generated from P (x, y).If we put no restricti<strong>on</strong> <strong>on</strong> the class of functi<strong>on</strong>s that we choose our estimate ffrom, however, even a functi<strong>on</strong> which does well <strong>on</strong> the training data, e.g. by satisfyingf(x i ) = y i for all i = 1, . . . , m, need not generalize well to unseen examples. To seethis, note that for each functi<strong>on</strong> f and any test set (¯x 1 , ȳ 1 ), . . . , (¯x ¯m , ȳ ¯m ) ∈ R N ×{±1},satisfying {¯x 1 , . . . , ¯x ¯m } ∩ {x 1 , . . . , x m } = {}, there exists another functi<strong>on</strong> f ∗ suchthat f ∗ (x i ) = f(x i ) for all i = 1, . . . , m, yet f ∗ (¯x i ) ≠ f(¯x i ) for all i = 1, . . . , ¯m.As we are <strong>on</strong>ly given the training data, we have no means of selecting which of the twofuncti<strong>on</strong>s (and hence which of the completely different sets of test label predicti<strong>on</strong>s) ispreferable. Hence, <strong>on</strong>ly minimizing the training error (or empirical risk),R emp [f] = 1 mm∑i=112 |f(x i) − y i |, (16)does not imply a small test error (called risk), averaged over test examples drawn fromthe underlying distributi<strong>on</strong> P (x, y),∫ 1R[f] = |f(x) − y| dP (x, y). (17)2Statistical learning theory ([41], [37], [38], [39]), or VC (Vapnik-Cherv<strong>on</strong>enkis) theory,shows that it is imperative to restrict the class of functi<strong>on</strong>s that f is chosen from to <strong>on</strong>ewhich has a capacity that is suitable for the amount of available training data. VC theoryprovides bounds <strong>on</strong> the test error. The minimizati<strong>on</strong> of these bounds, which depend <strong>on</strong>both the empirical risk and the capacity of the functi<strong>on</strong> class, leads to the principle ofstructural risk minimizati<strong>on</strong> ([37]). The best-known capacity c<strong>on</strong>cept of VC theory isthe VC dimensi<strong>on</strong>, defined as the largest number h of points that can be separated inall possible ways using functi<strong>on</strong>s of the given class. An example of a VC bound is thefollowing: if h < m is the VC dimensi<strong>on</strong> of the class of functi<strong>on</strong>s that the learningmachine can implement, then for all functi<strong>on</strong>s of that class, with a probability of atleast 1 − η, the bound( hR(f) ≤ R emp (f) + φm , log(η) )(18)mholds, where the c<strong>on</strong>fidence term φ is defined as√( hφm , log(η) )h ( log 2m h=m+ 1) − log(η/4). (19)m


Tighter bounds can be formulated in terms of other c<strong>on</strong>cepts, such as the annealed VCentropy or the Growth functi<strong>on</strong>. These are usually c<strong>on</strong>sidered to be harder to evaluate,but they play a fundamental role in the c<strong>on</strong>ceptual part of VC theory ([38]). Alternativecapacity c<strong>on</strong>cepts that can be used to formulate bounds include the fat shatteringdimensi<strong>on</strong> ([2]).The bound (18) deserves some further explanatory remarks. Suppose we wanted tolearn a “dependency” where P (x, y) = P (x) · P (y), i.e., where the pattern x c<strong>on</strong>tainsno informati<strong>on</strong> about the label y, with uniform P (y). Given a training sample of fixedsize, we can then surely come up with a learning machine which achieves zero trainingerror (provided we have no examples c<strong>on</strong>tradicting each other). However, in order toreproduce the random labelling, this machine will necessarily require a large VC dimensi<strong>on</strong>h. Thus, the c<strong>on</strong>fidence term (19), increasing m<strong>on</strong>ot<strong>on</strong>ically with h, will belarge, and the bound (18) will not <strong>support</strong> possible hopes that due to the small trainingerror, we should expect a small test error. This makes it understandable how (18) canhold independent of assumpti<strong>on</strong>s about the underlying distributi<strong>on</strong> P (x, y): it alwaysholds (provided that h < m), but it does not always make a n<strong>on</strong>trivial predicti<strong>on</strong> — abound <strong>on</strong> an error rate becomes void if it is larger than the maximum error rate. In orderto get n<strong>on</strong>trivial predicti<strong>on</strong>s from (18), the functi<strong>on</strong> space must be restricted such thatthe capacity (e.g. VC dimensi<strong>on</strong>) is small enough (in relati<strong>on</strong> to the available amountof data).3 Hyperplane ClassifiersIn the present secti<strong>on</strong>, we shall describe a hyperplane learning algorithm that can beperformed in a dot product space (such as the feature space that we introduced previously).As described in the previous secti<strong>on</strong>, to design learning algorithms, <strong>on</strong>e needsto come up with a class of functi<strong>on</strong>s whose capacity can be computed.[42] c<strong>on</strong>sidered the class of hyperplanescorresp<strong>on</strong>ding to decisi<strong>on</strong> functi<strong>on</strong>s(w · x) + b = 0 w ∈ R N , b ∈ R, (20)f(x) = sgn ((w · x) + b), (21)and proposed a learning algorithm for separable problems, termed the Generalized Portrait,for c<strong>on</strong>structing f from empirical data. It is based <strong>on</strong> two facts. First, am<strong>on</strong>g allhyperplanes separating the data, there exists a unique <strong>on</strong>e yielding the maximum marginof separati<strong>on</strong> between the classes,maxw,bmin{‖x − x i ‖ : x ∈ R N , (w · x) + b = 0, i = 1, . . . , m}. (22)Sec<strong>on</strong>d, the capacity decreases with increasing margin.To c<strong>on</strong>struct this Optimal Hyperplane (cf. Figure 2), <strong>on</strong>e solves the following optimizati<strong>on</strong>problem:1minimizew,b 2 ‖w‖2subject to y i · ((w · x i ) + b) ≥ 1, i = 1, . . . , m. (23)


{x | (w . x) + b = −1}y i = −1❍{x | (w.x) + b = +1}Note:◆(w.x 1) + b = +1x y (w.xx 1 i = +12 ) + b = −1❍◆2=> (w.(x 1 −x 2 )) = 2◆.ww=> . 2◆ ( ||w||(x 1 −x 2 ) =||w||❍❍❍{x | (w . x) + b = 0}Fig. 2. A binary classificati<strong>on</strong> toy problem: separate balls from diam<strong>on</strong>ds. The optimal hyperplaneis orthog<strong>on</strong>al to the shortest line c<strong>on</strong>necting the c<strong>on</strong>vex hulls of the two classes (dotted), andintersects it half-way between the two classes. The problem is separable, so there exists a weight<strong>vector</strong> w and a threshold b such that y i · ((w · x i ) + b) > 0 (i = 1, . . . , m). Rescaling w and bsuch that the point(s) closest to the hyperplane satisfy |(w · x i ) + b| = 1, we obtain a can<strong>on</strong>icalform (w, b) of the hyperplane, satisfying y i ·((w·x i)+b) ≥ 1. Note that in this case, the margin,measured perpendicularly to the hyperplane, equals 2/‖w‖. This can be seen by c<strong>on</strong>sidering twopoints x 1, x 2 <strong>on</strong> opposite sides of the margin, i.e., (w · x 1) + b = 1, (w · x 2) + b = −1, andprojecting them <strong>on</strong>to the hyperplane normal <strong>vector</strong> w/‖w‖ (from [29]).A way to solve (23) is through its Lagrangian dual:whereL(w, b, α) = 1 2 ‖w‖2 −max (min L(w, b, α)), (24)α≥0w,bm∑α i (y i · ((x i · w) + b) − 1) . (25)i=1The Lagrangian L has to be minimized with respect to the primal variables w andb and maximized with respect to the dual variables α i . For a n<strong>on</strong>linear problem like(23), called the primal problem, there are several closely related problems of whichthe Lagrangian dual is an important <strong>on</strong>e. Under certain c<strong>on</strong>diti<strong>on</strong>s, the primal and dualproblems have the same optimal objective values. Therefore, we can instead solve thedual which may be an easier problem than the primal. In particular, we will see inSecti<strong>on</strong> 4 that when working in feature spaces, solving the dual may be the <strong>on</strong>ly way totrain SVM.Let us try to get some intuiti<strong>on</strong> for this primal-dual relati<strong>on</strong>. Assume (¯w, ¯b) is anoptimal soluti<strong>on</strong> of the primal with the optimal objective value γ = 1 2 ‖¯w‖2 . Thus, no(w, b) satisfies12 ‖w‖2 < γ and y i · ((w · x i ) + b) ≥ 1, i = 1, . . . , m. (26)


With (26), there is ᾱ ≥ 0 such that for all w, b12 ‖w‖2 − γ −m∑ᾱ i (y i · ((x i · w) + b) − 1) ≥ 0. (27)i=1We do not provide a rigorous proof here but details can be found in, for example, [5].Note that for general c<strong>on</strong>vex programming this result requires some additi<strong>on</strong>al c<strong>on</strong>diti<strong>on</strong>s<strong>on</strong> c<strong>on</strong>straints which are now satisfied by our simple linear inequalities.Therefore, (27) impliesOn the other hand, for any α,max minα≥0 w,bL(w, b, α) ≥ γ. (28)min L(w, b, α) ≤ L(¯w, ¯b, α),w,bsomax minα≥0 w,bL(w, b, α) ≤ maxα≥0 L(¯w, ¯b, α) = 1 2 ‖¯w‖2 = γ. (29)Therefore, with (28), the inequality in (29) becomes an equality. This property isthe str<strong>on</strong>g duality where the primal and dual have the same optimal objective value. Inadditi<strong>on</strong>, putting (¯w, ¯b) into (27), with ᾱ i ≥ 0 and y i · ((x i · ¯w) + ¯b) − 1 ≥ 0,ᾱ i · [y i ((x i · ¯w) + ¯b) − 1] = 0, i = 1, . . . , m, (30)which is usually called the complementarity c<strong>on</strong>diti<strong>on</strong>.To simplify the dual, as L(w, b, α) is c<strong>on</strong>vex when α is fixed, for any given α,leads toand∂∂L(w, b, α) = 0,∂bL(w, b, α) = 0, (31)∂wm∑α i y i = 0 (32)i=1w =m∑α i y i x i . (33)i=1As α is now given, we may w<strong>on</strong>der what (32) means. From the definiti<strong>on</strong> of the Lagrangian,if ∑ mi=1 α iy i ≠ 0, we can decrease −b ∑ mi=1 α iy i in L(w, b, α) as much aswe want. Therefore, by substituting (33) into (24), the dual problem can be written asmaxα≥0{ ∑mi=1 α i − 1 ∑ m2 i,j=1 α iα j y i y j (x i · x j )−∞if ∑ mi=1 α iy i = 0,if ∑ mi=1 α iy i ≠ 0.(34)


As −∞ is definitely not the maximal objective value of the dual, the dual optimal soluti<strong>on</strong>does not happen when ∑ mi=1 α iy i ≠ 0. Therefore, the dual problem is simplifiedto finding multipliers α i whichmaximizeα∈R msubject tom ∑i=1α i − 1 2m∑α i α j y i y j (x i · x j ) (35)i,j=1α i ≥ 0, i = 1, . . . , m, andm∑α i y i = 0. (36)i=1This is the dual SVM problem that we usually refer to. Note that (30), (32), α i ≥ 0∀i,and (33), are called the Karush-Kuhn-Tucker (KKT) optimality c<strong>on</strong>diti<strong>on</strong>s of the primalproblem. Except an abnormal situati<strong>on</strong> where all optimal α i are zero, b can be computedusing (30).The discussi<strong>on</strong> from (31) to (33) implies that we can c<strong>on</strong>sider a different form ofdual problem:maximizew,b,α≥0L(w, b, α)subject to ∂ ∂b L(w, b, α) = 0, ∂∂w L(w, b, α) = 0. (37)This is the so called Wolfe dual for c<strong>on</strong>vex optimizati<strong>on</strong>, which is a very early work induality [45]. For c<strong>on</strong>vex and differentiable problems, it is equivalent to the Lagrangiandual though the derivati<strong>on</strong> of the Lagrangian dual more easily shows the str<strong>on</strong>g dualityresults. Some notes about the two duals are in, for example, [3, Secti<strong>on</strong> 5.4].Following the above discussi<strong>on</strong>, the hyperplane decisi<strong>on</strong> functi<strong>on</strong> can be written as( m)∑f(x) = sgn y i α i · (x · x i ) + b . (38)i=1The soluti<strong>on</strong> <strong>vector</strong> w thus has an expansi<strong>on</strong> in terms of a subset of the training patterns,namely those patterns whose α i is n<strong>on</strong>-zero, called Support Vectors. By (30), theSupport Vectors lie <strong>on</strong> the margin (cf. Figure 2). All remaining examples of the trainingset are irrelevant: their c<strong>on</strong>straint (23) does not play a role in the optimizati<strong>on</strong>, and theydo not appear in the expansi<strong>on</strong> (33). This nicely captures our intuiti<strong>on</strong> of the problem:as the hyperplane (cf. Figure 2) is completely determined by the patterns closest to it,the soluti<strong>on</strong> should not depend <strong>on</strong> the other examples.The structure of the optimizati<strong>on</strong> problem closely resembles those that typicallyarise in Lagrange’s formulati<strong>on</strong> of mechanics. Also there, often <strong>on</strong>ly a subset of thec<strong>on</strong>straints become active. For instance, if we keep a ball in a box, then it will typicallyroll into <strong>on</strong>e of the corners. The c<strong>on</strong>straints corresp<strong>on</strong>ding to the walls which are nottouched by the ball are irrelevant, the walls could just as well be removed.Seen in this light, it is not too surprising that it is possible to give a mechanical interpretati<strong>on</strong>of optimal margin hyperplanes ([9]): If we assume that each <strong>support</strong> <strong>vector</strong>x i exerts a perpendicular force of size α i and sign y i <strong>on</strong> a solid plane sheet lying al<strong>on</strong>gthe hyperplane, then the soluti<strong>on</strong> satisfies the requirements of mechanical stability. The


c<strong>on</strong>straint (32) states that the forces <strong>on</strong> the sheet sum to zero; and (33) implies that thetorques also sum to zero, via ∑ i x i × y i α i · w/‖w‖ = w × w/‖w‖ = 0.There are theoretical arguments <strong>support</strong>ing the good generalizati<strong>on</strong> performance ofthe optimal hyperplane ([41], [37], [4], [33], [44]). In additi<strong>on</strong>, it is computati<strong>on</strong>allyattractive, since it can be c<strong>on</strong>structed by solving a quadratic programming problem.4 Optimal Margin Support Vector ClassifiersWe now have all the tools to describe <strong>support</strong> <strong>vector</strong> machines ([38], [31]). Everythingin the last secti<strong>on</strong> was formulated in a dot product space. We think of this space as thefeature space H described in Secti<strong>on</strong> 1. To express the formulas in terms of the inputpatterns living in X , we thus need to employ (5), which expresses the dot product ofbold face feature <strong>vector</strong>s x, x ′ in terms of the kernel k evaluated <strong>on</strong> input patterns x, x ′ ,k(x, x ′ ) = (x · x ′ ). (39)This can be d<strong>on</strong>e since all feature <strong>vector</strong>s <strong>on</strong>ly occurred in dot products. The weight<strong>vector</strong> (cf. (33)) then becomes an expansi<strong>on</strong> in feature space, 1 and will thus typicallyno l<strong>on</strong>ger corresp<strong>on</strong>d to the image of a single <strong>vector</strong> from input space. We thus obtaindecisi<strong>on</strong> functi<strong>on</strong>s of the more general form (cf. (38))( m)∑f(x) = sgn y i α i · (Φ(x) · Φ(x i )) + bi=1( m)∑= sgn y i α i · k(x, x i ) + b , (40)i=1and the following quadratic program (cf. (35)):maximizeα∈R m W (α) =subject tom∑α i − 1 2i=1m∑α i α j y i y j k(x i , x j ) (41)i,j=1α i ≥ 0, i = 1, . . . , m, andm∑α i y i = 0. (42)Working in the feature space somewhat forces us to solve the dual problem insteadof the primal. The dual problem has the same number of variables as the number oftraining data. However, the primal problem may have a lot more (even infinite) variablesdepending <strong>on</strong> the dimensi<strong>on</strong>ality of the feature space (i.e. the length of Φ(x)). Thoughour derivati<strong>on</strong> of the dual problem in Secti<strong>on</strong> 3 c<strong>on</strong>siders problems in finite-dimensi<strong>on</strong>alspaces, it can be directly extended to problems in Hilbert spaces [20].1 This c<strong>on</strong>stitutes a special case of the so-called representer theorem, which states that underfairly general c<strong>on</strong>diti<strong>on</strong>s, the minimizers of objective functi<strong>on</strong>s which c<strong>on</strong>tain a penalizer interms of a norm in feature space will have kernel expansi<strong>on</strong>s ([43], [31]).i=1


Fig. 3. Example of a Support Vector classifier found by using a radial basis functi<strong>on</strong> kernelk(x, x ′ ) = exp(−‖x − x ′ ‖ 2 ). Both coordinate axes range from -1 to +1. Circles and disksare two classes of training examples; the middle line is the decisi<strong>on</strong> surface; the outer lines preciselymeet the c<strong>on</strong>straint (23). Note that the Support Vectors found by the algorithm (marked byextra circles) are not centers of clusters, but examples which are critical for the given classificati<strong>on</strong>task. Grey values code the modulus of the argument P mi=1yiαi · k(x, xi) + b of the decisi<strong>on</strong>functi<strong>on</strong> (40) (from [29]).)5 KernelsWe now take a closer look at the issue of the similarity measure, or kernel, k. In thissecti<strong>on</strong>, we think of X as a subset of the <strong>vector</strong> space R N , (N ∈ N), endowed with thecan<strong>on</strong>ical dot product (3).5.1 Product FeaturesSuppose we are given patterns x ∈ R N where most informati<strong>on</strong> is c<strong>on</strong>tained in the dthorder products (m<strong>on</strong>omials) of entries [x] j of x,[x] j1 · · · · · [x] jd , (43)where j 1 , . . . , j d ∈ {1, . . . , N}. In that case, we might prefer to extract these productfeatures, and work in the feature space H of all products of d entries. In visual recogniti<strong>on</strong>problems, where images are often represented as <strong>vector</strong>s, this would amount toextracting features which are products of individual pixels.


For instance, in R 2 , we can collect all m<strong>on</strong>omial feature extractors of degree 2 inthe n<strong>on</strong>linear mapΦ : R 2 → H = R 3 (44)([x] 1 , [x] 2 ) ↦→ ([x] 2 1, [x] 2 2, [x] 1 [x] 2 ). (45)This approach works fine for small toy examples, but it fails for realistically sized problems:for N-dimensi<strong>on</strong>al input patterns, there existN H =(N + d − 1)!d!(N − 1)!(46)different m<strong>on</strong>omials (43), comprising a feature space H of dimensi<strong>on</strong>ality N H . Forinstance, already 16 × 16 pixel input images and a m<strong>on</strong>omial degree d = 5 yield adimensi<strong>on</strong>ality of 10 10 .In certain cases described below, there exists, however, a way of computing dotproducts in these high-dimensi<strong>on</strong>al feature spaces without explicitly mapping into them:by means of kernels n<strong>on</strong>linear in the input space R N . Thus, if the subsequent processingcan be carried out using dot products exclusively, we are able to deal with the highdimensi<strong>on</strong>ality.5.2 Polynomial Feature Spaces Induced by KernelsIn order to compute dot products of the form (Φ(x) · Φ(x ′ )), we employ kernel representati<strong>on</strong>sof the formk(x, x ′ ) = (Φ(x) · Φ(x ′ )), (47)which allow us to compute the value of the dot product in H without having to carryout the map Φ. This method was used by Boser et al. to extend the Generalized Portraithyperplane classifier [41] to n<strong>on</strong>linear Support Vector machines [8]. Aizerman etal. called H the linearizati<strong>on</strong> space, and used in the c<strong>on</strong>text of the potential functi<strong>on</strong>classificati<strong>on</strong> method to express the dot product between elements of H in terms ofelements of the input space [1].What does k look like for the case of polynomial features? We start by giving anexample ([38]) for N = d = 2. For the mapdot products in H take the formΦ 2 : ([x] 1 , [x] 2 ) ↦→ ([x] 2 1, [x] 2 2, [x] 1 [x] 2 , [x] 2 [x] 1 ), (48)(Φ 2 (x) · Φ 2 (x ′ )) = [x] 2 1[x ′ ] 2 1 + [x] 2 2[x ′ ] 2 2 + 2[x] 1 [x] 2 [x ′ ] 1 [x ′ ] 2 = (x · x ′ ) 2 , (49)i.e., the desired kernel k is simply the square of the dot product in input space. Note thatit is possible to modify (x · x ′ ) d such that it maps into the space of all m<strong>on</strong>omials up todegree d, defining ([38])k(x, x ′ ) = ((x · x ′ ) + 1) d . (50)


5.3 Examples of KernelsWhen c<strong>on</strong>sidering feature maps, it is also possible to look at things the other wayaround, and start with the kernel. Given a kernel functi<strong>on</strong> satisfying a mathematicalc<strong>on</strong>diti<strong>on</strong> termed positive definiteness, it is possible to c<strong>on</strong>struct a feature space suchthat the kernel computes the dot product in that feature space. This has been broughtto the attenti<strong>on</strong> of the machine learning community by [1], [8], and [38]. In functi<strong>on</strong>alanalysis, the issue has been studied under the heading of Reproducing kernel Hilbertspace (RKHS).Besides (50), a popular choice of kernel is the Gaussian radial basis functi<strong>on</strong> ([1])k(x, x ′ ) = exp ( −γ‖x − x ′ ‖ 2) . (51)An illustrati<strong>on</strong> is in Figure 3. For an overview of other kernels, see [31].6 ν-Soft Margin Support Vector ClassifiersIn practice, a separating hyperplane may not exist, e.g. if a high noise level causes alarge overlap of the classes. To allow for the possibility of examples violating (23), <strong>on</strong>eintroduces slack variables ([15], [38], [32])in order to relax the c<strong>on</strong>straints toξ i ≥ 0, i = 1, . . . , m (52)y i · ((w · x i ) + b) ≥ 1 − ξ i , i = 1, . . . , m. (53)A classifier which generalizes well is then found by c<strong>on</strong>trolling both the classifier capacity(via ‖w‖) and the sum of the slacks ∑ i ξ i. The latter is d<strong>on</strong>e as it can be shownto provide an upper bound <strong>on</strong> the number of training errors which leads to a c<strong>on</strong>vexoptimizati<strong>on</strong> problem.One possible realizati<strong>on</strong>,called C-SVC, of a soft margin classifier is minimizing theobjective functi<strong>on</strong>τ(w, ξ) = 1 m∑2 ‖w‖2 + C ξ i (54)subject to the c<strong>on</strong>straints (52) and (53), for some value of the c<strong>on</strong>stant C > 0 determiningthe trade-off. Here and below, we use boldface Greek letters as a shorthandfor corresp<strong>on</strong>ding <strong>vector</strong>s ξ = (ξ 1 , . . . , ξ m ). Incorporating kernels, and rewriting itin terms of Lagrange multipliers, this again leads to the problem of maximizing (41),subject to the c<strong>on</strong>straints0 ≤ α i ≤ C, i = 1, . . . , m, andi=1m∑α i y i = 0. (55)The <strong>on</strong>ly difference from the separable case is the upper bound C <strong>on</strong> the Lagrange multipliersα i . This way, the influence of the individual patterns (which could be outliers)gets limited. As above, the soluti<strong>on</strong> takes the form (40).i=1


Another possible realizati<strong>on</strong>,called ν-SVC of a soft margin variant of the optimalhyperplane uses the ν-parameterizati<strong>on</strong> ([32]). In it, the parameter C is replaced by aparameter ν ∈ [0, 1] which is the lower and upper bound <strong>on</strong> the number of examplesthat are <strong>support</strong> <strong>vector</strong>s and that lie <strong>on</strong> the wr<strong>on</strong>g side of the hyperplane, respectively.As a primal problem for this approach, termed the ν-SV classifier, we c<strong>on</strong>siderminimizew∈H,ξ∈R m ,ρ,b∈Rτ(w, ξ, ρ) = 1 2 ‖w‖2 − νρ + 1 mm∑ξ i (56)subject to y i (〈x i , w〉 + b) ≥ ρ − ξ i , i = 1, . . . , m (57)i=1and ξ i ≥ 0, ρ ≥ 0. (58)Note that no c<strong>on</strong>stant C appears in this formulati<strong>on</strong>; instead, there is a parameter ν, andalso an additi<strong>on</strong>al variable ρ to be optimized. To understand the role of ρ, note that forξ = 0, the c<strong>on</strong>straint (57) simply states that the two classes are separated by the margin2ρ/‖w‖.To explain the significance of ν, let us first introduce the term margin error: by this,we denote training points with ξ i > 0. These are points which either are errors, or liewithin the margin. Formally, the fracti<strong>on</strong> of margin errors isR ρ emp[g] := 1 m |{i|y ig(x i ) < ρ}| . (59)Here, g is used to denote the argument of the sign in the decisi<strong>on</strong> functi<strong>on</strong> (40): f =sgn ◦g. We are now in a positi<strong>on</strong> to state a result that explains the significance of ν.Propositi<strong>on</strong> 1 ([32]). Suppose we run ν-SVC with kernel functi<strong>on</strong> k <strong>on</strong> some data withthe result that ρ > 0. Then(i) ν is an upper bound <strong>on</strong> the fracti<strong>on</strong> of margin errors (and hence also <strong>on</strong> the fracti<strong>on</strong>of training errors).(ii) ν is a lower bound <strong>on</strong> the fracti<strong>on</strong> of SVs.(iii) Suppose the data (x 1 , y 1 ), . . . , (x m , y m ) were generated iid from a distributi<strong>on</strong>Pr(x, y) = Pr(x) Pr(y|x), such that neither Pr(x, y = 1) nor Pr(x, y = −1) c<strong>on</strong>tainsany discrete comp<strong>on</strong>ent. Suppose, moreover, that the kernel used is analyticand n<strong>on</strong>-c<strong>on</strong>stant. With probability 1, asymptotically, ν equals both the fracti<strong>on</strong> ofSVs and the fracti<strong>on</strong> of margin errors.Before we get into the technical details of the dual derivati<strong>on</strong>, let us take a look ata toy example illustrating the influence of ν (Figure 4). The corresp<strong>on</strong>ding fracti<strong>on</strong>s ofSVs and margin errors are listed in table 1.Let us next derive the dual of the ν-SV classificati<strong>on</strong> algorithm. We c<strong>on</strong>sider theLagrangianL(w, ξ, b, ρ, α, β, δ) = 1 2 ‖w‖2 − νρ + 1 m−m∑i=1ξ im∑(α i (y i (〈x i , w〉 + b) − ρ + ξ i ) + β i ξ i − δρ), (60)i=1


Fig. 4. Toy problem (task: to separate circles from disks) solved using ν-SV classificati<strong>on</strong>, withparameter values ranging from ν = 0.1 (top left) to ν = 0.8 (bottom right). The larger we makeν, the more points are allowed to lie inside the margin (depicted by dotted lines). Results areshown for a Gaussian kernel, k(x, x ′ ) = exp(−‖x − x ′ ‖ 2 ) (from [31]).Table 1. Fracti<strong>on</strong>s of errors and SVs, al<strong>on</strong>g with the margins of class separati<strong>on</strong>, for the toyexample in Figure 4.Note that ν upper bounds the fracti<strong>on</strong> of errors and lower bounds the fracti<strong>on</strong> of SVs, and thatincreasing ν, i.e., allowing more errors, increases the margin.ν 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8fracti<strong>on</strong> of errors 0.00 0.07 0.25 0.32 0.39 0.50 0.61 0.71fracti<strong>on</strong> of SVs 0.29 0.36 0.43 0.46 0.57 0.68 0.79 0.86margin ρ/‖w‖ 0.005 0.018 0.115 0.156 0.364 0.419 0.461 0.546using multipliers α i , β i , δ ≥ 0. This functi<strong>on</strong> has to be minimized with respect to theprimal variables w, ξ, b, ρ, and maximized with respect to the dual variables α, β, δ.Following the same derivati<strong>on</strong> in (31)–(33), we compute the corresp<strong>on</strong>ding partialderivatives and set them to 0, obtaining the following c<strong>on</strong>diti<strong>on</strong>s:w =m∑α i y i x i , (61)i=1α i + β i = 1/m, (62)m∑α i y i = 0, (63)i=1m∑α i − δ = ν. (64)i=1


Again, in the SV expansi<strong>on</strong> (61), the α i that are n<strong>on</strong>-zero corresp<strong>on</strong>d to a c<strong>on</strong>straint (57)which is precisely met.Substituting (61) and (62) into L, using α i , β i , δ ≥ 0, and incorporating kernelsfor dot products, leaves us with the following quadratic optimizati<strong>on</strong> problem for ν-SVclassificati<strong>on</strong>:maximizeα∈R m W (α) = − 1 2m∑α i α j y i y j k(x i , x j ) (65)i,j=1subject to 0 ≤ α i ≤ 1 m , (66)m∑α i y i = 0, (67)i=1m∑α i ≥ ν. (68)i=1As above, the resulting decisi<strong>on</strong> functi<strong>on</strong> can be shown to take the form( m)∑f(x) = sgn α i y i k(x, x i ) + b . (69)i=1Compared with the C-SVC dual ((41), (55)), there are two differences. First, there isan additi<strong>on</strong>al c<strong>on</strong>straint (68). Sec<strong>on</strong>d, the linear term ∑ mi=1 α i no l<strong>on</strong>ger appears in theobjective functi<strong>on</strong> (65). This has an interesting c<strong>on</strong>sequence: (65) is now quadraticallyhomogeneous in α. It is straightforward to verify that the same decisi<strong>on</strong> functi<strong>on</strong> isobtained if we start with the primal functi<strong>on</strong>()τ(w, ξ, ρ) = 1 2 ‖w‖2 + C −νρ + 1 m∑ξ i , (70)mi.e., if <strong>on</strong>e does use C [31].The computati<strong>on</strong> of the threshold b and the margin parameter ρ will be discussed inSecti<strong>on</strong> 7.4.A c<strong>on</strong>necti<strong>on</strong> to standard SV classificati<strong>on</strong>, and a somewhat surprising interpretati<strong>on</strong>of the regularizati<strong>on</strong> parameter C, is described by the following result:Propositi<strong>on</strong> 2 (C<strong>on</strong>necti<strong>on</strong> ν-SVC — C-SVC [32]). If ν-SV classificati<strong>on</strong> leads toρ > 0, then C-SV classificati<strong>on</strong>, with C set a priori to 1/mρ, leads to the same decisi<strong>on</strong>functi<strong>on</strong>.For further details <strong>on</strong> the c<strong>on</strong>necti<strong>on</strong> between ν-SVMs and C-SVMs, see [16, 6]. Byc<strong>on</strong>sidering the optimal α as a functi<strong>on</strong> of parameters, a complete account is as follows:Propositi<strong>on</strong> 3 (Detailed c<strong>on</strong>necti<strong>on</strong> ν-SVC — C-SVC [11]). ∑ mi=1 α i/(Cm) by theC-SVM is a well defined decreasing functi<strong>on</strong> of C. We can definei=1Then,limC→∞∑ mi=1 α iCm= ν min ≥ 0 and limC→0∑ mi=1 α iCm = ν max ≤ 1. (71)


1. ν max = 2 min(m + , m − )/m.2. For any ν > ν max , the dual ν-SVM is infeasible. That is, the set of feasible points isempty. For any ν ∈ (ν min , ν max ], the optimal soluti<strong>on</strong> set of dual ν-SVM is the sameas that of either <strong>on</strong>e or some C-SVM where these C form an interval. In additi<strong>on</strong>,the optimal objective value of ν-SVM is strictly positive. For any 0 ≤ ν ≤ ν min ,dual ν-SVM is feasible with zero optimal objective value.3. If the kernel matrix is positive definite, then ν min = 0.Therefore, for a given problem and kernel, there is an interval [ν min , ν max ] of admissiblevalues for ν, with 0 ≤ ν min ≤ ν max ≤ 1. An illustrati<strong>on</strong> of the relati<strong>on</strong> between ν andC is in Figure 5.ν0.90.80.70.60.50.40.30.20.10-3 -2 -1 0 1 2 3 4 5log 10 CFig. 5. The relati<strong>on</strong> between ν and C (using the RBF kernel <strong>on</strong> the problem australian from theStatlog collecti<strong>on</strong> [25])It has been noted that ν-SVMs have an interesting interpretati<strong>on</strong> in terms of reducedc<strong>on</strong>vex hulls [16, 6]. One can show that for separable problems, <strong>on</strong>e can obtain the optimalmargin separating hyperplane by forming the c<strong>on</strong>vex hulls of both classes, findingthe shortest c<strong>on</strong>necti<strong>on</strong> between the two c<strong>on</strong>vex hulls (since the problem is separable,they are disjoint), and putting the hyperplane halfway al<strong>on</strong>g that c<strong>on</strong>necti<strong>on</strong>, orthog<strong>on</strong>alto it. If a problem is n<strong>on</strong>-separable, however, the c<strong>on</strong>vex hulls of the two classes willno l<strong>on</strong>ger be disjoint. Therefore, it no l<strong>on</strong>ger makes sense to search for the shortest linec<strong>on</strong>necting them. In this situati<strong>on</strong>, it seems natural to reduce the c<strong>on</strong>vex hulls in size,by limiting the size of the coefficients c i in the c<strong>on</strong>vex setsC ± :={ ∑y i=±1c i x i∣ ∣∣∣∣ ∑y i=±1c i = 1, c i ≥ 0}. (72)


to some value ν ∈ (0, 1). Intuitively, this amounts to limiting the influence of individualpoints. It is possible to show that the ν-SVM formulati<strong>on</strong> solves the problem of findingthe hyperplane orthog<strong>on</strong>al to the closest line c<strong>on</strong>necting the reduced c<strong>on</strong>vex hulls [16].We now move <strong>on</strong> to another aspect of soft margin classificati<strong>on</strong>. When we introducedthe slack variables, we did not attempt to justify the fact that in the objectivefuncti<strong>on</strong>, we used a penalizer ∑ mi=1 ξ i. Why not use another penalizer, such as ∑ mi=1 ξp i ,for some p ≥ 0 [15]? For instance, p = 0 would yield a penalizer that exactly countsthe number of margin errors. Unfortunately, however, it is also a penalizer that leads toa combinatorial optimizati<strong>on</strong> problem. Penalizers yielding optimizati<strong>on</strong> problems thatare particularly c<strong>on</strong>venient, <strong>on</strong> the other hand, are obtained for p = 1 and p = 2. Bydefault, we use the former, as it possesses an additi<strong>on</strong>al property which is statisticallyattractive. As the following propositi<strong>on</strong> shows, linearity of the target functi<strong>on</strong> in theslack variables ξ i leads to a certain “outlier” resistance of the estimator. As above, weuse the shorthand x i for Φ(x i ).Propositi<strong>on</strong> 4 (Resistance of SV classificati<strong>on</strong> [32]). Suppose w can be expressed interms of the SVs which are not at bound,w =m∑γ i x i (73)i=1with γ i ≠ 0 <strong>on</strong>ly if α i ∈ (0, 1/m) (where the α i are the coefficients of the dual soluti<strong>on</strong>).Then local movements of any margin error x j parallel to w do not change thehyperplane. 2This result is about the stability of classifiers. Results have also shown that in generalp = 1 leads to fewer <strong>support</strong> <strong>vector</strong>s. Further results in <strong>support</strong> of the p = 1 case canbe seen in [34, 36].Although propositi<strong>on</strong> 1 shows that ν possesses an intuitive meaning, it is still unclearhow to choose ν for a learning task. [35] proves that given ¯R, a close upper bound<strong>on</strong> the expected optimal Bayes risk, an asymptotically good estimate of the optimalvalue of ν is 2 ¯R:Propositi<strong>on</strong> 5. If R[f] is the expected risk defined in (17),R p := inf R[f], (74)fand the kernel used by ν-SVM is universal, then for all ν > 2R p and all ɛ > 0, thereexists a c<strong>on</strong>stant c > 0 such thatP (T = {(x 1 , y 1 ), . . . , (x m , y m )} | R[f ν T ] ≤ ν − R p + ɛ) ≥ 1 − e −cm . (75)Quite a few popular kernels such as the Gaussian are universal. The definiti<strong>on</strong> of auniversal kernel can be seen in [35]. Here, fTν is the decisi<strong>on</strong> functi<strong>on</strong> obtained bytraining ν-SVM <strong>on</strong> the data set T .2 Note that the perturbati<strong>on</strong> of the point is carried out in feature space. What it precisely corresp<strong>on</strong>dsto in input space therefore depends <strong>on</strong> the specific kernel chosen.


Therefore, given an upper bound ¯R <strong>on</strong> R p , the decisi<strong>on</strong> functi<strong>on</strong> with respect to ν = 2 ¯Ralmost surely achieves a risk not larger than R p + 2( ¯R − R p ).The selecti<strong>on</strong> of ν and kernel parameters can be d<strong>on</strong>e by estimating the performanceof <strong>support</strong> <strong>vector</strong> binary classifiers <strong>on</strong> data not yet observed. One such performance estimateis the leave-<strong>on</strong>e-out error, which is an almost unbiased estimate of the generalizati<strong>on</strong>performance [22]. To compute this performance metric, a single point is excludedfrom the training set, and the classifier is trained using the remaining points. It is thendetermined whether this new classifier correctly labels the point that was excluded. Theprocess is repeated over the entire training set. Although theoretically attractive, thisestimate obviously entails a large computati<strong>on</strong>al cost.Three estimates of the leave-<strong>on</strong>e-out error for the ν-SV learning algorithm are presentedin [17]. Of these three estimates, the general ν-SV bound is an upper bound <strong>on</strong>the leave-<strong>on</strong>e-out error, the restricted ν-SV estimate is an approximati<strong>on</strong> that assumesthe sets of margin errors and <strong>support</strong> <strong>vector</strong>s <strong>on</strong> the margin to be c<strong>on</strong>stant, and the maximizedtarget estimate is an approximati<strong>on</strong> that assumes the sets of margin errors andn<strong>on</strong>-<strong>support</strong> <strong>vector</strong>s not to decrease. The derivati<strong>on</strong> of the general ν-SV bound takesa form similar to an upper bound described in [40] for the C-SV classifier, while therestricted ν-SV estimate is based <strong>on</strong> a similar C-SV estimate proposed in [40, 26]: boththese estimates are based <strong>on</strong> the geometrical c<strong>on</strong>cept of the span, which is (roughlyspeaking) a measure of how easily a particular point in the training sample can bereplaced by the other points used to define the classificati<strong>on</strong> functi<strong>on</strong>. No analogousmethod exists in the C-SV case for the maximized target estimate.7 Implementati<strong>on</strong> of ν-SV ClassifiersWe change the dual form of ν-SV classifiers to be a minimizati<strong>on</strong> problem:minimize W (α) = 1 m∑α i α j y i y j k(x i , x j )α∈R m 2i,j=1subject to 0 ≤ α i ≤ 1 m , (76)m∑α i y i = 0,i=1m∑α i = ν. (77)i=1[11] proves that for any given ν, there is at least an optimal soluti<strong>on</strong> which satisfiese T α = ν. Therefore, it is sufficient to solve a simpler problem with the equality c<strong>on</strong>straint(77).Similar to C-SVC, the difficulty of solving (76) is that y i y j k(x i , x j ) are in generalnot zero. Thus, for large data sets, the Hessian (sec<strong>on</strong>d derivative) matrix of theobjective functi<strong>on</strong> cannot be stored in the computer memory, so traditi<strong>on</strong>al optimizati<strong>on</strong>methods such as Newt<strong>on</strong> or quasi Newt<strong>on</strong> cannot be directly used. Currently, thedecompositi<strong>on</strong> method is the most used approach to c<strong>on</strong>quer this difficulty. Here, wepresent the implementati<strong>on</strong> in [11], which modifies the procedure for C-SVC.


7.1 The Decompositi<strong>on</strong> MethodThe decompositi<strong>on</strong> method is an iterative process. In each step, the index set of variablesis partiti<strong>on</strong>ed to two sets B and N, where B is the working set. Then, in that iterati<strong>on</strong>variables corresp<strong>on</strong>ding to N are fixed while a sub-problem <strong>on</strong> variables corresp<strong>on</strong>dingto B is minimized. The procedure is as follows:Algorithm 1 (Decompositi<strong>on</strong> method)1. Given a number q ≤ l as the size of the working set. Find α 1 as an initial feasiblesoluti<strong>on</strong> of (76). Set k = 1.2. If α k is an optimal soluti<strong>on</strong> of (76), stop. Otherwise, find a working set B ⊂{1, . . . , l} whose size is q. Define N ≡ {1, . . . , l}\B and αB k and αk N to be sub<strong>vector</strong>sof α k corresp<strong>on</strong>ding to B and N, respectively.3. Solve the following sub-problem with the variable α B :1minimizeα B ∈R q 2∑i∈B,j∈Bα i α j y i y j k(x i , x j ) +∑i∈B,j∈Nα i α k j y i y j k(x i , x j )subject to 0 ≤ α i ≤ 1 , i ∈ B, (78)∑mαi k y i , (79)α i y i = − ∑i∈B i∈N∑αi k . (80)i∈Bα i = ν − ∑ i∈N4. Set α k+1Bto be the optimal soluti<strong>on</strong> of (78) and αk+1Ngoto Step 2.≡ αk N . Set k ← k + 1 andNote that B is updated in each iterati<strong>on</strong>. To simplify the notati<strong>on</strong>, we simply use Binstead of B k .7.2 Working Set Selecti<strong>on</strong>An important issue of the decompositi<strong>on</strong> method is the selecti<strong>on</strong> of the working set B.Here, we c<strong>on</strong>sider an approach based <strong>on</strong> the violati<strong>on</strong> of the KKT c<strong>on</strong>diti<strong>on</strong>. Similar to(30), by putting (61) into (57), <strong>on</strong>e of the KKT c<strong>on</strong>diti<strong>on</strong>s ism∑α i · [y i ( α j K(x i , x j ) + b) − ρ + ξ i ] = 0, i = 1, . . . , m. (81)j=1Using 0 ≤ α i ≤ 1 m, (81) can be rewritten as:m∑α j y i y j k(x i , x j ) + by i − ρ ≥ 0, if α i < 1 m ,j=1m∑α j y i y j k(x i , x j ) + by i − ρ ≤ 0, if α i > 0.j=1(82)


That is, an α is optimal for the dual problem (76) if and <strong>on</strong>ly if α is feasible and satisfies(81). Using the property that y i = ±1 and representing ∇W (α) i = ∑ mj=1 α jy i y j K(x i , x j ),(82) can be further written asmax ∇W (α)i∈Iup 1 (α) i ≤ ρ − b ≤ min ∇W (α) i andi∈Ilow 1 (α)max ∇W (α) i ≤ ρ + b ≤ min ∇W (α) i ,i∈Iup −1 (α)i∈I −1low (α)(83)whereI 1 up(α) := {i | α i > 0, y i = 1}, I 1 low(α) := {i | α i < 1 m , y i = 1}, (84)andI −1up (α) := {i | α i < 1 m , y i = −1}, I −1low (α) := {i | α i > 0, y i = −1}. (85)We call any (i, j) ∈ Iup(α) 1 × Ilow 1 (α) or I−1 up (α) × I −1 (α) satisfyinglowy i ∇W (α) i > y j ∇W (α) j (86)a violating pair as (83) is not satisfied. When α is not optimal yet, if any such a violatingpair is included in B, the optimal objective value of (78) is small than that at α k .Therefore, the decompositi<strong>on</strong> procedure has its objective value strictly decreasing from<strong>on</strong>e iterati<strong>on</strong> to the next.Therefore, a natural choice of B is to select all pairs which violate (83) the most.To be more precise, we can set q to be an even integer and sequentially select q/2 pairs{(i 1 , j 1 ), . . . , (i q/2 , j q/2 )} from ∈ Iup(α) 1 × Ilow 1 (α) or I−1 up (α) × I −1 (α) such thaty i1 ∇W (α) i1 − y j1 ∇W (α) j1 ≥ · · · ≥ y iq/2 ∇W (α) iq/2 − y jq/2 ∇W (α) jq/2 . (87)This working set selecti<strong>on</strong> is merely an extensi<strong>on</strong> of that for C-SVC. The maindifference is that for C-SVM, (83) becomes <strong>on</strong>ly <strong>on</strong>e inequality with b. Due to thissimilarity, we believe that the c<strong>on</strong>vergence analysis of C-SVC [21] can be adapted herethough detailed proofs have not been written and published.[11] c<strong>on</strong>siders the same working set selecti<strong>on</strong>. However, following the derivati<strong>on</strong>for C-SVC in [19], it is obtained using the c<strong>on</strong>cept of feasible directi<strong>on</strong>s in c<strong>on</strong>strainedoptimizati<strong>on</strong>. We feel that a derivati<strong>on</strong> from the violati<strong>on</strong> of the KKT c<strong>on</strong>diti<strong>on</strong> is moreintuitive.low7.3 SMO-type Implementati<strong>on</strong>The Sequential Minimal Optimizati<strong>on</strong> (SMO) algorithm [28] is an extreme of the decompositi<strong>on</strong>method where, for C-SVC, the working set is restricted to <strong>on</strong>ly two elements.The main advantage is that each two-variable sub-problem can be analyticallysolved, so numerical optimizati<strong>on</strong> software are not needed. For this method, atleast two elements are required for the working set. Otherwise, the equality c<strong>on</strong>straint


∑i∈B α iy i = − ∑ j∈N αk j y j leads to a fixed optimal objective value of the sub-problem.Then, the decompositi<strong>on</strong> procedure stays at the same point.Now the dual of ν-SVC possesses two inequalities, so we may think that moreelements are needed for the working set. Indeed, two elements are still enough for thecase of ν-SVC. Note that (79) and (80) can be rewritten as∑α i y i = ν 2 −∑∑αi k y i and α i y i = ν 2 −∑αi k y i . (88)i∈B,y i =1i∈N,y i =1i∈B,y i =−1i∈N,y i =−1Thus, if (i 1 , j 1 ) are selected as the working set selecti<strong>on</strong> using (87), y i1 = y j1 , so(88) reduces to <strong>on</strong>ly <strong>on</strong>e equality with two variables. Then, the sub-problem is stillguaranteed to be smaller than that at α k .The comparis<strong>on</strong> in [11] shows that using C and ν with the c<strong>on</strong>necti<strong>on</strong> in propositi<strong>on</strong>3 and equivalent stopping c<strong>on</strong>diti<strong>on</strong>, the performance of the SMO-type implementati<strong>on</strong>described here for C-SVM and ν-SVM are comparable.7.4 The Calculati<strong>on</strong> of b and ρ and Stopping CriteriaIf at an optimal soluti<strong>on</strong>, 0 < α i < 1/m and y i = 1, then i ∈ Iup(α) 1 and Ilow 1 (α).Thus, ρ − b = ∇W (α) i . Similarly, if there is another 0 < α j < 1/m and y j = −1,then ρ + b = ∇W (α) j . Thus, solving two equalities gives b and ρ. In practice, weaverage W (α) i to avoid numerical errors:∑0


two classes of data and obtains a decisi<strong>on</strong> functi<strong>on</strong>. Thus, for a k-class problem, thereare k(k−1)/2 decisi<strong>on</strong> functi<strong>on</strong>s. In the predicti<strong>on</strong> stage, a voting strategy is used wherethe testing point is designated to be in a class with the maximum number of votes. In[18], it was experimentally shown that for general problems, using C-SV classifier,various multi-class approaches give similar accuracy. However, the “<strong>on</strong>e-against-<strong>on</strong>e”method is more efficient for training. Here, we will focus <strong>on</strong> extending it for ν-SVM.Multi-class methods must be c<strong>on</strong>sidered together with parameter-selecti<strong>on</strong> strategies.That is, we search for appropriate C and kernel parameters for c<strong>on</strong>structing abetter model. In the following, we restrict the discussi<strong>on</strong> <strong>on</strong> <strong>on</strong>ly the Gaussian (radiusbasis functi<strong>on</strong>) kernel k(x i , x j ) = e −γ‖x i−x j ‖ 2 , so the kernel parameter is γ. Withthe parameter selecti<strong>on</strong> c<strong>on</strong>sidered, there are two ways to implement the “<strong>on</strong>e-against<strong>on</strong>e”method: First, for any two classes of data, the parameter selecti<strong>on</strong> is c<strong>on</strong>ductedto have the best (C, γ). Thus, for the best model selected, each decisi<strong>on</strong> functi<strong>on</strong> hasits own (C, γ). For experiments here, the parameter selecti<strong>on</strong> of each binary SVM isby a five-fold cross-validati<strong>on</strong>. The sec<strong>on</strong>d way is that for each (C, γ), an evaluati<strong>on</strong>criteri<strong>on</strong> (e.g. cross-validati<strong>on</strong>) combining with the “<strong>on</strong>e-against-<strong>on</strong>e” method is usedfor estimating the performance of the model. A sequence of pre-selected (C, γ) is triedto select the best model. Therefore, for each model, k(k −1)/2 decisi<strong>on</strong> functi<strong>on</strong>s sharethe same C and γ.It is not very clear which <strong>on</strong>e of the two implementati<strong>on</strong>s is better. On <strong>on</strong>e hand, asingle parameter set may not be uniformly good for all k(k − 1)/2 decisi<strong>on</strong> functi<strong>on</strong>s.On the other hand, as the overall accuracy is the final c<strong>on</strong>siderati<strong>on</strong>, <strong>on</strong>e parameter setfor <strong>on</strong>e decisi<strong>on</strong> functi<strong>on</strong> may lead to over-fitting. [14] is the first to compare the twoapproaches using C-SVM, where the preliminary results show that both give similaraccuracy.For ν-SVM, each binary SVM using data from the ith and the jth classes has anadmissible interval [ν ijmin , νij max], where νmax ij = 2 min(m i , m j )/(m i + m j ) accordingto propositi<strong>on</strong> 3. Here m i and m j are the number of data points in the ith and jthclasses, respectively. Thus, if all k(k − 1)/2 decisi<strong>on</strong> functi<strong>on</strong>s share the same ν, theadmissible interval is[maxi≠j νij min , mini≠j νij max]. (91)This set is n<strong>on</strong>-empty if the kernel matrix is positive definite. The reas<strong>on</strong> is that propositi<strong>on</strong>3 implies ν ijmin = 0, ∀i ≠ j, so min i≠j νmax ij = 0. Therefore, unlike C of C-SVM,which has a large valid range [0, ∞), for ν-SVM, we worry that the admissible intervalmay be too small. For example, if the data set is highly unbalanced, min i≠j ν ijminis verysmall.We redo the same comparis<strong>on</strong> as that in [14] for ν-SVM. Results are in Table 2. Wec<strong>on</strong>sider multi-class problems tested in [18], where most of them are from the statlogcollecti<strong>on</strong> [25]. Except data sets dna, shuttle, letter, satimage, and usps, where test setsare available, we separate each problem to 80% training and 20% testing. Then, crossvalidati<strong>on</strong> are c<strong>on</strong>ducted <strong>on</strong>ly <strong>on</strong> the training data. All other settings such as data scalingare the same as those in [18]. Experiments are c<strong>on</strong>ducted using LIBSVM [10], whichsolves both C-SVM and ν-SVM.Results in Table 2 show no significant difference am<strong>on</strong>g the four implementati<strong>on</strong>s.Note that some problems (e.g. shuttle) are highly unbalanced so the admissible interval


(91) is very small. Surprisingly, from such intervals, we can still find a suitable ν whichleads to a good model. This preliminary experiment indicates that in general the use of“<strong>on</strong>e-against-<strong>on</strong>e” approach for multi-class ν-SVM is viable.Table 2. Test accuracy (in percentage) of multi-class data sets by C-SVM and ν-SVM. Thecolumns “Comm<strong>on</strong> C”, “Different C”, “Comm<strong>on</strong> ν”, “Different ν” are testing accuracy of usingthe same and different (C,γ), (or (ν,γ)) for all k(k − 1)/2 decisi<strong>on</strong> functi<strong>on</strong>s. The validati<strong>on</strong> isc<strong>on</strong>ducted <strong>on</strong> the following points of (C,γ): [2 −5 , 2 −3 , . . . , 2 15 ] × [2 −15 , 2 −13 , . . . , 2 3 ]. For ν-SVM, the range of γ is the same but we validate a 10-point discretizati<strong>on</strong> of ν in the interval (91)or [ν ijmin , νij max], depending <strong>on</strong> whether k(k − 1)/2 decisi<strong>on</strong> functi<strong>on</strong>s share the same parametersor not. For small problems (number of training data ≤ 1000), we do cross validati<strong>on</strong> five times,and then average the testing accuracy.Data set Class No. # training # testing Comm<strong>on</strong> C Different C Comm<strong>on</strong> ν Different νvehicle 4 677 169 86.5 87.1 85.9 87.8glass 6 171 43 72.2 70.7 73.0 69.3iris 3 120 30 96.0 93.3 94.0 94.6dna 3 2000 1186 95.6 95.1 95.0 94.8segment 7 1848 462 98.3 97.2 96.7 97.6shuttle 7 43500 14500 99.9 99.9 99.7 99.8letter 26 15000 5000 97.9 97.7 97.9 96.8vowel 11 423 105 98.1 97.7 98.3 96.0satimage 6 4435 2000 91.9 92.2 92.1 91.9wine 3 143 35 97.1 97.1 97.1 96.6usps 10 7291 2007 95.3 95.2 95.3 94.8We also present the c<strong>on</strong>tours of C-SVM and ν-SVM in Figure 8 using the approachthat all decisi<strong>on</strong> functi<strong>on</strong>s share the same (C, γ). In the c<strong>on</strong>tour of C-SVM, the x-axis and y-axis are log 2 C and log 2 γ, respectively. For ν-SVM, the x-axis is ν in theinterval (91). Clearly, the good regi<strong>on</strong> of using ν-SVM is smaller. This c<strong>on</strong>firms ourc<strong>on</strong>cern earlier, which motivated us to c<strong>on</strong>duct experiments in this secti<strong>on</strong>. Fortunately,points in this smaller good regi<strong>on</strong> still lead to models that are competitive with those byC-SVM.There are some ways to enlarge the admissible interval of ν. A work to extendalgorithm to the case of very small values of ν by allowing negative margins is [27].For the upper bound, according to the above propositi<strong>on</strong> 3, if the classes are balanced,then the upper bound is 1. This leads to the idea to modify the algorithm by adjustingthe cost functi<strong>on</strong> such that the classes are balanced in terms of the cost, even if they arenot in terms of the mere numbers of training examples. An earlier discussi<strong>on</strong> <strong>on</strong> suchformulati<strong>on</strong>s is at [12]. For example, we can c<strong>on</strong>sider the following formulati<strong>on</strong>:minimizew∈H,ξ∈R m ,ρ,b∈Rτ(w, ξ, ρ) = 1 2 ‖w‖2 − νρ + 12m +subject to y i (〈x i , w〉 + b) ≥ ρ − ξ i ,and ξ i ≥ 0, ρ ≥ 0.∑i:y i =1ξ i + 12m −∑i:y i =−1ξ i


satimage 91.59190.59089.54 8988.52-16-5 0 5 10 150-2-4-6-8-10-12-14log(gamma)satimage 91.59190.59089.54 8988.52-160 0.1 0.2 0.3 0.4 0.5 0.60-2-4-6-8-10-12-14log(gamma)log(C)NUFig. 6. 5-fold cross-validati<strong>on</strong> accuracy of the data set satimage. Left: C-SVM, Right: ν-SVMThe dual ismaximizeα∈R m W (α) = − 1 2m∑α i α j y i y j k(x i , x j )i,j=1subject to 0 ≤ α i ≤ 12m +, if y i = 1,0 ≤ α i ≤ 1 , if y i = −1,2m −m∑m∑α i y i = 0, α i ≥ ν.i=1Clearly, when all α i equals its corresp<strong>on</strong>ding upper bound, α is a feasible soluti<strong>on</strong> with∑ mi=1 α i = 1 .Another possibility isminimizew∈H,ξ∈R m ,ρ,b∈Ri=1τ(w, ξ, ρ) = 1 2 ‖w‖2 − νρ +subject to y i (〈x i , w〉 + b) ≥ ρ − ξ i ,and ξ i ≥ 0, ρ ≥ 0.12 min(m + , m − )m∑i=1ξ i


The dual ismaximizeα∈R m W (α) = − 1 2subject tom∑α i α j y i y j k(x i , x j )i,j=110 ≤ α i ≤2 min(m + , m − ) ,m∑m∑α i y i = 0, α i ≥ ν.i=1Then, the largest admissible ν is 1.A slight modificati<strong>on</strong> of the implementati<strong>on</strong> in Secti<strong>on</strong> 7 for the above formulati<strong>on</strong>sis in [13].i=19 Applicati<strong>on</strong>s of ν-SV ClassifiersResearchers have applied ν-SVM <strong>on</strong> different applicati<strong>on</strong>s. Some of them feel that itis easier and more intuitive to deal with ν ∈ [0, 1] than C ∈ [0, ∞). Here, we brieflysummarize some work which use LIBSVM to solve ν-SVM.In [7], researchers from HP Labs discuss the topics of pers<strong>on</strong>al email agent. Dataclassificati<strong>on</strong> is an important comp<strong>on</strong>ent for which the authors use ν-SVM because theythink “the ν parameter is more intuitive than the C parameter.”[23] applies machine learning methods to detect and localize boundaries of naturalimages. Several classifiers are tested where, for SVM, the authors c<strong>on</strong>sidered ν-SVM.10 C<strong>on</strong>clusi<strong>on</strong>One of the most appealing features of kernel algorithms is the solid foundati<strong>on</strong> providedby both statistical learning theory and functi<strong>on</strong>al analysis. Kernel methods letus interpret (and design) learning algorithms geometrically in feature spaces n<strong>on</strong>linearlyrelated to the input space, and combine statistics and geometry in a promisingway. Kernels provide an elegant framework for studying three fundamental issues ofmachine learning:– Similarity measures — the kernel can be viewed as a (n<strong>on</strong>linear) similarity measure,and should ideally incorporate prior knowledge about the problem at hand– Data representati<strong>on</strong> — as described above, kernels induce representati<strong>on</strong>s of thedata in a linear space– Functi<strong>on</strong> class — due to the representer theorem, the kernel implicitly also determinesthe functi<strong>on</strong> class which is used for learning.Support <strong>vector</strong> machines have been <strong>on</strong>e of the major kernel methods for data classificati<strong>on</strong>.Its original form requires a parameter C ∈ [0, ∞), which c<strong>on</strong>trols the trade-offbetween the classifier capacity and the training errors. Using the ν-parameterizati<strong>on</strong>,the parameter C is replaced by a parameter ν ∈ [0, 1]. In this <str<strong>on</strong>g>tutorial</str<strong>on</strong>g>, we have given itsderivati<strong>on</strong> and present possible advantages of using the ν-<strong>support</strong> <strong>vector</strong> classifier.


AcknowledgmentsThe authors thank Ingo Steinwart and Arthur Grett<strong>on</strong> for some helpful comments.References1. M. A. Aizerman, É.. M. Braverman, and L. I. Roz<strong>on</strong>oér. Theoretical foundati<strong>on</strong>s of thepotential functi<strong>on</strong> method in pattern recogniti<strong>on</strong> learning. Automati<strong>on</strong> and Remote C<strong>on</strong>trol,25:821–837, 1964.2. N. Al<strong>on</strong>, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensi<strong>on</strong>s, uniformc<strong>on</strong>vergence, and learnability. Journal of the ACM, 44(4):615–631, 1997.3. M. Avriel. N<strong>on</strong>linear Programming. Prentice-Hall Inc., New Jersey, 1976.4. P. L. Bartlett and J. Shawe-Taylor. Generalizati<strong>on</strong> performance of <strong>support</strong> <strong>vector</strong> machinesand other pattern classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors,Advances in Kernel Methods — Support Vector Learning, pages 43–54, Cambridge, MA,1999. MIT Press.5. M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. N<strong>on</strong>linear programming : theory and algorithms.Wiley, sec<strong>on</strong>d editi<strong>on</strong>, 1993.6. K. P. Bennett and E. J. Bredensteiner. Duality and geometry in SVM classifiers. In P. Langley,editor, Proceedings of the 17th Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Machine Learning, pages 57–64,San Francisco, California, 2000. Morgan Kaufmann.7. R. Bergman, M. Griss, and C. Staelin. A pers<strong>on</strong>al email assistant. Technical Report HPL-2002-236, HP Laboratories, Palo Alto, CA, 2002.8. B. E. Boser, I. M. Guy<strong>on</strong>, and V. Vapnik. A training algorithm for optimal margin classifiers.In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop <strong>on</strong> Computati<strong>on</strong>alLearning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press.9. C. J. C. Burges and B. Schölkopf. Improving the accuracy and speed of <strong>support</strong> <strong>vector</strong>learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in NeuralInformati<strong>on</strong> Processing Systems 9, pages 375–381, Cambridge, MA, 1997. MIT Press.10. C.-C. Chang and C.-J. Lin. LIBSVM: a library for <strong>support</strong> <strong>vector</strong> machines, 2001. Softwareavailable at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.11. C.-C. Chang and C.-J. Lin. Training ν-<strong>support</strong> <strong>vector</strong> classifiers: Theory and algorithms.Neural Computati<strong>on</strong>, 13(9):2119–2147, 2001.12. H.-G. Chew, R. E. Bogner, and C.-C. Lim. Dual ν-<strong>support</strong> <strong>vector</strong> machine with error rateand training size biasing. In Proceedings of ICASSP, pages 1269–72, 2001.13. H. G. Chew, C. C. Lim, and R. E. Bogner. An implementati<strong>on</strong> of training dual-nu <strong>support</strong><strong>vector</strong> machines. In Qi, Teo, and Yang, editors, Optimizati<strong>on</strong> and C<strong>on</strong>trol with Applicati<strong>on</strong>s.Kluwer, 2003.14. K.-M. Chung, W.-C. Kao, C.-L. Sun, and C.-J. Lin. Decompositi<strong>on</strong> methods for linear <strong>support</strong><strong>vector</strong> machines. Technical report, Department of Computer Science and Informati<strong>on</strong>Engineering, Nati<strong>on</strong>al Taiwan University, 2002.15. C. Cortes and V. Vapnik. Support <strong>vector</strong> networks. Machine Learning, 20:273–297, 1995.16. D. J. Crisp and C. J. C. Burges. A geometric interpretati<strong>on</strong> of ν-SVM classifiers. In S. A.Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Informati<strong>on</strong> ProcessingSystems 12. MIT Press, 2000.17. A. Grett<strong>on</strong>, R. Herbrich, O. Chapelle, B. Schölkopf, and P. J. W. Rayner. Estimating theLeave-One-Out Error for Classificati<strong>on</strong> Learning with SVMs. Technical Report CUED/F-INFENG/TR.424, Cambridge University Engineering Department, 2001.


18. C.-W. Hsu and C.-J. Lin. A comparis<strong>on</strong> of methods for multi-class <strong>support</strong> <strong>vector</strong> machines.IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks, 13(2):415–425, 2002.19. T. Joachims. Making large–scale SVM learning practical. In B. Schölkopf, C. J. C. Burges,and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages169–184, Cambridge, MA, 1999. MIT Press.20. C.-J. Lin. Formulati<strong>on</strong>s of <strong>support</strong> <strong>vector</strong> machines: a note from an optimizati<strong>on</strong> point ofview. Neural Computati<strong>on</strong>, 13(2):307–317, 2001.21. C.-J. Lin. On the c<strong>on</strong>vergence of the decompositi<strong>on</strong> method for <strong>support</strong> <strong>vector</strong> machines.IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks, 12(6):1288–1298, 2001.22. Luntz, A. and Brailovsky, V. On estimati<strong>on</strong> of characters obtained in statistical procedure ofrecogniti<strong>on</strong>. Technicheskaya Kibernetica 3, 1969.23. D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries usingbrightness and texture. In Advances in Neural Informati<strong>on</strong> Processing Systems, volume 14,2002.24. J. Mercer. Functi<strong>on</strong>s of positive and negative type and their c<strong>on</strong>necti<strong>on</strong> with the theory ofintegral equati<strong>on</strong>s. Philosophical Transacti<strong>on</strong>s of the Royal Society, L<strong>on</strong>d<strong>on</strong>, A 209:415–446, 1909.25. D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and StatisticalClassificati<strong>on</strong>. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available athttp://www.ncc.up.pt/liacc/ML/statlog/datasets.html.26. M. Opper and O. Winther. Gaussian processes and SVM: Mean field and leave-<strong>on</strong>e-outestimator. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances inLarge Margin Classifiers, Cambridge, MA, 2000. MIT Press.27. F. Perez Cruz, J. West<strong>on</strong>, D. J. L. Herrmann, and B. Schölkopf. Extensi<strong>on</strong> of the ν-svmrange for classificati<strong>on</strong>. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle,editors, Advances in Learning Theory: Methods, Models and Applicati<strong>on</strong>s, 190, pages 179–196, Amsterdam, 2003. IOS Press.28. J. C. Platt. Fast training of <strong>support</strong> <strong>vector</strong> machines using sequential minimal optimizati<strong>on</strong>.In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -Support Vector Learning, Cambridge, MA, 1998. MIT Press.29. B. Schölkopf. Support Vector Learning. R. Oldenbourg Verlag, München, 1997. Doktorarbeit,Technische Universität Berlin. Available from http:/www.kyb.tuebingen.mpg.de/∼bs.30. B. Schölkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods — SupportVector Learning. MIT Press, Cambridge, MA, 1999.31. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.32. B. Schölkopf, A. J. Smola, R. C. Williams<strong>on</strong>, and P. L. Bartlett. New <strong>support</strong> <strong>vector</strong> algorithms.Neural Computati<strong>on</strong>, 12:1207–1245, 2000.33. A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans. Advances in Large MarginClassifiers. MIT Press, Cambridge, MA, 2000.34. I. Steinwart. Support <strong>vector</strong> machines are universally c<strong>on</strong>sistent. Journal of Complexity,18:768–791, 2002.35. I. Steinwart. On the optimal parameter choice for ν-<strong>support</strong> <strong>vector</strong> machines. IEEE Transacti<strong>on</strong>s<strong>on</strong> Pattern Analysis and Machine Intelligence, 2003. To appear.36. I. Steinwart. Sparseness of <strong>support</strong> <strong>vector</strong> machines. Technical report, 2003.37. V. Vapnik. Estimati<strong>on</strong> of Dependences Based <strong>on</strong> Empirical Data (in Russian). Nauka,Moscow, 1979. (English translati<strong>on</strong>: Springer Verlag, New York, 1982).38. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995.39. V. Vapnik. Statistical Learning Theory. Wiley, NY, 1998.40. V. Vapnik and O. Chapelle. Bounds <strong>on</strong> error expectati<strong>on</strong> for <strong>support</strong> <strong>vector</strong> machines. NeuralComputati<strong>on</strong>, 12(9):2013–2036, 2000.


41. V. Vapnik and A. Cherv<strong>on</strong>enkis. Theory of Pattern Recogniti<strong>on</strong> (in Russian). Nauka,Moscow, 1974. (German Translati<strong>on</strong>: W. Wapnik & A. Tscherw<strong>on</strong>enkis, Theorie der Zeichenerkennung,Akademie–Verlag, Berlin, 1979).42. V. Vapnik and A. Lerner. Pattern recogniti<strong>on</strong> using generalized portrait method. Automati<strong>on</strong>and Remote C<strong>on</strong>trol, 24:774–780, 1963.43. G. Wahba. Spline Models for Observati<strong>on</strong>al Data, volume 59 of CBMS-NSF Regi<strong>on</strong>al C<strong>on</strong>ferenceSeries in Applied Mathematics. Society for Industrial and Applied Mathematics,Philadelphia, Pennsylvania, 1990.44. R. C. Williams<strong>on</strong>, A. J. Smola, and B. Schölkopf. Generalizati<strong>on</strong> performance of regularizati<strong>on</strong>networks and <strong>support</strong> <strong>vector</strong> machines via entropy numbers of compact operators. IEEETransacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory, 47(6):2516–2532, 2001.45. P. Wolfe. A duality theorem for n<strong>on</strong>-linear programming. Quartely of Applied Mathematics,19:239–244, 1961.


Statistics and Computing 14: 199–222, 2004C○ 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> ∗ALEX J. SMOLA and BERNHARD SCHÖLKOPFRSISE, Australian Nati<strong>on</strong>al University, Canberra 0200, AustraliaAlex.Smola@anu.edu.auMax-Planck-Institut für biologische Kybernetik, 72076 Tübingen, GermanyBernhard.Schoelkopf@tuebingen.mpg.deReceived July 2002 and accepted November 2003In this <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> we give an overview of the basic ideas underlying Support Vector (SV) machines forfuncti<strong>on</strong> estimati<strong>on</strong>. Furthermore, we include a summary of currently used algorithms for trainingSV machines, covering both the quadratic (or c<strong>on</strong>vex) programming part and advanced methods fordealing with large datasets. Finally, we menti<strong>on</strong> some modificati<strong>on</strong>s and extensi<strong>on</strong>s that have beenapplied to the standard SV algorithm, and discuss the aspect of regularizati<strong>on</strong> from a SV perspective.Keywords:machine learning, <strong>support</strong> <strong>vector</strong> machines, regressi<strong>on</strong> estimati<strong>on</strong>1. Introducti<strong>on</strong>The purpose of this paper is twofold. It should serve as a selfc<strong>on</strong>tainedintroducti<strong>on</strong> to Support Vector regressi<strong>on</strong> for readersnew to this rapidly developing field of research. 1 On the otherhand, it attempts to give an overview of recent developments inthe field.To this end, we decided to organize the essay as follows.We start by giving a brief overview of the basic techniques inSecti<strong>on</strong>s 1, 2 and 3, plus a short summary with a number offigures and diagrams in Secti<strong>on</strong> 4. Secti<strong>on</strong> 5 reviews currentalgorithmic techniques used for actually implementing SVmachines. This may be of most interest for practiti<strong>on</strong>ers.The following secti<strong>on</strong> covers more advanced topics such asextensi<strong>on</strong>s of the basic SV algorithm, c<strong>on</strong>necti<strong>on</strong>s between SVmachines and regularizati<strong>on</strong> and briefly menti<strong>on</strong>s methods forcarrying out model selecti<strong>on</strong>. We c<strong>on</strong>clude with a discussi<strong>on</strong>of open questi<strong>on</strong>s and problems and current directi<strong>on</strong>s of SVresearch. Most of the results presented in this review paperalready have been published elsewhere, but the comprehensivepresentati<strong>on</strong>s and some details are new.1.1. Historic backgroundThe SV algorithm is a n<strong>on</strong>linear generalizati<strong>on</strong> of the GeneralizedPortrait algorithm developed in Russia in the sixties 2∗ An extended versi<strong>on</strong> of this paper is available as NeuroCOLT Technical ReportTR-98-030.(Vapnik and Lerner 1963, Vapnik and Cherv<strong>on</strong>enkis 1964). Assuch, it is firmly grounded in the framework of statistical learningtheory, or VC theory,which has been developed over the lastthree decades by Vapnik and Cherv<strong>on</strong>enkis (1974) and Vapnik(1982, 1995). In a nutshell, VC theory characterizes propertiesof learning machines which enable them to generalize well tounseen data.In its present form, the SV machine was largely developedat AT&T Bell Laboratories by Vapnik and co-workers (Boser,Guy<strong>on</strong> and Vapnik 1992, Guy<strong>on</strong>, Boser and Vapnik 1993, Cortesand Vapnik, 1995, Schölkopf, Burges and Vapnik 1995, 1996,Vapnik, Golowich and Smola 1997). Due to this industrial c<strong>on</strong>text,SV research has up to date had a sound orientati<strong>on</strong> towardsreal-world applicati<strong>on</strong>s. Initial work focused <strong>on</strong> OCR (opticalcharacter recogniti<strong>on</strong>). Within a short period of time, SV classifiersbecame competitive with the best available systems forboth OCR and object recogniti<strong>on</strong> tasks (Schölkopf, Burges andVapnik 1996, 1998a, Blanz et al. 1996, Schölkopf 1997). Acomprehensive <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> SV classifiers has been published byBurges (1998). But also in regressi<strong>on</strong> and time series predicti<strong>on</strong>applicati<strong>on</strong>s, excellent performances were so<strong>on</strong> obtained(Müller et al. 1997, Drucker et al. 1997, Stits<strong>on</strong> et al. 1999,Mattera and Haykin 1999). A snapshot of the state of the artin SV learning was recently taken at the annual Neural Informati<strong>on</strong>Processing Systems c<strong>on</strong>ference (Schölkopf, Burges,and Smola 1999a). SV learning has now evolved into an activearea of research. Moreover, it is in the process of entering thestandard methods toolbox of machine learning (Haykin 1998,Cherkassky and Mulier 1998, Hearst et al. 1998). Schölkopf and0960-3174 C○ 2004 Kluwer Academic Publishers


200 Smola and SchölkopfSmola (2002) c<strong>on</strong>tains a more in-depth overview of SVM regressi<strong>on</strong>.Additi<strong>on</strong>ally, Cristianini and Shawe-Taylor (2000) and Herbrich(2002) provide further details <strong>on</strong> kernels in the c<strong>on</strong>text ofclassificati<strong>on</strong>.1.2. The basic ideaSuppose we are given training data {(x 1 , y 1 ),...,(x l , y l )}⊂X × R, where X denotes the space of the input patterns (e.g.X = R d ). These might be, for instance, exchange rates for somecurrency measured at subsequent days together with corresp<strong>on</strong>dingec<strong>on</strong>ometric indicators. In ε-SV regressi<strong>on</strong> (Vapnik 1995),our goal is to find a functi<strong>on</strong> f (x) that has at most ε deviati<strong>on</strong>from the actually obtained targets y i for all the training data, andat the same time is as flat as possible. In other words, we do notcare about errors as l<strong>on</strong>g as they are less than ε, but will notaccept any deviati<strong>on</strong> larger than this. This may be important ifyouwant to be sure not to lose more than ε m<strong>on</strong>ey when dealingwith exchange rates, for instance.For pedagogical reas<strong>on</strong>s, we begin by describing the case oflinear functi<strong>on</strong>s f , taking the formf (x) =〈w, x〉+b with w ∈ X , b ∈ R (1)where 〈·, ·〉denotes the dot product in X . Flatness in the caseof (1) means that <strong>on</strong>e seeks a small w. One way to ensure this isto minimize the norm, 3 i.e. ‖w‖ 2 =〈w, w〉. Wecan write thisproblem as a c<strong>on</strong>vex optimizati<strong>on</strong> problem:1minimize2{ ‖w‖2yi −〈w, x i 〉−b ≤ ε (2)subject to〈w, x i 〉+b − y i ≤ εThe tacit assumpti<strong>on</strong> in (2) was that such a functi<strong>on</strong> f actuallyexists that approximates all pairs (x i , y i ) with ε precisi<strong>on</strong>, or inother words, that the c<strong>on</strong>vex optimizati<strong>on</strong> problem is feasible.Sometimes, however, this may not be the case, or we also maywant to allow for some errors. Analogously to the “soft margin”loss functi<strong>on</strong> (Bennett and Mangasarian 1992) which wasused in SV machines by Cortes and Vapnik (1995), <strong>on</strong>e can introduceslack variables ξ i ,ξi∗ to cope with otherwise infeasiblec<strong>on</strong>straints of the optimizati<strong>on</strong> problem (2). Hence we arrive atthe formulati<strong>on</strong> stated in Vapnik (1995).1l∑minimize2 ‖w‖2 + C (ξ i + ξi ∗ )⎧i=1⎪⎨ y i −〈w, x i 〉−b ≤ ε + ξ i (3)subject to 〈w, x i 〉+b − y i ≤ ε + ξi∗⎪⎩ξ i ,ξi ∗ ≥ 0The c<strong>on</strong>stant C > 0 determines the trade-off between the flatnessof f and the amount up to which deviati<strong>on</strong>s larger thanε are tolerated. This corresp<strong>on</strong>ds to dealing with a so calledε-insensitive loss functi<strong>on</strong> |ξ| ε described by{0 if|ξ| ≤ε|ξ| ε :=(4)|ξ|−ε otherwise.Fig. 1. The soft margin loss setting for a linear SVM (from Schölkopfand Smola, 2002)Figure 1 depicts the situati<strong>on</strong> graphically. Only the points outsidethe shaded regi<strong>on</strong> c<strong>on</strong>tribute to the cost insofar, as the deviati<strong>on</strong>sare penalized in a linear fashi<strong>on</strong>. It turns out that in most casesthe optimizati<strong>on</strong> problem (3) can be solved more easily in its dualformulati<strong>on</strong>. 4 Moreover, as we will see in Secti<strong>on</strong> 2, the dual formulati<strong>on</strong>provides the key for extending SV machine to n<strong>on</strong>linearfuncti<strong>on</strong>s. Hence we will use a standard dualizati<strong>on</strong> method utilizingLagrange multipliers, as described in e.g. Fletcher (1989).1.3. Dual problem and quadratic programsThe key idea is to c<strong>on</strong>struct a Lagrange functi<strong>on</strong> from the objectivefuncti<strong>on</strong> (it will be called the primal objective functi<strong>on</strong>in the rest of this article) and the corresp<strong>on</strong>ding c<strong>on</strong>straints, byintroducing a dual set of variables. It can be shown that thisfuncti<strong>on</strong> has a saddle point with respect to the primal and dualvariables at the soluti<strong>on</strong>. For details see e.g. Mangasarian (1969),McCormick (1983), and Vanderbei (1997) and the explanati<strong>on</strong>sin Secti<strong>on</strong> 5.2. We proceed as follows:L := 1 l∑l∑2 ‖w‖2 + C (ξ i + ξi ∗ ) − (η i ξ i + ηi ∗ ξ i ∗ )−−i=1i=1l∑α i (ε + ξ i − y i +〈w, x i 〉+b)i=1l∑i=1α ∗ i (ε + ξ ∗i + y i −〈w, x i 〉−b) (5)Here L is the Lagrangian and η i ,η ∗ i ,α i,α ∗ iare Lagrange multipliers.Hence the dual variables in (5) have to satisfy positivityc<strong>on</strong>straints, i.e.α (∗)i,η (∗)i≥ 0. (6)Note that by α (∗)i,werefer to α i and α ∗ i .It follows from the saddle point c<strong>on</strong>diti<strong>on</strong> that the partialderivatives of L with respect to the primal variables (w, b,ξ i ,ξi ∗)have to vanish for optimality.l∑∂ b L =i=1∂ w L = w −(α ∗ i − α i ) = 0 (7)l∑(α i − αi ∗ )x i = 0 (8)i=1∂ (∗) ξL = C − α (∗)i− η (∗)i= 0 (9)i


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 201Substituting (7), (8), and (9) into (5) yields the dual optimizati<strong>on</strong>problem.⎧⎪⎨ − 1 l∑(α i − αi ∗ 2)(α j − α ∗ j )〈x i, x j 〉i, j=1maximizel∑l∑⎪⎩ −ε (α i + αi ∗ ) + y i (α i − αi ∗ ) (10)i=1i=1l∑subject to (α i − αi ∗ ) = 0 and α i,αi ∗ ∈ [0, C]i=1In deriving (10) we already eliminated the dual variables η i ,ηi∗through c<strong>on</strong>diti<strong>on</strong> (9) which can be reformulated as η (∗)i= C −. Equati<strong>on</strong> (8) can be rewritten as followsα (∗)iw =l∑(α i −αi ∗ )x i, thus f (x) =i=1l∑(α i −αi ∗ )〈x i, x〉+b. (11)This is the so-called Support Vector expansi<strong>on</strong>, i.e. w can becompletely described as a linear combinati<strong>on</strong> of the trainingpatterns x i .Inasense, the complexity of a functi<strong>on</strong>’s representati<strong>on</strong>by SVs is independent of the dimensi<strong>on</strong>ality of the inputspace X , and depends <strong>on</strong>ly <strong>on</strong> the number of SVs.Moreover, note that the complete algorithm can be describedin terms of dot products between the data. Even when evaluatingf (x) weneed not compute w explicitly. These observati<strong>on</strong>swill come in handy for the formulati<strong>on</strong> of a n<strong>on</strong>linearextensi<strong>on</strong>.1.4. Computing bSo far we neglected the issue of computing b. The latter can bed<strong>on</strong>e by exploiting the so called Karush–Kuhn–Tucker (KKT)c<strong>on</strong>diti<strong>on</strong>s (Karush 1939, Kuhn and Tucker 1951). These statethat at the point of the soluti<strong>on</strong> the product between dual variablesand c<strong>on</strong>straints has to vanish.andi=1α i (ε + ξ i − y i +〈w, x i 〉+b) = 0α ∗ i (ε + ξ ∗i + y i −〈w, x i 〉−b) = 0(C − α i )ξ i = 0(C − α ∗ i )ξ ∗i = 0.(12)(13)This allows us to make several useful c<strong>on</strong>clusi<strong>on</strong>s. Firstly <strong>on</strong>lysamples (x i , y i ) with corresp<strong>on</strong>ding α (∗)i= C lie outside the ε-insensitive tube. Sec<strong>on</strong>dly α i αi∗ = 0, i.e. there can never be a setof dual variables α i ,αi ∗ which are both simultaneously n<strong>on</strong>zero.This allows us to c<strong>on</strong>clude thatε − y i +〈w, x i 〉+b ≥ 0 and ξ i = 0 ifα i < C (14)ε − y i +〈w, x i 〉+b ≤ 0 ifα i > 0 (15)In c<strong>on</strong>juncti<strong>on</strong> with an analogous analysis <strong>on</strong> αi ∗ we havemax{−ε + y i −〈w, x i 〉|α i < C or αi∗ > 0} ≤b ≤min{−ε + y i −〈w, x i 〉|α i > 0orαi ∗ (16)< C}If some α (∗)i∈ (0, C) the inequalities become equalities. Seealso Keerthi et al. (2001) for further means of choosing b.Another way of computing b will be discussed in the c<strong>on</strong>textof interior point optimizati<strong>on</strong> (cf. Secti<strong>on</strong> 5). There b turns outto be a by-product of the optimizati<strong>on</strong> process. Further c<strong>on</strong>siderati<strong>on</strong>sshall be deferred to the corresp<strong>on</strong>ding secti<strong>on</strong>. See alsoKeerthi et al. (1999) for further methods to compute the c<strong>on</strong>stantoffset.A final note has to be made regarding the sparsity of the SVexpansi<strong>on</strong>. From (12) it follows that <strong>on</strong>ly for | f (x i ) − y i |≥εthe Lagrange multipliers may be n<strong>on</strong>zero, or in other words, forall samples inside the ε–tube (i.e. the shaded regi<strong>on</strong> in Fig. 1)the α i ,αi∗ vanish: for | f (x i ) − y i | < ε the sec<strong>on</strong>d factor in(12) is n<strong>on</strong>zero, hence α i ,αi ∗ has to be zero such that the KKTc<strong>on</strong>diti<strong>on</strong>s are satisfied. Therefore we have a sparse expansi<strong>on</strong>of w in terms of x i (i.e. we do not need all x i to describe w). Theexamples that come with n<strong>on</strong>vanishing coefficients are calledSupport Vectors.2. Kernels2.1. N<strong>on</strong>linearity by preprocessingThe next step is to make the SV algorithm n<strong>on</strong>linear. This, forinstance, could be achieved by simply preprocessing the trainingpatterns x i byamap : X → F into some feature space F,as described in Aizerman, Braverman and Roz<strong>on</strong>oér (1964) andNilss<strong>on</strong> (1965) and then applying the standard SV regressi<strong>on</strong>algorithm. Let us have a brief look at an example given in Vapnik(1995).Example 1 (Quadratic features in R 2 ). C<strong>on</strong>sider the map :R 2 → R 3 with (x 1 , x 2 ) = (x1 2, √ 2x 1 x 2 , x2 2 ). It is understoodthat the subscripts in this case refer to the comp<strong>on</strong>ents of x ∈ R 2 .Training a linear SV machine <strong>on</strong> the preprocessed features wouldyield a quadratic functi<strong>on</strong>.While this approach seems reas<strong>on</strong>able in the particular exampleabove, it can easily become computati<strong>on</strong>ally infeasiblefor both polynomial features of higher order and higher dimensi<strong>on</strong>ality,as the number of different m<strong>on</strong>omial featuresof degree p is ( d+p−1p), where d = dim(X ). Typical valuesfor OCR tasks (with good performance) (Schölkopf, Burgesand Vapnik 1995, Schölkopf et al. 1997, Vapnik 1995) arep = 7, d = 28 · 28 = 784, corresp<strong>on</strong>ding to approximately3.7 · 10 16 features.2.2. Implicit mapping via kernelsClearly this approach is not feasible and we have to find a computati<strong>on</strong>allycheaper way. The key observati<strong>on</strong> (Boser, Guy<strong>on</strong>


202 Smola and Schölkopfand Vapnik 1992) is that for the feature map of example 2.1 wehave〈(x21 , √ 2x 1 x 2 , x2) 2 (, x′21 , √ 2x ′ 1x ′ 2, x ′ 22)〉=〈x, x ′ 〉 2 . (17)As noted in the previous secti<strong>on</strong>, the SV algorithm <strong>on</strong>ly depends<strong>on</strong> dot products between patterns x i . Hence it suffices to knowk(x, x ′ ):=〈(x),(x ′ )〉 rather than explicitly which allowsus to restate the SV optimizati<strong>on</strong> problem:⎧⎪⎨ − 1 l∑(α i − αi ∗ 2)(α j − α ∗ j )k(x i, x j )i, j=1maximizel∑l∑⎪⎩ −ε (α i + αi ∗ ) + y i (α i − αi ∗ ) (18)i=1i=1l∑subject to (α i − αi ∗ ) = 0 and α i,αi ∗ ∈ [0, C]i=1Likewise the expansi<strong>on</strong> of f (11) may be written asl∑w = (α i − αi ∗ )(x i) andf (x) =i=1l∑(α i − αi ∗ )k(x i, x) + b. (19)i=1The difference to the linear case is that w is no l<strong>on</strong>ger given explicitly.Also note that in the n<strong>on</strong>linear setting, the optimizati<strong>on</strong>problem corresp<strong>on</strong>ds to finding the flattest functi<strong>on</strong> in featurespace, not in input space.2.3. C<strong>on</strong>diti<strong>on</strong>s for kernelsThe questi<strong>on</strong> that arises now is, which functi<strong>on</strong>s k(x, x ′ ) corresp<strong>on</strong>dto a dot product in some feature space F. The followingtheorem characterizes these functi<strong>on</strong>s (defined <strong>on</strong> X ).Theorem 2 (Mercer 1909). Suppose k ∈ L ∞ (X 2 ) such thatthe integral operator T k : L 2 (X ) → L 2 (X ),∫T k f (·) := k(·, x) f (x)dµ(x) (20)Xis positive (here µ denotes a measure <strong>on</strong> X with µ(X ) finiteand supp(µ) = X ). Let ψ j ∈ L 2 (X ) be the eigenfuncti<strong>on</strong> of T kassociated with the eigenvalue λ j ≠ 0 and normalized such that‖ψ j ‖ L2 = 1 and let ψ j denote its complex c<strong>on</strong>jugate. Then1. (λ j (T )) j ∈ l 1 .2. k(x, x ′ ) = ∑ j∈N λ jψ j (x)ψ j (x ′ ) holds for almost all (x, x ′ ),where the series c<strong>on</strong>verges absolutely and uniformly for almostall (x, x ′ ).Less formally speaking this theorem means that if∫k(x, x ′ ) f (x) f (x ′ ) dxdx ′ ≥ 0 for all f ∈ L 2 (X ) (21)X ×Xholds we can write k(x, x ′ )asadot product in some featurespace. From this c<strong>on</strong>diti<strong>on</strong> we can c<strong>on</strong>clude some simple rulesfor compositi<strong>on</strong>s of kernels, which then also satisfy Mercer’sc<strong>on</strong>diti<strong>on</strong> (Schölkopf, Burges and Smola 1999a). In the followingwe will call such functi<strong>on</strong>s k admissible SV kernels.Corollary 3 (Positive linear combinati<strong>on</strong>s of kernels). Denoteby k 1 , k 2 admissible SV kernels and c 1 , c 2 ≥ 0 thenk(x, x ′ ):= c 1 k 1 (x, x ′ ) + c 2 k 2 (x, x ′ ) (22)is an admissible kernel. This follows directly from (21) by virtueof the linearity of integrals.More generally, <strong>on</strong>e can show that the set of admissible kernelsforms a c<strong>on</strong>vex c<strong>on</strong>e, closed in the topology of pointwisec<strong>on</strong>vergence (Berg, Christensen and Ressel 1984).Corollary 4 (Integrals of kernels). Let s(x, x ′ ) be a functi<strong>on</strong><strong>on</strong> X × X such that∫k(x, x ′ ):= s(x, z)s(x ′ , z) dz (23)exists. Then k is an admissible SV kernel.XThis can be shown directly from (21) and (23) by rearranging theorder of integrati<strong>on</strong>. We now state a necessary and sufficient c<strong>on</strong>diti<strong>on</strong>for translati<strong>on</strong> invariant kernels, i.e. k(x, x ′ ):= k(x − x ′ )as derived in Smola, Schölkopf and Müller (1998c).Theorem 5 (Products of kernels). Denote by k 1 and k 2 admissibleSV kernels thenk(x, x ′ ):= k 1 (x, x ′ )k 2 (x, x ′ ) (24)is an admissible kernel.This can be seen by an applicati<strong>on</strong> of the “expansi<strong>on</strong> part” ofMercer’s theorem to the kernels k 1 and k 2 and observing thateach term in the double sum ∑ i, j λ1 i λ2 j ψ 1 i (x)ψ 1 i (x ′ )ψ 2 j (x)ψ 2 j (x ′ )gives rise to a positive coefficient when checking (21).Theorem 6 (Smola, Schölkopf and Müller 1998c). Atranslati<strong>on</strong>invariant kernel k(x, x ′ ) = k(x − x ′ ) is an admissible SVkernels if and <strong>on</strong>ly if the Fourier transform∫F[k](ω) = (2π) − d 2 e −i〈ω,x〉 k(x)dx (25)is n<strong>on</strong>negative.We will give a proof and some additi<strong>on</strong>al explanati<strong>on</strong>s to thistheorem in Secti<strong>on</strong> 7. It follows from interpolati<strong>on</strong> theory(Micchelli 1986) and the theory of regularizati<strong>on</strong> networks(Girosi, J<strong>on</strong>es and Poggio 1993). For kernels of the dot-producttype, i.e. k(x, x ′ ) = k(〈x, x ′ 〉), there exist sufficient c<strong>on</strong>diti<strong>on</strong>sfor being admissible.Theorem 7 (Burges 1999). Any kernel of dot-product typek(x, x ′ ) = k(〈x, x ′ 〉) has to satisfyk(ξ) ≥ 0, ∂ ξ k(ξ) ≥ 0 and ∂ ξ k(ξ) + ξ∂ξ 2 k(ξ) ≥ 0 (26)for any ξ ≥ 0 in order to be an admissible SV kernel.X


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 203Note that the c<strong>on</strong>diti<strong>on</strong>s in Theorem 7 are <strong>on</strong>ly necessary butnot sufficient. The rules stated above can be useful tools forpractiti<strong>on</strong>ers both for checking whether a kernel is an admissibleSV kernel and for actually c<strong>on</strong>structing new kernels. The generalcase is given by the following theorem.Theorem 8 (Schoenberg 1942). A kernel of dot-product typek(x, x ′ ) = k(〈x, x ′ 〉) defined <strong>on</strong> an infinite dimensi<strong>on</strong>al Hilbertspace, with a power series expansi<strong>on</strong>∞∑k(t) = a n t n (27)n=0is admissible if and <strong>on</strong>ly if all a n ≥ 0.A slightly weaker c<strong>on</strong>diti<strong>on</strong> applies for finite dimensi<strong>on</strong>alspaces. For further details see Berg, Christensen and Ressel(1984) and Smola, Óvári and Williams<strong>on</strong> (2001).2.4. ExamplesIn Schölkopf, Smola and Müller (1998b) it has been shown, byexplicitly computing the mapping, that homogeneous polynomialkernels k with p ∈ N andk(x, x ′ ) =〈x, x ′ 〉 p (28)are suitable SV kernels (cf. Poggio 1975). From this observati<strong>on</strong><strong>on</strong>e can c<strong>on</strong>clude immediately (Boser, Guy<strong>on</strong> and Vapnik 1992,Vapnik 1995) that kernels of the typek(x, x ′ ) = (〈x, x ′ 〉+c) p (29)i.e. inhomogeneous polynomial kernels with p ∈ N, c ≥ 0 areadmissible, too: rewrite k as a sum of homogeneous kernels andapply Corollary 3. Another kernel, that might seem appealingdue to its resemblance to Neural Networks is the hyperbolictangent kernelk(x, x ′ ) = tanh(ϑ + κ〈x, x ′ 〉). (30)By applying Theorem 8 <strong>on</strong>e can check that this kernel doesnot actually satisfy Mercer’s c<strong>on</strong>diti<strong>on</strong> (Ovari 2000). Curiously,the kernel has been successfully used in practice; cf. Scholkopf(1997) for a discussi<strong>on</strong> of the reas<strong>on</strong>s.Translati<strong>on</strong> invariant kernels k(x, x ′ ) = k(x − x ′ ) arequite widespread. It was shown in Aizerman, Braverman andRoz<strong>on</strong>oér (1964), Micchelli (1986) and Boser, Guy<strong>on</strong> and Vapnik(1992) thatk(x, x ′ ) = e − ‖x−x′ ‖ 22σ 2 (31)is an admissible SV kernel. Moreover <strong>on</strong>e can show (Smola1996, Vapnik, Golowich and Smola 1997) that (1 X denotes theindicator functi<strong>on</strong> <strong>on</strong> the set X and ⊗ the c<strong>on</strong>voluti<strong>on</strong> operati<strong>on</strong>)k(x, x ′ ) = B 2n+1 (‖x − x ′ ‖) with B k :=k⊗i=11 [−12 , 1 2] (32)B-splines of order 2n + 1, defined by the 2n + 1 c<strong>on</strong>voluti<strong>on</strong> ofthe unit inverval, are also admissible. We shall postp<strong>on</strong>e furtherc<strong>on</strong>siderati<strong>on</strong>s to Secti<strong>on</strong> 7 where the c<strong>on</strong>necti<strong>on</strong> to regularizati<strong>on</strong>operators will be pointed out in more detail.3. Cost functi<strong>on</strong>sSo far the SV algorithm for regressi<strong>on</strong> may seem rather strangeand hardly related to other existing methods of functi<strong>on</strong> estimati<strong>on</strong>(e.g. Huber 1981, St<strong>on</strong>e 1985, Härdle 1990, Hastie andTibshirani 1990, Wahba 1990). However, <strong>on</strong>ce cast into a morestandard mathematical notati<strong>on</strong>, we will observe the c<strong>on</strong>necti<strong>on</strong>sto previous work. For the sake of simplicity we will, again,<strong>on</strong>ly c<strong>on</strong>sider the linear case, as extensi<strong>on</strong>s to the n<strong>on</strong>linear <strong>on</strong>eare straightforward by using the kernel method described in theprevious chapter.3.1. The risk functi<strong>on</strong>alLet us for a moment go back to the case of Secti<strong>on</strong> 1.2. There, wehad some training data X := {(x 1 , y 1 ),...,(x l , y l )}⊂X × R.We will assume now, that this training set has been drawn iid(independent and identically distributed) from some probabilitydistributi<strong>on</strong> P(x, y). Our goal will be to find a functi<strong>on</strong> fminimizing the expected risk (cf. Vapnik 1982)∫R[ f ] = c(x, y, f (x))dP(x, y) (33)(c(x, y, f (x)) denotes a cost functi<strong>on</strong> determining how we willpenalize estimati<strong>on</strong> errors) based <strong>on</strong> the empirical data X.Giventhat we do not know the distributi<strong>on</strong> P(x, y) wecan<strong>on</strong>ly useX for estimating a functi<strong>on</strong> f that minimizes R[ f ]. A possibleapproximati<strong>on</strong> c<strong>on</strong>sists in replacing the integrati<strong>on</strong> by theempirical estimate, to get the so called empirical risk functi<strong>on</strong>alR emp [ f ]:= 1 ll∑c(x i , y i , f (x i )). (34)i=1A first attempt would be to find the empirical risk minimizerf 0 := argmin f ∈H R emp [ f ] for some functi<strong>on</strong> class H.However,if H is very rich, i.e. its “capacity” is very high, as for instancewhen dealing with few data in very high-dimensi<strong>on</strong>al spaces,this may not be a good idea, as it will lead to overfitting and thusbad generalizati<strong>on</strong> properties. Hence <strong>on</strong>e should add a capacityc<strong>on</strong>trol term, in the SV case ‖w‖ 2 ,which leads to the regularizedrisk functi<strong>on</strong>al (Tikh<strong>on</strong>ov and Arsenin 1977, Morozov 1984,Vapnik 1982)R reg [ f ]:= R emp [ f ] + λ 2 ‖w‖2 (35)where λ>0 is a so called regularizati<strong>on</strong> c<strong>on</strong>stant. Manyalgorithms like regularizati<strong>on</strong> networks (Girosi, J<strong>on</strong>es andPoggio 1993) or neural networks with weight decay networks(e.g. Bishop 1995) minimize an expressi<strong>on</strong> similar to (35).


204 Smola and Schölkopf3.2. Maximum likelihood and density modelsThe standard setting in the SV case is, as already menti<strong>on</strong>ed inSecti<strong>on</strong> 1.2, the ε-insensitive lossc(x, y, f (x)) =|y − f (x)| ε . (36)It is straightforward to show that minimizing (35) with the particularloss functi<strong>on</strong> of (36) is equivalent to minimizing (3), the<strong>on</strong>ly difference being that C = 1/(λl).Loss functi<strong>on</strong>s such like |y − f (x)| εp with p > 1maynotbe desirable, as the superlinear increase leads to a loss of therobustness properties of the estimator (Huber 1981): in thosecases the derivative of the cost functi<strong>on</strong> grows without bound.For p < 1, <strong>on</strong> the other hand, c becomes n<strong>on</strong>c<strong>on</strong>vex.For the case of c(x, y, f (x)) = (y − f (x)) 2 we recover theleast mean squares fit approach, which, unlike the standard SVloss functi<strong>on</strong>, leads to a matrix inversi<strong>on</strong> instead of a quadraticprogramming problem.The questi<strong>on</strong> is which cost functi<strong>on</strong> should be used in (35). Onthe <strong>on</strong>e hand we will want to avoid a very complicated functi<strong>on</strong> cas this may lead to difficult optimizati<strong>on</strong> problems. On the otherhand <strong>on</strong>e should use that particular cost functi<strong>on</strong> that suits theproblem best. Moreover, under the assumpti<strong>on</strong> that the sampleswere generated by an underlying functi<strong>on</strong>al dependency plusadditive noise, i.e. y i = f true (x i ) + ξ i with density p(ξ), then theoptimal cost functi<strong>on</strong> in a maximum likelihood sense isc(x, y, f (x)) =−log p(y − f (x)). (37)This can be seen as follows. The likelihood of an estimateX f := {(x 1 , f (x 1 )),...,(x l , f (x l ))} (38)for additive noise and iid data isl∏p(X f | X) = p( f (x i ) | (x i , y i )) =i=1l∏p(y i − f (x i )). (39)Maximizing P(X f | X) isequivalent to minimizing − log P(X f | X). By using (37) we getl∑− log P(X f | X) = c(x i , y i , f (x i )). (40)i=1i=1However, the cost functi<strong>on</strong> resulting from this reas<strong>on</strong>ing mightbe n<strong>on</strong>c<strong>on</strong>vex. In this case <strong>on</strong>e would have to find a c<strong>on</strong>vexproxy in order to deal with the situati<strong>on</strong> efficiently (i.e. to findan efficient implementati<strong>on</strong> of the corresp<strong>on</strong>ding optimizati<strong>on</strong>problem).If, <strong>on</strong> the other hand, we are given a specific cost functi<strong>on</strong> froma real world problem, <strong>on</strong>e should try to find as close a proxy tothis cost functi<strong>on</strong> as possible, as it is the performance wrt. thisparticular cost functi<strong>on</strong> that matters ultimately.Table 1c<strong>on</strong>tains an overview over some comm<strong>on</strong> densitymodels and the corresp<strong>on</strong>ding loss functi<strong>on</strong>s as defined by(37).The <strong>on</strong>ly requirement we will impose <strong>on</strong> c(x, y, f (x)) in thefollowing is that for fixed x and y we have c<strong>on</strong>vexity in f (x).This requirement is made, as we want to ensure the existence anduniqueness (for strict c<strong>on</strong>vexity) of a minimum of optimizati<strong>on</strong>problems (Fletcher 1989).3.3. Solving the equati<strong>on</strong>sFor the sake of simplicity we will additi<strong>on</strong>ally assume c tobe symmetric and to have (at most) two (for symmetry) disc<strong>on</strong>tinuitiesat ±ε, ε ≥ 0inthe first derivative, and to bezero in the interval [−ε, ε]. All loss functi<strong>on</strong>s from Table 1bel<strong>on</strong>g to this class. Hence c will take <strong>on</strong> the followingform.c(x, y, f (x)) = ˜c(|y − f (x)| ε ) (41)Note the similarity to Vapnik’s ε-insensitive loss. It is ratherstraightforward to extend this special choice to more generalc<strong>on</strong>vex cost functi<strong>on</strong>s. For n<strong>on</strong>zero cost functi<strong>on</strong>s in the interval[−ε, ε] use an additi<strong>on</strong>al pair of slack variables. Moreoverwe might choose different cost functi<strong>on</strong>s ˜c i , ˜ci∗ and differentvalues of ε i , εi∗ for each sample. At the expense of additi<strong>on</strong>alLagrange multipliers in the dual formulati<strong>on</strong> additi<strong>on</strong>al disc<strong>on</strong>tinuitiesalso can be taken care of. Analogously to (3) we arrive ata c<strong>on</strong>vex minimizati<strong>on</strong> problem (Smola and Schölkopf 1998a).To simplify notati<strong>on</strong> we will stick to the <strong>on</strong>e of (3) and use CTable 1. Comm<strong>on</strong> loss functi<strong>on</strong>s and corresp<strong>on</strong>ding density modelsLoss functi<strong>on</strong>Density modelε-insensitive c(ξ) =|ξ| ε p(ξ) = 1 exp(−|ξ| 2(1+ε) ε)Laplacian c(ξ) =|ξ| p(ξ) = 1 exp(−|ξ|)2 ( )Gaussian c(ξ) = 1 ξ 2 p(ξ) = √ 12 2πexp − ξ 22{ ⎧ ( )12σHuber’s robust loss c(ξ) =(ξ)2 if |ξ| ≤σ⎨ exp − ξ 2if |ξ| ≤σ2σp(ξ) ∝ ( )|ξ|− σ otherwise⎩ σ2exp −|ξ| otherwise2Polynomial c(ξ) = 1 p |ξ|p p(ξ) = p2Ɣ(1/p) exp(−|ξ|p { ⎧ ( )1(ξ) p if |ξ| ≤σ⎨ exp −ξ pif |ξ| ≤σpσPiecewise polynomial c(ξ) =p−1 pσ|ξ|−σ p−1p(ξ) ∝ (p−1 )otherwise⎩pexp −|ξ| otherwiseσ p−1p


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 205instead of normalizing by λ and l.Table 2. Terms of the c<strong>on</strong>vex optimizati<strong>on</strong> problem depending <strong>on</strong> theT (ξ) =− p − 1 σ C − p pmapped into feature space by a map . Then dot productsp−1 αp−1 . (47)pare computed with the images of the training patterns underminimizechoice of the loss functi<strong>on</strong>1l∑2 ‖w‖2 + C ( ˜c(ξ i ) + ˜c(ξi ∗ ε α CT(α)⎧i=1⎪⎨ y i −〈w, x i 〉−b ≤ ε + ξ i (42) ε-insensitive ε ≠ 0 α ∈ [0, C] 0subject to 〈w, x i 〉+b − y i ≤ ε + ξi∗Laplacian ε = 0 α ∈ [0, C] 0⎪⎩ξ i ,ξi ∗Gaussian ε = 0 α ∈ [0, ∞) −≥ 0α 22Huber’s ε = 0 α ∈ [0, C] − 1 σ C −1 α 22Again, by standard Lagrange multiplier techniques, exactly in robust lossthe same manner as in the |·| ε case, <strong>on</strong>e can compute the dual optimizati<strong>on</strong>Polynomial ε = 0 α ∈ [0, ∞) − p−1 p−1 1α pp−1pproblem (the main difference is that the slack variable Piecewise ε = 0 α ∈ [0, C] − p−1 p−1 1α pp−1terms ˜c(ξ (∗)pi)now have n<strong>on</strong>vanishing derivatives). We will omit polynomialthe indices i and ∗ ,where applicable to avoid tedious notati<strong>on</strong>.This yields⎧In the sec<strong>on</strong>d case (ξ ≥ σ )wehave− 1 l∑(α i − αi ∗ 2j − α ∗ j )〈x i, x j 〉T (ξ) = ξ − σ p − 1 − ξ =−σ p − 1 (48)i, j=1pp⎪⎨ l∑maximize + y i (α i − αi ∗ ) − ε(α i + αi ∗ )and ξ = inf{ξ | C ≥ α} =σ ,which, in turn yields α ∈ [0, C].i=1Combining both cases we havel∑⎪⎩ +C T (ξ i ) + T (ξi ∗ α ∈ [0, C] and T (α) =− p − 1 σ C − p pp−1 αp−1 . (49)pi=1⎧⎪⎨l∑Table 2c<strong>on</strong>tains a summary of the various c<strong>on</strong>diti<strong>on</strong>s <strong>on</strong> α andw = (αwherei − αi ∗ )x formulas for T (α) (strictly speaking T (ξ(α))) for different costi(43)i=1functi<strong>on</strong>s. ⎪⎩Note that the maximum slope of ˜c determines theT (ξ) := ˜c(ξ) − ξ∂ ξ ˜c(ξ)regi<strong>on</strong> of feasibility of α, i.e. s := sup ξ∈R + ∂ ξ ˜c(ξ) < ∞ leads to⎧compact intervals [0, Cs] for α. This means that the influencel∑(α i − αi ∗ ⎪⎨of a single pattern is bounded, leading to robust estimatorsi=1(Huber 1972). One can also observe experimentally that thesubject to α ≤ C∂ ξ ˜c(ξ)performance of a SV machine depends significantly <strong>on</strong> the costξ = inf{ξ | C∂ ξ ˜c ≥ α}functi<strong>on</strong> used (Müller et al. 1997, Smola, Schölkopf and Müller⎪⎩1998b)α, ξ ≥ 0A cauti<strong>on</strong>ary remark is necessary regarding the use of costfuncti<strong>on</strong>s other than the ε-insensitive <strong>on</strong>e. Unless ε ≠ 0we3.4. Exampleswill lose the advantage of a sparse decompositi<strong>on</strong>. This maybe acceptable in the case of few data, but will render the predicti<strong>on</strong>step extremely slow otherwise. Hence <strong>on</strong>e will have toLet us c<strong>on</strong>sider the examples of Table 1. We will show explicitlyfor two examples how (43) can be further simplified to bring it trade off a potential loss in predicti<strong>on</strong> accuracy with faster predicti<strong>on</strong>s.Note, however, that also a reduced set algorithm likeinto a form that is practically useful. In the ε-insensitive case,i.e. ˜c(ξ) =|ξ| we getin Burges (1996), Burges and Schölkopf (1997) and SchölkopfT (ξ) = ξ − ξ · 1 = 0. (44)et al. (1999b) or sparse decompositi<strong>on</strong> techniques (Smola andSchölkopf 2000) could be applied to address this issue. In aMorover <strong>on</strong>e can c<strong>on</strong>clude from ∂ ξ ˜c(ξ) = 1 thatBayesian setting, Tipping (2000) has recently shown how an L 2cost functi<strong>on</strong> can be used without sacrificing sparsity.ξ = inf{ξ | C ≥ α} =0 and α ∈ [0, C]. (45)For the case of piecewise polynomial loss we have to distinguish 4. The bigger picturetwo different cases: ξ ≤ σ and ξ>σ.Inthe first case we getT (ξ) = 1pσ ξ p − 1p−1 σ ξ p =− p − 1Before delving into algorithmic details of the implementati<strong>on</strong>σ 1−p ξ p (46) let us briefly review the basic properties of the SV algorithmp−1 pfor regressi<strong>on</strong> as described so far. Figure 2 c<strong>on</strong>tains a graphicaland ξ = inf{ξ | Cσ 1−p ξ p−1 ≥ α} =σ C − 1 1overview over the different steps in the regressi<strong>on</strong> stage.p−1 αp−1 and thusThe input pattern (for which a predicti<strong>on</strong> is to be made) is


206 Smola and SchölkopfFig. 2. Architecture of a regressi<strong>on</strong> machine c<strong>on</strong>structed by the SValgorithmthe map . This corresp<strong>on</strong>ds to evaluating kernel functi<strong>on</strong>sk(x i , x). Finally the dot products are added up using the weightsν i = α i − αi ∗ . This, plus the c<strong>on</strong>stant term b yields the finalpredicti<strong>on</strong> output. The process described here is very similar toregressi<strong>on</strong> in a neural network, with the difference, that in theSV case the weights in the input layer are a subset of the trainingpatterns.Figure 3 dem<strong>on</strong>strates how the SV algorithm chooses theflattest functi<strong>on</strong> am<strong>on</strong>g those approximating the original datawith a given precisi<strong>on</strong>. Although requiring flatness <strong>on</strong>ly infeature space, <strong>on</strong>e can observe that the functi<strong>on</strong>s also arevery flat in input space. This is due to the fact, that kernelscan be associated with flatness properties via regular-izati<strong>on</strong> operators. This will be explained in more detail inSecti<strong>on</strong> 7.Finally Fig. 4 shows the relati<strong>on</strong> between approximati<strong>on</strong> qualityand sparsity of representati<strong>on</strong> in the SV case. The lower theprecisi<strong>on</strong> required for approximating the original data, the fewerSVs are needed to encode that. The n<strong>on</strong>-SVs are redundant, i.e.even without these patterns in the training set, the SV machinewould have c<strong>on</strong>structed exactly the same functi<strong>on</strong> f . One mightthink that this could be an efficient way of data compressi<strong>on</strong>,namely by storing <strong>on</strong>ly the <strong>support</strong> patterns, from which the estimatecan be rec<strong>on</strong>structed completely. However, this simpleanalogy turns out to fail in the case of high-dimensi<strong>on</strong>al data,and even more drastically in the presence of noise. In Vapnik,Golowich and Smola (1997) <strong>on</strong>e can see that even for moderateapproximati<strong>on</strong> quality, the number of SVs can be c<strong>on</strong>siderablyhigh, yielding rates worse than the Nyquist rate (Nyquist 1928,Shann<strong>on</strong> 1948).5. Optimizati<strong>on</strong> algorithmsWhile there has been a large number of implementati<strong>on</strong>s of SValgorithms in the past years, we focus <strong>on</strong> a few algorithms whichwill be presented in greater detail. This selecti<strong>on</strong> is somewhatbiased, as it c<strong>on</strong>tains these algorithms the authors are most familiarwith. However, we think that this overview c<strong>on</strong>tains someof the most effective <strong>on</strong>es and will be useful for practiti<strong>on</strong>erswho would like to actually code a SV machine by themselves.But before doing so we will briefly cover major optimizati<strong>on</strong>packages and strategies.Fig. 3. Left to right: approximati<strong>on</strong> of the functi<strong>on</strong> sinc x with precisi<strong>on</strong>s ε = 0.1, 0.2, and 0.5. The solid top and the bottom lines indicate the sizeof the ε-tube, the dotted line in between is the regressi<strong>on</strong>Fig. 4. Left to right: regressi<strong>on</strong> (solid line), datapoints (small dots) and SVs (big dots) for an approximati<strong>on</strong> with ε = 0.1, 0.2, and 0.5. Note thedecrease in the number of SVs


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 2075.1. Implementati<strong>on</strong>sMost commercially available packages for quadratic programmingcan also be used to train SV machines. These are usuallynumerically very stable general purpose codes, with special enhancementsfor large sparse systems. While the latter is a featurethat is not needed at all in SV problems (there the dot productmatrix is dense and huge) they still can be used with good success.6OSL: This package was written by IBM-Corporati<strong>on</strong> (1992). Ituses a two phase algorithm. The first step c<strong>on</strong>sists of solvinga linear approximati<strong>on</strong> of the QP problem by the simplex algorithm(Dantzig 1962). Next a related very simple QP problemis dealt with. When successive approximati<strong>on</strong>s are closeenough together, the sec<strong>on</strong>d subalgorithm, which permits aquadratic objective and c<strong>on</strong>verges very rapidly from a goodstarting value, is used. Recently an interior point algorithmwas added to the software suite.CPLEX by CPLEX-Optimizati<strong>on</strong>-Inc. (1994) uses a primal-duallogarithmic barrier algorithm (Megiddo 1989) instead withpredictor-corrector step (see e.g. Lustig, Marsten and Shanno1992, Mehrotra and Sun 1992).MINOS by the Stanford Optimizati<strong>on</strong> Laboratory (Murtagh andSaunders 1983) uses a reduced gradient algorithm in c<strong>on</strong>juncti<strong>on</strong>with a quasi-Newt<strong>on</strong> algorithm. The c<strong>on</strong>straints arehandled by an active set strategy. Feasibility is maintainedthroughout the process. On the active c<strong>on</strong>straint manifold, aquasi-Newt<strong>on</strong> approximati<strong>on</strong> is used.MATLAB: Until recently the matlab QP optimizer delivered <strong>on</strong>lyagreeable, although below average performance <strong>on</strong> classificati<strong>on</strong>tasks and was not all too useful for regressi<strong>on</strong> tasks(for problems much larger than 100 samples) due to the factthat <strong>on</strong>e is effectively dealing with an optimizati<strong>on</strong> problemof size 2l where at least half of the eigenvalues of theHessian vanish. These problems seem to have been addressedin versi<strong>on</strong> 5.3 / R11. Matlab now uses interior point codes.LOQO by Vanderbei (1994) is another example of an interiorpoint code. Secti<strong>on</strong> 5.3 discusses the underlying strategies indetail and shows how they can be adapted to SV algorithms.Maximum margin perceptr<strong>on</strong> by Kowalczyk (2000) is an algorithmspecifically tailored to SVs. Unlike most other techniquesit works directly in primal space and thus does nothave to take the equality c<strong>on</strong>straint <strong>on</strong> the Lagrange multipliersinto account explicitly.Iterative free set methods The algorithm by Kaufman (Bunch,Kaufman and Parlett 1976, Bunch and Kaufman 1977, 1980,Drucker et al. 1997, Kaufman 1999), uses such a techniquestarting with all variables <strong>on</strong> the boundary and adding them asthe Karush Kuhn Tucker c<strong>on</strong>diti<strong>on</strong>s become more violated.This approach has the advantage of not having to computethe full dot product matrix from the beginning. Instead it isevaluated <strong>on</strong> the fly, yielding a performance improvementin comparis<strong>on</strong> to tackling the whole optimizati<strong>on</strong> problemat <strong>on</strong>ce. However, also other algorithms can be modified bysubset selecti<strong>on</strong> techniques (see Secti<strong>on</strong> 5.5) to address thisproblem.5.2. Basic noti<strong>on</strong>sMost algorithms rely <strong>on</strong> results from the duality theory in c<strong>on</strong>vexoptimizati<strong>on</strong>. Although we already happened to menti<strong>on</strong> somebasic ideas in Secti<strong>on</strong> 1.2 we will, for the sake of c<strong>on</strong>venience,briefly review without proof the core results. These are neededin particular to derive an interior point algorithm. For details andproofs (see e.g. Fletcher 1989).Uniqueness: Every c<strong>on</strong>vex c<strong>on</strong>strained optimizati<strong>on</strong> problemhas a unique minimum. If the problem is strictly c<strong>on</strong>vex thenthe soluti<strong>on</strong> is unique. This means that SVs are not plaguedwith the problem of local minima as Neural Networks are. 7Lagrange functi<strong>on</strong>: The Lagrange functi<strong>on</strong> is given by the primalobjective functi<strong>on</strong> minus the sum of all products betweenc<strong>on</strong>straints and corresp<strong>on</strong>ding Lagrange multipliers (cf. e.g.Fletcher 1989, Bertsekas 1995). Optimizati<strong>on</strong> can be seenas minimzati<strong>on</strong> of the Lagrangian wrt. the primal variablesand simultaneous maximizati<strong>on</strong> wrt. the Lagrange multipliers,i.e. dual variables. It has a saddle point at the soluti<strong>on</strong>.Usually the Lagrange functi<strong>on</strong> is <strong>on</strong>ly a theoretical device toderive the dual objective functi<strong>on</strong> (cf. Secti<strong>on</strong> 1.2).Dual objective functi<strong>on</strong>: It is derived by minimizing theLagrange functi<strong>on</strong> with respect to the primal variables andsubsequent eliminati<strong>on</strong> of the latter. Hence it can be writtensolely in terms of the dual variables.Duality gap: For both feasible primal and dual variables the primalobjective functi<strong>on</strong> (of a c<strong>on</strong>vex minimizati<strong>on</strong> problem)is always greater or equal than the dual objective functi<strong>on</strong>.Since SVMs have <strong>on</strong>ly linear c<strong>on</strong>straints the c<strong>on</strong>straint qualificati<strong>on</strong>sof the str<strong>on</strong>g duality theorem (Bazaraa, Sherali andShetty 1993, Theorem 6.2.4) are satisfied and it follows thatgap vanishes at optimality. Thus the duality gap is a measurehow close (in terms of the objective functi<strong>on</strong>) the current setof variables is to the soluti<strong>on</strong>.Karush–Kuhn–Tucker (KKT) c<strong>on</strong>diti<strong>on</strong>s: Aset of primal anddual variables that is both feasible and satisfies the KKTc<strong>on</strong>diti<strong>on</strong>s is the soluti<strong>on</strong> (i.e. c<strong>on</strong>straint · dual variable = 0).The sum of the violated KKT terms determines exactly thesize of the duality gap (that is, we simply compute thec<strong>on</strong>straint · Lagrangemultiplier part as d<strong>on</strong>e in (55)). Thisallows us to compute the latter quite easily.A simple intuiti<strong>on</strong> is that for violated c<strong>on</strong>straints the dualvariable could be increased arbitrarily, thus rendering theLagrange functi<strong>on</strong> arbitrarily large. This, however, is in c<strong>on</strong>traditi<strong>on</strong>to the saddlepoint property.5.3. Interior point algorithmsIn a nutshell the idea of an interior point algorithm is to computethe dual of the optimizati<strong>on</strong> problem (in our case the dualdual of R reg [ f ]) and solve both primal and dual simultaneously.This is d<strong>on</strong>e by <strong>on</strong>ly gradually enforcing the KKT c<strong>on</strong>diti<strong>on</strong>s


208 Smola and Schölkopfto iteratively find a feasible soluti<strong>on</strong> and to use the dualitygap between primal and dual objective functi<strong>on</strong> to determinethe quality of the current set of variables. The special flavourof algorithm we will describe is primal-dual path-following(Vanderbei 1994).In order to avoid tedious notati<strong>on</strong> we will c<strong>on</strong>sider the slightlymore general problem and specialize the result to the SVM later.It is understood that unless stated otherwise, variables like αdenote <strong>vector</strong>s and α i denotes its i-th comp<strong>on</strong>ent.1minimize q(α) +〈c,α〉2 (50)subject to Aα = b and l ≤ α ≤ uwith c,α,l, u ∈ R n , A ∈ R n·m , b ∈ R m , the inequalities between<strong>vector</strong>s holding comp<strong>on</strong>entwise and q(α) being a c<strong>on</strong>vexfuncti<strong>on</strong> of α. Now we will add slack variables to get rid of allinequalities but the positivity c<strong>on</strong>straints. This yields:1minimize q(α) +〈c,α〉2subject to Aα = b,α− g = l, α+ t = u, (51)g, t ≥ 0, αfreeThe dual of (51) ismaximizesubject to12 (q(α) −〈 ⃗∂q(α),α)〉+〈b, y〉+〈l, z〉−〈u, s〉1∂q(α)2 ⃗ + c − (Ay) ⊤ + s = z, s, z ≥ 0, y free(52)Moreover we get the KKT c<strong>on</strong>diti<strong>on</strong>s, namelyg i z i = 0 and s i t i = 0 for all i ∈ [1 ...n]. (53)A necessary and sufficient c<strong>on</strong>diti<strong>on</strong> for the optimal soluti<strong>on</strong> isthat the primal/dual variables satisfy both the feasibility c<strong>on</strong>diti<strong>on</strong>sof (51) and (52) and the KKT c<strong>on</strong>diti<strong>on</strong>s (53). We proceedto solve (51)–(53) iteratively. The details can be found inAppendix A.5.4. Useful tricksBefore proceeding to further algorithms for quadratic optimizati<strong>on</strong>let us briefly menti<strong>on</strong> some useful tricks that can be appliedto all algorithms described subsequently and may have significantimpact despite their simplicity. They are in part derivedfrom ideas of the interior-point approach.Training with different regularizati<strong>on</strong> parameters: For severalreas<strong>on</strong>s (model selecti<strong>on</strong>, c<strong>on</strong>trolling the number of <strong>support</strong><strong>vector</strong>s, etc.) it may happen that <strong>on</strong>e has to train a SV machinewith different regularizati<strong>on</strong> parameters C, but otherwiserather identical settings. If the parameters C new = τC oldis not too different it is advantageous to use the rescaled valuesof the Lagrange multipliers (i.e. α i ,αi ∗ )asastarting pointfor the new optimizati<strong>on</strong> problem. Rescaling is necessary tosatisfy the modified c<strong>on</strong>straints. One getsα new = τα old and likewise b new = τb old . (54)Assuming that the (dominant) c<strong>on</strong>vex part q(α) ofthe primalobjective is quadratic, the q scales with τ 2 where as thelinear part scales with τ.However, since the linear term dominatesthe objective functi<strong>on</strong>, the rescaled values are still abetter starting point than α = 0. In practice a speedup ofapproximately 95% of the overall training time can be observedwhen using the sequential minimizati<strong>on</strong> algorithm,cf. (Smola 1998). A similar reas<strong>on</strong>ing can be applied whenretraining with the same regularizati<strong>on</strong> parameter but different(yet similar) width parameters of the kernel functi<strong>on</strong>. SeeCristianini, Campbell and Shawe-Taylor (1998) for detailsthere<strong>on</strong> in a different c<strong>on</strong>text.M<strong>on</strong>itoring c<strong>on</strong>vergence via the feasibility gap: Inthe case ofboth primal and dual feasible variables the following c<strong>on</strong>necti<strong>on</strong>between primal and dual objective functi<strong>on</strong> holds:Dual Obj. = Primal Obj. − ∑ (g i z i + s i t i ) (55)iThis can be seen immediately by the c<strong>on</strong>structi<strong>on</strong> of theLagrange functi<strong>on</strong>. In Regressi<strong>on</strong> Estimati<strong>on</strong> (with the ε-insensitive loss functi<strong>on</strong>) <strong>on</strong>e obtains for ∑ i g i z i + s i t i⎡+ max(0, f (x i ) − (y i + ε i ))(C − αi ∗)⎤∑− min(0, f (x i ) − (y i + ε i ))α ∗ i⎢⎣ + max(0, (yii − εi ∗) − f (x ⎥i))(C − α i ) ⎦ . (56)− min(0, (y i − εi ∗) − f (x i))α iThus c<strong>on</strong>vergence with respect to the point of the soluti<strong>on</strong>can be expressed in terms of the duality gap. An effectivestopping rule is to require∑i g i z i + s i t i|Primal Objective|+1 ≤ ε tol (57)for some precisi<strong>on</strong> ε tol . This c<strong>on</strong>diti<strong>on</strong> is much in the spirit ofprimal dual interior point path following algorithms, wherec<strong>on</strong>vergence is measured in terms of the number of significantfigures (which would be the decimal logarithm of (57)), ac<strong>on</strong>venti<strong>on</strong> that will also be adopted in the subsequent partsof this expositi<strong>on</strong>.5.5. Subset selecti<strong>on</strong> algorithmsThe c<strong>on</strong>vex programming algorithms described so far can beused directly <strong>on</strong> moderately sized (up to 3000) samples datasetswithout any further modificati<strong>on</strong>s. On large datasets, however, itis difficult, due to memory and cpu limitati<strong>on</strong>s, to compute thedot product matrix k(x i , x j ) and keep it in memory. A simplecalculati<strong>on</strong> shows that for instance storing the dot product matrixof the NIST OCR database (60.000 samples) at single precisi<strong>on</strong>would c<strong>on</strong>sume 0.7 GBytes. A Cholesky decompositi<strong>on</strong> thereof,which would additi<strong>on</strong>ally require roughly the same amount ofmemory and 64 Teraflops (counting multiplies and adds separately),seems unrealistic, at least at current processor speeds.A first soluti<strong>on</strong>, which was introduced in Vapnik (1982) relies<strong>on</strong> the observati<strong>on</strong> that the soluti<strong>on</strong> can be rec<strong>on</strong>structed fromthe SVs al<strong>on</strong>e. Hence, if we knew the SV set beforehand, and


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 209it fitted into memory, then we could directly solve the reducedproblem. The catch is that we do not know the SV set beforesolving the problem. The soluti<strong>on</strong> is to start with an arbitrarysubset, a first chunk that fits into memory, train the SV algorithm<strong>on</strong> it, keep the SVs and fill the chunk up with data the currentestimator would make errors <strong>on</strong> (i.e. data lying outside the ε-tube of the current regressi<strong>on</strong>). Then retrain the system and keep<strong>on</strong> iterating until after training all KKT-c<strong>on</strong>diti<strong>on</strong>s are satisfied.The basic chunking algorithm just postp<strong>on</strong>ed the underlyingproblem of dealing with large datasets whose dot-product matrixcannot be kept in memory: it will occur for larger training setsizes than originally, but it is not completely avoided. Hencethe soluti<strong>on</strong> is Osuna, Freund and Girosi (1997) to use <strong>on</strong>ly asubset of the variables as a working set and optimize the problemwith respect to them while freezing the other variables. Thismethod is described in detail in Osuna, Freund and Girosi (1997),Joachims (1999) and Saunders et al. (1998) for the case of patternrecogniti<strong>on</strong>. 8An adaptati<strong>on</strong> of these techniques to the case of regressi<strong>on</strong>with c<strong>on</strong>vex cost functi<strong>on</strong>s can be found in Appendix B. Thebasic structure of the method is described by Algorithm 1.Algorithm 1.: Basic structure of a working set algorithmInitialize α i ,α ∗ i= 0Choose arbitrary working set S wrepeatCompute coupling terms (linear and c<strong>on</strong>stant) for S w (seeAppendix A.3)Solve reduced optimizati<strong>on</strong> problemChoose new S w from variables α i ,α ∗ inot satisfying theKKT c<strong>on</strong>diti<strong>on</strong>suntil working set S w =∅5.6. Sequential minimal optimizati<strong>on</strong>Recently an algorithm—Sequential Minimal Optimizati<strong>on</strong>(SMO)—was proposed (Platt 1999) that puts chunking to theextreme by iteratively selecting subsets <strong>on</strong>ly of size 2 and optimizingthe target functi<strong>on</strong> with respect to them. It has beenreported to have good c<strong>on</strong>vergence properties and it is easilyimplemented. The key point is that for a working set of 2 theoptimizati<strong>on</strong> subproblem can be solved analytically without explicitlyinvoking a quadratic optimizer.While readily derived for pattern recogniti<strong>on</strong> by Platt (1999),<strong>on</strong>e simply has to mimick the original reas<strong>on</strong>ing to obtain anextensi<strong>on</strong> to Regressi<strong>on</strong> Estimati<strong>on</strong>. This is what will be d<strong>on</strong>ein Appendix C (the pseudocode can be found in Smola andSchölkopf (1998b)). The modificati<strong>on</strong>s c<strong>on</strong>sist of a pattern dependentregularizati<strong>on</strong>, c<strong>on</strong>vergence c<strong>on</strong>trol via the number ofsignificant figures, and a modified system of equati<strong>on</strong>s to solvethe optimizati<strong>on</strong> problem in two variables for regressi<strong>on</strong> analytically.Note that the reas<strong>on</strong>ing <strong>on</strong>ly applies to SV regressi<strong>on</strong> withthe ε insensitive loss functi<strong>on</strong>—for most other c<strong>on</strong>vex cost functi<strong>on</strong>san explicit soluti<strong>on</strong> of the restricted quadratic programmingproblem is impossible. Yet, <strong>on</strong>e could derive an analogous n<strong>on</strong>quadraticc<strong>on</strong>vex optimizati<strong>on</strong> problem for general cost functi<strong>on</strong>sbut at the expense of having to solve it numerically.The expositi<strong>on</strong> proceeds as follows: first <strong>on</strong>e has to derivethe (modified) boundary c<strong>on</strong>diti<strong>on</strong>s for the c<strong>on</strong>strained 2 indices(i, j) subproblem in regressi<strong>on</strong>, next <strong>on</strong>e can proceed to solve theoptimizati<strong>on</strong> problem analytically, and finally <strong>on</strong>e has to check,which part of the selecti<strong>on</strong> rules have to be modified to makethe approach work for regressi<strong>on</strong>. Since most of the c<strong>on</strong>tent isfairly technical it has been relegated to Appendix C.The main difference in implementati<strong>on</strong>s of SMO for regressi<strong>on</strong>can be found in the way the c<strong>on</strong>stant offset b is determined(Keerthi et al. 1999) and which criteri<strong>on</strong> is used to select a newset of variables. We present <strong>on</strong>e such strategy in Appendix C.3.However, since selecti<strong>on</strong> strategies are the focus of current researchwe recommend that readers interested in implementingthe algorithm make sure they are aware of the most recent developmentsin this area.Finally, we note that just as we presently describe a generalizati<strong>on</strong>of SMO to regressi<strong>on</strong> estimati<strong>on</strong>, other learning problemscan also benefit from the underlying ideas. Recently, a SMOalgorithm for training novelty detecti<strong>on</strong> systems (i.e. <strong>on</strong>e-classclassificati<strong>on</strong>) has been proposed (Schölkopf et al. 2001).6. Variati<strong>on</strong>s <strong>on</strong> a themeThere exists a large number of algorithmic modificati<strong>on</strong>s of theSV algorithm, to make it suitable for specific settings (inverseproblems, semiparametric settings), different ways of measuringcapacity and reducti<strong>on</strong>s to linear programming (c<strong>on</strong>vex combinati<strong>on</strong>s)and different ways of c<strong>on</strong>trolling capacity. We willmenti<strong>on</strong> some of the more popular <strong>on</strong>es.6.1. C<strong>on</strong>vex combinati<strong>on</strong>s and l 1 -normsAll the algorithms presented so far involved c<strong>on</strong>vex, and atbest, quadratic programming. Yet <strong>on</strong>e might think of reducingthe problem to a case where linear programming techniquescan be applied. This can be d<strong>on</strong>e in a straightforward fashi<strong>on</strong>(Mangasarian 1965, 1968, West<strong>on</strong> et al. 1999, Smola, Schölkopfand Rätsch 1999) for both SV pattern recogniti<strong>on</strong> and regressi<strong>on</strong>.The key is to replace (35) byR reg [ f ]:= R emp [ f ] + λ‖α‖ 1 (58)where ‖α‖ 1 denotes the l 1 norm in coefficient space. Hence <strong>on</strong>euses the SV kernel expansi<strong>on</strong> (11)f (x) =l∑α i k(x i , x) + bi=1with a different way of c<strong>on</strong>trolling capacity by minimizingR reg [ f ] = 1 l∑l∑c(x i , y i , f (x i )) + λ |α i |. (59)li=1i=1


210 Smola and SchölkopfFor the ε-insensitive loss functi<strong>on</strong> this leads to a linear programmingproblem. In the other cases, however, the problem still staysa quadratic or general c<strong>on</strong>vex <strong>on</strong>e, and therefore may not yieldthe desired computati<strong>on</strong>al advantage. Therefore we will limitourselves to the derivati<strong>on</strong> of the linear programming problemin the case of |·| ε cost functi<strong>on</strong>. Reformulating (59) yieldsl∑l∑minimize (α i + αi ∗ ) + C (ξ i + ξ ∗subject toi=1i=1⎧l∑y i − (α j − α ∗ j⎪⎨)k(x j, x i ) − b ≤ ε + ξ ij=1l∑(α j − α ∗ j )k(x j, x i ) + b − y i ≤ ε + ξi∗j=1 ⎪⎩α i ,αi ∗ ,ξ i,ξi ∗ ≥ 0Unlike in the classical SV case, the transformati<strong>on</strong> into its dualdoes not give any improvement in the structure of the optimizati<strong>on</strong>problem. Hence it is best to minimize R reg [ f ] directly, whichcan be achieved by a linear optimizer, (e.g. Dantzig 1962, Lustig,Marsten and Shanno 1990, Vanderbei 1997).In (West<strong>on</strong> et al. 1999) a similar variant of the linear SV approachis used to estimate densities <strong>on</strong> a line. One can show(Smola et al. 2000) that <strong>on</strong>e may obtain bounds <strong>on</strong> the generalizati<strong>on</strong>error which exhibit even better rates (in terms of theentropy numbers) than the classical SV case (Williams<strong>on</strong>, Smolaand Schölkopf 1998).6.2. Automatic tuning of the insensitivity tubeBesides standard model selecti<strong>on</strong> issues, i.e. how to specify thetrade-off between empirical error and model capacity there alsoexists the problem of an optimal choice of a cost functi<strong>on</strong>. Inparticular, for the ε-insensitive cost functi<strong>on</strong> we still have theproblem of choosing an adequate parameter ε in order to achievegood performance with the SV machine.Smola et al. (1998a) show the existence of a linear dependencybetween the noise level and the optimal ε-parameter forSV regressi<strong>on</strong>. However, this would require that we know somethingabout the noise model. This knowledge is not available ingeneral. Therefore, albeit providing theoretical insight, this findingby itself is not particularly useful in practice. Moreover, if wereally knew the noise model, we most likely would not choosethe ε-insensitive cost functi<strong>on</strong> but the corresp<strong>on</strong>ding maximumlikelihood loss functi<strong>on</strong> instead.There exists, however, a method to c<strong>on</strong>struct SV machinesthat automatically adjust ε and moreover also, at least asymptotically,have a predetermined fracti<strong>on</strong> of sampling points as SVs(Schölkopf et al. 2000). We modify (35) such that ε becomes avariable of the optimizati<strong>on</strong> problem, including an extra term inthe primal objective functi<strong>on</strong> which attempts to minimize ε. Inother wordsminimize R ν [ f ]:= R emp [ f ] + λ 2 ‖w‖2 + νε (60)i )for some ν>0. Hence (42) becomes (again carrying out theusual transformati<strong>on</strong> between λ, l and C)( )1l∑minimize2 ‖w‖2 + C ( ˜c(ξ i ) + ˜c(ξi ∗ )) + lνεi=1⎧(61)⎪⎨ y i −〈w, x i 〉−b ≤ ε + ξ isubject to 〈w, x i 〉+b − y i ≤ ε + ξi∗⎪⎩ξ i ,ξi ∗ ≥ 0We c<strong>on</strong>sider the standard |·| ε loss functi<strong>on</strong>. Computing the dualof (62) yields⎧⎪⎨ − 1 l∑(α i − αi ∗ 2)(α j − α ∗ j )k(x i, x j )i, j=1maximizel∑⎪⎩ + y i (α i − αi ∗ )i=1⎧(62)l∑(α i − αi ∗ ⎪⎨) = 0i=1subject to l∑(α i + αi ∗ ) ≤ Cνli=1 ⎪⎩α i ,αi ∗ ∈ [0, C]Note that the optimizati<strong>on</strong> problem is thus very similar to the ε-SV <strong>on</strong>e: the target functi<strong>on</strong> is even simpler (it is homogeneous),but there is an additi<strong>on</strong>al c<strong>on</strong>straint. For informati<strong>on</strong> <strong>on</strong> how thisaffects the implementati<strong>on</strong> (cf. Chang and Lin 2001).Besides having the advantage of being able to automaticallydetermine ε (63) also has another advantage. It can be used topre–specify the number of SVs:Theorem 9 (Schölkopf et al. 2000).1. ν is an upper bound <strong>on</strong> the fracti<strong>on</strong> of errors.2. ν is a lower bound <strong>on</strong> the fracti<strong>on</strong> of SVs.3. Suppose the data has been generated iid from a distributi<strong>on</strong>p(x, y) = p(x)p(y | x) with a c<strong>on</strong>tinuous c<strong>on</strong>diti<strong>on</strong>al distributi<strong>on</strong>p(y | x).With probability 1, asymptotically,νequalsthe fracti<strong>on</strong> of SVs and the fracti<strong>on</strong> of errors.Essentially, ν-SV regressi<strong>on</strong> improves up<strong>on</strong> ε-SV regressi<strong>on</strong> byallowing the tube width to adapt automatically to the data. Whatis kept fixed up to this point, however, is the shape of the tube.One can, however, go <strong>on</strong>e step further and use parametric tubemodels with n<strong>on</strong>-c<strong>on</strong>stant width, leading to almost identical optimizati<strong>on</strong>problems (Schölkopf et al. 2000).Combining ν-SV regressi<strong>on</strong> with results <strong>on</strong> the asymptoticaloptimal choice of ε for a given noise model (Smola et al. 1998a)leads to a guideline how to adjust ν provided the class of noisemodels (e.g. Gaussian or Laplacian) is known.Remark 10 (Optimal choice of ν). Denote by p a probabilitydensity with unit variance, and by P afamliy of noise modelsgenerated from p by P := {p|p = 1 σ p( y )}. Moreover assumeσ


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 211by making problems seemingly easier yet reliable via a map intosome even higher dimensi<strong>on</strong>al space.In this secti<strong>on</strong> we focus <strong>on</strong> the c<strong>on</strong>necti<strong>on</strong>s between SVmethods and previous techniques like Regularizati<strong>on</strong> Networks(Girosi, J<strong>on</strong>es and Poggio 1993). 9 In particular we will showthat SV machines are essentially Regularizati<strong>on</strong> Networks (RN)with a clever choice of cost functi<strong>on</strong>s and that the kernels areGreen’s functi<strong>on</strong> of the corresp<strong>on</strong>ding regularizati<strong>on</strong> operators.Forafull expositi<strong>on</strong> of the subject the reader is referred to Smola,Schölkopf and Müller (1998c).Fig. 5. Optimal ν and ε for various degrees of polynomial additivenoisethat the data were drawn iid from p(x, y) = p(x)p(y − f (x))with p(y − f (x)) c<strong>on</strong>tinuous. Then under the assumpti<strong>on</strong> ofuniform c<strong>on</strong>vergence, the asymptotically optimal value of ν iswhereν = 1 −∫ ε−εp(t) dtε := argmin(p(−τ) + p(τ))(1 −2 −τ∫ τ−τ)p(t) dt(63)For polynomial noise models, i.e. densities of type exp(−|ξ| p )<strong>on</strong>e may compute the corresp<strong>on</strong>ding (asymptotically) optimalvalues of ν. They are given in Fig. 5. For further details see(Schölkopf et al. 2000, Smola 1998); an experimental validati<strong>on</strong>has been given by Chalimourda, Schölkopf and Smola (2000).We c<strong>on</strong>clude this secti<strong>on</strong> by noting that ν-SV regressi<strong>on</strong> isrelated to the idea of trimmed estimators. One can show that theregressi<strong>on</strong> is not influenced if we perturb points lying outside thetube. Thus, the regressi<strong>on</strong> is essentially computed by discardinga certain fracti<strong>on</strong> of outliers, specified by ν, and computing theregressi<strong>on</strong> estimate from the remaining points (Schölkopf et al.2000).7. Regularizati<strong>on</strong>So far we were not c<strong>on</strong>cerned about the specific properties ofthe map into feature space and used it <strong>on</strong>ly as a c<strong>on</strong>venienttrick to c<strong>on</strong>struct n<strong>on</strong>linear regressi<strong>on</strong> functi<strong>on</strong>s. In some casesthe map was just given implicitly by the kernel, hence the mapitself and many of its properties have been neglected. A deeperunderstanding of the kernel map would also be useful to chooseappropriate kernels for a specific task (e.g. by incorporatingprior knowledge (Schölkopf et al. 1998a)). Finally the featuremap seems to defy the curse of dimensi<strong>on</strong>ality (Bellman 1961)7.1. Regularizati<strong>on</strong> networksLet us briefly review the basic c<strong>on</strong>cepts of RNs. As in (35)we minimize a regularized risk functi<strong>on</strong>al. However, rather thanenforcing flatness in feature space we try to optimize somesmoothness criteri<strong>on</strong> for the functi<strong>on</strong> in input space. Thus wegetR reg [ f ]:= R emp [ f ] + λ 2 ‖Pf‖2 . (64)Here P denotes a regularizati<strong>on</strong> operator in the sense ofTikh<strong>on</strong>ov and Arsenin (1977), i.e. P is a positive semidefiniteoperator mapping from the Hilbert space H of functi<strong>on</strong>s f underc<strong>on</strong>siderati<strong>on</strong> to a dot product space D such that the expressi<strong>on</strong>〈Pf · Pg〉 is well defined for f, g ∈ H. For instance by choosinga suitable operator that penalizes large variati<strong>on</strong>s of f <strong>on</strong>ecan reduce the well–known overfitting effect. Another possiblesetting also might be an operator P mapping from L 2 (R n ) intosome Reproducing Kernel Hilbert Space (RKHS) (Ar<strong>on</strong>szajn,1950, Kimeldorf and Wahba 1971, Saitoh 1988, Schölkopf 1997,Girosi 1998).Using an expansi<strong>on</strong> of f in terms of some symmetric functi<strong>on</strong>k(x i , x j ) (note here, that k need not fulfill Mercer’s c<strong>on</strong>diti<strong>on</strong>and can be chosen arbitrarily since it is not used to define aregularizati<strong>on</strong> term),f (x) =l∑α i k(x i , x) + b, (65)i=1and the ε-insensitive cost functi<strong>on</strong>, this leads to a quadratic programmingproblem similar to the <strong>on</strong>e for SVs. UsingD ij :=〈(Pk)(x i ,.) · (Pk)(x j ,.)〉 (66)we get α = D −1 K (β − β ∗ ), with β,β ∗ being the soluti<strong>on</strong> ofminimizesubject to12 (β∗ − β) ⊤ KD −1 K (β ∗ − β)l∑−(β ∗ − β) ⊤ y − ε (β i + βi ∗ ) (67)i=1l∑(β i − βi ∗ ) = 0 and β i,βi ∗ ∈ [0, C].i=1


212 Smola and SchölkopfUnfortunately, this setting of the problem does not preserve sparsityin terms of the coefficients, as a potentially sparse decompositi<strong>on</strong>in terms of β i and β ∗ iis spoiled by D −1 K ,which is notin general diag<strong>on</strong>al.7.2. Green’s functi<strong>on</strong>sComparing (10) with (67) leads to the questi<strong>on</strong> whether and underwhich c<strong>on</strong>diti<strong>on</strong> the two methods might be equivalent andtherefore also under which c<strong>on</strong>diti<strong>on</strong>s regularizati<strong>on</strong> networksmight lead to sparse decompositi<strong>on</strong>s, i.e. <strong>on</strong>ly a few of the expansi<strong>on</strong>coefficients α i in f would differ from zero. A sufficientc<strong>on</strong>diti<strong>on</strong> is D = K and thus KD −1 K = K (if K does not havefull rank we <strong>on</strong>ly need that KD −1 K = K holds <strong>on</strong> the image ofK ):k(x i , x j ) =〈(Pk)(x i ,.) · (Pk)(x j ,.)〉 (68)Our goal now is to solve the following two problems:1. Given a regularizati<strong>on</strong> operator P, find a kernel k such that aSV machine using k will not <strong>on</strong>ly enforce flatness in featurespace, but also corresp<strong>on</strong>d to minimizing a regularized riskfuncti<strong>on</strong>al with P as regularizer.2. Given an SV kernel k, find a regularizati<strong>on</strong> operator P suchthat a SV machine using this kernel can be viewed as a Regularizati<strong>on</strong>Network using P.These two problems can be solved by employing the c<strong>on</strong>ceptof Green’s functi<strong>on</strong>s as described in Girosi, J<strong>on</strong>es and Poggio(1993). These functi<strong>on</strong>s were introduced for the purpose of solvingdifferential equati<strong>on</strong>s. In our c<strong>on</strong>text it is sufficient to knowthat the Green’s functi<strong>on</strong>s G xi (x)ofP ∗ P satisfy(P ∗ PG xi )(x) = δ xi (x). (69)Here, δ xi (x)isthe δ-distributi<strong>on</strong> (not to be c<strong>on</strong>fused with the Kr<strong>on</strong>eckersymbol δ ij )which has the property that 〈 f ·δ xi 〉= f (x i ).The relati<strong>on</strong>ship between kernels and regularizati<strong>on</strong> operators isformalized in the following propositi<strong>on</strong>:Propositi<strong>on</strong> 1 (Smola, Schölkopf and Müller 1998b). Let Pbe a regularizati<strong>on</strong> operator, and G be the Green’s functi<strong>on</strong> ofP ∗ P.Then G is a Mercer Kernel such that D = K.SV machinesusing G minimize risk functi<strong>on</strong>al (64) with P as regularizati<strong>on</strong>operator.In the following we will exploit this relati<strong>on</strong>ship in both ways:to compute Green’s functi<strong>on</strong>s for a given regularizati<strong>on</strong> operatorP and to infer the regularizer, given a kernel k.7.3. Translati<strong>on</strong> invariant kernelsLet us now more specifically c<strong>on</strong>sider regularizati<strong>on</strong> operatorsˆP that may be written as multiplicati<strong>on</strong>s in Fourier space〈Pf · Pg〉 =1(2π) n/2 ∫f ˜(ω) g(ω) ˜dω (70)P(ω)with f ˜(ω) denoting the Fourier transform of f (x), and P(ω) =P(−ω) real valued, n<strong>on</strong>negative and c<strong>on</strong>verging to 0 for |ω| →∞ and := supp[P(ω)]. Small values of P(ω) corresp<strong>on</strong>d toa str<strong>on</strong>g attenuati<strong>on</strong> of the corresp<strong>on</strong>ding frequencies. Hencesmall values of P(ω) for large ω are desirable since high frequencycomp<strong>on</strong>ents of f˜corresp<strong>on</strong>d to rapid changes in f .P(ω) describes the filter properties of P ∗ P. Note that no attenuati<strong>on</strong>takes place for P(ω) = 0asthese frequencies have beenexcluded from the integrati<strong>on</strong> domain.Forregularizati<strong>on</strong> operators defined in Fourier Space by (70)<strong>on</strong>e can show by exploiting P(ω) = P(−ω) = P(ω) thatG(x i , x) =∫1(2π) n/2R n e iω(x i −x) P(ω) dω (71)is a corresp<strong>on</strong>ding Green’s functi<strong>on</strong> satisfying translati<strong>on</strong>al invariance,i.e.G(x i , x j ) = G(x i − x j ) and G(ω) ˜ = P(ω). (72)This provides us with an efficient tool for analyzing SV kernelsand the types of capacity c<strong>on</strong>trol they exhibit. In fact the aboveis a special case of Bochner’s theorem (Bochner 1959) statingthat the Fourier transform of a positive measure c<strong>on</strong>stitutes apositive Hilbert Schmidt kernel.Example 2 (Gaussian kernels). Following the expositi<strong>on</strong> ofYuille and Grzywacz (1988) as described in Girosi, J<strong>on</strong>es andPoggio (1993), <strong>on</strong>e can see that for∫‖Pf‖ 2 = dx ∑ σ 2mmm!2 ( Oˆm f (x)) 2 (73)mwith Oˆ2m = m and Oˆ2m+1 =∇ m , being the Laplacianand ∇ the Gradient operator, we get Gaussians kernels (31).Moreover, we can provide an equivalent representati<strong>on</strong> of Pin terms of its Fourier properties, i.e. P(ω) = e − σ 2 ‖ω‖ 22 up to amultiplicative c<strong>on</strong>stant.Training an SV machine with Gaussian RBF kernels (Schölkopfet al. 1997) corresp<strong>on</strong>ds to minimizing the specific cost functi<strong>on</strong>with a regularizati<strong>on</strong> operator of type (73). Recall that (73)means that all derivatives of f are penalized (we have a pseudodifferentialoperator) to obtain a very smooth estimate. This alsoexplains the good performance of SV machines in this case, as itis by no means obvious that choosing a flat functi<strong>on</strong> in some highdimensi<strong>on</strong>al space will corresp<strong>on</strong>d to a simple functi<strong>on</strong> in lowdimensi<strong>on</strong>al space, as shown in Smola, Schölkopf and Müller(1998c) for Dirichlet kernels.The questi<strong>on</strong> that arises now is which kernel to choose. Letus think about two extreme situati<strong>on</strong>s.1. Suppose we already knew the shape of the power spectrumPow(ω)ofthe functi<strong>on</strong> we would like to estimate. In this casewe choose k such that ˜k matches the power spectrum (Smola1998).2. If we happen to know very little about the given data a generalsmoothness assumpti<strong>on</strong> is a reas<strong>on</strong>able choice. Hence


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 213we might want to choose a Gaussian kernel. If computingtime is important <strong>on</strong>e might moreover c<strong>on</strong>sider kernels withcompact <strong>support</strong>, e.g. using the B q –spline kernels (cf. (32)).This choice will cause many matrix elements k ij = k(x i −x j )to vanish.The usual scenario will be in between the two extreme cases andwe will have some limited prior knowledge available. For moreinformati<strong>on</strong> <strong>on</strong> using prior knowledge for choosing kernels (seeSchölkopf et al. 1998a).7.4. Capacity c<strong>on</strong>trolAll the reas<strong>on</strong>ing so far was based <strong>on</strong> the assumpti<strong>on</strong> that thereexist ways to determine model parameters like the regularizati<strong>on</strong>c<strong>on</strong>stant λ or length scales σ of rbf–kernels. The model selecti<strong>on</strong>issue itself would easily double the length of this reviewand moreover it is an area of active and rapidly moving research.Therefore we limit ourselves to a presentati<strong>on</strong> of the basic c<strong>on</strong>ceptsand refer the interested reader to the original publicati<strong>on</strong>s.It is important to keep in mind that there exist several fundamentallydifferent approaches such as Minimum Descripti<strong>on</strong>Length (cf. e.g. Rissanen 1978, Li and Vitányi 1993) which isbased <strong>on</strong> the idea that the simplicity of an estimate, and thereforealso its plausibility is based <strong>on</strong> the informati<strong>on</strong> (number of bits)needed to encode it such that it can be rec<strong>on</strong>structed.Bayesian estimati<strong>on</strong>, <strong>on</strong> the other hand, c<strong>on</strong>siders the posteriorprobability of an estimate, given the observati<strong>on</strong>s X ={(x 1 , y 1 ),...(x l , y l )}, anobservati<strong>on</strong> noise model, and a priorprobability distributi<strong>on</strong> p( f ) over the space of estimates(parameters). It is given by Bayes Rule p( f | X)p(X) =p(X | f )p( f ). Since p(X) does not depend <strong>on</strong> f , <strong>on</strong>e can maximizep(X | f )p( f )toobtain the so-called MAP estimate. 10 Asarule of thumb, to translate regularized risk functi<strong>on</strong>als intoBayesian MAP estimati<strong>on</strong> schemes, all <strong>on</strong>e has to do is to c<strong>on</strong>siderexp(−R reg [ f ]) = p( f | X). For a more detailed discussi<strong>on</strong>(see e.g. Kimeldorf and Wahba 1970, MacKay 1991, Neal 1996,Rasmussen 1996, Williams 1998).A simple yet powerful way of model selecti<strong>on</strong> is cross validati<strong>on</strong>.This is based <strong>on</strong> the idea that the expectati<strong>on</strong> of the error<strong>on</strong> a subset of the training sample not used during training isidentical to the expected error itself. There exist several strategiessuch as 10-fold crossvalidati<strong>on</strong>, leave-<strong>on</strong>e out error (l-foldcrossvalidati<strong>on</strong>), bootstrap and derived algorithms to estimatethe crossvalidati<strong>on</strong> error itself (see e.g. St<strong>on</strong>e 1974, Wahba 1980,Efr<strong>on</strong> 1982, Efr<strong>on</strong> and Tibshirani 1994, Wahba 1999, Jaakkolaand Haussler 1999) for further details.Finally, <strong>on</strong>e may also use uniform c<strong>on</strong>vergence bounds suchas the <strong>on</strong>es introduced by Vapnik and Cherv<strong>on</strong>enkis (1971). Thebasic idea is that <strong>on</strong>e may bound with probability 1 − η (withη>0) the expected risk R[ f ]byR emp [ f ] + (F,η), where is a c<strong>on</strong>fidence term depending <strong>on</strong> the class of functi<strong>on</strong>s F.Several criteria for measuring the capacity of F exist, such as theVC-Dimensi<strong>on</strong> which, in pattern recogniti<strong>on</strong> problems, is givenby the maximum number of points that can be separated by thefuncti<strong>on</strong> class in all possible ways, the Covering Number whichis the number of elements from F that are needed to cover F withaccuracy of at least ε, Entropy Numbers which are the functi<strong>on</strong>alinverse of Covering Numbers, and many more variants thereof(see e.g. Vapnik 1982, 1998, Devroye, Györfi and Lugosi 1996,Williams<strong>on</strong>, Smola and Schölkopf 1998, Shawe-Taylor et al.1998).8. C<strong>on</strong>clusi<strong>on</strong>Due to the already quite large body of work d<strong>on</strong>e in the field ofSV research it is impossible to write a <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> SV regressi<strong>on</strong>which includes all c<strong>on</strong>tributi<strong>on</strong>s to this field. This also wouldbe quite out of the scope of a <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> and rather be relegated totextbooks <strong>on</strong> the matter (see Schölkopf and Smola (2002) for acomprehensive overview, Schölkopf, Burges and Smola (1999a)for a snapshot of the current state of the art, Vapnik (1998) for anoverview <strong>on</strong> statistical learning theory, or Cristianini and Shawe-Taylor (2000) for an introductory textbook). Still the authorshope that this work provides a not overly biased view of the stateof the art in SV regressi<strong>on</strong> research. We deliberately omitted(am<strong>on</strong>g others) the following topics.8.1. Missing topicsMathematical programming: Starting from a completely differentperspective algorithms have been developed that are similarin their ideas to SV machines. A good primer mightbe (Bradley, Fayyad and Mangasarian 1998). (Also seeMangasarian 1965, 1969, Street and Mangasarian 1995). Acomprehensive discussi<strong>on</strong> of c<strong>on</strong>necti<strong>on</strong>s between mathematicalprogramming and SV machines has been given by(Bennett 1999).Density estimati<strong>on</strong>: with SV machines (West<strong>on</strong> et al. 1999,Vapnik 1999). There <strong>on</strong>e makes use of the fact that the cumulativedistributi<strong>on</strong> functi<strong>on</strong> is m<strong>on</strong>ot<strong>on</strong>ically increasing,and that its values can be predicted with variable c<strong>on</strong>fidencewhich is adjusted by selecting different values of ε in the lossfuncti<strong>on</strong>.Dicti<strong>on</strong>aries: were originally introduced in the c<strong>on</strong>text ofwavelets by (Chen, D<strong>on</strong>oho and Saunders 1999) to allowfor a large class of basis functi<strong>on</strong>s to be c<strong>on</strong>sidered simultaneously,e.g. kernels with different widths. In the standard SVcase this is hardly possible except by defining new kernels aslinear combinati<strong>on</strong>s of differently scaled <strong>on</strong>es: choosing theregularizati<strong>on</strong> operator already determines the kernel completely(Kimeldorf and Wahba 1971, Cox and O’Sullivan1990, Schölkopf et al. 2000). Hence <strong>on</strong>e has to resort to linearprogramming (West<strong>on</strong> et al. 1999).Applicati<strong>on</strong>s: The focus of this review was <strong>on</strong> methods andtheory rather than <strong>on</strong> applicati<strong>on</strong>s. This was d<strong>on</strong>e to limitthe size of the expositi<strong>on</strong>. State of the art, or even recordperformance was reported in Müller et al. (1997), Druckeret al. (1997), Stits<strong>on</strong> et al. (1999) and Mattera and Haykin(1999).


214 Smola and SchölkopfIn many cases, it may be possible to achieve similar performancewith neural network methods, however, <strong>on</strong>ly ifmany parameters are optimally tuned by hand, thus dependinglargely <strong>on</strong> the skill of the experimenter. Certainly, SVmachines are not a “silver bullet.” However, as they have<strong>on</strong>ly few critical parameters (e.g. regularizati<strong>on</strong> and kernelwidth), state-of-the-art results can be achieved with relativelylittle effort.8.2. Open issuesBeing a very active field there exist still a number of open issuesthat have to be addressed by future research. After thatthe algorithmic development seems to have found a more stablestage, <strong>on</strong>e of the most important <strong>on</strong>es seems to be to findtight error bounds derived from the specific properties of kernelfuncti<strong>on</strong>s. It will be of interest in this c<strong>on</strong>text, whetherSV machines, or similar approaches stemming from a linearprogramming regularizer, will lead to more satisfactoryresults.Moreover some sort of “luckiness framework” (Shawe-Tayloret al. 1998) for multiple model selecti<strong>on</strong> parameters, similar tomultiple hyperparameters and automatic relevance detecti<strong>on</strong> inBayesian statistics (MacKay 1991, Bishop 1995), will have tobe devised to make SV machines less dependent <strong>on</strong> the skill ofthe experimenter.It is also worth while to exploit the bridge between regularizati<strong>on</strong>operators, Gaussian processes and priors (see e.g. (Williams1998)) to state Bayesian risk bounds for SV machines in orderto compare the predicti<strong>on</strong>s with the <strong>on</strong>es from VC theory. Optimizati<strong>on</strong>techniques developed in the c<strong>on</strong>text of SV machinesalso could be used to deal with large datasets in the Gaussianprocess settings.Prior knowledge appears to be another important questi<strong>on</strong> inSV regressi<strong>on</strong>. Whilst invariances could be included in patternrecogniti<strong>on</strong> in a principled way via the virtual SV mechanismand restricti<strong>on</strong> of the feature space (Burges and Schölkopf 1997,Schölkopf et al. 1998a), it is still not clear how (probably) moresubtle properties, as required for regressi<strong>on</strong>, could be dealt withefficiently.Reduced set methods also should be c<strong>on</strong>sidered for speedingup predicti<strong>on</strong> (and possibly also training) phase for large datasets(Burges and Schölkopf 1997, Osuna and Girosi 1999, Schölkopfet al. 1999b, Smola and Schölkopf 2000). This topic is of greatimportance as data mining applicati<strong>on</strong>s require algorithms thatare able to deal with databases that are often at least <strong>on</strong>e order ofmagnitude larger (1 milli<strong>on</strong> samples) than the current practicalsize for SV regressi<strong>on</strong>.Many more aspects such as more data dependent generalizati<strong>on</strong>bounds, efficient training algorithms, automatic kernel selecti<strong>on</strong>procedures, and many techniques that already have madetheir way into the standard neural networks toolkit, will have tobe c<strong>on</strong>sidered in the future.Readers who are tempted to embark up<strong>on</strong> a more detailedexplorati<strong>on</strong> of these topics, and to c<strong>on</strong>tribute their own ideas tothis exciting field, may find it useful to c<strong>on</strong>sult the web pagewww.kernel-machines.org.Appendix A: Solving the interior-pointequati<strong>on</strong>sA.1. Path followingRather than trying to satisfy (53) directly we will solve a modifiedversi<strong>on</strong> thereof for some µ>0 substituted <strong>on</strong> the rhs in the firstplace and decrease µ while iterating.g i z i = µ, s i t i = µ for all i ∈ [1 ...n]. (74)Still it is rather difficult to solve the n<strong>on</strong>linear system of equati<strong>on</strong>s(51), (52), and (74) exactly. However we are not interestedin obtaining the exact soluti<strong>on</strong> to the approximati<strong>on</strong> (74). Instead,we seek a somewhat more feasible soluti<strong>on</strong> for a given µ,then decrease µ and repeat. This can be d<strong>on</strong>e by linearizing theabove system and solving the resulting equati<strong>on</strong>s by a predictor–corrector approach until the duality gap is small enough. Theadvantage is that we will get approximately equal performanceas by trying to solve the quadratic system directly, provided thatthe terms in 2 are small enough.A(α + α) = bα + α − g − g = lα + α + t + t = uc + 1 2 ∂ αq(α) + 1 2 ∂2 αq(α)α − (A(y + y))⊤+ s + s = z + z(g i + g i )(z i + z i ) = µ(s i + s i )(t i + t i ) = µSolving for the variables in we getAα = b − Aα =: ρα − g = l − α + g =: να + t = u − α − t =: τ(Ay) ⊤ + z − s − 1 2 ∂2 α q(α)α= c − (Ay) ⊤ + s − z + 1 2 ∂ αq(α) =: σg −1 zg + z = µg −1 − z − g −1 gz =: γ zt −1 st + s = µt −1 − s − t −1 ts =: γ swhere g −1 denotes the <strong>vector</strong> (1/g 1 ,...,1/g n ), and t analogously.Moreover denote g −1 z and t −1 s the <strong>vector</strong> generatedby the comp<strong>on</strong>entwise product of the two <strong>vector</strong>s. Solving for


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 215g,t,z,s we getg = z −1 g(γ z − z) z = g −1 z(ˆν − α)t = s −1 t(γ s − s) s = t −1 s(α − ˆτ)(75)where ˆν := ν − z −1 gγ zˆτ := τ − s −1 tγ sNow we can formulate the reduced KKT–system (see (Vanderbei1994) for the quadratic case):[ −H A⊤][ ] [ α σ − g −1 z ˆν − t −1 ]s ˆτ=(76)A 0 yρwhere H := ( 1 2 ∂2 α q(α) + g−1 z + t −1 s).A.2. Iterati<strong>on</strong> strategiesFor the predictor-corrector method we proceed as follows. Inthe predictor step solve the system of (75) and (76) with µ = 0and all -terms <strong>on</strong> the rhs set to 0, i.e. γ z = z,γ s = s. Thevalues in are substituted back into the definiti<strong>on</strong>s for γ z andγ s and (75) and (76) are solved again in the corrector step. As thequadratic part in (76) is not affected by the predictor–correctorsteps, we <strong>on</strong>ly need to invert the quadratic matrix <strong>on</strong>ce. This isd<strong>on</strong>e best by manually pivoting for the H part, as it is positivedefinite.Next the values in obtained by such an iterati<strong>on</strong> step are usedto update the corresp<strong>on</strong>ding values in α, s, t, z,....Toensurethat the variables meet the positivity c<strong>on</strong>straints, the steplengthξ is chosen such that the variables move at most 1 − ε of theirinitial distance to the boundaries of the positive orthant. Usually(Vanderbei 1994) <strong>on</strong>e sets ε = 0.05.Another heuristic is used for computing µ, the parameter determininghow much the KKT-c<strong>on</strong>diti<strong>on</strong>s should be enforced.Obviously it is our aim to reduce µ as fast as possible, howeverif we happen to choose it too small, the c<strong>on</strong>diti<strong>on</strong> of the equati<strong>on</strong>swill worsen drastically. A setting that has proven to workrobustly isµ =〈g, z〉+〈s, t〉2n( ) ξ − 1 2. (77)ξ + 10The rati<strong>on</strong>ale behind (77) is to use the average of the satisfacti<strong>on</strong>of the KKT c<strong>on</strong>diti<strong>on</strong>s (74) as point of reference and thendecrease µ rapidly if we are far enough away from the boundariesof the positive orthant, to which all variables (except y) arec<strong>on</strong>strained to.Finally <strong>on</strong>e has to come up with good initial values. Analogouslyto Vanderbei (1994) we choose a regularized versi<strong>on</strong> of(76) in order to determine the initial c<strong>on</strong>diti<strong>on</strong>s. One solves[ (−12 ∂2 α q(α) + 1) ] [A ⊤ ] [ ] α c=(78)A 1 y band subsequently restricts the soluti<strong>on</strong> to a feasible set( )ux = max x,100g = min(α − l, u)t = min(u − α, u) (79)( ( )1z = min 2 ∂ αq(α) + c − (Ay) ⊤ + u )100 , u(s = min (− 1 )2 ∂ αq(α) − c + (Ay) ⊤ + u )100 , u(·) denotes the Heavyside functi<strong>on</strong>, i.e. (x) = 1 for x > 0and (x) = 0 otherwise.A.3. Special c<strong>on</strong>siderati<strong>on</strong>s for SV regressi<strong>on</strong>The algorithm described so far can be applied to both SV patternrecogniti<strong>on</strong> and regressi<strong>on</strong> estimati<strong>on</strong>. For the standard settingin pattern recogniti<strong>on</strong> we havel∑q(α) = α i α j y i y j k(x i , x j ) (80)i, j=0and c<strong>on</strong>sequently ∂ αi q(α) = 0, ∂α 2 i α jq(α) = y i y j k(x i , x j ), i.e.the Hessian is dense and the <strong>on</strong>ly thing we can do is computeits Cholesky factorizati<strong>on</strong> to compute (76). In the case of SV regressi<strong>on</strong>,however we have (with α := (α 1 ,...,α l ,α1 ∗,...,α∗ l ))l∑q(α) = (α i − αi ∗ )(α j − α ∗ j )k(x i, x j )and thereforei, j=1+ 2C∂ αi q(α) =l∑T (α i ) + T (αi ∗ ) (81)i=1ddα iT (α i )∂ 2 α i α jq(α) = k(x i , x j ) + δ ijd 2∂ 2 α i α ∗ j q(α) =−k(x i, x j )dα 2 iT (α i ) (82)and ∂α 2 q(α), ∂ 2 i ∗α∗ j αi ∗α q(α) analogously. Hence we are dealing withja matrix of type M := [ K +D −K−K K+D ′ ]where D, D′ are diag<strong>on</strong>almatrices. By applying an orthog<strong>on</strong>al transformati<strong>on</strong> M can beinverted essentially by inverting an l × l matrix instead of a2l × 2l system. This is exactly the additi<strong>on</strong>al advantage <strong>on</strong>ecan gain from implementing the optimizati<strong>on</strong> algorithm directlyinstead of using a general purpose optimizer. One can show thatfor practical implementati<strong>on</strong>s (Smola, Schölkopf and Müller1998b) <strong>on</strong>e can solve optimizati<strong>on</strong> problems using nearly arbitraryc<strong>on</strong>vex cost functi<strong>on</strong>s as efficiently as the special case ofε-insensitive loss functi<strong>on</strong>s.Finally note that due to the fact that we are solving the primaland dual optimizati<strong>on</strong> problem simultaneously we are also


216 Smola and Schölkopfcomputing parameters corresp<strong>on</strong>ding to the initial SV optimizati<strong>on</strong>problem. This observati<strong>on</strong> is useful as it allows us to obtainthe c<strong>on</strong>stant term b directly, namely by setting b = y. (see Smola(1998) for details).at sample x i , i.e.[ m∑ϕ i := y i − f (x i ) = y i − k(x i , x j )(α i − αi ]. ∗ ) + b (84)j=1Appendix B: Solving the subset selecti<strong>on</strong>problemB.1. Subset optimizati<strong>on</strong> problemWe will adapt the expositi<strong>on</strong> of Joachims (1999) to the case ofregressi<strong>on</strong> with c<strong>on</strong>vex cost functi<strong>on</strong>s. Without loss of generalitywe will assume ε ≠ 0 and α ∈ [0, C] (the other situati<strong>on</strong>scan be treated as a special case). First we will extract a reducedoptimizati<strong>on</strong> problem for the working set when all other variablesare kept fixed. Denote S w ⊂{1,...,l} the working setand S f :={1,...,l}\S w the fixed set. Writing (43) as an optimizati<strong>on</strong>problem <strong>on</strong>ly in terms of S w yieldsmaximizesubject to⎧− 1 ∑(α i − αi ∗ 2)(α j − α ∗ j )〈x i, x j 〉i, j∈S w⎪⎨+ ∑ (α i − αi (y ∗ ) i − ∑ )(α j − α ∗ j )〈x i, x j 〉i∈S w j∈S f⎪⎩+ ∑ i∈S w(−ε(α i + α ∗ i ) + C(T (α i) + T (α ∗ i )))⎧ ∑⎨ (α i − αi ∗ ) =−∑ (α i − αi ∗ )i∈S w i∈S f⎩α i ∈ [0, C](83)Hence we <strong>on</strong>ly have to update the linear term by the couplingwith the fixed set − ∑ i∈S w(α i −αi ∗) ∑ j∈S f(α j −α ∗ j )〈x i, x j 〉 andthe equality c<strong>on</strong>straint by − ∑ i∈S f(α i − αi ∗ ). It is easy to seethat maximizing (83) also decreases (43) by exactly the sameamount. If we choose variables for which the KKT–c<strong>on</strong>diti<strong>on</strong>sare not satisfied the overall objective functi<strong>on</strong> tends to decreasewhilst still keeping all variables feasible. Finally it is boundedfrom below.Even though this does not prove c<strong>on</strong>vergence (c<strong>on</strong>trary tostatement in Osuna, Freund and Girosi (1997)) this algorithmproves very useful in practice. It is <strong>on</strong>e of the few methods (besides(Kaufman 1999, Platt 1999)) that can deal with problemswhose quadratic part does not completely fit into memory. Stillin practice <strong>on</strong>e has to take special precauti<strong>on</strong>s to avoid stallingof c<strong>on</strong>vergence (recent results of Chang, Hsu and Lin (1999)indicate that under certain c<strong>on</strong>diti<strong>on</strong>s a proof of c<strong>on</strong>vergence ispossible). The crucial part is the <strong>on</strong>e of S w .B.2. A note <strong>on</strong> optimalityFor c<strong>on</strong>venience the KKT c<strong>on</strong>diti<strong>on</strong>s are repeated in a slightlymodified form. Denote ϕ i the error made by the current estimateRewriting the feasibility c<strong>on</strong>diti<strong>on</strong>s (52) in terms of α yields2∂ αi T (α i ) + ε − ϕ i + s i − z i = 02∂ α ∗iT (α ∗ i ) + ε + ϕ i + s ∗ i − z ∗ i = 0for all i ∈ {1,...,m} with z i , zi ∗, s i, si∗feasible variables z, s is given by(85)≥ 0. A set of dualz i = max ( 2∂ αi T (α i ) + ε − ϕ i , 0 )s i =−min ( 2∂ αi T (α i ) + ε − ϕ i , 0 ) (86)zi ∗ = max ( 2∂ α ∗iT (αi ∗ ) + ε + ϕ i, 0 )si ∗ =−min ( 2∂ α ∗iT (αi ∗ ) + ε + ϕ i, 0 )C<strong>on</strong>sequently the KKT c<strong>on</strong>diti<strong>on</strong>s (53) can be translated intoα i z i = 0 and (C − α i )s i = 0α ∗ i z∗ i = 0 and (C − α ∗ i )s∗ i = 0(87)All variables α i ,α ∗ iviolating some of the c<strong>on</strong>diti<strong>on</strong>s of (87) maybe selected for further optimizati<strong>on</strong>. In most cases, especially inthe initial stage of the optimizati<strong>on</strong> algorithm, this set of patternsis much larger than any practical size of S w . UnfortunatelyOsuna, Freund and Girosi (1997) c<strong>on</strong>tains little informati<strong>on</strong> <strong>on</strong>how to select S w . The heuristics presented here are an adaptati<strong>on</strong>of Joachims (1999) to regressi<strong>on</strong>. See also Lin (2001) for details<strong>on</strong> optimizati<strong>on</strong> for SVR.B.3. Selecti<strong>on</strong> rulesSimilarly to a merit functi<strong>on</strong> approach (El-Bakry et al. 1996) theidea is to select those variables that violate (85) and (87) most,thus c<strong>on</strong>tribute most to the feasibility gap. Hence <strong>on</strong>e defines ascore variable ζ i byζ i := g i z i + s i t i= α i z i + α ∗ i z∗ i + (C − α i )s i + (C − α ∗ i )s∗ i (88)By c<strong>on</strong>structi<strong>on</strong>, ∑ i ζ i is the size of the feasibility gap (cf. (56)for the case of ε-insensitive loss). By decreasing this gap, <strong>on</strong>eapproaches the the soluti<strong>on</strong> (upper bounded by the primal objectiveand lower bounded by the dual objective functi<strong>on</strong>). Hence,the selecti<strong>on</strong> rule is to choose those patterns for which ζ i is


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 217C.2. Analytic soluti<strong>on</strong> for regressi<strong>on</strong>i j)ilargest. Some algorithms useHζ i ′ := α i (z i ) + αi ∗ (z∗ i )Next <strong>on</strong>e has to solve the optimizati<strong>on</strong> problem analytically. Wemake use of (84) and substitute the values of φ i into the reducedor+ (C − α i )(s i ) + (C − αi ∗ )(s i)optimizati<strong>on</strong> problem (83). In particular we use(89)ζ i′′ := (α i )z i + (αi ∗ )z∗ iy i − ∑ (α i − α+ (C − α i )s i + (C − αi ∗ )s i ∗ )K ij = ϕ i + b + ∑ (αoldi − α ∗ old )i K ij .i.j∉S w j∈S wOne can see that ζ i = 0, ζi ′ = 0, and ζi′′ = 0 mutually imply each(91)other. However, <strong>on</strong>ly ζ i gives a measure for the c<strong>on</strong>tributi<strong>on</strong> ofthe variable i to the size of the feasibility gap.Finally, note that heuristics like assigning sticky–flags (cf.Burges 1998) to variables at the boundaries, thus effectivelysolving smaller subproblems, or completely removingthe corresp<strong>on</strong>ding patterns from the training set while accountingfor their couplings (Joachims 1999) can significantlydecrease the size of the problem <strong>on</strong>e has to solve andmaximize − 1 2 (α i − αi ∗ )2 η − ε(α i + αi ∗ )(1 − s)thus result in a noticeable speedup. Also caching (Joachims1999, Kowalczyk 2000) of already computed entries of the+ (α i − αi ∗ )( φ i − φ j + η ( αiold − α ∗ old ))i (92)dot product matrix may have a significant impact <strong>on</strong> the subject to α (∗)i∈ [L (∗) , H (∗) ].performance.The unc<strong>on</strong>strained maximum of (92) with respect to α i or αi∗can be found below.Appendix C: Solving the SMO equati<strong>on</strong>sC.1. Pattern dependent regularizati<strong>on</strong>(I) α i ,α j α oldi+ η −1 (ϕ i − ϕ j )(II) α i ,α ∗ jα oldi+ η −1 (ϕ i − ϕ j − 2ε)C<strong>on</strong>sider the c<strong>on</strong>strained optimizati<strong>on</strong> problem (83) for two indices,say (i, j). Pattern dependent regularizati<strong>on</strong> means that C j α ∗old i(III) αi ∗ − η −1 (ϕ i − ϕ j + 2ε)i (IV) αi ∗may be different for every pattern (possibly even different forjα ∗old i− η −1 (ϕ i − ϕ j )α i and αi ∗ ). Since at most two variables may become n<strong>on</strong>zeroat the same time and moreover we are dealing with a c<strong>on</strong>strainedoptimizati<strong>on</strong> problem we may express everything in four quadrants (I)–(IV) c<strong>on</strong>tains the soluti<strong>on</strong>. However, by c<strong>on</strong>-The problem is that we do not know beforehand which of theterms of just <strong>on</strong>e variable. From the summati<strong>on</strong> c<strong>on</strong>straint we sidering the sign of γ we can distinguish two cases: for γ>0obtain<strong>on</strong>ly (I)–(III) are possible, for γ 0itisbest to start with quadrant (I), test whether theunc<strong>on</strong>strained soluti<strong>on</strong> hits <strong>on</strong>e of the boundaries L, H and if so,probe the corresp<strong>on</strong>ding adjacent quadrant (II) or (III). γ < 0can be dealt with analogously.Due to numerical instabilities, it may happen that η


218 Smola and SchölkopfC.3. Selecti<strong>on</strong> rule for regressi<strong>on</strong>Finally, <strong>on</strong>e has to pick indices (i, j) such that the objectivefuncti<strong>on</strong> is maximized. Again, the reas<strong>on</strong>ing of SMO (Platt 1999,Secti<strong>on</strong> 12.2.2) for classificati<strong>on</strong> will be mimicked. This meansthat a two loop approach is chosen to maximize the objectivefuncti<strong>on</strong>. The outer loop iterates over all patterns violating theKKT c<strong>on</strong>diti<strong>on</strong>s, first <strong>on</strong>ly over those with Lagrange multipliersneither <strong>on</strong> the upper nor lower boundary, and <strong>on</strong>ce all of themare satisfied, over all patterns violating the KKT c<strong>on</strong>diti<strong>on</strong>s, toensure self c<strong>on</strong>sistency <strong>on</strong> the complete dataset. 12 This solvesthe problem of choosing i.Now for j: Tomake a large step towards the minimum, <strong>on</strong>elooks for large steps in α i .Asitiscomputati<strong>on</strong>ally expensive tocompute η for all possible pairs (i, j) <strong>on</strong>e chooses the heuristic tomaximize the absolute value of the numerator in the expressi<strong>on</strong>sfor α i and αi ∗, i.e. |ϕ i − ϕ j | and |ϕ i − ϕ j ± 2ε|. The index jcorresp<strong>on</strong>ding to the maximum absolute value is chosen for thispurpose.If this heuristic happens to fail, in other words if little progressis made by this choice, all other indices j are looked at (this iswhat is called “sec<strong>on</strong>d choice hierarcy” in Platt (1999) in thefollowing way:1. All indices j corresp<strong>on</strong>ding to n<strong>on</strong>–bound examples arelooked at, searching for an example to make progress <strong>on</strong>.2. In the case that the first heuristic was unsuccessful, allother samples are analyzed until an example is found whereprogress can be made.3. If both previous steps fail proceed to the next i.For amore detailed discussi<strong>on</strong> (see Platt 1999). Unlike interiorpoint algorithms SMO does not automatically provide a valuefor b.However this can be chosen like in Secti<strong>on</strong> 1.4 by havinga close look at the Lagrange multipliers α (∗)iobtained.C.4. Stopping criteriaBy essentially minimizing a c<strong>on</strong>strained primal optimizati<strong>on</strong>problem <strong>on</strong>e cannot ensure that the dual objective functi<strong>on</strong> increaseswith every iterati<strong>on</strong> step. 13 Nevertheless <strong>on</strong>e knows thatthe minimum value of the objective functi<strong>on</strong> lies in the interval[dual objective i , primal objective i ] for all steps i, hence also inthe interval [(max j≤i dual objective j ), primal objective i ]. Oneuses the latter to determine the quality of the current soluti<strong>on</strong>.The calculati<strong>on</strong> of the primal objective functi<strong>on</strong> from the predicti<strong>on</strong>errors is straightforward. One uses∑(α i − αi ∗ )(α j − α ∗ j )k ij =− ∑ (α i − αi ∗ )(ϕ i + y i − b),i, ji(93)i.e. the definiti<strong>on</strong> of ϕ i to avoid the matrix–<strong>vector</strong> multiplicati<strong>on</strong>with the dot product matrix.AcknowledgmentsThis work has been <strong>support</strong>ed in part by a grant of the DFG(Ja 379/71, Sm 62/1). The authors thank Peter Bartlett, ChrisBurges, Stefan Harmeling, Olvi Mangasarian, Klaus-RobertMüller, Vladimir Vapnik, Jas<strong>on</strong> West<strong>on</strong>, Robert Williams<strong>on</strong>, andAndreas Ziehe for helpful discussi<strong>on</strong>s and comments.Notes1. Our use of the term ‘regressi<strong>on</strong>’ is somewhat lose in that it also includescases of functi<strong>on</strong> estimati<strong>on</strong> where <strong>on</strong>e minimizes errors other than the meansquare loss. This is d<strong>on</strong>e mainly for historical reas<strong>on</strong>s (Vapnik, Golowichand Smola 1997).2. A similar approach, however using linear instead of quadratic programming,was taken at the same time in the USA, mainly by Mangasarian (1965, 1968,1969).3. See Smola (1998) for an overview over other ways of specifying flatness ofsuch functi<strong>on</strong>s.4. This is true as l<strong>on</strong>g as the dimensi<strong>on</strong>ality of w is much higher than thenumber of observati<strong>on</strong>s. If this is not the case, specialized methods canoffer c<strong>on</strong>siderable computati<strong>on</strong>al savings (Lee and Mangasarian 2001).5. The table displays CT(α) instead of T (α) since the former can be pluggeddirectly into the corresp<strong>on</strong>ding optimizati<strong>on</strong> equati<strong>on</strong>s.6. The high price tag usually is the major deterrent for not using them. Moreover<strong>on</strong>e has to bear in mind that in SV regressi<strong>on</strong>, <strong>on</strong>e may speed up the soluti<strong>on</strong>c<strong>on</strong>siderably by exploiting the fact that the quadratic form has a specialstructure or that there may exist rank degeneracies in the kernel matrixitself.7. For large and noisy problems (e.g. 100.000 patterns and more with a substantialfracti<strong>on</strong> of n<strong>on</strong>bound Lagrange multipliers) it is impossible to solve theproblem exactly: due to the size <strong>on</strong>e has to use subset selecti<strong>on</strong> algorithms,hence joint optimizati<strong>on</strong> over the training set is impossible. However, unlikein Neural Networks, we can determine the closeness to the optimum. Notethat this reas<strong>on</strong>ing <strong>on</strong>ly holds for c<strong>on</strong>vex cost functi<strong>on</strong>s.8. A similar technique was employed by Bradley and Mangasarian (1998) inthe c<strong>on</strong>text of linear programming in order to deal with large datasets.9. Due to length c<strong>on</strong>straints we will not deal with the c<strong>on</strong>necti<strong>on</strong> betweenGaussian Processes and SVMs. See Williams (1998) for an excellentoverview.10. Strictly speaking, in Bayesian estimati<strong>on</strong> <strong>on</strong>e is not so much c<strong>on</strong>cerned aboutthe maximizer fˆof p( f | X) but rather about the posterior distributi<strong>on</strong> off .11. Negative values of η are theoretically impossible since k satisfies Mercer’sc<strong>on</strong>diti<strong>on</strong>: 0 ≤‖(x i ) − (x j )‖ 2 = K ii + K jj − 2K ij = η.12. It is sometimes useful, especially when dealing with noisy data, to iterateover the complete KKT violating dataset already before complete self c<strong>on</strong>sistency<strong>on</strong> the subset has been achieved. Otherwise much computati<strong>on</strong>alresources are spent <strong>on</strong> making subsets self c<strong>on</strong>sistent that are not globallyself c<strong>on</strong>sistent. This is the reas<strong>on</strong> why in the pseudo code a global loopis initiated already when <strong>on</strong>ly less than 10% of the n<strong>on</strong> bound variableschanged.13. It is still an open questi<strong>on</strong> how a subset selecti<strong>on</strong> optimizati<strong>on</strong> algorithmcould be devised that decreases both primal and dual objective functi<strong>on</strong>at the same time. The problem is that this usually involves a number ofdual variables of the order of the sample size, which makes this attemptunpractical.ReferencesAizerman M.A., Braverman É.M., and Roz<strong>on</strong>oér L.I. 1964. Theoreticalfoundati<strong>on</strong>s of the potential functi<strong>on</strong> method in pattern recogniti<strong>on</strong>learning. Automati<strong>on</strong> and Remote C<strong>on</strong>trol 25: 821–837.Ar<strong>on</strong>szajn N. 1950. Theory of reproducing kernels. Transacti<strong>on</strong>s of theAmerican Mathematical Society 68: 337–404.


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 219Bazaraa M.S., Sherali H.D., and Shetty C.M. 1993. N<strong>on</strong>linear Programming:Theory and Algorithms, 2nd editi<strong>on</strong>, Wiley.Bellman R.E. 1961. Adaptive C<strong>on</strong>trol Processes. Princet<strong>on</strong> UniversityPress, Princet<strong>on</strong>, NJ.Bennett K. 1999. Combining <strong>support</strong> <strong>vector</strong> and mathematical programmingmethods for inducti<strong>on</strong>. In: Schölkopf B., Burges C.J.C., andSmola A.J., (Eds.), Advances in Kernel Methods—SV Learning,MIT Press, Cambridge, MA, pp. 307–326.Bennett K.P. and Mangasarian O.L. 1992. Robust linear programmingdiscriminati<strong>on</strong> of two linearly inseparable sets. Optimizati<strong>on</strong>Methods and Software 1: 23–34.Berg C., Christensen J.P.R., and Ressel P. 1984. Harm<strong>on</strong>ic Analysis <strong>on</strong>Semigroups. Springer, New York.Bertsekas D.P. 1995. N<strong>on</strong>linear Programming. Athena Scientific,Belm<strong>on</strong>t, MA.Bishop C.M. 1995. Neural Networks for Pattern Recogniti<strong>on</strong>.Clarend<strong>on</strong> Press, Oxford.Blanz V., Schölkopf B., Bülthoff H., Burges C., Vapnik V., and VetterT. 1996. Comparis<strong>on</strong> of view-based object recogniti<strong>on</strong> algorithmsusing realistic 3D models. In: v<strong>on</strong> der Malsburg C., v<strong>on</strong> SeelenW.,Vorbrüggen J.C., and Sendhoff B. (Eds.), Artificial NeuralNetworks ICANN’96, Berlin. Springer Lecture Notes in ComputerScience, Vol. 1112, pp. 251–256.Bochner S. 1959. Lectures <strong>on</strong> Fourier integral. Princet<strong>on</strong> Univ. Press,Princet<strong>on</strong>, New Jersey.Boser B.E., Guy<strong>on</strong> I.M., and Vapnik V.N. 1992. A training algorithm foroptimal margin classifiers. In: Haussler D. (Ed.), Proceedings ofthe Annual C<strong>on</strong>ference <strong>on</strong> Computati<strong>on</strong>al Learning Theory. ACMPress, Pittsburgh, PA, pp. 144–152.Bradley P.S., Fayyad U.M., and Mangasarian O.L. 1998. Data mining:Overview and optimizati<strong>on</strong> opportunities. Technical Report98–01, University of Wisc<strong>on</strong>sin, Computer Sciences Department,Madis<strong>on</strong>, January. INFORMS Journal <strong>on</strong> Computing, toappear.Bradley P.S. and Mangasarian O.L. 1998. Feature selecti<strong>on</strong> via c<strong>on</strong>caveminimizati<strong>on</strong> and <strong>support</strong> <strong>vector</strong> machines. In: Shavlik J.(Ed.), Proceedings of the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> MachineLearning, Morgan Kaufmann Publishers, San Francisco, California,pp. 82–90. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps.Z.Bunch J.R. and Kaufman L. 1977. Some stable methods for calculatinginertia and solving symmetric linear systems. Mathematics ofComputati<strong>on</strong> 31: 163–179.Bunch J.R. and Kaufman L. 1980. A computati<strong>on</strong>al method for theindefinite quadratic programming problem. Linear Algebra andIts Applicati<strong>on</strong>s, pp. 341–370, December.Bunch J.R., Kaufman L., and Parlett B. 1976. Decompositi<strong>on</strong> of a symmetricmatrix. Numerische Mathematik 27: 95–109.Burges C.J.C. 1996. Simplified <strong>support</strong> <strong>vector</strong> decisi<strong>on</strong> rules. InL. Saitta (Ed.), Proceedings of the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Machine Learning, Morgan Kaufmann Publishers, San Mateo,CA, pp. 71–77.Burges C.J.C. 1998. A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> machines for patternrecogniti<strong>on</strong>. Data Mining and Knowledge Discovery 2(2): 121–167.Burges C.J.C. 1999. Geometry and invariance in kernel based methods.In Schölkopf B., Burges C.J.C., and Smola A.J., (Eds.), Advancesin Kernel Methods—Support Vector Learning, MIT Press, Cambridge,MA, pp. 89–116.Burges C.J.C. and Schölkopf B. 1997. Improving the accuracy and speedof <strong>support</strong> <strong>vector</strong> learning machines. In Mozer M.C., Jordan M.I.,and Petsche T., (Eds.), Advances in Neural Informati<strong>on</strong> ProcessingSystems 9, MIT Press, Cambridge, MA, pp. 375–381.Chalimourda A., Schölkopf B., and Smola A.J. 2004. Experimentallyoptimal ν in <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> for different noise modelsand parameter settings. Neural Networks 17(1): 127–141.Chang C.-C., Hsu C.-W., and Lin C.-J. 1999. The analysis of decompositi<strong>on</strong>methods for <strong>support</strong> <strong>vector</strong> machines. In Proceeding ofIJCAI99, SVM Workshop.Chang C.C. and Lin C.J. 2001. Training ν-<strong>support</strong> <strong>vector</strong> classifiers:Theory and algorithms. Neural Computati<strong>on</strong> 13(9): 2119–2147.Chen S., D<strong>on</strong>oho D., and Saunders M. 1999. Atomic decompositi<strong>on</strong> bybasis pursuit. Siam Journal of Scientific Computing 20(1): 33–61.Cherkassky V. and Mulier F. 1998. Learning from Data. John Wiley andS<strong>on</strong>s, New York.Cortes C. and Vapnik V. 1995. Support <strong>vector</strong> networks. Machine Learning20: 273–297.Cox D. and O’Sullivan F. 1990. Asymptotic analysis of penalized likelihoodand related estimators. Annals of Statistics 18: 1676–1695.CPLEX Optimizati<strong>on</strong> Inc. Using the CPLEX callable library. Manual,1994.Cristianini N. and Shawe-Taylor J. 2000. An Introducti<strong>on</strong> to SupportVector Machines. Cambridge University Press, Cambridge, UK.Cristianini N., Campbell C., and Shawe-Taylor J. 1998. Multiplicativeupdatings for <strong>support</strong> <strong>vector</strong> learning. NeuroCOLT Technical ReportNC-TR-98-016, Royal Holloway College.Dantzig G.B. 1962. Linear Programming and Extensi<strong>on</strong>s. Princet<strong>on</strong>Univ. Press, Princet<strong>on</strong>, NJ.Devroye L., Györfi L., and Lugosi G. 1996. A Probabilistic Theory ofPattern Recogniti<strong>on</strong>. Number 31 in Applicati<strong>on</strong>s of mathematics.Springer, New York.Drucker H., Burges C.J.C., Kaufman L., Smola A., and Vapnik V. 1997.Support <strong>vector</strong> regressi<strong>on</strong> machines. In: Mozer M.C., Jordan M.I.,and Petsche T. (Eds.), Advances in Neural Informati<strong>on</strong> ProcessingSystems 9, MIT Press, Cambridge, MA, pp. 155–161.Efr<strong>on</strong> B. 1982. The jacknife, the bootstrap, and other resampling plans.SIAM, Philadelphia.Efr<strong>on</strong> B. and Tibshirani R.J. 1994. An Introducti<strong>on</strong> to the Bootstrap.Chapman and Hall, New York.El-Bakry A., Tapia R., Tsuchiya R., and Zhang Y. 1996. On the formulati<strong>on</strong>and theory of the Newt<strong>on</strong> interior-point method for n<strong>on</strong>linearprogramming. J. Optimizati<strong>on</strong> Theory and Applicati<strong>on</strong>s 89: 507–541.Fletcher R. 1989. Practical Methods of Optimizati<strong>on</strong>. John Wiley andS<strong>on</strong>s, New York.Girosi F. 1998. An equivalence between sparse approximati<strong>on</strong> and <strong>support</strong><strong>vector</strong> machines. Neural Computati<strong>on</strong> 10(6): 1455–1480.Girosi F., J<strong>on</strong>es M., and Poggio T. 1993. Priors, stabilizers and basisfuncti<strong>on</strong>s: From regularizati<strong>on</strong> to radial, tensor and additivesplines. A.I. Memo No. 1430, Artificial Intelligence Laboratory,Massachusetts Institute of Technology.Guy<strong>on</strong> I., Boser B., and Vapnik V. 1993. Automatic capacity tuning ofvery large VC-dimensi<strong>on</strong> classifiers. In: Hans<strong>on</strong> S.J., Cowan J.D.,and Giles C.L. (Eds.), Advances in Neural Informati<strong>on</strong> ProcessingSystems 5. Morgan Kaufmann Publishers, pp. 147–155.Härdle W. 1990. Applied n<strong>on</strong>parametric regressi<strong>on</strong>, volume 19 ofEc<strong>on</strong>ometric Society M<strong>on</strong>ographs. Cambridge University Press.


220 Smola and SchölkopfHastie T.J. and Tibshirani R.J. 1990. Generalized Additive Models,volume 43 of M<strong>on</strong>ographs <strong>on</strong> Statistics and Applied Probability.Chapman and Hall, L<strong>on</strong>d<strong>on</strong>.Haykin S. 1998. Neural Networks: A Comprehensive Foundati<strong>on</strong>. 2ndediti<strong>on</strong>. Macmillan, New York.Hearst M.A., Schölkopf B., Dumais S., Osuna E., and Platt J. 1998.Trends and c<strong>on</strong>troversies—<strong>support</strong> <strong>vector</strong> machines. IEEE IntelligentSystems 13: 18–28.Herbrich R. 2002. Learning Kernel Classifiers: Theory and Algorithms.MIT Press.Huber P.J. 1972. Robust statistics: A review. Annals of Statistics43: 1041.Huber P.J. 1981. Robust Statistics. John Wiley and S<strong>on</strong>s, New York.IBM Corporati<strong>on</strong>. 1992. IBM optimizati<strong>on</strong> subroutine library guideand reference. IBM Systems Journal, 31, SC23-0519.Jaakkola T.S. and Haussler D. 1999. Probabilistic kernel regressi<strong>on</strong>models. In: Proceedings of the 1999 C<strong>on</strong>ference <strong>on</strong> AI and Statistics.Joachims T. 1999. Making large-scale SVM learning practical.In: Schölkopf B., Burges C.J.C., and Smola A.J. (Eds.), Advancesin Kernel Methods—Support Vector Learning, MIT Press,Cambridge, MA, pp. 169–184.Karush W. 1939. Minima of functi<strong>on</strong>s of several variables with inequalitiesas side c<strong>on</strong>straints. Master’s thesis, Dept. of Mathematics,Univ. of Chicago.Kaufman L. 1999. Solving the quadratic programming problem arisingin <strong>support</strong> <strong>vector</strong> classificati<strong>on</strong>. In: Schölkopf B., Burges C.J.C.,and Smola A.J. (Eds.), Advances in Kernel Methods—SupportVector Learning, MIT Press, Cambridge, MA, pp. 147–168Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murthy K.R.K. 1999.Improvements to Platt’s SMO algorithm for SVM classifier design.Technical Report CD-99-14, Dept. of Mechanical and Producti<strong>on</strong>Engineering, Natl. Univ. Singapore, Singapore.Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murty K.R.K. 2001.Improvements to platt’s SMO algorithm for SVM classifier design.Neural Computati<strong>on</strong> 13: 637–649.Kimeldorf G.S. and Wahba G. 1970. A corresp<strong>on</strong>dence betweenBayesian estimati<strong>on</strong> <strong>on</strong> stochastic processes and smoothing bysplines. Annals of Mathematical Statistics 41: 495–502.Kimeldorf G.S. and Wahba G. 1971. Some results <strong>on</strong> Tchebycheffianspline functi<strong>on</strong>s. J. Math. Anal. Applic. 33: 82–95.Kowalczyk A. 2000. Maximal margin perceptr<strong>on</strong>. In: Smola A.J.,Bartlett P.L., Schölkopf B., and Schuurmans D. (Eds.), Advancesin Large Margin Classifiers, MIT Press, Cambridge, MA, pp. 75–113.Kuhn H.W. and Tucker A.W. 1951. N<strong>on</strong>linear programming. In: Proc.2nd Berkeley Symposium <strong>on</strong> Mathematical Statistics and Probabilistics,Berkeley. University of California Press, pp. 481–492.Lee Y.J. and Mangasarian O.L. 2001. SSVM: A smooth <strong>support</strong> <strong>vector</strong>machine for classificati<strong>on</strong>. Computati<strong>on</strong>al optimizati<strong>on</strong> and Applicati<strong>on</strong>s20(1): 5–22.Li M. and Vitányi P. 1993. An introducti<strong>on</strong> to Kolmogorov Complexityand its applicati<strong>on</strong>s. Texts and M<strong>on</strong>ographs in Computer Science.Springer, New York.Lin C.J. 2001. On the c<strong>on</strong>vergence of the decompositi<strong>on</strong> method for<strong>support</strong> <strong>vector</strong> machines. IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks12(6): 1288–1298.Lustig I.J., Marsten R.E., and Shanno D.F. 1990. On implementingMehrotra’s predictor-corrector interior point method for linear programming.Princet<strong>on</strong> Technical Report SOR 90–03., Dept. of CivilEngineering and Operati<strong>on</strong>s Research, Princet<strong>on</strong> University.Lustig I.J., Marsten R.E., and Shanno D.F. 1992. On implementingMehrotra’s predictor-corrector interior point method for linearprogramming. SIAM Journal <strong>on</strong> Optimizati<strong>on</strong> 2(3): 435–449.MacKay D.J.C. 1991. Bayesian Methods for Adaptive Models. PhDthesis, Computati<strong>on</strong> and Neural Systems, California Institute ofTechnology, Pasadena, CA.Mangasarian O.L. 1965. Linear and n<strong>on</strong>linear separati<strong>on</strong> of patterns bylinear programming. Operati<strong>on</strong>s Research 13: 444–452.Mangasarian O.L. 1968. Multi-surface method of pattern separati<strong>on</strong>.IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory IT-14: 801–807.Mangasarian O.L. 1969. N<strong>on</strong>linear Programming. McGraw-Hill, NewYork.Mattera D. and Haykin S. 1999. Support <strong>vector</strong> machines for dynamicrec<strong>on</strong>structi<strong>on</strong> of a chaotic system. In: Schölkopf B., BurgesC.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, pp. 211–242.McCormick G.P. 1983. N<strong>on</strong>linear Programming: Theory, Algorithms,and Applicati<strong>on</strong>s. John Wiley and S<strong>on</strong>s, New York.Megiddo N. 1989. Progressin Mathematical Programming, chapterPathways to the optimal set in linear programming, Springer, NewYork, NY, pp. 131–158.Mehrotra S. and Sun J. 1992. On the implementati<strong>on</strong> of a (primal-dual)interior point method. SIAM Journal <strong>on</strong> Optimizati<strong>on</strong> 2(4): 575–601.Mercer J. 1909. Functi<strong>on</strong>s of positive and negative type and their c<strong>on</strong>necti<strong>on</strong>with the theory of integral equati<strong>on</strong>s. Philosophical Transacti<strong>on</strong>sof the Royal Society, L<strong>on</strong>d<strong>on</strong> A 209: 415–446.Micchelli C.A. 1986. Algebraic aspects of interpolati<strong>on</strong>. Proceedingsof Symposia in Applied Mathematics 36: 81–102.Morozov V.A. 1984. Methods for Solving Incorrectly Posed Problems.Springer.Müller K.-R., Smola A., Rätsch G., Schölkopf B., Kohlmorgen J., andVapnik V. 1997. Predicting time series with <strong>support</strong> <strong>vector</strong> machines.In: Gerstner W., Germ<strong>on</strong>d A., Hasler M., and Nicoud J.-D.(Eds.), Artificial Neural Networks ICANN’97, Berlin. SpringerLecture Notes in Computer Science Vol. 1327 pp. 999–1004.Murtagh B.A. and Saunders M.A. 1983. MINOS 5.1 user’s guide. TechnicalReport SOL 83-20R, Stanford University, CA, USA, Revised1987.Neal R. 1996. Bayesian Learning in Neural Networks. Springer.Nilss<strong>on</strong> N.J. 1965. Learning machines: Foundati<strong>on</strong>s of Trainable PatternClassifying Systems. McGraw-Hill.Nyquist. H. 1928. Certain topics in telegraph transmissi<strong>on</strong> theory.Trans. A.I.E.E., pp. 617–644.Osuna E., Freund R., and Girosi F. 1997. An improved training algorithmfor <strong>support</strong> <strong>vector</strong> machines. In Principe J., Gile L., MorganN., and Wils<strong>on</strong> E. (Eds.), Neural Networks for Signal ProcessingVII—Proceedings of the 1997 IEEE Workshop, pp. 276–285, NewYork, IEEE.Osuna E. and Girosi F. 1999. Reducing the run-time complexity in<strong>support</strong> <strong>vector</strong> regressi<strong>on</strong>. In: Schölkopf B., Burges C.J.C., andSmola A. J. (Eds.), Advances in Kernel Methods—Support VectorLearning, pp. 271–284, Cambridge, MA, MIT Press.Ovari Z. 2000. Kernels, eigenvalues and <strong>support</strong> <strong>vector</strong> machines. H<strong>on</strong>oursthesis, Australian Nati<strong>on</strong>al University, Canberra.


A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong> 221Platt J. 1999. Fast training of <strong>support</strong> <strong>vector</strong> machines using sequentialminimal optimizati<strong>on</strong>. In: Schölkopf B., Burges C.J.C., andSmola A.J. (Eds.) Advances in Kernel Methods—Support VectorLearning, pp. 185–208, Cambridge, MA, MIT Press.Poggio T. 1975. On optimal n<strong>on</strong>linear associative recall. BiologicalCybernetics, 19: 201–209.Rasmussen C. 1996. Evaluati<strong>on</strong> of Gaussian Processes andOther Methods for N<strong>on</strong>-Linear Regressi<strong>on</strong>. PhD thesis,Department of Computer Science, University of Tor<strong>on</strong>to,ftp://ftp.cs.tor<strong>on</strong>to.edu/pub/carl/thesis.ps.gz.Rissanen J. 1978. Modeling by shortest data descripti<strong>on</strong>. Automatica,14: 465–471.Saitoh S. 1988. Theory of Reproducing Kernels and its Applicati<strong>on</strong>s.L<strong>on</strong>gman Scientific & Technical, Harlow, England.Saunders C., Stits<strong>on</strong> M.O., West<strong>on</strong> J., Bottou L., Schölkopf B., andSmola A. 1998. Support <strong>vector</strong> machine—reference manual. TechnicalReport CSD-TR-98-03, Department of Computer Science,Royal Holloway, University of L<strong>on</strong>d<strong>on</strong>, Egham, UK. SVM availableat http://svm.dcs.rhbnc.ac.uk/.Schoenberg I. 1942. Positive definite functi<strong>on</strong>s <strong>on</strong> spheres. DukeMath. J., 9: 96–108.Schölkopf B. 1997. Support Vector Learning. R. OldenbourgVerlag, München. Doktorarbeit, TU Berlin. Download:http://www.kernel-machines.org.Schölkopf B., Burges C., and Vapnik V. 1995. Extracting <strong>support</strong> datafor a given task. In: Fayyad U.M. and Uthurusamy R. (Eds.), Proceedings,First Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Knowledge Discovery& Data Mining, Menlo Park, AAAI Press.Schölkopf B., Burges C., and Vapnik V. 1996. Incorporating invariancesin <strong>support</strong> <strong>vector</strong> learning machines. In: v<strong>on</strong> der Malsburg C., v<strong>on</strong>Seelen W., Vorbrüggen J. C., and Sendhoff B. (Eds.), ArtificialNeural Networks ICANN’96, pp. 47–52, Berlin, Springer LectureNotes in Computer Science, Vol. 1112.Schölkopf B., Burges C.J.C., and Smola A.J. 1999a. (Eds.) Advancesin Kernel Methods—Support Vector Learning. MIT Press,Cambridge, MA.Schölkopf B., Herbrich R., Smola A.J., and Williams<strong>on</strong> R.C. 2001. Ageneralized representer theorem. Technical Report 2000-81, NeuroCOLT,2000. To appear in Proceedings of the Annual C<strong>on</strong>ference<strong>on</strong> Learning Theory, Springer (2001).Schölkopf B., Mika S., Burges C., Knirsch P., Müller K.-R., Rätsch G.,and Smola A. 1999b. Input space vs. feature space in kernel-basedmethods. IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks, 10(5): 1000–1017.Schölkopf B., Platt J., Shawe-Taylor J., Smola A.J. , and Williams<strong>on</strong> R.C.2001. Estimating the <strong>support</strong> of a high-dimensi<strong>on</strong>al distributi<strong>on</strong>.Neural Computati<strong>on</strong>, 13(7): 1443–1471.Schölkopf B., Simard P., Smola A., and Vapnik V. 1998a. Prior knowledgein <strong>support</strong> <strong>vector</strong> kernels. In: Jordan M.I., Kearns M.J., andSolla S.A. (Eds.) Advances in Neural Informati<strong>on</strong> Processing Systems10, MIT Press. Cambridge, MA, pp. 640–646.Schölkopf B., Smola A., and Müller K.-R. 1998b. N<strong>on</strong>linear comp<strong>on</strong>entanalysis as a kernel eigenvalue problem. Neural Computati<strong>on</strong>,10: 1299–1319.Schölkopf B., Smola A., Williams<strong>on</strong> R.C., and Bartlett P.L. 2000. New<strong>support</strong> <strong>vector</strong> algorithms. Neural Computati<strong>on</strong>, 12: 1207–1245.Schölkopf B. and Smola A.J. 2002. Learning with Kernels. MIT Press.Schölkopf B., Sung K., Burges C., Girosi F., Niyogi P., Poggio T.,and Vapnik V. 1997. Comparing <strong>support</strong> <strong>vector</strong> machines withGaussian kernels to radial basis functi<strong>on</strong> classifiers. IEEE Transacti<strong>on</strong>s<strong>on</strong> Signal Processing, 45: 2758–2765.Shann<strong>on</strong> C.E. 1948. A mathematical theory of communicati<strong>on</strong>. BellSystem Technical Journal, 27: 379–423, 623–656.Shawe-Taylor J., Bartlett P.L., Williams<strong>on</strong> R.C., and Anth<strong>on</strong>y M.1998. Structural risk minimizati<strong>on</strong> over data-dependent hierarchies.IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory, 44(5): 1926–1940.Smola A., Murata N., Schölkopf B., and Müller K.-R. 1998a. Asymptoticallyoptimal choice of ε-loss for <strong>support</strong> <strong>vector</strong> machines.In: Niklass<strong>on</strong> L., Bodén M., and Ziemke T. (Eds.) Proceedingsof the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Artificial Neural Networks,Perspectives in Neural Computing, pp. 105–110, Berlin,Springer.Smola A., Schölkopf B., and Müller K.-R. 1998b. The c<strong>on</strong>necti<strong>on</strong> betweenregularizati<strong>on</strong> operators and <strong>support</strong> <strong>vector</strong> kernels. NeuralNetworks, 11: 637–649.Smola A., Schölkopf B., and Müller K.-R. 1998c. General cost functi<strong>on</strong>sfor <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong>. In: Downs T., Frean M.,and Gallagher M. (Eds.) Proc. of the Ninth Australian C<strong>on</strong>f. <strong>on</strong>Neural Networks, pp. 79–83, Brisbane, Australia. University ofQueensland.Smola A., Schölkopf B., and Rätsch G. 1999. Linear programs forautomatic accuracy c<strong>on</strong>trol in regressi<strong>on</strong>. In: Ninth Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Artificial Neural Networks, C<strong>on</strong>ference Publicati<strong>on</strong>sNo. 470, pp. 575–580, L<strong>on</strong>d<strong>on</strong>. IEE.Smola. A.J. 1996. Regressi<strong>on</strong> estimati<strong>on</strong> with <strong>support</strong> <strong>vector</strong> learningmachines. Diplomarbeit, Technische Universität München.Smola A.J. 1998. Learning with Kernels. PhD thesis, Technische UniversitätBerlin. GMD Research Series No. 25.Smola A.J., Elisseeff A., Schölkopf B., and Williams<strong>on</strong> R.C. 2000.Entropy numbers for c<strong>on</strong>vex combinati<strong>on</strong>s and MLPs. In SmolaA.J., Bartlett P.L., Schölkopf B., and Schuurmans D. (Eds.) Advancesin Large Margin Classifiers, MIT Press, Cambridge, MA,pp. 369–387.Smola A.J., Óvári Z.L., and Williams<strong>on</strong> R.C. 2001. Regularizati<strong>on</strong> withdot-product kernels. In: Leen T.K., Dietterich T.G., and Tresp V.(Eds.) Advances in Neural Informati<strong>on</strong> Processing Systems 13,MIT Press, pp. 308–314.Smola A.J. and Schölkopf B. 1998a. On a kernel-based method forpattern recogniti<strong>on</strong>, regressi<strong>on</strong>, approximati<strong>on</strong> and operator inversi<strong>on</strong>.Algorithmica, 22: 211–231.Smola A.J. and Schölkopf B. 1998b. A <str<strong>on</strong>g>tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>support</strong> <strong>vector</strong> regressi<strong>on</strong>.NeuroCOLT Technical Report NC-TR-98-030, RoyalHolloway College, University of L<strong>on</strong>d<strong>on</strong>, UK.Smola A.J. and Schölkopf B. 2000. Sparse greedy matrix approximati<strong>on</strong>for machine learning. In: Langley P. (Ed.), Proceedings of the Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Machine Learning, Morgan KaufmannPublishers, San Francisco, pp. 911–918.Stits<strong>on</strong> M., Gammerman A., Vapnik V., Vovk V., Watkins C., andWest<strong>on</strong> J. 1999. Support <strong>vector</strong> regressi<strong>on</strong> with ANOVA decompositi<strong>on</strong>kernels. In: Schölkopf B., Burges C.J.C., and Smola A.J.(Eds.), Advances in Kernel Methods—Support Vector Learning,MIT Press Cambridge, MA, pp. 285–292.St<strong>on</strong>e C.J. 1985. Additive regressi<strong>on</strong> and other n<strong>on</strong>parametric models.Annals of Statistics, 13: 689–705.St<strong>on</strong>e M. 1974. Cross-validatory choice and assessment of statisticalpredictors (with discussi<strong>on</strong>). Journal of the Royal Statistical Society,B36: 111–147.


222 Smola and SchölkopfStreet W.N. and Mangasarian O.L. 1995. Improved generalizati<strong>on</strong> viatolerant training. Technical Report MP-TR-95-11, University ofWisc<strong>on</strong>sin, Madis<strong>on</strong>.Tikh<strong>on</strong>ov A.N. and Arsenin V.Y. 1977. Soluti<strong>on</strong> of Ill-posed problems.V. H. Winst<strong>on</strong> and S<strong>on</strong>s.Tipping M.E. 2000. The relevance <strong>vector</strong> machine. In: Solla S.A., LeenT.K., and Müller K.-R. (Eds.), Advances in Neural Informati<strong>on</strong>Processing Systems 12, MIT Press, Cambridge, MA, pp. 652–658.Vanderbei R.J. 1994. LOQO: An interior point code for quadratic programming.TR SOR-94-15, Statistics and Operati<strong>on</strong>s Research,Princet<strong>on</strong> Univ., NJ.Vanderbei R.J. 1997. LOQO user’s manual—versi<strong>on</strong> 3.10. TechnicalReport SOR-97-08, Princet<strong>on</strong> University, Statistics and Operati<strong>on</strong>sResearch, Code available at http://www.princet<strong>on</strong>.edu/˜rvdb/.Vapnik V. 1995. The Nature of Statistical Learning Theory. Springer,New York.Vapnik V. 1998. Statistical Learning Theory. John Wiley and S<strong>on</strong>s,New York.Vapnik. V. 1999. Three remarks <strong>on</strong> the <strong>support</strong> <strong>vector</strong> method offuncti<strong>on</strong> estimati<strong>on</strong>. In: Schölkopf B., Burges C.J.C., and SmolaA.J. (Eds.), Advances in Kernel Methods—Support VectorLearning, MIT Press, Cambridge, MA, pp. 25–42.Vapnik V. and Cherv<strong>on</strong>enkis A. 1964. A note <strong>on</strong> <strong>on</strong>e class ofperceptr<strong>on</strong>s. Automati<strong>on</strong> and Remote C<strong>on</strong>trol, 25.Vapnik V. and Cherv<strong>on</strong>enkis A. 1974. Theory of Pattern Recogniti<strong>on</strong>[in Russian]. Nauka, Moscow. (German Translati<strong>on</strong>: WapnikW. & Tscherw<strong>on</strong>enkis A., Theorie der Zeichenerkennung,Akademie-Verlag, Berlin, 1979).Vapnik V., Golowich S., and Smola A. 1997. Support <strong>vector</strong> methodfor functi<strong>on</strong> approximati<strong>on</strong>, regressi<strong>on</strong> estimati<strong>on</strong>, and signalprocessing. In: Mozer M.C., Jordan M.I., and Petsche T. (Eds.)Advances in Neural Informati<strong>on</strong> Processing Systems 9, MA, MITPress, Cambridge. pp. 281–287.Vapnik V. and Lerner A. 1963. Pattern recogniti<strong>on</strong> using generalizedportrait method. Automati<strong>on</strong> and Remote C<strong>on</strong>trol, 24: 774–780.Vapnik V.N. 1982. Estimati<strong>on</strong> of Dependences Based <strong>on</strong> EmpiricalData. Springer, Berlin.Vapnik V.N. and Cherv<strong>on</strong>enkis A.Y. 1971. On the uniform c<strong>on</strong>vergenceof relative frequencies of events to their probabilities. Theory ofProbability and its Applicati<strong>on</strong>s, 16(2): 264–281.Wahba G. 1980. Spline bases, regularizati<strong>on</strong>, and generalizedcross-validati<strong>on</strong> for solving approximati<strong>on</strong> problems with largequantities of noisy data. In: Ward J. and Cheney E. (Eds.), Proceedingsof the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Approximati<strong>on</strong> theory inh<strong>on</strong>our of George Lorenz, Academic Press, Austin, TX, pp. 8–10.Wahba G. 1990. Spline Models for Observati<strong>on</strong>al Data, volume 59 ofCBMS-NSF Regi<strong>on</strong>al C<strong>on</strong>ference Series in Applied Mathematics.SIAM, Philadelphia.Wahba G. 1999. Support <strong>vector</strong> machines, reproducing kernel Hilbertspaces and the randomized GACV. In: Schölkopf B., BurgesC.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA. pp. 69–88.West<strong>on</strong> J., Gammerman A., Stits<strong>on</strong> M., Vapnik V., Vovk V., and WatkinsC. 1999. Support <strong>vector</strong> density estimati<strong>on</strong>. In: Schölkopf B.,Burges C.J.C., and Smola A.J. (Eds.) Advances in KernelMethods—Support Vector Learning, MIT Press, Cambridge,MA. pp. 293–306.Williams C.K.I. 1998. Predicti<strong>on</strong> with Gaussian processes: From linearregressi<strong>on</strong> to linear predicti<strong>on</strong> and bey<strong>on</strong>d. In: Jordan M.I. (Ed.),Learning and Inference in Graphical Models, Kluwer Academic,pp. 599–621.Williams<strong>on</strong> R.C., Smola A.J., and Schölkopf B. 1998. Generalizati<strong>on</strong>performance of regularizati<strong>on</strong> networks and <strong>support</strong> <strong>vector</strong>machines via entropy numbers of compact operators. TechnicalReport 19, NeuroCOLT, http://www.neurocolt.com. Publishedin IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory, 47(6): 2516–2532(2001).Yuille A. and Grzywacz N. 1988. The moti<strong>on</strong> coherence theory.In: Proceedings of the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> ComputerVisi<strong>on</strong>, IEEE Computer Society Press, Washingt<strong>on</strong>, DC, pp. 344–354.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!