16.01.2013 Views

Journal of Computers - Academy Publisher

Journal of Computers - Academy Publisher

Journal of Computers - Academy Publisher

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011 1963<br />

machine with Gaussian noises on triangular fuzzy number<br />

space to forecast fuzzy nonlinear system [11].<br />

The rough set theory [12] is a powerful preprocessing<br />

tool to find out knowledge from an amount <strong>of</strong> uncertain<br />

and incomplete data and is applied to the support vector<br />

machines to reduce the features <strong>of</strong> data to process and<br />

eliminate redundancy. At the same time, it also improves<br />

performance <strong>of</strong> the classical support vector machines. To<br />

deal with the overfitting problem <strong>of</strong> the traditional<br />

support vector machine, Zhang and Wang proposed a<br />

rough margin based support vector machine [13]. In this<br />

paper, we propose a double margin based fuzzy support<br />

vector machine by combination <strong>of</strong> rough theory and<br />

fuzzy support vector machine, namely a double margin<br />

(rough margin) based on fuzzy support vector machine<br />

(RFSVM). The proposed method not only inherits the<br />

characteristic <strong>of</strong> the FSVM method, but also considers the<br />

effects <strong>of</strong> decision hyperplane depending on the position<br />

<strong>of</strong> training samples in the rough margin. So presented<br />

method further reduce overfitting due to noises or outliers.<br />

This paper is organized as follows. In section 2, a brief<br />

review <strong>of</strong> support vector machine is described. In Section<br />

3, we describe the proposed RFSVM in detail which<br />

contain both binary classification and multiple<br />

classification RFSVM. In the following section, we<br />

evaluate our method on benchmark data sets and compare<br />

it with the existing support vector machine. Some<br />

conclusions are given in the final section.<br />

II. SUPPORT VECTOR MACHINES ALGORITHM<br />

In this section, we briefly describe the support vector<br />

machines in binary classification problems.<br />

Given a dataset <strong>of</strong> labeled training points (x1, y1), (x2,<br />

y2),…, (xl, yl), where N<br />

( xi, yi) ⊆ R × { + 1, − 1} , i=1, 2…l.<br />

Supposed training data are linearly separable. That is to<br />

say, there is some hyperplane which correctly separates<br />

the positive examples and negative examples. The point x<br />

lying on the hyperplane satisfies +b = 0, where w<br />

is normal to the hyperplane. In this case, support vector<br />

machine algorithm finds the optimal separating<br />

hyperplane with the maximal margin. When the training<br />

data are linearly non-separable or approximately<br />

separable, it is needed to introduce the trade-<strong>of</strong>f<br />

parameter. When the training data is not linearly<br />

separable, support vector machine learning algorithm<br />

introduces kernel strategy that maps the input data to a<br />

higher-dimension feature space z by using a nonlinearly<br />

mapping function ϕ () x and then the data in feature space<br />

z is indeed linearly or approximately separable. All<br />

training data satisfy the following decision function<br />

⎧+<br />

1, if yi<br />

=+ 1<br />

f( xi) = sign( < w, x >+ b)<br />

= ⎨<br />

. (1)<br />

⎩ − 1, if yi<br />

= -1<br />

All training points satisfy the following inequalities:<br />

⎧<<br />

wx , i > + b≥ + 1, ifyi = + 1 . (2)<br />

⎨<br />

⎩ < wx , i > + b≤ − 1, ifyi = -1<br />

In fact, it can be written as yi( < w, xi > + b)<br />

≥ 1,<br />

i=1,2,…,l. above inequalities. It is seen that finding the<br />

hyperplane is equivalent to obtain the maximizing margin<br />

© 2011 ACADEMY PUBLISHER<br />

2<br />

by minimizing || w || subject to constraints (2). So the<br />

primal optimal problem is given as<br />

1 2<br />

min || w ||<br />

wb , 2<br />

st .. y( < w, x >+ b)<br />

≥1.<br />

(3)<br />

i i<br />

i = 1, 2,..., l<br />

To solve optimal problem, we introduce Lagrange<br />

multiplier to transform the primal problem (3) into its<br />

dual problem that becomes the following quadratic<br />

programming (QP) problem:<br />

l l l<br />

1<br />

min ∑∑αiα jyy i j( xi⋅xj) −∑αi<br />

α 2 i= 1 j= 1 i=<br />

1 . (4)<br />

l<br />

∑<br />

s. t. α y = 0, 0 ≤ α , i= 1,2,..., l.<br />

i=<br />

1<br />

i i i<br />

In classifier, the solution in feature space using a<br />

linearly mapping function ϕ ( x)<br />

only replaces the dot<br />

product x ⋅ x j by inner product vectors ϕ( x) ⋅ ϕ(<br />

x j ) . The<br />

mapping function ϕ( x)<br />

and ϕ ( xi<br />

) satisfy<br />

< ϕ( x), ϕ(<br />

xj) >= K( x, xi)<br />

, where K( x, xi) is called kernel<br />

function. In real world application, we would never need<br />

to explicitly know what ϕ is. A decision function with<br />

SVM is obtained by computing dot products <strong>of</strong> a given<br />

test point x, or more specifically by computing following<br />

sign:<br />

Ns<br />

*<br />

f( x) = α y ( s ⋅ x) + b<br />

∑<br />

i=<br />

1<br />

i i i<br />

Ns<br />

*<br />

∑α<br />

i<br />

i=<br />

1<br />

iϕ i ϕ<br />

Ns<br />

*<br />

= ∑α<br />

i<br />

i=<br />

1<br />

i ( i,<br />

) +<br />

. (5)<br />

= y ( s ) ⋅ ( x) + b<br />

yK s x b<br />

Where the coefficient α is positive, i<br />

i<br />

s is support vector<br />

and Ns is the number <strong>of</strong> support vectors.<br />

In most cases, as the learning <strong>of</strong> a suitable hyperplane<br />

is too restrictive to be <strong>of</strong> practical use and causes a large<br />

overlap <strong>of</strong> classes, there is nonexistent some separable<br />

hyperplane. To deal with linearly non-separable data, it<br />

<strong>of</strong>ten allows that some points are misclassified, and<br />

introduces nonnegative slack variables ξ > 0 measuring<br />

the number <strong>of</strong> misclassifications and a punishment<br />

parameter C which is a cost trade-<strong>of</strong>f between<br />

maximizing the margin and minimizing the classification<br />

error <strong>of</strong> training data. The sum <strong>of</strong> the slacks ∑ ξ is an<br />

i<br />

upper bound on the number <strong>of</strong> training errors. And, the<br />

original constraints (2) are relaxed to<br />

yi( < w, xi>+ b) ≥1 − ξi,<br />

i = 1,2,..., l.<br />

(6)<br />

Thus, constructing optimal hyperplane is equivalent to<br />

solve the following optimization problem:<br />

l<br />

1 2<br />

min || w|| + C∑ξi<br />

wb , , ξ 2<br />

i=<br />

1<br />

st .. yi( < w, ϕ( xi) >+ b)<br />

≥1−ξ (7)<br />

i<br />

ξ ≥ 0, i = 1, 2,..., l.<br />

i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!