Journal of Computers - Academy Publisher
Journal of Computers - Academy Publisher
Journal of Computers - Academy Publisher
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011 1963<br />
machine with Gaussian noises on triangular fuzzy number<br />
space to forecast fuzzy nonlinear system [11].<br />
The rough set theory [12] is a powerful preprocessing<br />
tool to find out knowledge from an amount <strong>of</strong> uncertain<br />
and incomplete data and is applied to the support vector<br />
machines to reduce the features <strong>of</strong> data to process and<br />
eliminate redundancy. At the same time, it also improves<br />
performance <strong>of</strong> the classical support vector machines. To<br />
deal with the overfitting problem <strong>of</strong> the traditional<br />
support vector machine, Zhang and Wang proposed a<br />
rough margin based support vector machine [13]. In this<br />
paper, we propose a double margin based fuzzy support<br />
vector machine by combination <strong>of</strong> rough theory and<br />
fuzzy support vector machine, namely a double margin<br />
(rough margin) based on fuzzy support vector machine<br />
(RFSVM). The proposed method not only inherits the<br />
characteristic <strong>of</strong> the FSVM method, but also considers the<br />
effects <strong>of</strong> decision hyperplane depending on the position<br />
<strong>of</strong> training samples in the rough margin. So presented<br />
method further reduce overfitting due to noises or outliers.<br />
This paper is organized as follows. In section 2, a brief<br />
review <strong>of</strong> support vector machine is described. In Section<br />
3, we describe the proposed RFSVM in detail which<br />
contain both binary classification and multiple<br />
classification RFSVM. In the following section, we<br />
evaluate our method on benchmark data sets and compare<br />
it with the existing support vector machine. Some<br />
conclusions are given in the final section.<br />
II. SUPPORT VECTOR MACHINES ALGORITHM<br />
In this section, we briefly describe the support vector<br />
machines in binary classification problems.<br />
Given a dataset <strong>of</strong> labeled training points (x1, y1), (x2,<br />
y2),…, (xl, yl), where N<br />
( xi, yi) ⊆ R × { + 1, − 1} , i=1, 2…l.<br />
Supposed training data are linearly separable. That is to<br />
say, there is some hyperplane which correctly separates<br />
the positive examples and negative examples. The point x<br />
lying on the hyperplane satisfies +b = 0, where w<br />
is normal to the hyperplane. In this case, support vector<br />
machine algorithm finds the optimal separating<br />
hyperplane with the maximal margin. When the training<br />
data are linearly non-separable or approximately<br />
separable, it is needed to introduce the trade-<strong>of</strong>f<br />
parameter. When the training data is not linearly<br />
separable, support vector machine learning algorithm<br />
introduces kernel strategy that maps the input data to a<br />
higher-dimension feature space z by using a nonlinearly<br />
mapping function ϕ () x and then the data in feature space<br />
z is indeed linearly or approximately separable. All<br />
training data satisfy the following decision function<br />
⎧+<br />
1, if yi<br />
=+ 1<br />
f( xi) = sign( < w, x >+ b)<br />
= ⎨<br />
. (1)<br />
⎩ − 1, if yi<br />
= -1<br />
All training points satisfy the following inequalities:<br />
⎧<<br />
wx , i > + b≥ + 1, ifyi = + 1 . (2)<br />
⎨<br />
⎩ < wx , i > + b≤ − 1, ifyi = -1<br />
In fact, it can be written as yi( < w, xi > + b)<br />
≥ 1,<br />
i=1,2,…,l. above inequalities. It is seen that finding the<br />
hyperplane is equivalent to obtain the maximizing margin<br />
© 2011 ACADEMY PUBLISHER<br />
2<br />
by minimizing || w || subject to constraints (2). So the<br />
primal optimal problem is given as<br />
1 2<br />
min || w ||<br />
wb , 2<br />
st .. y( < w, x >+ b)<br />
≥1.<br />
(3)<br />
i i<br />
i = 1, 2,..., l<br />
To solve optimal problem, we introduce Lagrange<br />
multiplier to transform the primal problem (3) into its<br />
dual problem that becomes the following quadratic<br />
programming (QP) problem:<br />
l l l<br />
1<br />
min ∑∑αiα jyy i j( xi⋅xj) −∑αi<br />
α 2 i= 1 j= 1 i=<br />
1 . (4)<br />
l<br />
∑<br />
s. t. α y = 0, 0 ≤ α , i= 1,2,..., l.<br />
i=<br />
1<br />
i i i<br />
In classifier, the solution in feature space using a<br />
linearly mapping function ϕ ( x)<br />
only replaces the dot<br />
product x ⋅ x j by inner product vectors ϕ( x) ⋅ ϕ(<br />
x j ) . The<br />
mapping function ϕ( x)<br />
and ϕ ( xi<br />
) satisfy<br />
< ϕ( x), ϕ(<br />
xj) >= K( x, xi)<br />
, where K( x, xi) is called kernel<br />
function. In real world application, we would never need<br />
to explicitly know what ϕ is. A decision function with<br />
SVM is obtained by computing dot products <strong>of</strong> a given<br />
test point x, or more specifically by computing following<br />
sign:<br />
Ns<br />
*<br />
f( x) = α y ( s ⋅ x) + b<br />
∑<br />
i=<br />
1<br />
i i i<br />
Ns<br />
*<br />
∑α<br />
i<br />
i=<br />
1<br />
iϕ i ϕ<br />
Ns<br />
*<br />
= ∑α<br />
i<br />
i=<br />
1<br />
i ( i,<br />
) +<br />
. (5)<br />
= y ( s ) ⋅ ( x) + b<br />
yK s x b<br />
Where the coefficient α is positive, i<br />
i<br />
s is support vector<br />
and Ns is the number <strong>of</strong> support vectors.<br />
In most cases, as the learning <strong>of</strong> a suitable hyperplane<br />
is too restrictive to be <strong>of</strong> practical use and causes a large<br />
overlap <strong>of</strong> classes, there is nonexistent some separable<br />
hyperplane. To deal with linearly non-separable data, it<br />
<strong>of</strong>ten allows that some points are misclassified, and<br />
introduces nonnegative slack variables ξ > 0 measuring<br />
the number <strong>of</strong> misclassifications and a punishment<br />
parameter C which is a cost trade-<strong>of</strong>f between<br />
maximizing the margin and minimizing the classification<br />
error <strong>of</strong> training data. The sum <strong>of</strong> the slacks ∑ ξ is an<br />
i<br />
upper bound on the number <strong>of</strong> training errors. And, the<br />
original constraints (2) are relaxed to<br />
yi( < w, xi>+ b) ≥1 − ξi,<br />
i = 1,2,..., l.<br />
(6)<br />
Thus, constructing optimal hyperplane is equivalent to<br />
solve the following optimization problem:<br />
l<br />
1 2<br />
min || w|| + C∑ξi<br />
wb , , ξ 2<br />
i=<br />
1<br />
st .. yi( < w, ϕ( xi) >+ b)<br />
≥1−ξ (7)<br />
i<br />
ξ ≥ 0, i = 1, 2,..., l.<br />
i