29.12.2014 Views

Parallel Fast Simulated Annealing Algorithm for Linear ...

Parallel Fast Simulated Annealing Algorithm for Linear ...

Parallel Fast Simulated Annealing Algorithm for Linear ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

7th International Conference on Advanced Data Mining and Applications<br />

An <strong>Algorithm</strong> <strong>for</strong> Sample and Data Dimensionality<br />

Reduction Using <strong>Fast</strong> <strong>Simulated</strong> <strong>Annealing</strong><br />

Szymon Łukasik, Piotr Kulczycki<br />

Department of Automatic Control and IT, Cracow University of Technology<br />

Systems Research Institute, Polish Academy of Sciences


Motivation<br />

• It is estimated ("How much in<strong>for</strong>mation” project, Univ. of Cali<strong>for</strong>nia<br />

Berkeley) that 1 million terabytes of data is generated annually worldwide,<br />

with 99.997% of it available only in digital <strong>for</strong>m.<br />

• It is commonly agreed that our ability to analyze new data is growing at<br />

much lower pace than the capacity to collect and store it.<br />

• When examining huge data samples one faces both technical difficulties and<br />

methodological obstacles of high-dimensional data analysis (coined term –<br />

"curse of dimensionality”).<br />

2


Curse of dimensionality - example<br />

Source: K. Beyer et al., „When Is «Nearest Neighbor» meaningful”, In: Proc. ICDT, 1999.<br />

3


Scope of our research<br />

• We have developed an universal unsupervised data dimensionality reduction<br />

technique, in some aspects similar to Principal Components Analysis (it’s<br />

linear) and Multidimensional Scaling (it’s distance-preserving). What is more,<br />

we try to reduce data sample length at the same time<br />

• Establishing exact <strong>for</strong>m of the trans<strong>for</strong>mation matrix is treated as a<br />

continuous optimization problem and solved by <strong>Parallel</strong> <strong>Fast</strong> <strong>Simulated</strong><br />

<strong>Annealing</strong>.<br />

• The algorithm is supposed to be used in conjunction with various procedures<br />

of data mining e.g. outlier detection, cluster analysis, and classification.<br />

4


General description of the algorithm<br />

• Data dimensionality reduction is realized via linear trans<strong>for</strong>mation:<br />

W = A U<br />

whereas U denotes the initial data set (n× m), A - trans<strong>for</strong>mation matrix<br />

(N× n) and W represents the trans<strong>for</strong>med data matrix (N× m).<br />

• Trans<strong>for</strong>mation matrix is obtained using <strong>Parallel</strong> FSA. The cost function<br />

g(A) which is minimized is given by raw Stress:<br />

m<br />

m<br />

g(A) = w i (A) − w j (A) R<br />

N<br />

− u i − u j R<br />

n 2<br />

i=1<br />

j =i+1<br />

with A being a solution of the optimization problem, and u i , u j ,w i (A),<br />

w j (A) representing data instances in initial and reduced feature space.<br />

5


FSA neighbor generation strategy<br />

20<br />

20<br />

a 2<br />

a 1<br />

a 1<br />

0<br />

0<br />

a 2<br />

-20<br />

-20<br />

-20 -10 0 10 20<br />

-20 -10 0 10 20<br />

6


FSA temperature and termination criterion<br />

• Initial temperature T(0) is determined through a set of pilot runs consisting<br />

of k P positive transitions from the starting solution. It is supposed to<br />

guarantee predetermined initial level of worse solution’s acceptance<br />

probability P(0) resulting from the Metropolis rule.<br />

• Initial solution is obtained using feature selection algorithm introduced by<br />

Pal & Mitra in 2004. It is based on feature space clustering, with similar<br />

features <strong>for</strong>ming distinctive clusters. As a measure of similarity maximal<br />

in<strong>for</strong>mation compression index was defined. The partition itself is<br />

per<strong>for</strong>med using k-nearest neighbor rule (here k=n − N is being used).<br />

• The termination criterion is either executing assumed number of iterations<br />

or fulfilling customized condition based on the estimator of the global<br />

minimum employing order statistics proposed recently <strong>for</strong> a class of<br />

stochastic random search algorithms by Bartkute and Sakalauskas (2009)<br />

7


FSA paralellization<br />

Current global solution<br />

Neighbor 1 Neighbor 2 … Neighbor n cores<br />

FSA FSA FSA<br />

Current 1 Current 2 Current n cores<br />

Make global current either best improving or<br />

random non-improving solution<br />

8


Sample size reduction<br />

• For each sample element u i a positive weight p i is assigned. It incorporates an<br />

in<strong>for</strong>mation about a relative de<strong>for</strong>mation of the element’s distance to other<br />

sample points. Data elements with higher weight could then be treated as more<br />

adequate. Weights are normalized to fulfill p i = 1 .<br />

• Consequently weights could be then used to improve the per<strong>for</strong>mance of data<br />

mining procedures e.g. by introducing such weights into the definition of the<br />

classic data mining algorithms (e.g. k-means or k-nearest neighbor).<br />

• Alternatively one can use weights to eliminate some data elements from the<br />

sample. It can be per<strong>for</strong>med by removing from the sample data elements with<br />

associated weights fulfilling following condition: p i < P where P ∊ [0, 1] and then<br />

normalizing all weights. One can achieve in this way simultaneous<br />

dimensionality and sample length reduction with P serving as a data<br />

compression ratio.<br />

9


Experimental evaluation<br />

• We have examined the per<strong>for</strong>mance of the algorithm by measuring the accuracy of<br />

outlier detection I o (<strong>for</strong> artificially generated datasets), clustering I c and classification<br />

I k (<strong>for</strong> selected benchmark instances from UCI ML repository).<br />

• Outlier detection was per<strong>for</strong>med using nonparametric statistical kernel density<br />

estimation. By using randomly generated datasets we had a possibility to designate<br />

actual outliers.<br />

• Clustering accuracy was measured by Rand index (in reference to class labels). It was<br />

implemented via classic k-means algorithm.<br />

• Classification accuracy (<strong>for</strong> nearest-neighbor classifier) was measured, by average<br />

classification correctness obtained during 5-fold cross-validation procedure.<br />

• Each test consisted of 30 runs, we reported average and the mean of above<br />

mentioned indices. We compared our approach to PCA and Evolutionary <strong>Algorithm</strong>sbased<br />

Feature Selection (by Saxena et al.).<br />

10


Example: seeds dataset (7D→2D)<br />

Our approach<br />

PCA<br />

11


More details – classification<br />

glass<br />

9D→4D<br />

wine<br />

13D→5D<br />

WBC<br />

9D→4D<br />

vehicle<br />

18D→10D<br />

seeds<br />

7D→2D<br />

I kINIT<br />

±σ(I kINIT )<br />

71.90<br />

±8.10<br />

74.57<br />

±5.29<br />

95.88<br />

±1.35<br />

63.37<br />

±3.34<br />

90.23<br />

±2.85<br />

Our approach (P=0.1)<br />

I kRED<br />

±σ(I kRED )<br />

70.48<br />

±7.02<br />

78.00<br />

±4.86<br />

95.95<br />

±1.43<br />

63.96<br />

±2.66<br />

89.76<br />

±3.18<br />

PCA<br />

I kRED<br />

±σ(I kRED )<br />

58.33<br />

±6.37<br />

72.00<br />

±7.22<br />

95.29<br />

±2.06<br />

62.24<br />

±3.84<br />

83.09<br />

±7.31<br />

EA-based Feature Selection<br />

I kRED<br />

±σ(I kRED )<br />

64.80<br />

±4.43<br />

72.82<br />

±1.02<br />

95.10<br />

±0.80<br />

60.86<br />

±1.51<br />

not tested<br />

12


More details – cluster analysis<br />

glass<br />

9D→4D<br />

wine<br />

13D→5D<br />

WBC<br />

9D→4D<br />

vehicle<br />

18D→10D<br />

seeds<br />

7D→2D<br />

I cINIT 68.23 93.48 66.23 64.18 91.06<br />

Our approach (P=0.2)<br />

I cRED<br />

±σ(I cRED )<br />

68.43<br />

±0.62<br />

92.81<br />

±0.76<br />

66.29<br />

±0.62<br />

64.62<br />

±0.24<br />

89.59<br />

±1.57<br />

PCA<br />

I cRED 67.71 92.64 66.16 64.16 88.95<br />

13


Conclusion<br />

• The algorithm was tested <strong>for</strong> numerous instances of outlier detection,<br />

cluster analysis and classification problems and was found to offer promising<br />

per<strong>for</strong>mance. It results in an accurate distance preservation with possibility<br />

of out-of-sample extension at the same time.<br />

• Drawbacks<br />

It is not designed <strong>for</strong> huge datasets (due to significant computational cost of<br />

cost function evaluation) and shouldn’t be used in the situation where only<br />

single data analysis task needs to be per<strong>for</strong>med.<br />

• What can be done in the future<br />

We observed that taking into account topological de<strong>for</strong>mation of the dataset in the reduced<br />

feature space (by proposed weighting scheme) brings positive results in various data mining<br />

procedures. It can be easily extended <strong>for</strong> other DR techniques!<br />

Proposed approach could make algorithms very prone to ‘curse of dimensionality’ practically<br />

usable (we have examined it in the case of KDE).<br />

14


Thank you <strong>for</strong> your attention!


Short bibliography<br />

1. H. Szu, R. Hartley: "<strong>Fast</strong> simulated annealing”, Physics Letters A, vol. 122/3-4, 1987.<br />

2. L. Ingber: "Adaptive simulated annealing (ASA): Lessons learned“, Control and<br />

Cybernetics, vol. 25/1, 1996.<br />

3. D. Nam, J.-S. Lee, C. H. Park, "N-dimensional Cauchy neighbor generation <strong>for</strong> the fast<br />

simulated annealing", IEICE Trans. In<strong>for</strong>mation and Systems, vol. E87-D/11, 2004<br />

4. S.K. Pal, P. Mitra, "Pattern Recognition <strong>Algorithm</strong>s <strong>for</strong> Data Mining”, Chapman and Hall,<br />

2004.<br />

5. V. Bartkute, L. Sakalauskas: "Statistical Inferences <strong>for</strong> Termination of Markov Type<br />

Random Search <strong>Algorithm</strong>s”, Journal of Optimization Theory and Applications, vol.<br />

141/3, 2009.<br />

6. P. Kulczycki, "Kernel Estimators in Industrial Applications”, Soft Computing Applications<br />

in Industry”, B. Prasad (ed.), Springer-Verlag, 2008.<br />

7. A. Saxena, N.R. Pal, M. Vora, "Evolutionary methods <strong>for</strong> unsupervised feature selection<br />

using Sammon’s stress function". Fuzzy In<strong>for</strong>mation and Engineering, vol. 2, 2010.<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!