Parallel Fast Simulated Annealing Algorithm for Linear ...

7th International Conference on Advanced Data Mining and Applications 

An Algorithm for Sample and Data Dimensionality 

Reduction Using Fast Simulated Annealing 

Szymon Łukasik, Piotr Kulczycki 

Department of Automatic Control and IT, Cracow University of Technology 

Systems Research Institute, Polish Academy of Sciences

Motivation 

• It is estimated ("How much information” project, Univ. of California 

Berkeley) that 1 million terabytes of data is generated annually worldwide, 

with 99.997% of it available only in digital form. 

• It is commonly agreed that our ability to analyze new data is growing at 

much lower pace than the capacity to collect and store it. 

• When examining huge data samples one faces both technical difficulties and 

methodological obstacles of high-dimensional data analysis (coined term – 

"curse of dimensionality”). 

2

Curse of dimensionality - example 

Source: K. Beyer et al., „When Is «Nearest Neighbor» meaningful”, In: Proc. ICDT, 1999. 

3

Scope of our research 

• We have developed an universal unsupervised data dimensionality reduction 

technique, in some aspects similar to Principal Components Analysis (it’s 

linear) and Multidimensional Scaling (it’s distance-preserving). What is more, 

we try to reduce data sample length at the same time 

• Establishing exact form of the transformation matrix is treated as a 

continuous optimization problem and solved by Parallel Fast Simulated 

Annealing. 

• The algorithm is supposed to be used in conjunction with various procedures 

of data mining e.g. outlier detection, cluster analysis, and classification. 

4

General description of the algorithm 

• Data dimensionality reduction is realized via linear transformation: 

W = A U 

whereas U denotes the initial data set (n× m), A - transformation matrix 

(N× n) and W represents the transformed data matrix (N× m). 

• Transformation matrix is obtained using Parallel FSA. The cost function 

g(A) which is minimized is given by raw Stress: 

m 

m 

g(A) = w i (A) − w j (A) R 

N 

− u i − u j R 

n 2 

i=1 

j =i+1 

with A being a solution of the optimization problem, and u i , u j ,w i (A), 

w j (A) representing data instances in initial and reduced feature space. 

5

FSA neighbor generation strategy 

20 

20 

a 2 

a 1 

a 1 

0 

0 

a 2 

-20 

-20 

-20 -10 0 10 20 

-20 -10 0 10 20 

6

FSA temperature and termination criterion 

• Initial temperature T(0) is determined through a set of pilot runs consisting 

of k P positive transitions from the starting solution. It is supposed to 

guarantee predetermined initial level of worse solution’s acceptance 

probability P(0) resulting from the Metropolis rule. 

• Initial solution is obtained using feature selection algorithm introduced by 

Pal & Mitra in 2004. It is based on feature space clustering, with similar 

features forming distinctive clusters. As a measure of similarity maximal 

information compression index was defined. The partition itself is 

performed using k-nearest neighbor rule (here k=n − N is being used). 

• The termination criterion is either executing assumed number of iterations 

or fulfilling customized condition based on the estimator of the global 

minimum employing order statistics proposed recently for a class of 

stochastic random search algorithms by Bartkute and Sakalauskas (2009) 

7

FSA paralellization 

Current global solution 

Neighbor 1 Neighbor 2 … Neighbor n cores 

FSA FSA FSA 

Current 1 Current 2 Current n cores 

Make global current either best improving or 

random non-improving solution 

8

Sample size reduction 

• For each sample element u i a positive weight p i is assigned. It incorporates an 

information about a relative deformation of the element’s distance to other 

sample points. Data elements with higher weight could then be treated as more 

adequate. Weights are normalized to fulfill p i = 1 . 

• Consequently weights could be then used to improve the performance of data 

mining procedures e.g. by introducing such weights into the definition of the 

classic data mining algorithms (e.g. k-means or k-nearest neighbor). 

• Alternatively one can use weights to eliminate some data elements from the 

sample. It can be performed by removing from the sample data elements with 

associated weights fulfilling following condition: p i 

normalizing all weights. One can achieve in this way simultaneous 

dimensionality and sample length reduction with P serving as a data 

compression ratio. 

9

Experimental evaluation 

• We have examined the performance of the algorithm by measuring the accuracy of 

outlier detection I o (for artificially generated datasets), clustering I c and classification 

I k (for selected benchmark instances from UCI ML repository). 

• Outlier detection was performed using nonparametric statistical kernel density 

estimation. By using randomly generated datasets we had a possibility to designate 

actual outliers. 

• Clustering accuracy was measured by Rand index (in reference to class labels). It was 

implemented via classic k-means algorithm. 

• Classification accuracy (for nearest-neighbor classifier) was measured, by average 

classification correctness obtained during 5-fold cross-validation procedure. 

• Each test consisted of 30 runs, we reported average and the mean of above 

mentioned indices. We compared our approach to PCA and Evolutionary Algorithmsbased 

Feature Selection (by Saxena et al.). 

10

Example: seeds dataset (7D→2D) 

Our approach 

PCA 

11

More details – classification 

glass 

9D→4D 

wine 

13D→5D 

WBC 

9D→4D 

vehicle 

18D→10D 

seeds 

7D→2D 

I kINIT 

±σ(I kINIT ) 

71.90 

±8.10 

74.57 

±5.29 

95.88 

±1.35 

63.37 

±3.34 

90.23 

±2.85 

Our approach (P=0.1) 

I kRED 

±σ(I kRED ) 

70.48 

±7.02 

78.00 

±4.86 

95.95 

±1.43 

63.96 

±2.66 

89.76 

±3.18 

PCA 

I kRED 

±σ(I kRED ) 

58.33 

±6.37 

72.00 

±7.22 

95.29 

±2.06 

62.24 

±3.84 

83.09 

±7.31 

EA-based Feature Selection 

I kRED 

±σ(I kRED ) 

64.80 

±4.43 

72.82 

±1.02 

95.10 

±0.80 

60.86 

±1.51 

not tested 

12

More details – cluster analysis 

glass 

9D→4D 

wine 

13D→5D 

WBC 

9D→4D 

vehicle 

18D→10D 

seeds 

7D→2D 

I cINIT 68.23 93.48 66.23 64.18 91.06 

Our approach (P=0.2) 

I cRED 

±σ(I cRED ) 

68.43 

±0.62 

92.81 

±0.76 

66.29 

±0.62 

64.62 

±0.24 

89.59 

±1.57 

PCA 

I cRED 67.71 92.64 66.16 64.16 88.95 

13

Conclusion 

• The algorithm was tested for numerous instances of outlier detection, 

cluster analysis and classification problems and was found to offer promising 

performance. It results in an accurate distance preservation with possibility 

of out-of-sample extension at the same time. 

• Drawbacks 

It is not designed for huge datasets (due to significant computational cost of 

cost function evaluation) and shouldn’t be used in the situation where only 

single data analysis task needs to be performed. 

• What can be done in the future 

We observed that taking into account topological deformation of the dataset in the reduced 

feature space (by proposed weighting scheme) brings positive results in various data mining 

procedures. It can be easily extended for other DR techniques! 

Proposed approach could make algorithms very prone to ‘curse of dimensionality’ practically 

usable (we have examined it in the case of KDE). 

14

Thank you for your attention!

Short bibliography 

1. H. Szu, R. Hartley: "Fast simulated annealing”, Physics Letters A, vol. 122/3-4, 1987. 

2. L. Ingber: "Adaptive simulated annealing (ASA): Lessons learned“, Control and 

Cybernetics, vol. 25/1, 1996. 

3. D. Nam, J.-S. Lee, C. H. Park, "N-dimensional Cauchy neighbor generation for the fast 

simulated annealing", IEICE Trans. Information and Systems, vol. E87-D/11, 2004 

4. S.K. Pal, P. Mitra, "Pattern Recognition Algorithms for Data Mining”, Chapman and Hall, 

2004. 

5. V. Bartkute, L. Sakalauskas: "Statistical Inferences for Termination of Markov Type 

Random Search Algorithms”, Journal of Optimization Theory and Applications, vol. 

141/3, 2009. 

6. P. Kulczycki, "Kernel Estimators in Industrial Applications”, Soft Computing Applications 

in Industry”, B. Prasad (ed.), Springer-Verlag, 2008. 

7. A. Saxena, N.R. Pal, M. Vora, "Evolutionary methods for unsupervised feature selection 

using Sammon’s stress function". Fuzzy Information and Engineering, vol. 2, 2010. 

16

Parallel Fast Simulated Annealing Algorithm for Linear ...

Create successful ePaper yourself

Delete template?

Save as template?