A Queen Lead Ant-Based Algorithm for Database Clustering

The Pennsylvania State UniversityThe Graduate SchoolCapital CollegeA Queen Lead Ant-Based Algorithmfor Database ClusteringA Master’s Paper inComputer SciencebyJayanthi Pathapatic○2004 Jayanthi PathapatiSubmitted in Partial Fulfillmentof the Requirementsfor the Degree ofMaster of ScienceDecember 2004

AbstractIn this paper we introduce a novel ant-based clustering algorithm to solvethe unsupervised data clustering problem. The main inspiration behind antbasedalgorithms is the chemical recognition system of ants. In our currentalgorithm, we introduce a new method of ant clustering in which the queenof the ant colony controls the colony formation. We compared our resultsto the k-means method and to the AntClust algorithm on artificial and realdata sets.i

Table of ContentsAbstractAcknowledgementList of FiguresList of Tablesiiiiivv1 Introduction 12 Database Clustering 13 Algorithm 33.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 Final Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Performance Analysis 135 Conclusion 17References 17ii

AcknowledgementsI am very grateful to Dr. T. Bui for his constant support and guidancethroughout the period of this project. It has been a great pleasure andinspiration working with him. I would also like to thank the committeemembers: Dr. S. Chang, Dr. Q. Ding, Dr. P. Naumov, and Dr. L. Null forreviewing this paper. Finally I would like to thank my husband Dr. S.K.Daya for his support throughout this course.iii

List of Figures1 QLA algorithm for database clustering . . . . . . . . . . . . . 42 Initialization Algorithm . . . . . . . . . . . . . . . . . . . . . . 53 Algorithm for Training Stage . . . . . . . . . . . . . . . . . . . 84 Cluster Formation Algorithm . . . . . . . . . . . . . . . . . . 105 Rules followed when the ants meet . . . . . . . . . . . . . . . 116 Local Optimization Algorithm . . . . . . . . . . . . . . . . . . 127 Algorithm For Assigning Free Ants . . . . . . . . . . . . . . . 13iv

List of Tables1 QLA-Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Results of all the algorithms . . . . . . . . . . . . . . . . . . . 16v

1 INTRODUCTION 11 IntroductionDatabase clustering is one of the fundamental operations of data mining.The main idea behind it is to identify homogeneous groups of data, basedon their attributes. Obtaining the closest partition possible to the naturalpartition is a complex task because there is no a-priori information regardingthe structure and the original classification of the data.Artificial ants have been applied to problems of database clustering. Theinitial pioneering work in this field was performed by Deneunbourg[1] andLumer and Faieta[8]. Since the early work, many authors have used artificialants and proposed several data clustering techniques[9,10,11]. Ant basedclustering algorithms seek to model the collective behavior of the real antsby utilizing the colonial odor to form clusters.Our current paper proposes a new ant-based clustering algorithm. Thisalgorithm incorporates several ant clustering features as well as local optimizationtechniques. Unlike the other ant clustering algorithms where theants form the nests on their own, we introduce a queen for each colony totake care of the colony creation process. This algorithm does not require thenumber of expected clusters as part of the input to find the final partition.The rest of the paper is organized as follows. Section 2 outlines thedatabase clustering problem and its applications. Section 3 describes thealgorithm in detail. Section 4 contains the performance analysis and the experimentalresults comparing our algorithm against other known algorithms.The conclusion is given in Section 5.2 Database ClusteringThe aim of database clustering is to partition a heterogeneous data set intogroups of more homogenous characteristics. The formation of clusters isbased on the principle of maximizing similarity between patterns in thesame cluster and simultaneously minimizing the similarity between patternsbelonging to distinct clusters. The Database Clustering problem could be

2 DATABASE CLUSTERING 2formally defined as follows.Given a database of n objects, find the best partition of the data objectsinto groups such that each group contains objects which are similar to eachother.Several algorithms have been proposed for the clustering problem in publishedliterature. Some of them are: BIRCH[14], ISODATA[2,7], CLARANS[13],P-CLUSTER[6] and DBSCAN[3].Most of the clustering algorithms belong to two main categories: hierarchicaland partitional. In hierarchical clustering algorithms, there aretwo main categories: Agglomerative and Divisive. In the Agglomerativeapproach, each object is assigned to an individual cluster and then similarclusters are merged to form larger clusters. Repetition of this process leadsto the desired partition. On the other hand, Divisive algorithms begin withall the data objects placed in one cluster and then split them into smallerclusters until the final partition is obtained.In partitional clustering algorithms, given a set of data objects and apartition criterion the objects are partitioned into clusters based on the similaritiesof the objects. Furthermore, clustering is performed to meet thespecified criterion, such as minimization of the sum of squared distance fromthe means within each cluster.Fuzzy clustering and the K-means clustering are two well known clusteringtechniques. In fuzzy clustering, an object can belong to many clusters.Each object is associated with a cluster membership that represents the likelinessof the object being in that cluster.The K-means algorithm starts by choosing K cluster centers at random.It then assigns each object to the closest cluster possible and computethat cluster’s center. This step is repeated until the cluster’s centers do notchange, thus obtaining the desired partition.Association between data attributes in a given database can be found byutilizing techniques of data clustering. This plays a major role in the dataanalysis procedures. Database clustering has been applied in various fieldsincluding biology, psychology, commerce, and information technology. The

3 ALGORITHM 3main areas where it has been prominently used are: image segmentation[13],statistical data analysis[4], and data compression[14] among others.3 AlgorithmAs we mentioned earlier, the main aim of the Queen Lead ant-based (QLA)algorithm is to partition the given data set into clusters of similar objects.The idea of the algorithm is to simulate the way real ants form the colonies,which are queen dominant. The ants that possess similar cuticle odor to thatof the queen of a colony will join that colony. Each object of the database istreated as an ant and it follows a common set of rules[Figure 5], which leadsto the required partition. A local optimization technique is applied to assistthe ants in achieving good partition.The input for the algorithm is just the database to be clustered. It doesnot need any other additional parameters such as the number of desiredclusters or any information about the structure of the database. Each objectin the database has a certain number of attributes, and attributes can be ofeither numeric or symbolic types. The QLA algorithm, given in Figure 1,consists of four main stages. The first stage is the initialization stage in whichall the ants are initialized to the attributes of the object they represent. Thesecond stage is the training stage in which the ants learn about their possiblecolony mates by meeting and learning about other ants.The third stage consists of two phases: cluster formation and the localoptimization. In the cluster formation phase, we stimulate meetings betweenants so that the ants can decide which colony they belong to, thus givingus the clusters of similar ants. In the second phase, the local optimizationstep is designed to eliminate any ants that have been prematurely added toa colony even though they should not belong to it. It also checks to see ifthe queen represents a colony better than any other ant in that colony.The last stage is the final clustering stage in which the free ants, i.e.,the ants which do not belong to any colony, are added to the closest colonypossible. We describe all the stages of the algorithm in greater detail in the

3 ALGORITHM 4QLA AlgorithmInput: Database to be clusteredStage 1: InitializationAssign one ant for each object in the databaseStage 2: TrainingTrain the ants so that they learn about other antsStage 3: Clustering ProcessColony formation process and local optimizationStage 4: Final ClusteringAssign all ants which do not belong to any colony tothe closest colony availablereturn the clusters formedFigure 1: QLA algorithm for database clustering

3 ALGORITHM 5Initializationfor each ant dogenome = the index of the object it representscolony = -1Assign all of its attribute values with the values of theobject it representsend-forFigure 2: Initialization Algorithmfollowing section.3.1 InitializationThe algorithm starts by assigning each artificial ant to each of the objectsin the given input database, and then assigns all the ants with the attributevalues of the object it represents. Each ant contains the following parameters.The genome parameter of the ant corresponds to an object of the dataset. It is initialized to the index of the object that the ant represents. Eachant represents the same object during the whole process of clustering, thusthe genome of an ant remains unchanged to the end of the algorithm.The colony parameter indicates the group to which the ant belongs and issimply coded by a number. This attribute is assigned the value −1 becausethe ant initially doesn’t belong to any colony, i.e., all the ants at the beginningof the algorithm are free ants.The initialization stage of the algorithm is accomplished as shown inFigure 2.

3 ALGORITHM 7and 1. This threshold value is used to determine how well the ants accepteach other. We have defined the threshold value of an ant to fall betweenthe mean and the maximum similarity values of an ant. Every time an antmeets another ant in this period, we find the similarities between the two antsand update the respective mean and maximum similarities of the ants. Soevery time an ant has a meeting, its threshold value changes. The followingequation shows how this threshold is learned:T ai ← MeanSim(a i) + MaxSim(a i ).2The MeanSim(a i ) in the above equation is the mean similarity of the ant a iand is calculated as follows:MeanSim(a i ) = 1|N i |∑a j ɛN iSim(a i ,a j )where N i is the set of ants that a i has met so far. The MaxSim(a i ) is themaximum similarity of the ant a i and is calculated as follows:MaxSim(a i ) = max{Sim(a i ,a j )}where a j is any ant with which ant a i had a meeting with.3.3 ClusteringThere are two phases in this stage. The first phase is the cluster formationphase in which clusters are created. During this phase, meetings betweenrandom ants are stimulated and the behavior between the ants is describedin Figure 5. After every α (6 × n ANTS ) number of iterations in this step,we run a local optimization algorithm on the clusters we have found so far.This stage is performed until at least β number of ants join their colonies,or for γ times, where β is 50% and γ is 40 × n ANTS . This is done as shownin Figure 4. Tests were performed with different values of α, β and γ. Formany data sets, best partitions were achieved with the above given values ofα, β, and γ.

3 ALGORITHM 8Training Stagefor i = 0 to δSelect 2 ants randomlyFind the similarity between them andUpdate their threshold and the similarity valuesIncrement their ages by 1end-forFigure 3: Algorithm for Training StageWhen two ants meet in this stage, the threshold value is used to make thedecision whether two ants accept each other or not. The Acceptance(a i ,a j )for ants a i and a j is defined as follows:A(a i ,a j ) ≡ (Sim(a i ,a j ) > 0.96 ∗ T ai ) ∧ (Sim(a i ,a j ) > 0.96 ∗ T aj )The Acceptance(a i ,queen j ) for ant a i and queen queen j is defined as follows.A(a i ,queen j ) ≡ (Sim(a i ,queen j ) > 0.98 ∗ T ai )∧ (Sim(a i ,queen j ) > 0.98 ∗ T queenj )The Acceptance(queen i ,queen j ) for queens queen i and queen j is defined asfollows.A(queen i ,queen j ) ≡ (Sim(queen i ,queen j ) > 0.94 ∗ T queeni )∧ (Sim(queen i ,queen j ) > 0.94 ∗ T queenj )

3 ALGORITHM 9The Acceptance(a i ,a j ) is defined differently between the ants and thequeens. When two ants or two queens meet, they mutually accept each otherif the similarity between them is little less or greater than their thresholdvalues. However when an ant meets a queen of a colony, both similarityvalues should be almost close to or greater than their threshold values.The threshold and the similarity values are calculated using the sameformulas as shown in the training stage. The MeanSim(a i ) and MaxSim(a i )of the ant a i are calculated using different formulas because as the ants getolder, they update their threshold values less frequently, and the chances thatthey move from one colony to another is quite less. The following equationsare used to update a i ’s mean and maximum similarities when it meets anta j :MeanSim(a i ) =Age a iAge ai + 1 MeanSim(a 1i) +Age ai + 1 Sim(a i,a j )MaxSim(a i ) = 0.8 ∗ MaxSim(a i ) + 0.2 ∗ Sim(a i ,a j )only ifSim(a i ,a j ) > MaxSim(a i )The next phase in this stage is the Local Optimization Phase. The LocalOptimization algorithm is described in Figure 6. The aim in doing local optimizationis to make sure that the queen of a colony is the best representativeof the colony and to remove the ants that should not belong to that colony.The reasoning behind the need for the queen being the best representativeof the colony is that crucial decisions, like addition and deletion of ants,are made by the queen. It also allows us to have a better partition withouthaving to have more meetings between an ant and its potential colony mates.At the end of this stage, we have most of the ants in their respectivecolonies. We check if any two queens are similar to each other and mergethem if they accept. In the next stage, we address the clustering of theremaining ants.

3 ALGORITHM 10Cluster FormationUntil 50% of the total ants find their colonies orfor i = 0 to γ doRandomly select 2 antsStimulate a meeting between themFollow the rules in the meeting algorithm andtake appropriate actionif α is 0Run the local optimization algorithmend-for orend-untilfor each colony C doCheck if the queen of colony C is similar to anyother colony’s queen and merge them if they acceptend-forFigure 4: Cluster Formation Algorithm

3 ALGORITHM 11Rules when two ants X and Y meetCase 1: If both of the ants do not belong to any colony andif they accept each otherCreate a new colony and add these two ants to thatcolony and make one of them the queen of that colonyCase 2: If one of the ants belongs to a colony A and theother is freeIf they accept each other and the queen of the colony Aaccepts the free antAdd the free ant to the colony A and update thequeen of that colony with the information that the newant is being addedCase 3: If both ants are present in the same colony AIf they accept each otherleave them as they areElse (they do not accept each other)Check if the queen of the colony A accepts the antsDelete the ants that the queen has not accepted andinform the queen about the removal of the antsCase 4: If the ants are present in different colonies A and BIf the ants accept each other and the queens of thecolonies A and B accept each otherMerge both colonies into one colony and makethe queen of the bigger colony the queen of the newcolony formedFigure 5: Rules followed when the ants meet

3 ALGORITHM 12Local Optimizationfor each colony dofind the ant a that is most similar to all ofthe rest of the ants in that colony. This is done ina two step process as follows:Step 1:for each ant a in the nest doFind the similarities between the ant aand all of other ants in the colony. Calculate theaverage of all those similarities to obtain theaverage similarity of ant aend-forStep 2:Pick the ant a with the highest average similarityin the colonyIf ant a is not the present queen of thatcolony then make a the queenFind the average similarity of the colony and remove theants that has the average similarity less than0.97 ∗ CS , where CS is the colony’s average similarityend-forFigure 6: Local Optimization Algorithm

4 PERFORMANCE ANALYSIS 13Final Clusteringfor each free ant doFind the similarities between the free ant and all thequeens of the available colonies. Then find the mostsimilar queen and assign the free ant to the colonyto which that queen belongsend-forfor each colony C doCheck if the queen of colony C is similar to anyother colony’s queen and merge them if they acceptend-forFigure 7: Algorithm For Assigning Free Ants3.4 Final ClusteringThere might be some free ants that have not yet joined any colony. For eachof the free ants, we find the closest colony possible and add the ant to thatcolony. The algorithm for assigning free ants is described in Figure 7.4 Performance AnalysisThe algorithm was tested on some artificially generated databases and somereal databases from the Machine Learning Repository[15]. These databasesare used as a benchmark as they have been used to test a number of differentant-based clustering algorithms. Thus the results can be compared to other

4 PERFORMANCE ANALYSIS 14algorithms. The databases that we used have up to 1473 objects and 35attributes. The artificial data sets Art1, Art2, Art3, Art4, Art5 and Art6,are generated according to Gaussian or uniform laws in such a way thateach poses a unique situation for the algorithm to handle in the process ofclustering .Generally the real data sets are more difficult to cluster because they maybe noisier than the artificial data sets. The real data sets that we used totest our algorithm are Soybean, Iris, wine, Thyroid, Glass, Haberman, Ecoli,Pima and CMC.The Soybean data set has the data of four kinds of soybean diseases. Thedata set contains 45 instances, each with 35 attributes. This data set givesthe ability to test the algorithm on a data set with many attributes.The Iris data set contains 3 classes of 50 instances each, where each classrefers to a type of Iris plant. The wine data set is the result of an analysisof wines grown at a particular place. It contains 178 instances in 3 classes.The Thyroid data set contains 214 instances of diagnosis of thyroid. It hasthree classes each representing a kind of a thyroid disorder(like hyperthyroidand hypothyroid).The Glass data set is obtained from the study of classifications of types ofglass. This data set contains 9 classes, hence it is used to test the algorithmfor the data set that has records from many classes. The Haberman data setis obtained from a study on survival of patients who had undergone surgeryfor breast cancer. It has 306 instances with 2 classes in it.The Ecoli data set contains information about localization sites of protein.It has 8 classes and 336 instances in it. The Balance-Scale data set wasgenerated to model psychological experimental results. It has 625 instancesin 3 classes. The Pima data set has 798 instances of the diabetes diagnosisfor the Pima Indian heritage women, the data set has 2 classes in it. TheCMC data set was obtained from 1987 National Indonesia ContraceptivePrevalence Survey. It has 1473 instances in 3 classes.To test the performance of our algorithm, we have used the classificationsuccess C s measure developed by Fowlkes and Mallows[5] and modified by

4 PERFORMANCE ANALYSIS 15Labroche et.al.[9]. This measure allows us to compare the clusters obtainedby our algorithm to the real clusters in the database. We denote the numberof objects in the database by N. The following equation has been used tofind the clustering success C s :C s = 1 −2N(N − 1)∑i,j∈1..N 2 ,i

4 PERFORMANCE ANALYSIS 16Database O A N MeanCS[Std] MeanNC[Std] AvgRTSoybean 47 35 4 0.92[0.03] 5.84[0.70] 0.02Iris 150 4 3 0.84[0.02] 4.98[1.00] 0.22Wine 178 13 3 0.81[0.12] 4.56[1.2] 0.26Thyroid 215 5 3 0.80[0.07] 4.92[1.26] 4.87Glass 214 9 7 0.65[0.03] 6.62[1.61] 5.39Haberman 306 3 2 0.94[0.01] 3.24[0.90] 0.63Ecoli 336 8 8 0.79[0.06] 7.74[2.12] 4.80Balance-Scale 625 4 3 0.60[0.01] 11.09[1.23] 0.30Pima 798 8 2 0.54[0.01] 3.34[1.30] 0.90CMC 1473 9 3 0.61[0.00] 15.16[2.04] 0.77Art1 400 2 4 0.84[0.01] 7.46[1.15] 0.71Art2 1000 2 2 0.78[0.04] 5.9[1.66] 34.28Art3 1100 2 4 0.80[0.03] 6.9[1.72] 40.20Art4 200 2 2 0.74[0.02] 6.04[1.32] 0.24Art5 900 2 9 0.85[0.02] 8.54[1.25] 2.77Art6 400 8 8 0.95[0.02] 6.52[1.40] 6.41Table 1: QLA-ResultsDatabase O A N QLA AntClust K-meansMeanCS[Std] MeanCS[Std] MeanCS[Std]Soybean 47 35 4 0.92[0.03] 0.93[0.04] 0.91[0.08]Iris 150 4 3 0.84[0.02] 0.78[0.01] 0.86[0.03]Thyroid 215 5 3 0.80[0.07] 0.84[0.03] 0.82[0.00]Glass 214 9 7 0.65[0.03] 0.64[0.02] 0.68[0.01]Pima 798 8 2 0.54[0.01] 0.54[0.01] 0.56[0.00]Art1 400 2 4 0.84[0.01] 0.78[0.03] 0.89[0.00]Art2 1000 2 2 0.78[0.04] 0.93[0.02] 0.96[0.00]Art3 1100 2 4 0.80[0.03] 0.85[0.02] 0.78[0.02]Art4 200 2 2 0.74[0.02] 0.77[0.05] 1.00[0.00]Art5 900 2 9 0.85[0.02] 0.74[0.02] 0.91[0.02]Art6 400 8 8 0.95[0.02] 0.95[0.01] 0.99[0.04]Table 2: Results of all the algorithms

5 CONCLUSION 175 ConclusionThis paper proposed a new ant-based clustering algorithm. During the processof clustering, the queen of the colony plays an important role in itsconstruction because it better represents the colony than its colony mates.This method, coupled with local optimization, results in a better partitionof the objects. The results obtained by this algorithm were close to the bestknown results for the set of benchmark databases.Future work could be done on clustering the given data based on somegiven criteria. Another idea is to make the problem distributed by dividingthe data set into subsets and find the solution for all the subsets in parallel.Then from the sub solutions, we could try to find the solution for the wholedata set. This could make the algorithm even more competitive.References[1] J.L. Deneubourg, S. Goss, N. Franks, A. Sendova-Franks, C. Detrain,and L. Chretien, “The Dynamic of Collective Sorting RobotlikeAnts and Ant-like Robots,” Proceedings of the first Conference onSimulation of Adaptive Behavior, J.A. Meyer and S.W. Wilson, MITPress/Bradford Books, 1990, pp. 356–363.[2] R.C. Dubes and A.K. Jain, “Algorithms for Clustering Data,” Prentice-Hall, Englewood Cliffs, NJ, 1988.[3] M. Ester, H.P. Kriegel, H. Sander, and X. Xu, “A Density-Based Algorithmfor Discovering Clusters in Large Spatial Datasets with Noise,”Proceedings od Second International Conference on Knowledge Discoveryand Data Mining, Portland, Oregon, 1996, pp. 1232–1239.[4] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth ,“KDD Process for ExtractingUseful Knowledge from Volumes of Data,” Communications ofthe ACM, Nov. 1996, 39(11), pp. 27–34.

REFERENCES 18[5] J. Heer and E. Chi,“Mining the Structure of User Activity Using ClusterStability,” in Proceedings of the Workshop on Web Analytics, SIAMConference on Data Mining, Arlington, VA, April. 2002.[6] D. Judd, P. McKinley and A. Jain, “Large-Scale Parallel Data Clustering,”IEEE Transactions on Pattern Analysis and Machine Intelligence20(8), August 1998, pp. 871–876.[7] L. Kaufmann and P.J. Rousseeuw, “Finding Groups in Data: An Introductionto Cluster Analysis,” John Wiley and Sons, Chichester, 1990.[8] E.D. Lumer and B. Faieta, “Diversity and Adaptation in Populationsof Clustering Ants, Simulation of Adaptive Behavior,” From Animals toAnimats 3, D.Cliff, P.Husbands, J.A. Meyer, S.W. Wilson, MIT-Press,1994, pp. 501–508.[9] N. Labroche, N. Monmarch, and G. Venturini, “AntClust: Ant Clusteringand Web Usage Mining,” Proceedings of the Genetic and EvolutionaryComputation Conference (GECCO), 2003, pp. 25–36.[10] N. Labroche, N. Monmarch, and G. Venturini, “A New Clustering AlgorithmBased on the Chemical Recognition System of Ants,” Proceedingsof the European Conference on Artificial Intelligence, Lyon, France,2002, pp. 345–349.[11] N. Monmarche, “On data clustering with artificial ants,” Freitas A., Ed.,AAAI-99 and GECCO-99 Workshop on Data Mining with EvolutionaryAlgorithms: Research Directions, Florida, 1999, pp. 23–26.[12] R. Ng and J. Han,“Efficient and Effective Clustering Methods for SpatialData Mining,” Proceedings of International Conference on Very LargeData Bases, Santiago, Chile, Sept. 1994, pp. 144–155.[13] B.J. Schachter, L.S. Davis, and A. Rosenfeld, “Some Experiments inImage Segmentation by Clustering of Local Feature Values,” PatternRecognition 11(1), 1979, pp. 19–28.

REFERENCES 19[14] T. Zhang, R. Ramakrishnan, and M. Livny ,“BIRCH: A New Data ClusteringAlgorithm and its Applications,” Data Mining and KnowledgeDiscovery 1(2), 1997, pp. 141–182.[15] ftp://ftp.ics.uci.edu/pub/machine-learning-databases/

A Queen Lead Ant-Based Algorithm for Database Clustering

Create successful ePaper yourself

Delete template?

Save as template?