12.07.2015 Views

efficient local algorithms for distributed data mining in large scale ...

efficient local algorithms for distributed data mining in large scale ...

efficient local algorithms for distributed data mining in large scale ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

APPROVAL SHEETTitle of Thesis: Efficient Local Algorithms <strong>for</strong> Distributed Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Large ScalePeer to Peer Environments: A Determ<strong>in</strong>istic ApproachName of Candidate:Kanishka BhaduriDoctor of Philosophy, 2008Thesis and Abstract Approved:Dr. Hillol KarguptaAssociate ProfessorDepartment of Computer Science andElectrical Eng<strong>in</strong>eer<strong>in</strong>gDate Approved:


Curriculum VitaeName: Kanishka Bhaduri.Permanent Address:211-F Atholgate Lane, Baltimore MD-21229.Degree and date to be conferred: Doctor of Philosophy, 2008.Date of Birth: April 24, 1980.Place of Birth: Kolkata, India.Secondary Education: South Po<strong>in</strong>t High School, Kolkata, India, 1997.Collegiate <strong>in</strong>stitutions attended:• University of Maryland Baltimore County, Maryland, USA,Doctor of Philosophy, 2008.• University of Miami, USA,2003–2004.• Jadavpur University, Kolkata, India,Bachelor of Eng<strong>in</strong>eer<strong>in</strong>g, Computer Science and Eng<strong>in</strong>eer<strong>in</strong>g, 2003.Major: Computer Science.Professional publications:Refereed Journals1. K. Bhaduri, H. Kargupta. Distributed Multivariate Regression <strong>in</strong> Peer-to-PeerNetworks. Statistical Analysis and Data M<strong>in</strong><strong>in</strong>g (SAM) Special Issue on Best ofSDM’08. (submitted) 2008.2. K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta. Decision Tree Induction <strong>in</strong>Peer-to-Peer Systems. Statistical Analysis and Data M<strong>in</strong><strong>in</strong>g (SAM) accepted <strong>for</strong>publication. 2008.3. K. Das, K. Bhaduri, K. Liu, H. Kargupta. Distributed Identification of Top-l InnerProduct Elements and its Application <strong>in</strong> a Peer-to-Peer Network. IEEE Transactionson Knowledge and Data Eng<strong>in</strong>eer<strong>in</strong>g. Volume 20, Issue 4, pp. 475-488. April 2008.4. K. Liu, K. Bhaduri, K. Das, P. Nguyen, H. Kargupta. Client-side Web M<strong>in</strong><strong>in</strong>g <strong>for</strong>Community Formation <strong>in</strong> Peer-to-Peer Environments. SIGKDD Explorations.Volume 8, Issue 2, pp. 11-20. December 2006.


5. R. Wolff, K. Bhaduri, H. Kargupta. A Generic Local Algorithm with Applications<strong>for</strong> Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Large Distributed Systems. IEEE Transactions on Knowledgeand Data Eng<strong>in</strong>eer<strong>in</strong>g (TKDE) (submitted). 2007.Book Chapter1. K. Bhaduri, K. Das, K. SivaKumar, H. Kargupta, R. Wolff, R. Chen. Algorithms<strong>for</strong> Distributed Data Stream M<strong>in</strong><strong>in</strong>g. A chapter <strong>in</strong> Data Streams: Models andAlgorithms, Charu Aggarwal (Editor), Spr<strong>in</strong>ger. pp. 309-332. 2006.Refereed Conference Proceed<strong>in</strong>gs1. K. Bhaduri, H. Kargupta. An <strong>efficient</strong> <strong>local</strong> Algorithm <strong>for</strong> Distributed MultivariateRegression <strong>in</strong> Peer-to-Peer Networks. Accepted <strong>for</strong> publication at the 2008 SIAMInternational Data M<strong>in</strong><strong>in</strong>g Conference. (Best of SDM’08)2. R. Wolff, K. Bhaduri, H. Kargupta. Local L2 Threshold<strong>in</strong>g Based Data M<strong>in</strong><strong>in</strong>g <strong>in</strong>Peer-to-Peer Systems. SIAM International Conference <strong>in</strong> Data M<strong>in</strong><strong>in</strong>g, Bethesda,Maryland, USA. pp. 430-441. 2006.Refereed Workshop Proceed<strong>in</strong>gs1. K. Liu, K Bhaduri, K. Das, P. Nguyen, H. Kargupta. Client-side Web M<strong>in</strong><strong>in</strong>g <strong>for</strong>Community Formation <strong>in</strong> Peer-to-Peer Environments. SIGKDD workshop on webusage and analysis (WebKDD). Philadelphia, Pennsylvania, USA. 2006. (Selectedas the most <strong>in</strong>terest<strong>in</strong>g paper from the WebKDD workshop)2. K. Bhaduri, K. Das, H. Kargupta. Peer-to-Peer Data M<strong>in</strong><strong>in</strong>g. AutonomousIntelligent Systems: Agents and Data M<strong>in</strong><strong>in</strong>g. V. Gorodetsky, C. Zhang, V.Skorm<strong>in</strong>, L. Cao (Editors), LNAI 4476, Spr<strong>in</strong>ger. pp. 1-10. 2007.Invited1. S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. Kargupta. Distributed DataM<strong>in</strong><strong>in</strong>g <strong>in</strong> Peer-to-Peer Networks. IEEE Internet Comput<strong>in</strong>g special issue onDistributed Data M<strong>in</strong><strong>in</strong>g. Volume 10, Number 4, pp. 18-26. 2006.Professional positions held:• Research Assistant. (05/2004 – 03/2008).Distributed Adaptive Discovery and Computation Lab, Department of ComputerScience and Electrical Eng<strong>in</strong>eer<strong>in</strong>g, University of Maryland Baltimore County(UMBC).


• Software Internship. (05/2007 – 08/2007).Symantec Corporation, Columbia, Maryland.• Teach<strong>in</strong>g Assistant. (08/2003 – 05/2004).Department of Computer Science, University of Miami, Florida.


ABSTRACTTitle of Dissertation:Efficient Local Algorithms <strong>for</strong> DistributedData M<strong>in</strong><strong>in</strong>g <strong>in</strong> Large Scale Peer to PeerEnvironments: A Determ<strong>in</strong>istic ApproachKanishka Bhaduri, Doctor of Philosophy, 2008Thesis directed by:Dr. Hillol KarguptaAssociate ProfessorDepartment of Computer Science andElectrical Eng<strong>in</strong>eer<strong>in</strong>gPeer-to-peer (P2P) systems such as Gnutella, Napster, e-Mule, Kazaa, and Freenetare <strong>in</strong>creas<strong>in</strong>gly becom<strong>in</strong>g popular <strong>for</strong> many applications that go beyond download<strong>in</strong>g musicfiles without pay<strong>in</strong>g <strong>for</strong> it. Examples <strong>in</strong>clude P2P systems <strong>for</strong> network storage, webcach<strong>in</strong>g, search<strong>in</strong>g and <strong>in</strong>dex<strong>in</strong>g of relevant documents and <strong>distributed</strong> network-threat analysis.These environments are rich <strong>in</strong> <strong>data</strong> and this <strong>data</strong>, if m<strong>in</strong>ed, can provide valuablesource of <strong>in</strong><strong>for</strong>mation. M<strong>in</strong><strong>in</strong>g the web cache of users, <strong>for</strong> example, may often give <strong>in</strong><strong>for</strong>mationabout their brows<strong>in</strong>g patterns lead<strong>in</strong>g to <strong>efficient</strong> search<strong>in</strong>g, resource utilization,query rout<strong>in</strong>g and more. However, most of the off-the-shelf <strong>data</strong> analysis techniques aredesigned <strong>for</strong> centralized applications where the entire <strong>data</strong> is stored <strong>in</strong> a s<strong>in</strong>gle location.These techniques do not work <strong>in</strong> a highly decentralized, <strong>distributed</strong> environment such asa P2P network. We need <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> that are fundamentally <strong>local</strong>,scalable, decentralized, asynchronous and anytime to solve this problem.This research proposes DeFraLC: a Determ<strong>in</strong>sitic Framework <strong>for</strong> Local Computationof functions def<strong>in</strong>ed on <strong>data</strong> <strong>distributed</strong> <strong>in</strong> <strong>large</strong> <strong>scale</strong> (peer to peer) systems. Comput<strong>in</strong>gglobal <strong>data</strong> models <strong>in</strong> such environments can be very expensive. Mov<strong>in</strong>g all or some of the<strong>data</strong> to a central location does not work because of the high cost <strong>in</strong>volved <strong>in</strong> centralization.The cost <strong>in</strong>creases even more under a dynamic scenario where the peers’ <strong>data</strong> and thenetwork topology change arbitrarily. In this dissertation we have focused on develop<strong>in</strong>g


<strong>algorithms</strong> <strong>for</strong> determ<strong>in</strong>istic function-computation <strong>in</strong> <strong>large</strong> <strong>scale</strong> P2P environments. Ouralgorithmic framework is <strong>local</strong> which means that a peer can compute a function based onthe <strong>in</strong><strong>for</strong>mation of only a handful of nearby neighbors and the communication overhead ofthe algorithm is upper bounded by some constant, <strong>in</strong>dependent of the size of the system.As a consequence, several messages can be pruned, lead<strong>in</strong>g to excellent scalability of our<strong>algorithms</strong>.The first algorithm that we have developed — PeGMA, Peer-to-Peer Generic Monitor<strong>in</strong>gAlgorithm — is capable of comput<strong>in</strong>g complex functions def<strong>in</strong>ed on the average of the horizontally<strong>distributed</strong> <strong>data</strong>. This generic algorithm is extremely accurate, highly scalable andcan seamlessly adapt to changes <strong>in</strong> the <strong>data</strong> or the network. Follow<strong>in</strong>g PeGMA, several <strong>in</strong>terest<strong>in</strong>g<strong>algorithms</strong> can be developed such as the L2 norm monitor<strong>in</strong>g of <strong>distributed</strong> <strong>data</strong>which is a very powerful primitive. Us<strong>in</strong>g a two step feedback loop, a number of <strong>data</strong><strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> have been proposed. The first step uses the <strong>local</strong> algorithm to raise aflag whenever the current <strong>data</strong> does not fit the function. The second step uses a feedbackloop to sample <strong>data</strong> from the network and build a new function. The correctness of the<strong>local</strong> algorithm guarantees that once the computation term<strong>in</strong>ates each peer has the sameresult compared to a centralized scenario. We propose solutions <strong>for</strong> P2Pk-means monitor<strong>in</strong>g,eigen monitor<strong>in</strong>g and multivariate regression <strong>in</strong> P2P environments. Furthermore, wehave shown how a complex <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> algorithm such as decision tree <strong>in</strong>duction can bedeveloped <strong>for</strong> P2P environments. F<strong>in</strong>ally we have implemented all of the <strong>algorithms</strong> <strong>in</strong> aDistributed Data M<strong>in</strong><strong>in</strong>g Toolkit (DDMT) [44] developed at the DIADIC research lab atUMBC. Our extensive experimental results show that the proposed <strong>algorithms</strong> are accurate,<strong>efficient</strong> and highly scalable.


EFFICIENT LOCAL ALGORITHMS FORDISTRIBUTED DATA MINING IN LARGE SCALEPEER TO PEER ENVIRONMENTS: ADETERMINISTIC APPROACHbyKanishka BhaduriDissertation submitted to the Faculty of the Graduate Schoolof the University of Maryland <strong>in</strong> partial fulfillmentof the requirements <strong>for</strong> the degree ofDoctor of Philosophy2008


c○ Copyright Kanishka Bhaduri 2008


Dedicated to Dadon...who was always proud of myachievementsii


hope she f<strong>in</strong>ishes her PhD soon so that we can be together aga<strong>in</strong>.I also want to mention my labmates Kun Liu, Haimonti Dutta, Souptik Datta andSourav Mukherjee — work<strong>in</strong>g with all of them has been a pleasure. My graduate lifeat UMBC paved the path <strong>for</strong> some very important friendships <strong>in</strong> my life. I will alwaysremember Kishalay, Nicolle, Aarti and Aseem <strong>for</strong> all their help and <strong>for</strong> all the fun th<strong>in</strong>gswe did together.F<strong>in</strong>ally, I would like to thank all my committee members <strong>for</strong> provid<strong>in</strong>g me with theirvaluable feedback on my thesis and <strong>for</strong> work<strong>in</strong>g with me under severe time constra<strong>in</strong>ts.Overall, it has been a reward<strong>in</strong>g experience that I will cherish <strong>for</strong> a long time to come.iv


TABLE OF CONTENTSDEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xLIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviChapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Chapter 2 BACKGROUND AND RELATED WORK . . . . . . . . . . . 82.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Client-Server Networks and Peer-to-Peer Networks . . . . . . . . . . . . . 92.3 Types of Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Centralized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Decentralized but structured . . . . . . . . . . . . . . . . . . . . . 12v


2.3.3 Decentralized and unstructured . . . . . . . . . . . . . . . . . . . . 132.4 Characteristics of Peer-to-Peer Data M<strong>in</strong><strong>in</strong>g Algorithms . . . . . . . . . . . 132.5 Related work: Distributed Data M<strong>in</strong><strong>in</strong>g . . . . . . . . . . . . . . . . . . . 152.5.1 Distributed Classification . . . . . . . . . . . . . . . . . . . . . . . 162.5.2 Distributed Cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Related work: Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Peer-to-Peer Environments . . . . . . . . . . 202.6.1 Approximate Algorithms . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Exact Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7 Related work: Distributed Data Stream M<strong>in</strong><strong>in</strong>g . . . . . . . . . . . . . . . 272.8 Related work: Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Sensor Networks . . . . . . . . . . . . . . . 302.9 Related work: In<strong>for</strong>mation Retrieval <strong>in</strong> P2P Networks . . . . . . . . . . . . 332.9.1 Search <strong>in</strong> P2P Network . . . . . . . . . . . . . . . . . . . . . . . . 332.9.2 Rank<strong>in</strong>g <strong>in</strong> P2P Network . . . . . . . . . . . . . . . . . . . . . . . 362.10 Related work: Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> GRID . . . . . . . . . . . . . . . . . . . . . 372.11 Related work: Distributed AI and Multi-Agent Systems . . . . . . . . . . . 392.11.1 Homogenous and non-communicat<strong>in</strong>g MAS . . . . . . . . . . . . . 402.11.2 Heterogenous and non-communicat<strong>in</strong>g MAS . . . . . . . . . . . . 402.11.3 Homogenous and communicat<strong>in</strong>g MAS . . . . . . . . . . . . . . . 402.11.4 Heterogenous and communicat<strong>in</strong>g MAS . . . . . . . . . . . . . . . 412.12 Applications of Peer-to-Peer Data M<strong>in</strong><strong>in</strong>g . . . . . . . . . . . . . . . . . . 412.12.1 File storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.12.2 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.12.3 Bio<strong>in</strong><strong>for</strong>matics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.12.4 Consumer Applications . . . . . . . . . . . . . . . . . . . . . . . . 432.12.5 Sensor Network Applications . . . . . . . . . . . . . . . . . . . . 442.12.6 Mobile Ad-hoc Networks (MANETs) . . . . . . . . . . . . . . . . 44vi


2.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Chapter 3DEFRALC: A DETERMINISTIC FRAMEWORK FOR LO-CAL COMPUTATION IN LARGE DISTRIBUTED SYSTEMS 473.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 What is <strong>local</strong>ity ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3 Local Algorithms: Def<strong>in</strong>itions and Properties . . . . . . . . . . . . . . . . 503.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 Notations, Assumptions, and Problem Def<strong>in</strong>ition . . . . . . . . . . . . . . 573.5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.5.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5.3 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 623.5.4 Problem Def<strong>in</strong>ition . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6 Ma<strong>in</strong> Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.7 Peer-to-Peer Generic Algorithm (PeGMA) and its Instantiations . . . . . . 723.7.1 PeGMA: Algorithmic Details . . . . . . . . . . . . . . . . . . . . 733.7.2 PeGMA: Correctness and Locality . . . . . . . . . . . . . . . . . . 763.7.3 Local L2 Norm Threshold<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . 793.7.4 Comput<strong>in</strong>g Pacman . . . . . . . . . . . . . . . . . . . . . . . . . . 833.8 Reactive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.8.1 Mean Monitor<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . 863.8.2 k-Means Monitor<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . 893.8.3 Eigenvector Monitor<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 933.8.4 Distributed Multivariate Regression . . . . . . . . . . . . . . . . . 963.9 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.9.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 105vii


3.9.2 Experiments with Local L2 Threshold<strong>in</strong>g Algorithm . . . . . . . . 1113.9.3 Experiments with Means-Monitor<strong>in</strong>g . . . . . . . . . . . . . . . . 1153.9.4 Experiments withk-Means and Eigen-Monitor<strong>in</strong>g . . . . . . . . . 1183.9.5 Experiments with Multivariate Regression Algorithm . . . . . . . . 1213.9.6 Results: Regression Models . . . . . . . . . . . . . . . . . . . . . 1253.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128Chapter 4DISTRIBUTED DECISION TREE INDUCTION IN PEER-TO-PEER SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . 1304.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304.2 Related Work: Distributed Classification Algorithms . . . . . . . . . . . . 1314.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.3.2 Distributed Majority Vot<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . 1354.4 P2P Decision Tree Induction Algorithm . . . . . . . . . . . . . . . . . . . 1384.4.1 Splitt<strong>in</strong>g Attribute Choos<strong>in</strong>g us<strong>in</strong>g the Misclassification Ga<strong>in</strong> Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.4.2 Speculative Decision Tree Induction . . . . . . . . . . . . . . . . . 1474.4.3 Stopp<strong>in</strong>g Rule Computation . . . . . . . . . . . . . . . . . . . . . 1524.4.4 Accuracy on Centralized Dataset . . . . . . . . . . . . . . . . . . . 1544.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1564.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 1574.5.2 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1574.5.3 Measurement Metric . . . . . . . . . . . . . . . . . . . . . . . . . 1584.5.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1604.5.5 Data Tuples per Peer . . . . . . . . . . . . . . . . . . . . . . . . . 1604.5.6 Depth of Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161viii


4.5.7 Size of Leaky Bucket . . . . . . . . . . . . . . . . . . . . . . . . . 1634.5.8 Noise <strong>in</strong> Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664.5.9 Number of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 1664.5.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1684.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172Chapter 5 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . 174Appendix AMAJORITY VOTING PROTOCOL AND ITS RELATION TOL2 THRESHOLDING ALGORITHM . . . . . . . . . . . . . 178Appendix B INFORMATION GAIN COMPARISON . . . . . . . . . . . . 182B.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182B.2 Threshold<strong>in</strong>g Misclassification Ga<strong>in</strong> . . . . . . . . . . . . . . . . . . . . . 182B.3 Threshold<strong>in</strong>g G<strong>in</strong>i In<strong>for</strong>mation Ga<strong>in</strong> . . . . . . . . . . . . . . . . . . . . . 184REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187ix


LIST OF FIGURES2.1 A client server architecture show<strong>in</strong>g three clients and three servers —a web server, application server and <strong>data</strong>base server.Image sourcehttp://www.sqapartners.com/images/webmatrix2 590.jpg. . . . . . . . . . . 102.2 A P2P network of 100 nodes show<strong>in</strong>g the ad-hoc structure. . . . . . . . . . 113.1 Figure show<strong>in</strong>g connections to immediate neighbors. . . . . . . . . . . . . 503.2 An example show<strong>in</strong>g the cover and the vectors ⃗ G, ⃗ K 1 and ⃗ K 2 . Each shadedregion is a convex region along with the unshaded region <strong>in</strong>side the square. 673.3 Vectors of four peers be<strong>for</strong>e and after Theorem 3.6.1 is applied. The knowledge,the agreement and the withheld knowledge are shown <strong>in</strong> the figure.The top row represents the <strong>in</strong>itial states of the four peers. The bottom rowdepicts what happens ifP 2 sends all its <strong>data</strong> to P 3 . . . . . . . . . . . . . . . 693.4 An arbitrary F is subdivided us<strong>in</strong>g quadtree <strong>data</strong> structure. . . . . . . . . . 733.5 Flowchart of PeGMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.6 (A) the area <strong>in</strong>side an ǫ circle (B) A random vector (C) A tangent def<strong>in</strong><strong>in</strong>ga half-space (D) The areas between the circle and the union of half-spacesare the tie areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.7 A circle of radius ǫ circumscrib<strong>in</strong>g a Pacman shape. (A) Area <strong>in</strong>side theshape. (B) Area outside the shape. (C) Unit vectors def<strong>in</strong><strong>in</strong>g the tangentl<strong>in</strong>es. (D) Tangent l<strong>in</strong>es def<strong>in</strong><strong>in</strong>g the bound<strong>in</strong>g polygon. . . . . . . . . . . . 85x


3.8 Convergecast and broadcast through the different steps. In subfigure 3.8(a),the peers do not raise a flag. In subfigure 3.8(b), the two leaves raise theirflags and send their <strong>data</strong> up (to the parent) as shown us<strong>in</strong>g arrows. Figure3.8(c) shows an <strong>in</strong>termediate step. F<strong>in</strong>ally, the roots (two of them) becomeactivated <strong>in</strong> subfigure 3.8(d) by exchang<strong>in</strong>g <strong>data</strong> with each other. . . . . . . 923.9 2-D typical <strong>data</strong> used <strong>in</strong> the experiment. It shows the two gaussians alongwith random noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093.10 A typical experiment is run <strong>for</strong> 10 equal length epochs. The epochs havevery similar means. Quality and overall cost are measured across the entireexperiment — <strong>in</strong>clud<strong>in</strong>g transitional phases. The monitor<strong>in</strong>g cost ismeasured on the last 80% of every epoch, <strong>in</strong> order to ignore transitionaleffects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093.11 Scalability of Local L2 algorithm w.r.t number of peers and the dimension.Both quality and cost w.r.t number of peers converge to a constant, show<strong>in</strong>ghigh scalability. For dimension variability, the cost <strong>in</strong>creases almostl<strong>in</strong>early w.r.t dimension, when the space <strong>in</strong>creases exponentially. . . . . . . 1103.12 Dependency of cost and quality of L2 Threshold<strong>in</strong>g on ‖σ‖ F. Quality isdef<strong>in</strong>ed as the percentage of peers correctly comput<strong>in</strong>g an alert (separated<strong>for</strong> epochs with∥G⃗ ∥ less and more thanǫ). Cost is def<strong>in</strong>ed as the portion ofthe leaky buckets <strong>in</strong>tervals that are used. Both overall cost and cost of justthe stationary periods are reported. Quality degrades and cost <strong>in</strong>creases as‖σ‖ Fis <strong>in</strong>creased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113xi


3.13 Dependency of cost and quality of L2 Threshold<strong>in</strong>g on |S i |. Quality isdef<strong>in</strong>ed as the percentage of peers correctly comput<strong>in</strong>g an alert (separated<strong>for</strong> epochs with ∥G⃗ ∥ less and more than ǫ). Cost is def<strong>in</strong>ed as the portionof the leaky buckets <strong>in</strong>tervals that are used. Both overall cost and cost ofjust the stationary periods are reported. The quality improves and the costdecreases as |S i | is <strong>in</strong>creased. . . . . . . . . . . . . . . . . . . . . . . . . . 1143.14 Dependency of cost and quality of L2 Threshold<strong>in</strong>g on ǫ. Quality is def<strong>in</strong>edas the percentage of peers correctly comput<strong>in</strong>g an alert (separated <strong>for</strong>epochs with∥G⃗ ∥ less and more thanǫ). Cost is def<strong>in</strong>ed as the portion of theleaky buckets <strong>in</strong>tervals that are used. Both overall cost and cost of just thestationary periods are reported. There is a drastic improvement of qualityand decrease <strong>in</strong> cost as ǫ is <strong>in</strong>creased. . . . . . . . . . . . . . . . . . . . . . 1143.15 Dependency of cost and quality of L2 Threshold<strong>in</strong>g the size of the leakybucketL. Quality is def<strong>in</strong>ed as the percentage of peers correctly comput<strong>in</strong>gan alert (separated <strong>for</strong> epochs with ∥G⃗ ∥ less and more than ǫ). Cost isdef<strong>in</strong>ed as the portion of the leaky buckets <strong>in</strong>tervals that are used. Bothoverall cost and cost of just the stationary periods are reported. Quality isalmost unaffected by L, while cost decreased with <strong>in</strong>creas<strong>in</strong>gL. . . . . . . 1153.16 Dependency of cost and quality of Mean-Monitor<strong>in</strong>g on alert mitigationperiodτ. The number of <strong>data</strong> collection rounds decrease as τ is <strong>in</strong>creased. . 1163.17 Dependency of cost and quality of Mean-Monitor<strong>in</strong>g on epoch length (T ).AsT <strong>in</strong>creases, the average||G−µ|| and L2 monitor<strong>in</strong>g cost decrease s<strong>in</strong>cethe average is taken over a <strong>large</strong>r stationary period <strong>in</strong> whichµis correct. . . 116xii


3.18 Dependence of cost and quality of Mean-Monitor<strong>in</strong>g onǫ. Average||G−µ||<strong>in</strong>creases and the cost decreases as ǫ is <strong>in</strong>creased. . . . . . . . . . . . . . . 1173.19 The Quality and Cost ofk-means Cluster<strong>in</strong>g. . . . . . . . . . . . . . . . . 1203.20 The Quality and Cost of Eigen monitor<strong>in</strong>g. . . . . . . . . . . . . . . . . . . 1203.21 A typical experiment <strong>for</strong> the regression monitor<strong>in</strong>g algorithm. . . . . . . . 1213.22 Dependence of the quality and cost of the monitor<strong>in</strong>g algorithm on|S i |. Asbe<strong>for</strong>e, quality improves and cost decreases as |S i | <strong>in</strong>creases. . . . . . . . . 1223.23 Dependence of the monitor<strong>in</strong>g algorithm on ǫ. As ǫ is <strong>in</strong>creased, qualityimproves and cost decreases s<strong>in</strong>ce the problem becomes significantly simpler.1233.24 Behavior of the monitor<strong>in</strong>g algorithm w.r.t size of the leaky bucket L. Thequality and cost <strong>in</strong> almost <strong>in</strong>variant ofL. . . . . . . . . . . . . . . . . . . . 1233.25 Dependence of the quality and cost of the monitor<strong>in</strong>g algorithm on noise <strong>in</strong>the <strong>data</strong>. The quality degrades and cost <strong>in</strong>creases as percentage is <strong>in</strong>creased. 1243.26 Scalability with respect to both number of peers and dimension of the multivariateproblem. Both quality and cost is <strong>in</strong>dependent of the number ofpeers and dimension of the multi-variate problem. . . . . . . . . . . . . . . 1253.27 Quality and cost of comput<strong>in</strong>g regression co<strong>efficient</strong>s <strong>for</strong> l<strong>in</strong>ear model. . . . 1263.28 Quality and cost of comput<strong>in</strong>g regression co<strong>efficient</strong>s <strong>for</strong> multiplicativemodel. The accuracy of the result becomes more accurate and the costdecreases as the sample size is <strong>in</strong>creased. . . . . . . . . . . . . . . . . . . 127xiii


3.29 Quality and cost of comput<strong>in</strong>g regression co<strong>efficient</strong>s <strong>for</strong> s<strong>in</strong>usoidal model.The accuracy of the result becomes more accurate and the cost decreasesas the sample size is <strong>in</strong>creased. . . . . . . . . . . . . . . . . . . . . . . . . 1274.1 Figure show<strong>in</strong>g the pivot selection process. In the first row, A 1 is selectedas the pivot. A 1 is then compared to all other attributes and all the onesbetter thanA 1 are selected <strong>in</strong> the third snapshot. A 2 is selected as the pivot.F<strong>in</strong>allyA 2 is output as the best attribute. . . . . . . . . . . . . . . . . . . . 1434.2 Comparison of two attributes A i and A j <strong>for</strong> two peers P k and P l . Thefigure also shows some example values of the <strong>in</strong>dicator variables and thesuspended/not suspended votes <strong>in</strong> each case. . . . . . . . . . . . . . . . . . 1484.3 The pivot selection process and how the best attribute is selected. The pivotis shown at each round. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494.4 Figure show<strong>in</strong>g how the speculative decision tree is build by a peer. Filledrectangles represent the newly created nodes. In the first snapshot the rootis just created with A 1 as the current best attribute. The root is split <strong>in</strong>totwo children <strong>in</strong> the second snapshot. The third snapshot shows furtherdevelopment of the tree by splitt<strong>in</strong>g the left child. In the fourth snapshot,the peer gets conv<strong>in</strong>ced that A 2 is the best attribute correspond<strong>in</strong>g to theroot. Earlier tree is made <strong>in</strong>active and a new tree is developed with split atA 2 . Fifth snapshot shows the leaf label assignments. . . . . . . . . . . . . . 1534.5 Comparison of accuracy (us<strong>in</strong>g 10-fold cross validation) of J48 weka treeand a tree <strong>in</strong>duced us<strong>in</strong>g misclassification ga<strong>in</strong> with fixed depth stopp<strong>in</strong>grule on a centralized <strong>data</strong>set. The three graphs correspond to depths of 3, 5and 7 of the misclassification ga<strong>in</strong> decision tree. . . . . . . . . . . . . . . . 155xiv


4.6 Dependence of the quality and cost of the decision tree algorithm on thenumber of peers. The accuracy and the cost is almost <strong>in</strong>variant of the numberof peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1614.7 Dependence of the quality and cost of the decision tree algorithm on |S i |.The accuracy improves and the cost decreases as |S i | <strong>in</strong>creases. . . . . . . . 1624.8 Dependence of the quality and cost of the decision tree algorithm on thedepth of the <strong>in</strong>duced tree. The accuracy first improves and then degradesas the depth of the tree <strong>in</strong>creases. The cost <strong>in</strong>creases as the depth <strong>in</strong>creases. 1644.9 Dependence of the quality and cost of the decision tree algorithm on thesize of the leaky bucket. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1654.10 Dependence of the quality and cost of the decision tree algorithm on noise<strong>in</strong> the <strong>data</strong>. The accuracy decreases and cost <strong>in</strong>creases as percentage ofnoise <strong>in</strong>creases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1674.11 Dependence of the quality and cost of the decision tree algorithm on thenumber of attributes when number of tuples=constant. As the number ofattributes <strong>in</strong>crease, the accuracy drops and the cost <strong>in</strong>creases. . . . . . . . . 1694.12 Dependence of the quality and cost of the decision tree algorithm on thenumber of attributes when #tuples <strong>in</strong>crease l<strong>in</strong>early with number of attributes.1704.13 Dependence of the quality and cost of the decision tree algorithm on thenumber of attributes when number of tuples <strong>in</strong>crease l<strong>in</strong>early with|doma<strong>in</strong>|. 171xv


LIST OF TABLES2.1 Categorization of the different P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>. . . . . . . . . . 203.1 Table show<strong>in</strong>g the messages exchanged by two peers P i and P j <strong>for</strong> the L2threshold<strong>in</strong>g algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84B.1 Number of entries of attributeA i and the classC. . . . . . . . . . . . . . . 183B.2 Number of entries of attributeA j and the class C. . . . . . . . . . . . . . . 183xvi


List of Algorithms1 P2P Generic Local Algorithm (PeGMA) . . . . . . . . . . . . . . . . . . . . 752 Generic cover <strong>for</strong> sphere <strong>in</strong> R d . . . . . . . . . . . . . . . . . . . . . . . . . 813 Local L2 Threshold<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824 Mean Monitor<strong>in</strong>g. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905 Convergecast function <strong>for</strong> Mean Monitor<strong>in</strong>g. . . . . . . . . . . . . . . . . . 916 Function handl<strong>in</strong>g the receipt of new means µ <strong>for</strong> Mean Monitor<strong>in</strong>g. . . . . . 917 P2Pk-Means Cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948 Convergecast function <strong>for</strong> P2Pk-Mean Monitor<strong>in</strong>g. . . . . . . . . . . . . . . 959 Function handl<strong>in</strong>g the receipt of new centroidsC ′ <strong>for</strong> P2Pk-means Monitor<strong>in</strong>g. 9610 Local Majority Vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13711 P2P Misclassification M<strong>in</strong>imization (P 2 MM) . . . . . . . . . . . . . . . . . 14512 P2P Decision Tree Induction (P 2 DTI) . . . . . . . . . . . . . . . . . . . . . 15113 Local Majority Vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17914 Dynamic Local Majority Vote . . . . . . . . . . . . . . . . . . . . . . . . . 181xvii


1Chapter 1INTRODUCTION1.1 IntroductionProliferation of communication technologies and reduction <strong>in</strong> storage costs over thepast decade have led to the emergence of several <strong>distributed</strong> systems. These environmentsare rich <strong>in</strong> <strong>data</strong>; however unlike traditional systems where the <strong>data</strong> and computational resourcesare at one central location, these systems are <strong>distributed</strong> both <strong>in</strong> terms of <strong>data</strong> andresources. As a result, <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> such environments naturally calls <strong>for</strong> proper utilizationof these <strong>distributed</strong> resources. Moreover, <strong>in</strong> many privacy sensitive applications different,possibly multi-party, <strong>data</strong> sets collected at different sites must be processed <strong>in</strong> a <strong>distributed</strong>fashion without collect<strong>in</strong>g everyth<strong>in</strong>g to a s<strong>in</strong>gle central site. However, most off-the-shelf<strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> systems are designed to work as a monolithic centralized application. Theynormally down-load the relevant <strong>data</strong> to a centralized location and then per<strong>for</strong>m the <strong>data</strong><strong>m<strong>in</strong><strong>in</strong>g</strong> operations. This centralized approach does not work well <strong>in</strong> many of the emerg<strong>in</strong>g<strong>distributed</strong>, ubiquitous, possibly privacy-sensitive <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> applications due to computation,communication, and storage constra<strong>in</strong>ts.Distributed Data M<strong>in</strong><strong>in</strong>g (DDM) offers an alternate approach to address this problemof <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>data</strong> us<strong>in</strong>g <strong>distributed</strong> resources. DDM pays careful attention to the <strong>distributed</strong>resources of <strong>data</strong>, comput<strong>in</strong>g, communication, and human factors <strong>in</strong> order to use them <strong>in</strong>a near optimal fashion. Distributed <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> is ga<strong>in</strong><strong>in</strong>g <strong>in</strong>creas<strong>in</strong>g attention <strong>in</strong> today’s


world <strong>for</strong> advanced <strong>data</strong> driven applications.2Peer-to-peer (P2P) systems are emerg<strong>in</strong>g as a choice of solution <strong>for</strong> a new breed of applicationssuch as file shar<strong>in</strong>g, collaborative movie and song scor<strong>in</strong>g, electronic commerce,Internet chat, webcast surveillance us<strong>in</strong>g sensor networks to name a few. P2P systems canbe viewed as an extreme case of DDM applications — they are huge (sizes are of the orderof tens of thousands or millions <strong>in</strong> many cases), there is no coord<strong>in</strong>ator node, the connectionsamong the nodes are ad hoc, the node and l<strong>in</strong>k failures are unpredictable, <strong>data</strong> changeswith time (stream<strong>in</strong>g scenario) just to mention a few. Traditional DDM <strong>algorithms</strong> cannotoften be deployed <strong>in</strong> such environments ma<strong>in</strong>ly because of scalability issues, asynchronousnature of such environments, stream<strong>in</strong>g <strong>data</strong> — to mention a few.This dissertation proposes a framework — DeFraLC (Determ<strong>in</strong>istic Framework <strong>for</strong>Local Comput<strong>in</strong>g) — <strong>for</strong> <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>large</strong> P2P systems. The algorithmicframework proposed is determ<strong>in</strong>istic and provably correct <strong>in</strong> the sense that the result producedby our framework is the same that would be produced if all the <strong>data</strong> were accumulatedat a central site and then an off-the-shelf centralized <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> algorithm was run.S<strong>in</strong>ce these systems typically consist of millions of nodes or peers and the system is constantlychang<strong>in</strong>g, it is very important to compute the results <strong>efficient</strong>ly. Local computation,as we def<strong>in</strong>e <strong>in</strong> this dissertation, guarantees an upper bound on the communication complexityat each node (<strong>in</strong>dependent of the network size), thereby provid<strong>in</strong>g high scalabilityof these <strong>algorithms</strong>. Each node contacts only a few of the other nodes <strong>in</strong> the network togenerate a global mode. The DeFraLC framework handles node and <strong>data</strong> changes withease; the model adapts automatically whenever the system changes. We show the genericnature of DeFraLC by propos<strong>in</strong>g a number of <strong>algorithms</strong> which are direct application ofDeFraLC.


1.2 Motivation3Data <strong>m<strong>in</strong><strong>in</strong>g</strong> has been def<strong>in</strong>ed as “the nontrivial extraction of implicit, previously unknown,and potentially useful <strong>in</strong><strong>for</strong>mation from <strong>data</strong>” [60] and “the science of extract<strong>in</strong>guseful <strong>in</strong><strong>for</strong>mation from <strong>large</strong> <strong>data</strong> sets or <strong>data</strong>bases” [73]. It <strong>in</strong>volves search<strong>in</strong>g through<strong>large</strong> amounts of <strong>data</strong> and pick<strong>in</strong>g out relevant <strong>in</strong><strong>for</strong>mation.P2P networks are quickly emerg<strong>in</strong>g as huge <strong>in</strong><strong>for</strong>mation systems. Through networkssuch as Kazaa, e-Mule, BitTorrents and more consumers can share vast amounts of <strong>data</strong>.While <strong>in</strong>itial consumer <strong>in</strong>terest <strong>in</strong> P2P networks was focused on the <strong>data</strong> itself, more recentresearch such as P2P web community <strong>for</strong>mation argues that the consumers will greatlybenefit from the knowledge locked <strong>in</strong> the <strong>data</strong> [115] [37].For <strong>in</strong>stance, music recommendations and shar<strong>in</strong>g systems are a thriv<strong>in</strong>g <strong>in</strong>dustry today[3][152] -- a sure sign of the value consumers have put on this application. However,all exist<strong>in</strong>g systems require that users submit their listen<strong>in</strong>g habits, either explicitly or implicitly,to centralized process<strong>in</strong>g. Such centralized process<strong>in</strong>g can be problematic becauseit can result <strong>in</strong> severe per<strong>for</strong>mance bottleneck. Several <strong>distributed</strong> association rule <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong> (e.g. the one by Wolff et al. [196]) have shown that centralized process<strong>in</strong>g maynot be a necessity by describ<strong>in</strong>g <strong>algorithms</strong> which compute association rules (and hence,recommendations) <strong>in</strong>-network; process<strong>in</strong>g the <strong>data</strong> <strong>in</strong>-network means that it is very <strong>efficient</strong>and scalable. Moreover Mierswa et al. [123] have demonstrated the collaborative useof features <strong>for</strong> organiz<strong>in</strong>g mucic collections <strong>in</strong> a P2P sett<strong>in</strong>g.Another application which offers high value to the consumers is failure determ<strong>in</strong>ation[145][206]. In failure determ<strong>in</strong>ation, computer-log <strong>data</strong> which may have relation to thefailure of software and this <strong>data</strong> is later analyzed <strong>in</strong> ef<strong>for</strong>t to determ<strong>in</strong>e the reason <strong>for</strong> thefailure. Data collection systems are today <strong>in</strong>tegral to both the W<strong>in</strong>dows and L<strong>in</strong>ux operat<strong>in</strong>gsystems. Analysis is per<strong>for</strong>med off-l<strong>in</strong>e on a central site and often uses knowledgediscovery methods. Still, home users often choose not to cooperate with current <strong>data</strong> col-


4lection systems because they fear <strong>for</strong> privacy and currently there is no immediate benefitto the user <strong>for</strong> participat<strong>in</strong>g <strong>in</strong> the system. Collaborative <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>for</strong> failure determ<strong>in</strong>ationcan be very useful <strong>in</strong> such scenarios. Such a privacy preserv<strong>in</strong>g P2P misconfigurationdiagnosis technique has been proposed by Huang et al [82].Consider a P2P system such as Napster [134] or Gnutella [68]. In most of these <strong>large</strong><strong>scale</strong> systems, users download music files or other files based on certa<strong>in</strong> preferences. Thereare several <strong>in</strong>terest<strong>in</strong>g questions that can be asked <strong>in</strong> such environments. For example, doesthe type of music downloaded change over time ? Or, is there a particular type of file orgenre of music preferred by users at a certa<strong>in</strong> time of the year (e.g. dur<strong>in</strong>g the Christmas)? Can users be clustered based on their preferences ? Answers to many such questions canbe crucial from the system’s po<strong>in</strong>t of view — the resources necessary can change over timeand it makes sense to re<strong>in</strong>vest the unused resources <strong>in</strong> a better way.However, it is not always possible to centralize the <strong>data</strong> and build models of it. Thebandwidth required to centralize the <strong>data</strong> may be too expensive. The storage required atthe central repository may also be unacceptable. Above all, if the <strong>data</strong> is non-stationary,it may so happen that the rate at which the <strong>data</strong> changes is faster than the rate at which<strong>data</strong> can be propagated through the network, mak<strong>in</strong>g <strong>distributed</strong> monitor<strong>in</strong>g <strong>algorithms</strong> theonly alternative. Moreover traditional client server based DDM <strong>algorithms</strong> do not <strong>scale</strong> tomillions of peers and hence they also fail to address the issues <strong>in</strong> P2P applications. Thismotivates us to develop the DeFraLC framework.1.3 Problem StatementThis dissertation addresses the follow<strong>in</strong>g problem. Consider a <strong>large</strong> P2P networkconsist<strong>in</strong>g of millions of nodes or peers and connected via an underly<strong>in</strong>g communication<strong>in</strong>frastructure. Each node has some <strong>data</strong> which is known only to itself. We call this the <strong>local</strong><strong>data</strong> of the node. Each node can exchange messages with its neighbors. The system is con-


5stantly chang<strong>in</strong>g both with respect to the nodes and the <strong>data</strong> conta<strong>in</strong>ed at each node. Thisresearch aims at answer<strong>in</strong>g the question: “How can <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> techniques or <strong>algorithms</strong>be developed or applied that can extract useful knowledge from the union of all <strong>data</strong> of allpeers <strong>in</strong> these networks under the follow<strong>in</strong>g constra<strong>in</strong>ts (1) low communication overhead,(2) no synchronization, (3) failure resistance <strong>in</strong> case of nodes depart<strong>in</strong>g or jo<strong>in</strong><strong>in</strong>g, and (4)quality of results comparable to centralized <strong>m<strong>in</strong><strong>in</strong>g</strong> techniques.”1.4 ContributionsIn this dissertation we have systematically studied the area of determ<strong>in</strong>istic <strong>local</strong><strong>algorithms</strong> <strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P environments.A generic framework — DeFraLC(Determ<strong>in</strong>istic Framework <strong>for</strong> Local Comput<strong>in</strong>g) — has been proposed <strong>for</strong> develop<strong>in</strong>gmany determ<strong>in</strong>istic <strong>local</strong> <strong>algorithms</strong>.The strength yet simplicity of the framework isdemonstrated by develop<strong>in</strong>g a number of <strong>algorithms</strong> based on it — the L2 Threshold<strong>in</strong>g,Means monitor<strong>in</strong>g, k-means monitor<strong>in</strong>g, eigen vector monitor<strong>in</strong>g, multivariate regressionand decision tree <strong>in</strong>duction.The specific contributions of this dissertation are highlighted next.1. We have identified a common pr<strong>in</strong>ciple which can be used to develop extremely complex<strong>efficient</strong> exact (determ<strong>in</strong>istic) <strong>local</strong> <strong>algorithms</strong>, prov<strong>in</strong>g a ma<strong>in</strong> theorem fromwhich a <strong>large</strong> number of determ<strong>in</strong>istic <strong>local</strong> <strong>algorithms</strong> can be developed.2. Based on the above theorem, we have developed the DeFraLC framework. The Peerto-PeerGeneric Monitor<strong>in</strong>g Algorithm (PeGMA) proposed <strong>in</strong> DeFraLC can be usedto compute arbitrarily complex functions of the average of the <strong>data</strong> <strong>in</strong> a <strong>distributed</strong>system. It can also be extended to compute functions on other l<strong>in</strong>ear comb<strong>in</strong>ationsof <strong>data</strong>, <strong>in</strong>clud<strong>in</strong>g weighted averages of selections from the <strong>data</strong>.3. Third, the DeFraLC can be used <strong>for</strong> monitor<strong>in</strong>g, and consequent reactive updat<strong>in</strong>g


6of any model of horizontally <strong>distributed</strong> <strong>data</strong>. Consequently, we have shown howthe DeFraLC framework can be applied to track the average of <strong>distributed</strong> <strong>data</strong>, thefirst eigenvector of that <strong>data</strong>, the k-means cluster<strong>in</strong>g of that <strong>data</strong> and multivariateregression of the <strong>data</strong>.4. F<strong>in</strong>ally, we have shown how a complex <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> algorithm such as decision tree<strong>in</strong>duction can be developed <strong>for</strong> P2P systems which is robust to <strong>data</strong> and networkchanges, suffers low communication overhead, offers high scalability and is asynchronous.In the process we have demonstrated how complex functions such as g<strong>in</strong>ior entropy <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> need to be replaced by simpler ones such as misclassificationga<strong>in</strong> to aid <strong>in</strong> the tree build<strong>in</strong>g process.5. Lastly, we have implemented all of the proposed <strong>algorithms</strong> <strong>in</strong> a Distributed DataM<strong>in</strong><strong>in</strong>g Toolkit (DDMT) [44], developed by the DIADIC Research Lab at UMBC.1.5 Dissertation OrganizationThis dissertation is organized as follows.Chapter 1:This chapter <strong>in</strong>troduces the doma<strong>in</strong> of research, motivates the problem,presents the problem statement and contributions, and states the organization of the dissertation.Chapter 2: This chapter presents a background study of the different types of P2P networksand the characteristics of the P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>. Then it discusses the exist<strong>in</strong>g<strong>algorithms</strong> and techniques <strong>for</strong> <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>, P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>, <strong>distributed</strong><strong>data</strong> stream <strong>m<strong>in</strong><strong>in</strong>g</strong>, sensor networks, GRID systems, P2P <strong>in</strong><strong>for</strong>mation retrieval and multiagentsystems. F<strong>in</strong>ally this chapter discusses some applications of the P2P networks.Chapter 3: This chapter presents the DeFraLC framework (Determ<strong>in</strong>istic Framework <strong>for</strong>Local Comput<strong>in</strong>g). The Peer-to-Peer Generic Monitor<strong>in</strong>g Algorithm (PeGMA), the ma<strong>in</strong>


7component of DeFraLC is capable of comput<strong>in</strong>g complex functions def<strong>in</strong>ed on the averageof the global <strong>data</strong>. Further, this chapter shows how the generic algorithm can be used <strong>for</strong>L2 norm threshold<strong>in</strong>g, mean monitor<strong>in</strong>g,k-means monitor<strong>in</strong>g, eigenvector monitor<strong>in</strong>g andmultivariate regression <strong>in</strong> <strong>large</strong> P2P sett<strong>in</strong>gs.Chapter 4: This chapter presents an <strong>efficient</strong> P2P decision tree <strong>in</strong>duction algorithm whichis asynchronous and robust to <strong>data</strong> and network changes. It shows (1) how the algorithmis <strong>efficient</strong> compared to brute <strong>for</strong>ce centralization of <strong>data</strong> statistics and (2) the quality iscomparable to such a centralized algorithm.Chapter 5: This chapter concludes the dissertation and presents some prospective areas offuture research.


8Chapter 2BACKGROUND AND RELATED WORK2.1 IntroductionP2P systems such as Gnutella, BitTorrents, e-Mule, Kazaa, and Freenet are <strong>in</strong>creas<strong>in</strong>glybecom<strong>in</strong>g popular <strong>for</strong> many applications that go beyond download<strong>in</strong>g music withoutpay<strong>in</strong>g <strong>for</strong> it. Examples <strong>in</strong>clude P2P systems <strong>for</strong> network storage, web cach<strong>in</strong>g, search<strong>in</strong>gand <strong>in</strong>dex<strong>in</strong>g of relevant documents and <strong>distributed</strong> network-threat analysis. The next generationof advanced P2P applications <strong>for</strong> bio<strong>in</strong><strong>for</strong>matics 1 , client-side web <strong>m<strong>in</strong><strong>in</strong>g</strong> [115] and<strong>large</strong>-<strong>scale</strong> web search 2 are likely to need support <strong>for</strong> advanced <strong>data</strong> analysis and <strong>m<strong>in</strong><strong>in</strong>g</strong>.This has resulted <strong>in</strong> the development of several <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> specificallyfocused towards knowledge extraction from such <strong>large</strong> asynchronous networks.Our goal <strong>in</strong> this chapter is to first familiarize the reader with the different types of P2Pnetworks. We then follow the discussion with the characteristics of the <strong>algorithms</strong> suitable<strong>for</strong> P2P networks. F<strong>in</strong>ally we present some related work <strong>in</strong> the area of this research. Forconvenience, we have categorized the related work <strong>in</strong>to the follow<strong>in</strong>g areas. We start withthe related work <strong>in</strong> DDM <strong>in</strong> Section 2.5 followed by the related work <strong>in</strong> P2P comput<strong>in</strong>g<strong>in</strong> Section 2.6. S<strong>in</strong>ce <strong>data</strong> stream <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>distributed</strong> environments has seen many <strong>algorithms</strong>similar to P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>, we discuss them <strong>in</strong> Section 2.7. Sensors are often1 http://smweb.bcgsc.bc.ca/ch<strong>in</strong>ook/<strong>in</strong>dex.html2 http://www.yacy.net/


9deployed both <strong>in</strong> military and civilian applications and they <strong>for</strong>m an ad-hoc network. Althoughcurrently their communication and network <strong>in</strong>frastructure are very structured andhierarchical, the next generation of sensor net applications may have more decentralizedcontrol such as <strong>in</strong> typical P2P networks. Because of its proximity to this area of research,we summarize the related work <strong>in</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> sensor networks <strong>in</strong> Section 2.8. We discussthe related work <strong>in</strong> P2P <strong>in</strong><strong>for</strong>mation retrieval <strong>in</strong> Section 2.9 followed by the relatedwork <strong>in</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> GRIDS <strong>in</strong> Section 2.10. F<strong>in</strong>ally, we present some work <strong>in</strong> the areaof multi-agent systems and <strong>distributed</strong> AI <strong>in</strong> Section 2.11. Be<strong>for</strong>e we end this chapter,we also describe some excit<strong>in</strong>g application areas of the P2P paradigm of <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> andcomput<strong>in</strong>g.2.2 Client-Server Networks and Peer-to-Peer NetworksAdvances <strong>in</strong> comput<strong>in</strong>g and communication over wired and wireless networks haveresulted <strong>in</strong> many pervasive <strong>distributed</strong> comput<strong>in</strong>g environments <strong>for</strong> example the Internet,Intranets, <strong>local</strong> area networks, ad hoc wireless networks, sensor networks and peer-to-peernetworks. Most of the <strong>in</strong>itial ef<strong>for</strong>t was directed towards develop<strong>in</strong>g systems based onclient-server architecture such as most modern web servers, file servers, mail servers, pr<strong>in</strong>tservers, e-commerce applications and more. Figure 2.1 shows a system based on clientserver architecture.In the figure the clients try to connect to a file server, applicationserver, web server or <strong>data</strong>base server. This architecture became popular ma<strong>in</strong>ly becauseof the simple synchronous nature of the communication between the server and the client.S<strong>in</strong>ce the <strong>data</strong> is stored ma<strong>in</strong>ly at the server, it is relatively easy to ensure security andprivacy. However as more and more of these systems evolved, several shortcom<strong>in</strong>gs wererealized. With <strong>in</strong>creased clients and more requests from each client, traffic congestionbetween the server and the client became heavy, lead<strong>in</strong>g to reduced throughput. Moreover<strong>in</strong> many cases several clients sat idle, lead<strong>in</strong>g to unbalanced load distribution. Another


10FIG. 2.1. A client server architecture show<strong>in</strong>g three clients and three servers— a web server, application server and <strong>data</strong>base server. Image sourcehttp://www.sqapartners.com/images/webmatrix2 590.jpg.major drawback was the lack of robustness to failed clients. S<strong>in</strong>ce <strong>in</strong> a typical client serverbased<strong>distributed</strong> system the <strong>data</strong> is not replicated but rather kept at a s<strong>in</strong>gle site, a nodefailure is devastat<strong>in</strong>g <strong>for</strong> the entire system.To alleviate some of these issues, Peer-to-Peer networks were developed. The term“Peer-to-Peer” (P2P) refers to a class of systems and applications that employ <strong>distributed</strong>resources to per<strong>for</strong>m critical functions <strong>in</strong> a decentralized manner [124]. P2P networks aretypically used <strong>for</strong> connect<strong>in</strong>g nodes via <strong>large</strong>ly ad hoc connections and hence all the nodesare equal <strong>in</strong> functionality. Each node acts as both the server and client and hence peers areoften referred to as servent. With the pervasive deployment of computers, P2P technologyis <strong>in</strong>creas<strong>in</strong>gly receiv<strong>in</strong>g attention <strong>in</strong> research, product development, and <strong>in</strong>vestment circles.Shar<strong>in</strong>g static content files such as audio, video, <strong>data</strong> or anyth<strong>in</strong>g <strong>in</strong> digital <strong>for</strong>mat is verycommon us<strong>in</strong>g P2P technology. Realtime <strong>data</strong> such as telephony traffic, stream<strong>in</strong>g audio


11FIG. 2.2. A P2P network of 100 nodes show<strong>in</strong>g the ad-hoc structure.and tv channels 3 is handled <strong>in</strong> a P2P manner. There are several advantages of a P2P networkover the client-server model. Due to the absence of any centralized node, there is no s<strong>in</strong>glepo<strong>in</strong>t of failure.Moreover, all clients provide resources, <strong>in</strong>clud<strong>in</strong>g bandwidth, storagespace, and comput<strong>in</strong>g power. Thus, as nodes arrive and demand on the system <strong>in</strong>creases,the total capacity of the system also <strong>in</strong>creases (because all nodes act as both the server andclient). This is not true of a client-server architecture with a fixed set of servers, <strong>in</strong> whichadd<strong>in</strong>g more clients could mean slower <strong>data</strong> transfer <strong>for</strong> all users. S<strong>in</strong>ce this research aimsat develop<strong>in</strong>g <strong>algorithms</strong> <strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P networks, the rest of this document willfocus on this model of comput<strong>in</strong>g.Depend<strong>in</strong>g on how the <strong>in</strong>dividual peers are connected, there are different types of P2Pnetworks — we present an overview of the different types <strong>in</strong> the next section.3 http://www.sopcast.org/


2.3 Types of Peer-to-Peer Networks12Several classifications exist <strong>for</strong> P2P networks. In this section we follow the one discussedby Lv et al. [117]. The authors classify the different architectures of P2P networks<strong>in</strong>to three classes — (1) Centralized, (2) Decentralized but structured, and (3) Decentralizedand unstructured.2.3.1 CentralizedIn this type of P2P networks, although there are several nodes or peers, the po<strong>in</strong>t ofcontrol or communication is through a central server. In early days of its advent, Napster[134] followed this type of P2P topology. All queries were handled through the centralnode and hence this node needed to have an <strong>in</strong>dex of all the files <strong>in</strong> the network <strong>in</strong> order toprovide an answer to the query. Naturally, the most serious disadvantage to this structureis a s<strong>in</strong>gle po<strong>in</strong>t of failure. Also these structures are not scalable due to the dependence onthe central node.2.3.2 Decentralized but structuredIn order to circumvent some of the problems associated with the earlier architecture,designers came up with a decentralized network structure. In this model of architecture,there is no concept of central node. However, <strong>in</strong> many cases, there is significant structureimposed on the network. For example, <strong>in</strong> many applications, there is an implicit assumptionof an overlay tree structure <strong>in</strong> such networks [195]. There are more examples of suchnetworks as provided <strong>in</strong> CHORD [175], CAN [158], PASTRY [160] and more. In thesenetworks, the major idea is to use a Distributed Hash Table (DHT) to aid <strong>in</strong> the search<strong>in</strong>gof contents <strong>in</strong> such networks. The files to be searched are not placed <strong>in</strong> random nodes;rather they are hashed to a number and the files are placed at the nodes correspond<strong>in</strong>g tothat number (node id).


2.3.3 Decentralized and unstructured13These are systems <strong>in</strong> which there is neither a centralized directory nor any precisecontrol over the network topology or file placement. Gnutella [68] is an example of suchdesigns. The resultant topology has certa<strong>in</strong> properties, but the placement of files is notbased on any knowledge of the topology. To f<strong>in</strong>d a file, a node queries its neighbors.The most typical query method is flood<strong>in</strong>g, where the query is propagated to all neighborswith<strong>in</strong> a certa<strong>in</strong> radius. These unstructured designs are extremely resilient to nodes enter<strong>in</strong>gand leav<strong>in</strong>g the system. These networks are easy to ma<strong>in</strong>ta<strong>in</strong> as well.With such varied types of P2P networks, it is easy to see that not all <strong>algorithms</strong> aresuitable <strong>for</strong> all the P2P architectures. In the next section we present some desired characteristicsof P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>.2.4 Characteristics of Peer-to-Peer Data M<strong>in</strong><strong>in</strong>g AlgorithmsAs po<strong>in</strong>ted out <strong>in</strong> the last section, the computational environment of P2P systems isquite different from the ones <strong>for</strong> which traditional <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> are <strong>in</strong>tended. Assuch, these environments demand a new breed of <strong>algorithms</strong> <strong>for</strong> <strong>efficient</strong> operation. Beloware a list of operational characteristics desired <strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> developed <strong>for</strong>such networks.1. Communication Efficient — P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> must be able to work <strong>in</strong>a communication <strong>efficient</strong> manner. Although, many P2P systems are designed <strong>for</strong>shar<strong>in</strong>g <strong>large</strong> <strong>data</strong> files (e.g. music, movies), a <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> system that<strong>in</strong>volves analyz<strong>in</strong>g such <strong>data</strong>, may not have the luxury to frequently exchange <strong>large</strong>volume of <strong>data</strong> among the nodes <strong>in</strong> the P2P network simply <strong>for</strong> <strong>data</strong> analysis. Ideally,it should not move any <strong>data</strong> (or at the maximum share m<strong>in</strong>imal <strong>data</strong>), but onlyexchange ‘knowledge’ about <strong>data</strong> <strong>for</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> purpose. Data <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> a P2P environ-


14ment should be “light-weight”; it should be able to per<strong>for</strong>m <strong>distributed</strong> <strong>data</strong> analysiswith m<strong>in</strong>imal communication overhead.2. Scalability — One of the most desired characteristics of any P2P algorithm is scalability.S<strong>in</strong>ce modern P2P systems can grow up to millions of peers (Skype presentlyhas around 50 million peers), the computational and communication (bandwidth)resource requirement <strong>for</strong> such <strong>algorithms</strong> should be <strong>in</strong>dependent of the size of thesystem. Communication <strong>efficient</strong> <strong>algorithms</strong> are, <strong>in</strong> general, expected to be scalable.Local <strong>algorithms</strong>, which we def<strong>in</strong>e <strong>in</strong> the next chapter offers a new way to develophighly scalable P2P <strong>algorithms</strong>.3. Anytimeness — In many real world P2P systems viz. sensor networks, <strong>data</strong> at thenodes often changes with time (stream<strong>in</strong>g scenario). Any algorithm that needs tobeg<strong>in</strong> computation from scratch whenever the <strong>data</strong> changes is thus <strong>in</strong>appropriate <strong>for</strong>our goals s<strong>in</strong>ce it would (1) take too many resources <strong>for</strong> every <strong>data</strong> change, and (2)take a long time to give the result. Typically, <strong>in</strong> a stream<strong>in</strong>g scenario we want answersto the problems as and when required — <strong>in</strong>cremental <strong>algorithms</strong> are can provide withthe solutions quickly. Furthermore, s<strong>in</strong>ce the rate of <strong>data</strong>-change may be higher thanthe rate of computation <strong>in</strong> some applications, the algorithm should be able to reporta partial, ad hoc solution, at any time.4. Asynchronism — Due to the massive <strong>scale</strong> of P2P systems and its volatile nature,P2P <strong>algorithms</strong> are desired to be asynchronous. Synchronized <strong>algorithms</strong> are likelyto fail due to connection latency, limited bandwidth or node failures.5. Decentralization — Although some P2P systems still use central servers <strong>for</strong> variouspurposes, ideally any algorithm designed <strong>for</strong> peer-to-peer systems should be ableto run <strong>in</strong> absence of any coord<strong>in</strong>ator (server or router) and calculate the result <strong>in</strong>networkrather than collect <strong>data</strong> <strong>in</strong> a s<strong>in</strong>gle peer. This immediately po<strong>in</strong>ts out the


15advantages — (1) there is no traffic bottleneck at the central node, and (2) there is nos<strong>in</strong>gle po<strong>in</strong>t of failure.6. Fault-tolerance — In P2P systems, it is not unusual <strong>for</strong> several peers to leave and jo<strong>in</strong>the system at any given moment. Thus,the system should be able to recover from thefailure of peers, and the subsequent loss of <strong>data</strong>. In other words, the algorithm mustbe ‘robust’ to network and <strong>data</strong> changes.7. Privacy and Security — S<strong>in</strong>ce P2P networks consists of heterogenous entities work<strong>in</strong>g<strong>in</strong> a collaborative fashion, privacy is an important issue. Similar to any other <strong>large</strong><strong>distributed</strong> system, security is also an important issue <strong>in</strong> P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> system.In many cases, develop<strong>in</strong>g <strong>algorithms</strong> that satisfy all of the above requirements is quitedifficult. In this dissertation, there<strong>for</strong>e, we will focus on certa<strong>in</strong> characteristics of the P2P<strong>algorithms</strong>.In the next few sections we present some literature related to our area of research.S<strong>in</strong>ce our research relates to many different areas of study, we have divided this section<strong>in</strong>to the follow<strong>in</strong>g subsections. The first section (Section 2.5) discusses the related work<strong>in</strong> <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>. Section 2.6 discusses the related work <strong>in</strong> P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>.The next section (Section 2.7) presents the related work <strong>in</strong> <strong>distributed</strong> <strong>data</strong> stream <strong>m<strong>in</strong><strong>in</strong>g</strong>followed by the related work <strong>in</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> sensor networks <strong>in</strong> Section 2.8. Section 2.10offers some details <strong>in</strong> the area of <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> Grids. F<strong>in</strong>ally, we present some work <strong>in</strong>the area of multi-agent systems and <strong>distributed</strong> AI <strong>in</strong> Section 2.11.2.5 Related work: Distributed Data M<strong>in</strong><strong>in</strong>gIn this section we present a brief overview of the exist<strong>in</strong>g work on Distributed DataM<strong>in</strong><strong>in</strong>g or (DDM) <strong>in</strong> general. DDM, as the name suggests, deals with the problem of <strong>data</strong>analysis <strong>in</strong> environments with <strong>distributed</strong> <strong>data</strong>, comput<strong>in</strong>g nodes, and users. This area has


16seen considerable amount of research dur<strong>in</strong>g the last decade. For an <strong>in</strong>troduction to thearea, <strong>in</strong>terested readers are referred to the books by Kargupta et al. [93][94]. A related ofresearch is the area of system development and algorithm development <strong>for</strong> <strong>distributed</strong> systems.The book by Ghosh [65] presents a detailed study of the different types of <strong>algorithms</strong>developed <strong>for</strong> <strong>distributed</strong> systems.Depend<strong>in</strong>g on how the <strong>data</strong> is <strong>distributed</strong> across the various sites, a natural way tocategorize the DDM <strong>algorithms</strong> is as follows:• Horizontally partitioned — <strong>in</strong> which all the features are present <strong>in</strong> all the sites, butthe samples/<strong>data</strong> tuples are <strong>distributed</strong> across the sites.• Vertically partitioned — <strong>in</strong> which the features are <strong>distributed</strong> across all the sites (maybe overlapp<strong>in</strong>g or disjo<strong>in</strong>t).DDM <strong>algorithms</strong> have been proposed <strong>for</strong> many popular <strong>data</strong> analysis tasks. In the nexttwo subsections, we present a brief sampl<strong>in</strong>g from each major category. We focus primarilyon <strong>distributed</strong> cluster<strong>in</strong>g and classification s<strong>in</strong>ce this thesis proposed <strong>algorithms</strong> <strong>for</strong> do<strong>in</strong>gthe same <strong>in</strong> a P2P doma<strong>in</strong>. Interested readers are referred to the books by Kargupta et al.[93][94], the <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> bibliography ma<strong>in</strong>ta<strong>in</strong>ed at UMBC [43] and severalsurveys [201][202] <strong>for</strong> more <strong>in</strong>-depth discussions.2.5.1 Distributed ClassificationEnsemble based classifier learn<strong>in</strong>g is well suited to <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> framework.In ensemble techniques such as bagg<strong>in</strong>g [21], boost<strong>in</strong>g [62] and random <strong>for</strong>ests [22]several weak classifiers are <strong>in</strong>duced from different partitions of the <strong>data</strong> and then they arecomb<strong>in</strong>ed us<strong>in</strong>g vot<strong>in</strong>g techniques or otherwise to produce the output. These techniquescan be adopted <strong>for</strong> <strong>distributed</strong> learn<strong>in</strong>g by <strong>in</strong>duc<strong>in</strong>g a weak classifier from each <strong>distributed</strong><strong>data</strong> site and then comb<strong>in</strong><strong>in</strong>g them at a central location. In the literature this is popularly


17known as the meta-learn<strong>in</strong>g framework [31][30]. Several strategies <strong>for</strong> comb<strong>in</strong><strong>in</strong>g the classifiershave been proposed such as vot<strong>in</strong>g, arbitration and comb<strong>in</strong>er. The meta learn<strong>in</strong>gframework <strong>for</strong> homogenous <strong>data</strong>set is implemented as part of the JAM system [176].The problem of learn<strong>in</strong>g from heterogeneously partitioned <strong>data</strong> has been addressed byseveral researchers. Park and his colleagues [148] have developed <strong>algorithms</strong> <strong>for</strong> learn<strong>in</strong>gfrom heterogenous <strong>data</strong>sets us<strong>in</strong>g an evolutionary technique. Their work first builds <strong>local</strong>classifiers and then identifies a selection of tuples that none of these <strong>local</strong> classifiers cancorrectly classify. These tuples are centralized and a new classifier is build on this tuples.The classification of a new tuple is based on either a collection of <strong>local</strong> classifiers or thecentralized one.Caragea et al. [29] presented a decision tree <strong>in</strong>duction algorithm <strong>for</strong> both <strong>distributed</strong>homogenous and heterogenous environments. Not<strong>in</strong>g that the crux of any decision treealgorithm is the use of an effective splitt<strong>in</strong>g criteria, the authors propose a method by whichthis criteria can be evaluated <strong>in</strong> a <strong>distributed</strong> fashion. More specifically the paper showsthat by only centraliz<strong>in</strong>g summary statistics from each site e.g., counts of <strong>in</strong>stances thatsatisfy specific constra<strong>in</strong>ts on the values of the attributes to one location, there can be hugesav<strong>in</strong>gs <strong>in</strong> terms of communication when compared to brute <strong>for</strong>ce centralization. Moreover,the <strong>distributed</strong> decision tree <strong>in</strong>duced is the same compared to a centralized scenario. Theirsystem is available as part of the INDUS system.A different approach was taken by Giannella et al. [66] and Olsen [141]. They usedG<strong>in</strong>i <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> as the impurity measure and showed that G<strong>in</strong>i between two attributescan be <strong>for</strong>mulated as a dot product between two b<strong>in</strong>ary vectors. To cut down the communicationcomplexity, the authors evaluated the dot product after project<strong>in</strong>g the vectors <strong>in</strong> arandom subspace. Instead of send<strong>in</strong>g either the raw <strong>data</strong> or the <strong>large</strong> b<strong>in</strong>ary vectors, the<strong>distributed</strong> sites communicate only these projected low-dimensional vectors. The papershows that us<strong>in</strong>g only 20% of the communication cost necessary to centralize the <strong>data</strong>,


18they can build trees which are at least 80% accurate compared to the trees produced bycentralization.An order statistics based approach to comb<strong>in</strong><strong>in</strong>g classifier ensembles generated fromheterogenous <strong>data</strong> has been proposed by Tumer and Ghosh [189].The collective <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> framework (CDM) by Kargupta et al. [92] proposes <strong>algorithms</strong><strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> from heterogenous <strong>data</strong> sites us<strong>in</strong>g orthogonal basis functions. Thebasic idea is to represent the model to be built us<strong>in</strong>g an orthonormal basis set such as fourierco<strong>efficient</strong>s. These co<strong>efficient</strong>s are then evaluated at each <strong>local</strong> site and they are transferredto a central location. Co<strong>efficient</strong>s <strong>in</strong>volv<strong>in</strong>g cross terms between the sites are evaluated atthe central site after centraliz<strong>in</strong>g a sample of the <strong>data</strong>. Several <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> havebeen proposed us<strong>in</strong>g this technique. Bayes net from <strong>distributed</strong> heterogenous <strong>data</strong> [33],decision trees [96], <strong>distributed</strong> multivariate regression [78], <strong>distributed</strong> pr<strong>in</strong>ciple component(PCA) and <strong>data</strong> cluster<strong>in</strong>g [95] are some of the <strong>algorithms</strong> developed us<strong>in</strong>g the CDMframework.2.5.2 Distributed Cluster<strong>in</strong>gSeveral techniques have been proposed <strong>in</strong> the literature <strong>for</strong> <strong>distributed</strong> cluster<strong>in</strong>g of<strong>data</strong>.Kargupta et al. [95] have developed a pr<strong>in</strong>ciple component analysis (PCA) basedcluster<strong>in</strong>g technique on the CDM framework <strong>for</strong> heterogeneously <strong>distributed</strong> <strong>data</strong>. Each<strong>local</strong> site per<strong>for</strong>ms PCA, projects the <strong>local</strong> <strong>data</strong> along the pr<strong>in</strong>ciple components, and appliesa known cluster<strong>in</strong>g algorithm. The communication complexity of such a technique is muchsmaller than centraliz<strong>in</strong>g all the <strong>data</strong>.Klusch et al. [102] considered the problem of <strong>distributed</strong> cluster<strong>in</strong>g over homogenously<strong>distributed</strong> <strong>data</strong> us<strong>in</strong>g kernel-density. They use a <strong>local</strong> def<strong>in</strong>ition of density basedcluster and po<strong>in</strong>ts which have same density (accord<strong>in</strong>g to some kernel function) are put <strong>in</strong>


19the same cluster. They build <strong>local</strong> clusters of the <strong>data</strong> and transmit these to a central site.The central site then comb<strong>in</strong>es these clusters.k-means cluster<strong>in</strong>g of <strong>data</strong> is a popular <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> technique. Distributed versions ofk-means cluster<strong>in</strong>g <strong>algorithms</strong> have been proposed by various researchers till date. Eisenhardtet al. [53] proposed a <strong>distributed</strong> method <strong>for</strong> document cluster<strong>in</strong>g us<strong>in</strong>g k-means.Their technique is enhanced with a “probe and echo” mechanism <strong>for</strong> updat<strong>in</strong>g cluster centroids.The algorithm is synchronized and each round corresponds to a k-means iteration.There is one designated <strong>in</strong>itiator node. It launches a probe message to all its neighbors andsends its <strong>local</strong> centroids and weights. Whenever a node P j receives a probe message fromits neighborP i , it first updates its current centroids with the ones received fromP i and then<strong>for</strong>wards the probe to all its neighbors (exceptP i ). When a site has received a probe or eachfrom everyone, it <strong>for</strong>wards the message to the one from which it first received the probe.This process cont<strong>in</strong>ues until a term<strong>in</strong>ation criteria stops the iterations.Based on this technique, several extensions have been proposed. Banyopadhyay et al.[7] have proposed an algorithm <strong>for</strong> k-means cluster<strong>in</strong>g based on sampl<strong>in</strong>g technique. Torelax the synchronization assumptions, Datta et al. [40] later proposed an asynchronousversion ofk-means algorithm suitable <strong>for</strong> P2P networks. Dhillon and Modha [46] have developeda parallel implementation of the k-means algorithm on homogenously partitioned<strong>data</strong>set. Similar technique <strong>for</strong> k-harmonic mean has been proposed by Forman and Zhang[58].A different approach to <strong>distributed</strong> cluster<strong>in</strong>g is to generate <strong>local</strong> models and thencomb<strong>in</strong>e them at a central site. These techniques are (1) asynchronous, (2) offer good communicationcomplexity compared to centralization of <strong>data</strong>, and (3) offers privacy preserv<strong>in</strong>gsolutions <strong>in</strong> many doma<strong>in</strong>s.Johnson and Kargupta [87] have proposed a hierarchical cluster<strong>in</strong>g algorithm on heterogeneously<strong>distributed</strong> <strong>data</strong>. The idea is to generate <strong>local</strong> dendograms and then comb<strong>in</strong>e


them at a central site.20Lazarevic et al. considered the problem of comb<strong>in</strong><strong>in</strong>g spatial cluster<strong>in</strong>gs to produce aglobal regression-based classifier on a homogenously <strong>distributed</strong> <strong>data</strong>.Several other techniques <strong>for</strong> <strong>distributed</strong> heterogenous <strong>data</strong> cluster<strong>in</strong>g have been proposedus<strong>in</strong>g ensemble approach such as [179] and [61].2.6 Related work: Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Peer-to-Peer EnvironmentsP2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> has very recently emerged as a subfield of DDM, specifically focus<strong>in</strong>gon <strong>algorithms</strong> which satisfy the properties mentioned <strong>in</strong> Section 2.4. Due to itsnascency, this area of research has little prior work. Datta et al. [39] present an overviewof this topic.In the next few sections we present an overview of the different types of P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong>. Table 2.1 shows the taxonomy of the different P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>.⎧⎪⎨P2P Alg.⎪⎩⎧⎨Approximate Alg.⎩⎧⎪⎨Exact Alg.⎪⎩{Sampl<strong>in</strong>g Alg.Probabilistic Alg.Gossip-based Alg.Determ<strong>in</strong>istic Alg.Flood<strong>in</strong>g Alg.Convergecast Alg.Local Alg.Cluster-based Alg.Table 2.1. Categorization of the different P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>.2.6.1 Approximate AlgorithmsApproximate <strong>algorithms</strong>, as the name suggests, computes the approximate <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>results. The approximation can be probabilistic (Section 2.6.1) or determ<strong>in</strong>istic (Section2.6.1).


Probabilistic Approximate Algorithms:21Probabilistic <strong>algorithms</strong> rely on the resultsof sampl<strong>in</strong>g theory from statistics to derive bounds on the quality of the results.1. Sampl<strong>in</strong>g-based <strong>algorithms</strong>: In sampl<strong>in</strong>g-based <strong>algorithms</strong> peers sample <strong>data</strong> (us<strong>in</strong>gsome variations of graph random walk as proposed by Datta et al. [41]) from their ownpartition and that of several neighbors’ and then build a model assum<strong>in</strong>g that this <strong>data</strong> isrepresentative of that of the entire set of peers. Thus, the <strong>algorithms</strong> claim noth<strong>in</strong>g on theaccuracy of the result<strong>in</strong>g model <strong>in</strong> the general case. Examples <strong>for</strong> these <strong>algorithms</strong> <strong>in</strong>cludethe P2P k-Means algorithm by Banyopadhyay et al. [7], the naive Bayes algorithmby Kowalczyk et al. [104], and more. Banyopadhyay et al. [7] talk about a <strong>distributed</strong>k-Means algorithm designed <strong>for</strong> a horizontally partitioned scenario, such that only the centroidsof the <strong>local</strong> sites need to be communicated. The paper presents a theoretical proof ofquality and convergence (<strong>in</strong> some restricted cases e.g. when the <strong>data</strong> is sampled uni<strong>for</strong>mlyfrom the network). This work was one of the first ef<strong>for</strong>ts of eng<strong>in</strong>eer<strong>in</strong>g a well-known <strong>data</strong><strong>m<strong>in</strong><strong>in</strong>g</strong> technique to work <strong>in</strong> a P2P doma<strong>in</strong>. Datta et al. [40] later enhanced this algorithmto work <strong>in</strong> an asynchronous environment. However it still lacks theoretical proof of qualityand convergence.More recently, Das et al. have developed an algorithm [37] <strong>for</strong> identify<strong>in</strong>g significant<strong>in</strong>ner product entries <strong>in</strong> a P2P network <strong>in</strong> a horizontally partitioned <strong>data</strong> distribution scenario.Inner product is an important primitive <strong>in</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>, useful <strong>for</strong> many techniquessuch as distance computation, cluster<strong>in</strong>g, classification and more. The proposed algorithmuses a variant of Metropolis-Hast<strong>in</strong>gs random walk [121] to draw random samples fromthe network and us<strong>in</strong>g the results from the ord<strong>in</strong>al decision theory bounds the quality ofthe result and the communication complexity. Extend<strong>in</strong>g further, Liu et al. [115] proposedan algorithm <strong>for</strong> community <strong>for</strong>mation <strong>in</strong> P2P environments us<strong>in</strong>g client-side browser historyand <strong>in</strong>ner product computation. The latter paper also talks about preserv<strong>in</strong>g the users’privacy us<strong>in</strong>g SMC protocols.


222. Gossip-based <strong>algorithms</strong>: The second category, gossip-based <strong>algorithms</strong>, relieson properties of random walks on graphs to provide estimates <strong>for</strong> various statistics of <strong>data</strong>stored <strong>in</strong> the graph. Gossip-based computation was first <strong>in</strong>troduced by Kempe et al. [99],<strong>in</strong> which they showed that each peer, by contact<strong>in</strong>g a small number of nodes chosen atrandom, can get the result of the computation exponentially fast. Boyd et al. [18] enhancedthe protocol <strong>for</strong> general graphs. The most important quality of gossip-based <strong>algorithms</strong>is that they provide probabilistic guarantees <strong>for</strong> the accuracy of the result. However, thefirst gossip-based <strong>algorithms</strong> required that the algorithm be executed from scratch if the<strong>data</strong> changes <strong>in</strong> order to ma<strong>in</strong>ta<strong>in</strong> those guarantees. This problem was later addressed byMehyar et al. [119], and by Jelasity et al. [85]. Mehyar et al. [119] propose a graphLaplacian-based approach to compute the average of a set of po<strong>in</strong>ts <strong>in</strong> a P2P network. Eachpeer has a number say x i and an estimate of the global average z i . The paper shows thatthe rate at whichz i → 1 n∑ ni=1 x i is exponential, thereby achiev<strong>in</strong>g fast convergence.Determ<strong>in</strong>istic Approximate Algorithms:The probabilistic techniques, as discussed<strong>in</strong> the previous sections use randomized <strong>algorithms</strong> and hence their output dependson the <strong>in</strong>put and some other random variables. As such, these <strong>algorithms</strong> can only guaranteethe quality of the results on average and the variance of the process is also bounded. As<strong>in</strong>gle trial can yield a very different answer. To counter these problems, researchers haveproposed a new genre of approximation technique which we call here the determ<strong>in</strong>isticapproximation technique. In such techniques, the output is only dependent on the <strong>in</strong>put.The results, although approximate, hold true <strong>for</strong> every s<strong>in</strong>gle trial. This makes them veryattractive <strong>for</strong> many real-life problems <strong>in</strong>clud<strong>in</strong>g <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>distributed</strong> environments.A popular method of per<strong>for</strong>m<strong>in</strong>g determ<strong>in</strong>istic approximation technique is to use themethod of variational approximation as proposed by Jordan et al. [89] and by Jaakkola[83]. The variational approximation technique consists of three steps:1. Pose the orig<strong>in</strong>al problem as a optimization problem. Def<strong>in</strong>e an objective function


23<strong>for</strong> the same such that the optimal solution of this objective function gives the exactsolution of the orig<strong>in</strong>al problem.2. Develop a search strategy to search <strong>in</strong> this trans<strong>for</strong>med space.3. S<strong>in</strong>ce <strong>for</strong> many problems this search can be <strong>in</strong>tractable, apply approximations thatmake this search computationally feasible. Note that this step makes (1) the variationaltechnique feasible and (2) an approximate technique rather than an exact technique.Several approaches exist <strong>in</strong> the literature based on the variational approximation technique.Jordan and Jaakkola have proposed solutions <strong>for</strong> problems such as <strong>in</strong>ferenc<strong>in</strong>g, parameterestimation us<strong>in</strong>g EM algorithm, gaussian mixture models and more. Bayesian parameterlearn<strong>in</strong>g us<strong>in</strong>g variational technique have been proposed by Murphy [132]. Whilemost of these techniques work <strong>for</strong> centralized scenarios, Mukherjee [130] has proposed<strong>distributed</strong> <strong>algorithms</strong> <strong>for</strong> variational approximation. The <strong>distributed</strong> <strong>in</strong>ferenc<strong>in</strong>g or VIDEframework [129] aims to solve the problem of target track<strong>in</strong>g <strong>in</strong> wireless sensor networks.Further, he proposed solutions <strong>for</strong> <strong>distributed</strong> l<strong>in</strong>ear regression us<strong>in</strong>g the same variationalframework [131].2.6.2 Exact AlgorithmsExact <strong>algorithms</strong> <strong>for</strong>m an excit<strong>in</strong>g paradigm of computation whereby the result generatedby the <strong>distributed</strong> <strong>algorithms</strong> is exactly the same if all the peers were given all the <strong>data</strong>.Thus, contrary to approximate techniques, these <strong>algorithms</strong> produce the correct result everytimethey are executed. Depend<strong>in</strong>g on how these <strong>algorithms</strong> compute the correct result,exact <strong>algorithms</strong> can be classified as convergecast-based <strong>algorithms</strong>, flood<strong>in</strong>g <strong>algorithms</strong>,<strong>local</strong> <strong>algorithms</strong>, and cluster-based <strong>algorithms</strong>.


Flood<strong>in</strong>g Algorithms24Flood<strong>in</strong>g <strong>algorithms</strong> flood whatever <strong>data</strong> is available at eachnode. This is very expensive s<strong>in</strong>ce it communicates all the <strong>data</strong>. It is trivial to develop<strong>algorithms</strong> <strong>for</strong> certa<strong>in</strong> functions which flood all the <strong>data</strong> and guarantee correctness of theresults. In most <strong>in</strong>terest<strong>in</strong>g cases, we strive to develop <strong>algorithms</strong> which can do better thanthis naive approach.Convergecast AlgorithmsThe second category, convergecast <strong>algorithms</strong>, are alsoextremely simple. In such <strong>algorithms</strong> the computation takes place over a spann<strong>in</strong>g tree. Forany peer, comput<strong>in</strong>g the network sum, <strong>for</strong> example, amounts to receiv<strong>in</strong>g the <strong>data</strong> of all theneighbors, then add<strong>in</strong>g its value and propagat<strong>in</strong>g the result to its parent. Algorithms suchas [156] provide generic solutions — suitable <strong>for</strong> the computation of multiple functions.They are also extremely communication <strong>efficient</strong>: comput<strong>in</strong>g the average, <strong>for</strong> <strong>in</strong>stance,only requires one message from each peer. The problem with these <strong>algorithms</strong> is that theyare extremely synchronized — with every round of computation tak<strong>in</strong>g a lot of time. Thisbecomes very problematic when the <strong>data</strong> is dynamic and computation has to be iteratedfrequently due to <strong>data</strong> change.Local <strong>algorithms</strong>Local <strong>algorithms</strong> are a class of highly <strong>efficient</strong> <strong>algorithms</strong> developed<strong>for</strong> P2P networks. They are <strong>data</strong> dependent <strong>distributed</strong> <strong>algorithms</strong>. However, <strong>in</strong> a<strong>distributed</strong> setup <strong>data</strong> dependency means that at certa<strong>in</strong> conditions peer can cease to communicatewith one another and the algorithm term<strong>in</strong>ates with the same result compared toa centralized scenario. These conditions can occur after a peer has collected the statisticsof just few other peers. In such cases, the overhead of every peer becomes <strong>in</strong>dependentof the size of the network — either a small constant or a slow grow<strong>in</strong>g polynomial comparedto the network size. Furthermore, these <strong>data</strong> dependent conditions can be recheckedevery time the <strong>data</strong> or the system, changes. If the change is stationary (i.e., the result ofthe computation rema<strong>in</strong>s the same then, very often, no communication is needed. This fea-


25ture makes <strong>local</strong> <strong>algorithms</strong> exceptionally suitable <strong>for</strong> P2P networks as well as <strong>for</strong> wirelesssensor networks.The idea of us<strong>in</strong>g <strong>local</strong> rules <strong>for</strong> <strong>algorithms</strong> dates back to the seventies. John Hollanddescribed such rules <strong>for</strong> non-l<strong>in</strong>ear adaptive systems and genetic <strong>algorithms</strong> <strong>in</strong> his sem<strong>in</strong>alwork <strong>for</strong> biological systems [80]. Local evolutionary rules <strong>for</strong> grid-based cellular automatonwere first <strong>in</strong>troduced <strong>in</strong> 1950’s by John Von Neumann [139] and later adopted <strong>in</strong> manyfields such as artificial agents, VLSI test<strong>in</strong>g, physical simulations to mention a few. In thecontext of graph theory, <strong>local</strong> <strong>algorithms</strong> were used <strong>in</strong> the early n<strong>in</strong>eties. L<strong>in</strong>ial [112] andlater Afek et al. [1] talked about <strong>local</strong> <strong>algorithms</strong> <strong>in</strong> the context of the graph color<strong>in</strong>g problems.Naor and Stockmeyer [133] asked what properties of a graph can be computed <strong>in</strong>constant time <strong>in</strong>dependent of the graph size. Kutten and Peleg [108] have <strong>in</strong>troduced <strong>local</strong><strong>algorithms</strong> <strong>for</strong> fault-detection <strong>in</strong> which the cost depends only on the unknown number offaults and not on the entire graph size. They have developed solutions <strong>for</strong> some key problemssuch as the maximal <strong>in</strong>dependent set (MIS) and graph color<strong>in</strong>g. Kuhn et al. [106]have suggested that some properties of graphs cannot be computed <strong>local</strong>ly.Local <strong>algorithms</strong> <strong>for</strong> P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong>clude the majority vot<strong>in</strong>g and association rule<strong>m<strong>in</strong><strong>in</strong>g</strong> protocol developed by Wolff and Schuster [196]. This algorithm and the ones discussed<strong>in</strong> this section guarantee eventual correctness — when the computation term<strong>in</strong>ates,each node computes the correct result compared to a centralized sett<strong>in</strong>g. In the most simple<strong>for</strong>m, majority rule protocol deals with the follow<strong>in</strong>g computation: suppose each node hastwo real numbers x i and y i , and the goal is to f<strong>in</strong>d out if ∑ ni=1 x i ≥ ∑ ni=1 y i, where thereare n peers <strong>in</strong> the network. This protocol is eventually correct, fault-tolerant and robust to<strong>data</strong> and network changes. They are <strong>efficient</strong> as well s<strong>in</strong>ce <strong>in</strong> the typical dynamic setupthey require far less resources compared to broadcast. Based on its variants, researchershave further proposed more complicated <strong>algorithms</strong> such as facility location [105], outlierdetection [19], and meta-classification [116].


26More recently, Bhaduri and Kargupta [13] have proposed a <strong>local</strong> algorithm <strong>for</strong> determ<strong>in</strong>isticmultivariate regression <strong>in</strong> P2P networks. The idea is to use a two-step approach.A <strong>local</strong> algorithm is used to track the fitness of the current regression model and the <strong>data</strong>.If the <strong>data</strong> has changed such that the model no longer fits the <strong>data</strong>, a feedback loop is usedand the model is rebuilt. Experimental results show the accuracy and low monitor<strong>in</strong>g costof the proposed technique.All the <strong>algorithms</strong> depicted above require the existence of a communication tree toavoid duplicate account<strong>in</strong>g of <strong>data</strong>. Some work have shown that they can be trans<strong>for</strong>med towork <strong>for</strong> general networks as well [14]. Note that, as shown <strong>in</strong> [14], such a tree can be <strong>efficient</strong>lyconstructed and ma<strong>in</strong>ta<strong>in</strong>ed us<strong>in</strong>g variations of Bellman-Ford <strong>algorithms</strong> [57][84].Generic <strong>local</strong> <strong>algorithms</strong> have also been proposed <strong>for</strong> a wide variety of <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problemssuch as L2-norm threshold<strong>in</strong>g,k-means monitor<strong>in</strong>g and more [195].Although several <strong>local</strong> <strong>algorithms</strong> have been developed, theoretical description of <strong>local</strong>algorithm complexity is still an open issue. Birk et al. [14] have proposed a new metricVeracity Radius (VR) to capture the <strong>local</strong>ity of <strong>local</strong> <strong>algorithms</strong>. This radius counts thenumber of hops which is necessary <strong>in</strong> order to get the correct result from a <strong>local</strong> algorithm.However this framework works only <strong>for</strong> simple aggregate problems such as majority vot<strong>in</strong>g.While most of the work <strong>in</strong> <strong>local</strong> <strong>algorithms</strong> focus on develop<strong>in</strong>g <strong>algorithms</strong> whichcompute models based on all the peers’ <strong>data</strong>, P<strong>in</strong>g et al. [116] have proposed a differentmethod. In that work, a meta classification algorithm is described <strong>in</strong> which every peercomputes a weak classifier on its own <strong>data</strong>. Then, weak classifiers are merged <strong>in</strong>to a metaclassifier by comput<strong>in</strong>g — per new sample — the majority of the outcomes of the weakclassifiers. The computation of weak classifiers requires no communication overhead atall, and the majority is computed us<strong>in</strong>g the majority vot<strong>in</strong>g protocol [196].Another work which should be mentioned <strong>in</strong> the context of <strong>local</strong> <strong>algorithms</strong> is the


27WildFire algorithm by Bawa et al. [11]. WildFire too makes use of <strong>local</strong> techniques toprune messages. It further suggests an algorithm <strong>for</strong> the problem of Means Monitor<strong>in</strong>g(computation of Sum or Average). However, the algorithm presented there is very costly<strong>for</strong> dynamic <strong>data</strong>.Cluster-based AlgorithmsThere is an <strong>in</strong>tr<strong>in</strong>sic relation between <strong>local</strong> <strong>algorithms</strong>and communication-<strong>efficient</strong> <strong>algorithms</strong> <strong>for</strong> <strong>large</strong> clusters. While the latter <strong>algorithms</strong> relyon broadcast<strong>in</strong>g [164][168], or on hierarchical communication patterns [8] they still havethe idea that not all of the processors need to send their entire <strong>data</strong> <strong>for</strong> the global result tobe computed accurately. For <strong>in</strong>stance, the association rule <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> <strong>for</strong> clusters[164] and P2P networks [196] both rely on an <strong>efficient</strong> majority votes. The algorithmpresented by Sharfman et al. [168], generalizes [164]; but still relies on broadcast as thecommunication model. Additionally, the notions of <strong>data</strong> and network changes throughoutthe algorithm execution are miss<strong>in</strong>g from the work on clusters, possibly because the smaller<strong>scale</strong> of those systems permit ignor<strong>in</strong>g or recover<strong>in</strong>g from rare failures.2.7 Related work: Distributed Data Stream M<strong>in</strong><strong>in</strong>gThere exists a plethora of work <strong>in</strong> the area of <strong>distributed</strong> <strong>data</strong> stream <strong>m<strong>in</strong><strong>in</strong>g</strong>. Not onlyhave the <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> and <strong>data</strong>bases community contributed to the literature, abulk of the work also comes from the wireless and sensor networks community. In thissection we discuss some of the related papers with po<strong>in</strong>ters <strong>for</strong> further read<strong>in</strong>g.Computation of complex functions over the union of multiple streams have been studiedwidely <strong>in</strong> the stream <strong>m<strong>in</strong><strong>in</strong>g</strong> literature. Gibbons et al. [67] presents the idea of do<strong>in</strong>gcoord<strong>in</strong>ated sampl<strong>in</strong>g <strong>in</strong> order to compute simple functions such as the total number ofones <strong>in</strong> the union of two b<strong>in</strong>ary streams. They have developed a new sampl<strong>in</strong>g strategy tosample from the two streams and have shown that their sampl<strong>in</strong>g strategy can reduce the


28space requirement <strong>for</strong> such a computation from Ω(n) to log(n), where n is the size of thestream. Their technique can easily be extended to the scenario where there are more thantwo streams. The authors also po<strong>in</strong>t out that this method would work even if the stream isnon-b<strong>in</strong>ary (with no change <strong>in</strong> space complexity).Much work has been done <strong>in</strong> the area of query process<strong>in</strong>g on <strong>distributed</strong> <strong>data</strong> streams.Chen et al. [32] have developed a system ‘NiagaraCQ’ which allows answer<strong>in</strong>g cont<strong>in</strong>uousqueries <strong>in</strong> <strong>large</strong> <strong>scale</strong> systems such as the Internet. In such systems many of the queriesare similar. So a lot of computation, communication and I/O resources can be saved byproperly group<strong>in</strong>g the similar queries. NiagaraCQ achieves the same goal. Their group<strong>in</strong>gscheme is <strong>in</strong>cremental and they use an adaptive regroup<strong>in</strong>g scheme <strong>in</strong> order to f<strong>in</strong>d theoptimal match between a new query and the group to which the query should be placed. Ifnone of these matches, then a new query group is <strong>for</strong>med with this query. The paper doesnot talk about reassignment of the exist<strong>in</strong>g queries <strong>in</strong>to the newly <strong>for</strong>med groups, ratherleaves it as a future work.An different approach has been described by Olston et al. [142]. The <strong>distributed</strong>model described there has nodes send<strong>in</strong>g stream<strong>in</strong>g <strong>data</strong> to a central node which is responsible<strong>for</strong> answer<strong>in</strong>g the queries. The network l<strong>in</strong>ks near the central node becomes abottleneck as soon as the arrival rate of <strong>data</strong> becomes too high. In order to avoid that, theauthors propose <strong>in</strong>stall<strong>in</strong>g filters which restrict the <strong>data</strong> transfer rate from the <strong>in</strong>dividualnodes. Node O <strong>in</strong>stalls a filter of width or range W O and of range [L O ,H O ]. W O is centeredaround the most recent value of the object V (L O = V − W O2and H O = V + W O2).Now the node does not send updates if V is <strong>in</strong>side the range L O ≤ V ≤ H O ; otherwiseit sends updates to the central node and recenters the bounds L O and H O . This techniqueprovides the answers to queries approximately and works <strong>in</strong> the circumstances where wedo not require the exact answer to the queries. S<strong>in</strong>ce <strong>in</strong> many cases the user can provide thequery precision that is necessary, the filters can be made to work after sett<strong>in</strong>g the bounds


ased on this user <strong>in</strong>put.29The sensor network community provides a rich literature on the <strong>data</strong> stream <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong>. S<strong>in</strong>ce, <strong>in</strong> many applications, the sensors are deployed <strong>in</strong> hostile terra<strong>in</strong>s, oneof the most fundamental task aims at develop<strong>in</strong>g a general framework <strong>for</strong> monitor<strong>in</strong>g thenetwork themselves. A similar idea has been presented <strong>in</strong> [205]. This paper presents a generalframework and shows how decomposable functions like m<strong>in</strong>, max, average, count andsum can be computed over such an architecture. The architecture is highlighted by threetools that the authors call digests, scans and dumps. Digests are the network parameters(e.g. count of the number of nodes) that are computed either cont<strong>in</strong>uously, periodically or<strong>in</strong> the event of a trigger. Scans are <strong>in</strong>voked when the digests report a problem (e.g. a suddendrop <strong>in</strong> the number of nodes) to f<strong>in</strong>d out the energy level throughout the network. Thesetwo steps can guide a network adm<strong>in</strong>istrator towards the location of the fault which canbe debugged us<strong>in</strong>g the dumps (dump all the <strong>data</strong> of a s<strong>in</strong>gle or few of the sensors). Furthermore,this paper talks about is the <strong>distributed</strong> comput<strong>in</strong>g of some aggregate functions(mean, max, count etc.). S<strong>in</strong>ce all these functions are decomposable, the advantage is <strong>in</strong>networkaggregation of partial results up a tree. The leaf does not need to send all its <strong>data</strong>to the root and <strong>in</strong> this way vital sav<strong>in</strong>gs can be done <strong>in</strong> terms of communication. The majorconcern though is ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g this tree structure <strong>in</strong> such a dynamic environment. Also thistechnique would fail <strong>for</strong> numerous non-decomposable functions e.g. median, quantile etc.The above algorithm describes a way of monitor<strong>in</strong>g the status of the sensor networkitself. There are many <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problems that need to be addressed <strong>in</strong> the sensor networkscenario. Such an algorithm <strong>for</strong> multi-target classification <strong>in</strong> the sensor networkshas been developed by Kotecha et al. [103] Each node makes <strong>local</strong> decisions and thesedecisions are <strong>for</strong>warded to a s<strong>in</strong>gle node which acts as the manager node. The maximumnumber of targets is known <strong>in</strong> apriori, although the exact number of targets is not known<strong>in</strong> advance. Nodes that are sufficiently apart are expected to provide <strong>in</strong>dependent feature


30vectors <strong>for</strong> the same target which can strengthen the global decision mak<strong>in</strong>g. Moreover, <strong>for</strong>an optimal classifier, the number of decisions <strong>in</strong>creases exponentially with the number oftargets. Hence the authors propose the use of sub-optimal l<strong>in</strong>ear classifiers. Through reallife experiments they show that their suboptimal classifiers per<strong>for</strong>m as well as the optimalclassifier under mild assumptions. This makes such a scheme attractive <strong>for</strong> low power, lowbandwidth environments.Frequent items <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>distributed</strong> streams is an active area of research. There aremany variants of the problem that has been proposed <strong>in</strong> the literature. Interested readers arereferred to [118] <strong>for</strong> a description. Generally speak<strong>in</strong>g there are m streams S 1 ,S 2 ,..,S m .Each stream consists of items with time stamps < d i1 ,t i1 >,< d i2 ,t i2 >, etc. Let S bethe sequence preserv<strong>in</strong>g union of all the streams. If an item i ∈ S has a count count(i)(the count may be evaluated by an exponential decay weight<strong>in</strong>g scheme). The task is tooutput an estimateĉount(i) ofcount(i) whose frequency exceeds a certa<strong>in</strong> threshold. Eachnode ma<strong>in</strong>ta<strong>in</strong>s a precision threshold and outputs only those items exceed<strong>in</strong>g the precisionthreshold. As two extreme cases, the threshold can be set to very low (≈ 0) or very high(≈ 1). In the first case, all the <strong>in</strong>termediate nodes will send everyth<strong>in</strong>g without prun<strong>in</strong>gresult<strong>in</strong>g <strong>in</strong> a message explosion at the root. In the second case, the <strong>in</strong>termediate nodeswill send a low number of items and hence no more prun<strong>in</strong>g would be possible at the<strong>in</strong>termediate nodes. So the precision selection problem is crucial <strong>for</strong> such an algorithmto produce mean<strong>in</strong>gful results with low communication overhead. The paper presents anumber of ways to select the precision values (they call it precision gradients) <strong>for</strong> differentscenarios of load m<strong>in</strong>imization.2.8 Related work: Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Sensor Networksnetworks.In this section we present some related work <strong>in</strong> the area of <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> sensor


31Wireless Sensor Networks (WSN’s) are f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>creas<strong>in</strong>g number of applications <strong>in</strong>many doma<strong>in</strong>s, <strong>in</strong>clud<strong>in</strong>g battle fields, smart build<strong>in</strong>gs, and even the human body. Mostsensor networks consist of a collection of light-weight (possibly mobile) sensors connectedvia wireless l<strong>in</strong>ks to each other or to a more powerful gateway node that is <strong>in</strong> turn connectedwith an external network through either wired or wireless connections. Currently most ofthe sensor nodes are connected <strong>in</strong> a hierarchical fashion, with partial or complete centralizedcontrol. Simplicity and low ma<strong>in</strong>tenance of P2P architecture suggests that the nextgeneration of sensor nodes will communicate <strong>in</strong> an P2P architecture us<strong>in</strong>g ad-hoc l<strong>in</strong>ks.Hence, <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> wireless sensor networks (WSN’s) is related to <strong>data</strong> process<strong>in</strong>gover a P2P network. However there are some significant differences. In many cases,while design<strong>in</strong>g <strong>algorithms</strong> <strong>for</strong> a P2P environment, we are ma<strong>in</strong>ly concerned with reduc<strong>in</strong>gthe communication and mak<strong>in</strong>g the algorithm scalable. However, <strong>in</strong> a typical WSNwe not only need to consider communication overhead, but we need to understand the energyrequirement of the algorithm, the number of idle cycles that each sensor may wastedue to computation/communication, the size of the message payload and possibly more.So a highly communication <strong>efficient</strong> algorithm <strong>in</strong> a P2P sett<strong>in</strong>g that requires each node toconstantly probe <strong>for</strong> <strong>in</strong>com<strong>in</strong>g messages is not a very suitable candidate <strong>in</strong> a WSN environment.More details about <strong>in</strong><strong>for</strong>mation process<strong>in</strong>g <strong>in</strong> sensor network can be found <strong>in</strong> thebook by Zhao and Guibas [204].Many researchers have focused on mak<strong>in</strong>g the <strong>data</strong> collection task <strong>in</strong> a sensor networkenergy <strong>efficient</strong>. LEACH, LEACH-C, LEACH-F [76][77], and PEGASIS [111] are someof the attempts towards mak<strong>in</strong>g that. Intrusion detection <strong>in</strong> sensor networks has receivedconsiderable attention <strong>in</strong> the last few years ma<strong>in</strong>ly because of the nature of the task that typicalsensor networks are <strong>in</strong>tended <strong>for</strong>. Radivojac et al. [157] has developed an algorithm <strong>for</strong><strong>in</strong>trusion detection <strong>in</strong> a supervised framework, where there are far more negative <strong>in</strong>stancesthan positive (<strong>in</strong>trusions). An unsupervised approach to the outlier detection problem <strong>in</strong>


32sensor networks is presented by Palpanas et. al. [146], where kernel density estimatorsare used to estimate the distribution of the <strong>data</strong> generated by the sensors, and then the outliersare detected depend<strong>in</strong>g on a distance based criterion. More recently, Branch et. al.[19] proposed another unsupervised approach to detect outliers <strong>in</strong> wireless sensor networksus<strong>in</strong>g <strong>local</strong> <strong>algorithms</strong>. This algorithm is robust to <strong>data</strong> and network changes and, unlikemany other <strong>local</strong> <strong>algorithms</strong>, works <strong>for</strong> any general network topology.In many cases comput<strong>in</strong>g decomposable aggregates such as sum, average, count is facilitatedby do<strong>in</strong>g an <strong>in</strong> network aggregation s<strong>in</strong>ce these these aggregates can be computed<strong>local</strong>ly and comb<strong>in</strong>ed to produce the f<strong>in</strong>al correct solution. However there are many functionsuch as median, quantiles and the like whose exact computation requires knowledgeof the entire <strong>data</strong>. Greenwald et al. [70] presents a general framework <strong>for</strong> comput<strong>in</strong>g theǫ-approximate quantile and median of the sensor <strong>data</strong>. The make the aggregates “quasi”-decomposable and thereby achieve excellent reduction <strong>in</strong> communication per node. Theexperimental results show the effectiveness of their approach.An important area of research is cluster<strong>in</strong>g the sensors themselves s<strong>in</strong>ce nodes that areclustered together can easily communicate with each other or solve a spatial problem. Inmany cases it is posed as an optimization problem. Ghiasi et al., [64] presents the theoreticalaspects of this problem with special emphasis to energy optimization. Each cluster isrepresented by a master node (similar to the centroid <strong>in</strong> a k-means cluster<strong>in</strong>g technique).Their cluster<strong>in</strong>g algorithm ensures that each cluster is balanced and the total distance betweenthe sensor nodes and the master nodes is m<strong>in</strong>imized. Several other techniques arealso presented <strong>in</strong> the literature such as [200][34].A recent workshop on “Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Sensor Networks” conta<strong>in</strong>s several relevantworks: an architecture <strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> [17], artificial neural-network <strong>algorithms</strong> [107],pre-process<strong>in</strong>g us<strong>in</strong>g EM [42] and more.


2.9 Related work: In<strong>for</strong>mation Retrieval <strong>in</strong> P2P Networks33In<strong>for</strong>mation retrieval <strong>in</strong> P2P networks (P2PIR)is closely related to <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong>P2P networks. Search<strong>in</strong>g and rank<strong>in</strong>g <strong>in</strong> P2P file shar<strong>in</strong>g networks <strong>for</strong>m basic foundationof P2PIR. P2PIR is considered a special sub branch of <strong>distributed</strong> <strong>in</strong><strong>for</strong>mation retrieval.Callan [25] presents an overview of <strong>distributed</strong> <strong>in</strong><strong>for</strong>mation retrieval where the searchable<strong>data</strong>bases are <strong>distributed</strong> across the network and there is no central coord<strong>in</strong>ator or <strong>in</strong>dexto aid <strong>in</strong> the search<strong>in</strong>g process. An detailed study of the different issues and prospects ofP2PIR is presented by Sia [171]. Search and order<strong>in</strong>g <strong>in</strong> P2P networks are active areas ofresearch. In the rema<strong>in</strong>der of this section we discuss the work related to search and rank<strong>in</strong>g<strong>in</strong> P2PIR.2.9.1 Search <strong>in</strong> P2P NetworkSeveral techniques have been proposed <strong>in</strong> the literature <strong>for</strong> <strong>efficient</strong> search<strong>in</strong>g <strong>in</strong> P2Pnetworks. Lv et al. [117] argues that the flood<strong>in</strong>g-based search technique currently employedby Gnutella [68] is not scalable due to its high bandwidth consumption <strong>in</strong> a <strong>large</strong>P2P network. The alternate ideas proposed <strong>in</strong> the paper are to use (1) a TTL-token basedapproach and (2) a random walk based approach. Each node <strong>in</strong> the network <strong>for</strong>wards aquery message, called walker, randomly to one of its peers. To reduce the query responsetime, the idea of the 1-walker is extended to a k-walker, where k <strong>in</strong>dependent walkersare simultaneously propagated. The paper shows through simulations that their techniqueachieves comparable results (compared to Gnutella) while reduc<strong>in</strong>g the network traffic bytwo orders of magnitude. A different approach was proposed by Kalogeraki et. al [91]. Theproposed technique, known as Random Breadth First Search (RBFS), works by <strong>for</strong>ward<strong>in</strong>ga search message to only a fraction of its peers, selected at random. S<strong>in</strong>ce a node is able tomake decisions based on only a subset of the nodes <strong>in</strong> the network, the advantage of RBFSis that it does not require global knowledge and is there<strong>for</strong>e scalable. However, be<strong>in</strong>g prob-


34abilistic <strong>in</strong> nature, the results are true based on the results of the theory of statistics. Yang etal. [198] present a collection of heuristics <strong>for</strong> <strong>efficient</strong> search<strong>in</strong>g <strong>in</strong> P2P networks. Iterateddeepen<strong>in</strong>g, directed BFS and <strong>local</strong> <strong>in</strong>dices are the three techniques proposed. Experimentalresults show that all of these techniques have their own merits and demerits and they areapplicable <strong>for</strong> the next generation P2P networks.Given a collection of text documents, the problem of retriev<strong>in</strong>g the subset that is relevantto a particular query has been studied extensively [161][159]. Till today, one of themost successful techniques <strong>for</strong> address<strong>in</strong>g this problem is the vector space rank<strong>in</strong>g modelorig<strong>in</strong>ally proposed by Salton and Yang [161]. Cuenca-Acuna and Nguyen [36] proposedPlanetP, a P2P search<strong>in</strong>g and <strong>in</strong><strong>for</strong>mation retrieval system based on the vector space rank<strong>in</strong>gmodel [161]. The ma<strong>in</strong> problem addressed <strong>in</strong> [36] is how to search <strong>for</strong> and retrievedocuments relevant to a query posed by some member of a PlanetP community. The basicidea <strong>in</strong> PlanetP is that each community member creates an <strong>in</strong>verted (word-to-document)<strong>in</strong>dex of the documents (IDF) that it wishes to share, summarizes this <strong>in</strong>dex <strong>in</strong> a compact<strong>for</strong>m, and diffuses the summary throughout the community. More specifically, the papershows how to approximate TFxIDF us<strong>in</strong>g compact summaries of <strong>in</strong>dividual peers’ <strong>in</strong>verted<strong>in</strong>dexes rather than the <strong>in</strong>verted <strong>in</strong>dex of the entire communal store. This is important <strong>for</strong>the scalability of the algorithm. The goal is to construct a content addressable publish/subscribe service that uses gossip<strong>in</strong>g of <strong>local</strong> state across unstructured communities <strong>for</strong> <strong>in</strong><strong>for</strong>mationretrieval. To achieve this PlanetP uses a heuristic <strong>for</strong> adaptively deter<strong>m<strong>in</strong><strong>in</strong>g</strong> theset of peers that should be contacted <strong>for</strong> a query.A popular technique of dissem<strong>in</strong>at<strong>in</strong>g <strong>in</strong><strong>for</strong>mation among the nodes of a <strong>large</strong> P2Pnetwork is Distributed Hash Tables (DHT). Distributed hash tables (DHTs) are a classof decentralized <strong>distributed</strong> systems that provide a lookup service similar to a hash table:(name, value) pairs are stored <strong>in</strong> the DHT, and any participat<strong>in</strong>g node can <strong>efficient</strong>ly retrievethe value associated with a given name. The hash table is not kept at a central site, rather


35this lookup <strong>in</strong><strong>for</strong>mation is dissem<strong>in</strong>ated among the nodes, <strong>in</strong> such a way that a change <strong>in</strong>the set of participants causes a m<strong>in</strong>imal amount of disruption. This allows DHTs to <strong>scale</strong>to extremely <strong>large</strong> numbers of nodes and to handle cont<strong>in</strong>ual node arrivals, departures, andfailures. Because DHTs support <strong>efficient</strong> query<strong>in</strong>g and replication, they can be used <strong>for</strong><strong>distributed</strong> file systems, P2P file systems and content distribution systems.Several DHTs have been proposed <strong>in</strong> the literature. CHORD [175], CAN [158], PAS-TRY [160] and TAPESTRY [203] are some of the most popular ones. DHT’s are characterizedby two quantities — keyspace partition<strong>in</strong>g and overlay network. Most DHTs use somevariant of consistent hash<strong>in</strong>g to map keys to nodes. A function δ(k 1 ,k 2 ) is used to def<strong>in</strong>ethe distance between two keys k 1 and k 2 which is unrelated to geographical distance ornetwork latency. Each node is assigned a s<strong>in</strong>gle key called its identifier (ID). A node withID i owns all the keys <strong>for</strong> whichiis the closest ID, measured accord<strong>in</strong>g toδ. For example,Chord DHT views the keys as if they are arranged on a circle and δ(k 1 ,k 2 ) is the distancetravel<strong>in</strong>g clockwise around the circle from k 1 to k 2 . The ma<strong>in</strong> advantage with consistenthash<strong>in</strong>g is that removal or addition of one node changes only the set of keys owned by thenodes with adjacent IDs, and leaves all other nodes unaffected. Along with this, each nodema<strong>in</strong>ta<strong>in</strong>s a set of l<strong>in</strong>ks to some other nodes <strong>in</strong> the overlay network. The overlay network<strong>in</strong> a DHT topology is such that <strong>for</strong> any keyk, a node either ownsk or knows a node whichis closer to k. Given this, rout<strong>in</strong>g a message/query is easy: <strong>for</strong>ward the message to theneighbor whose ID is closest to k. When no such neighbor exists, we have reached theowner of the <strong>data</strong>. DHT’s are scalable and the cost of <strong>in</strong>sert<strong>in</strong>g and delet<strong>in</strong>g a node growsvery slowly.Several other P2P search techniques and systems have been developed. Bessonov etal. [12] proposed an open architecture <strong>for</strong> P2P search<strong>in</strong>g as part of the OASIS project.ODISSEA (Open DIStributed Search Eng<strong>in</strong>e Architecture) was proposed <strong>in</strong> [180]. Tang etal. [186] proposed another P2P search algorithm called pSearch, that distributes document


36<strong>in</strong>dices through the P2P network based on document semantics generated by Latent SemanticIndex<strong>in</strong>g (LSI). Klampanos and Jose [101] proposed another architecture <strong>for</strong> search<strong>in</strong> P2P networks where there exists semi-collaborat<strong>in</strong>g peers <strong>in</strong> which peers collaborate toachieve some global goal. However their own proprietary <strong>data</strong> is protected. Yee and Frieder[199] have looked at several exist<strong>in</strong>g P2P search techniques and proposed a comb<strong>in</strong>ationwhich is generally very effective <strong>in</strong> practice.2.9.2 Rank<strong>in</strong>g <strong>in</strong> P2P NetworkP2P search <strong>for</strong>ms the basis of <strong>in</strong><strong>for</strong>mation extraction from P2P networks. In manycases, the search results are huge and unordered. In this section we discuss several <strong>algorithms</strong>that have been proposed <strong>in</strong> the literature <strong>for</strong> rank<strong>in</strong>g the results <strong>in</strong> order of relevancebe<strong>for</strong>e present<strong>in</strong>g it to the user. However, order<strong>in</strong>g the results is a challeng<strong>in</strong>g task <strong>in</strong>absence of a centralized document <strong>in</strong>dex. The centralized page rank algorithm, used byGoogle 4 , and its variants are not applicable <strong>for</strong> a <strong>distributed</strong> scenario like P2P network.As a result, top-k items identification <strong>in</strong> <strong>distributed</strong> sett<strong>in</strong>gs and its variants are activelyresearched even today.Fag<strong>in</strong> et al. [54] proposed a threshold algorithm <strong>for</strong> order<strong>in</strong>g top-k queries. Cao et al.[28] later showed that the algorithm presented <strong>in</strong> [54] consumed a lot of bandwidth whenthe number of nodes is <strong>large</strong>. The paper presents a new algorithm called “Three-PhaseUni<strong>for</strong>m Threshold” (TPUT). TPUT reduces network bandwidth consumption by prun<strong>in</strong>gaway <strong>in</strong>eligible objects, and term<strong>in</strong>ates <strong>in</strong> three round-trips regardless of <strong>data</strong> <strong>in</strong>put.The KLEE framework [122] proposed by Michel et al. presents a novel family of<strong>algorithms</strong> <strong>for</strong> top-k query process<strong>in</strong>g <strong>in</strong> wide-area <strong>distributed</strong> environments. It shows howefficiency can be enjoyed at low result-quality penalties. Shi et. al [170] presented a <strong>distributed</strong>version of the page-rank<strong>in</strong>g algorithm used by Google to rank results obta<strong>in</strong>ed <strong>in</strong> a4 http://en.wikipedia.org/wiki/PageRank


37P2P network. Their Pagerank found by their algorithm converges to the centralized Pagerankalgorithm. However, their technique is only applicable to structure P2P overlay networks.Nejdl et al. [136] considered the problem of rank<strong>in</strong>g and f<strong>in</strong>d<strong>in</strong>g the top-k queriesfrom a schema-based P2P network such as Edutella [137]. Edutella is a P2P <strong>in</strong>frastructure<strong>for</strong> stor<strong>in</strong>g and retriev<strong>in</strong>g RDF (Resource Description Format) meta<strong>data</strong> <strong>in</strong> a <strong>distributed</strong> environment.The proposed <strong>distributed</strong> algorithm makes use of <strong>local</strong> rank<strong>in</strong>gs, rank merg<strong>in</strong>gand optimized rout<strong>in</strong>g based on peer ranks, and m<strong>in</strong>imizes both answer set size and networktraffic among peers. They use the super peer approach to route the queries which maybe absent <strong>in</strong> many emerg<strong>in</strong>g P2P <strong>in</strong>frastructures. In 2005 [6], the authors improved upontheir earlier work and proposed a nearest match, or top-k query <strong>in</strong> P2P networks based onthe same super peer backbone. The proposed technique uses progressive improvement ofquery results, thereby reduc<strong>in</strong>g network traffic by a lot. Rank<strong>in</strong>g the best peers based on aset of multiple queries has been addressed by Tempich et al. [187].2.10 Related work: Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> GRIDGrid can be def<strong>in</strong>ed as - “the ability, us<strong>in</strong>g a set of open standards and protocols, toga<strong>in</strong> access to applications and <strong>data</strong>, process<strong>in</strong>g power, storage capacity and a vast array ofother comput<strong>in</strong>g resources over the Internet” [71]. Grid comput<strong>in</strong>g has ga<strong>in</strong>ed popularity asa <strong>distributed</strong> computational resource <strong>for</strong> a plethora of massive computational tasks whichis difficult to handle by a s<strong>in</strong>gle computational resource. Grids f<strong>in</strong>d applications <strong>in</strong> diversedoma<strong>in</strong>s such as:1. earth science applications (funded by NSF and NASA) — climate/weather model<strong>in</strong>g,earthquake simulation and more,2. f<strong>in</strong>ancial applications such as f<strong>in</strong>ancial model<strong>in</strong>g,3. genetic <strong>algorithms</strong> us<strong>in</strong>g the Condor service,


4. biological doma<strong>in</strong>s to study the effect of prote<strong>in</strong> fold<strong>in</strong>g,385. E-sciencE project, which is based <strong>in</strong> the European Union and <strong>in</strong>cludes sites <strong>in</strong> Asiaand the United States, is a follow up project to the European DataGrid (EDG) and isarguably the <strong>large</strong>st comput<strong>in</strong>g grid on the planet,6. and many more.Grid comput<strong>in</strong>g was popularized by Ian Foster, Carl Kesselman and Steve Tuecke <strong>in</strong>their sem<strong>in</strong>al work [59] who are widely recognized as the “father of the modern grids”[193]. A Grid is a type of parallel and <strong>distributed</strong> system that enables the shar<strong>in</strong>g, selection,and aggregation of resources <strong>distributed</strong> across multiple adm<strong>in</strong>istrative doma<strong>in</strong>s basedon the resources availability, capacity, per<strong>for</strong>mance, cost and users’ ”quality-of-service requirements”.S<strong>in</strong>ce many discipl<strong>in</strong>es of science <strong>in</strong>volve scaveng<strong>in</strong>g through massive volumesof <strong>data</strong>, computational grids offer a perfect plat<strong>for</strong>m <strong>for</strong> various <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> tasks.Grid systems differ from P2P systems <strong>in</strong> the sense that grid systems have a built-<strong>in</strong> centralizedcontrol structure and network <strong>in</strong>frastructure which assigns jobs and ensures optimalsystem per<strong>for</strong>mance. P2P systems on the other hand do not rely on such tightly coupledcentral authority. Nevertheless, s<strong>in</strong>ce both Grid and P2P systems deal with heterogenous<strong>distributed</strong> resources coupled together to solve a common goal, they both relate to the ideaof <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> and share some commonalities <strong>in</strong> terms of the <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> constra<strong>in</strong>ts.Talia and Trunfio [183] discuss the similarities between Grid and P2P comput<strong>in</strong>g.In another work, Talia and Skillicorn [182] argues that the grid offers unique prospects<strong>for</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> of <strong>large</strong> <strong>data</strong> sets due to its collaborative storage, bandwidth and computationalresources. Cannataro et al. [26] address general issues <strong>in</strong> <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> over thegrid. S<strong>in</strong>ce the grid uses resource from various sources, proper resource allocation to differentjobs is very important and challeng<strong>in</strong>g job. Hoschek et al. [81] discusses the <strong>data</strong>management issues <strong>for</strong> grid <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>. Several <strong>in</strong>terest<strong>in</strong>g ongo<strong>in</strong>g grid projects <strong>in</strong>volve


39<strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> over the grid. The NASA In<strong>for</strong>mation Power Grid 5 , Papyrus [4], the DataGrid [38], the Knowledge Grid [27] are to name a few of them. The Globus Consortiumhas put <strong>for</strong>ward the Globus Toolkit, which is open source and helps researchers with gridcomput<strong>in</strong>g.2.11 Related work: Distributed AI and Multi-Agent SystemsIn this section we give a very brief overview of the field of <strong>distributed</strong> artificial <strong>in</strong>telligence(DAI) and multi-agent systems (MAS). For a detailed overview, <strong>in</strong>terested readersare referred to the PhD thesis of Stone [177][178] and the review paper by Sen [166].The evolution of MAS can be traced to the early work <strong>in</strong> AI such as the blackboardsystem [140], M<strong>in</strong>sky’s Society of M<strong>in</strong>d concept [125] and more. They fostered the conceptof <strong>in</strong>dependent entities work<strong>in</strong>g <strong>in</strong> a parallel manner to achieve some common goal.In early years of MAS, Victor Lesser with the help of his colleagues and studentsdeveloped the <strong>distributed</strong> vehicle monitor<strong>in</strong>g testbed [50][109]. Almost at the same timeAgha [2] and Hewitt [79] proposed the actors framework <strong>for</strong> parallel communication of theagents. The contract net protocol developed by Smith [173] specifies agent communicationprotocols.Several researchers have looked <strong>in</strong>to the follow<strong>in</strong>g problems <strong>in</strong> MAS/DAI research.Distributed search <strong>in</strong> schedul<strong>in</strong>g and constra<strong>in</strong>t satisfaction was dealt with by Durfee [52].Multiagent learn<strong>in</strong>g is an active area of research. Haynes and Sen [75] have written asseries of papers which talk about learn<strong>in</strong>g <strong>in</strong> multi-agent learn<strong>in</strong>g.There are several ways of divid<strong>in</strong>g the MAS and the field of DAI [45][51]. Here wefollow the taxonomy presented by Stone [177]. The MAS literature can be subdivided <strong>in</strong>tothe follow<strong>in</strong>g four subsections as described below.5 http://www.gloriad.org/gloriad/projects/project000053.html


2.11.1 Homogenous and non-communicat<strong>in</strong>g MAS40In homogenous MAS, all the agents are same <strong>in</strong> functionality, doma<strong>in</strong> knowledge andpossible actions. They differ ma<strong>in</strong>ly <strong>in</strong> terms of their <strong>in</strong>puts (and hence outputs) becausethey are located at different physical locations <strong>in</strong> the world. The agents cannot communicatedirectly.Us<strong>in</strong>g homogenous non-communicat<strong>in</strong>g agents, Balch and Ark<strong>in</strong> [5] have <strong>in</strong>vestigatedthe <strong>for</strong>mation control <strong>for</strong> robot teams. Such agents have also been used elsewhere: <strong>local</strong>knowledge [167], stigmergy [69] and recursive model<strong>in</strong>g [49].2.11.2 Heterogenous and non-communicat<strong>in</strong>g MASIn this scenario the agents are different <strong>in</strong> their goals, actions and doma<strong>in</strong> knowledge.The assumption that the agents do not communicate directly still holds. The goals of theagents can be such that they are cooperative or disruptive. This issue of benevolence vs.competitiveness has been addressed by Goldman and Rosensche<strong>in</strong> [69]. Game theoretic<strong>for</strong>mulations <strong>for</strong> cooperative games such as prisoner’s dilemma have been proposed byMor and Rosensche<strong>in</strong> [127]. Littman [114] considered a zero sum non-cooperat<strong>in</strong>g gameand proposed M<strong>in</strong>imax-Q which is a variation of Q-learn<strong>in</strong>g.The use of heterogenous non-communicat<strong>in</strong>g agents extends to competitive coevolution[75], adaptive load balanc<strong>in</strong>g [162] and more.2.11.3 Homogenous and communicat<strong>in</strong>g MASIn homogenous communicat<strong>in</strong>g, the agents are the same (except the <strong>in</strong>puts and henceoutput) as <strong>in</strong> homogenous non-communicat<strong>in</strong>g, except that these agents can now talk toeach other. The communication can be broadcast, po<strong>in</strong>t-to-po<strong>in</strong>t or shared memory. Thisarea is closely related to the sensor network doma<strong>in</strong>. Use of MAS <strong>in</strong> this doma<strong>in</strong> <strong>in</strong>cludethe cooperative <strong>distributed</strong> vision project [135] and the real time traffic <strong>in</strong><strong>for</strong>mation system


[128].412.11.4 Heterogenous and communicat<strong>in</strong>g MASThis is perhaps the most difficult scenario to model because of heterogeneity and communication.Several techniques have been proposed <strong>for</strong> heterogenous agent communicationsuch as KQML [56]. A plethora of work exists with this type of MAS such as cooperativeco-evolution [24], multi-agent Q-learn<strong>in</strong>g [184], query response <strong>in</strong> <strong>in</strong><strong>for</strong>mation networks[181], coalitions [169] and more. Interested readers are refereed to Stone [177] <strong>for</strong> a detaileddiscussion on this topic.2.12 Applications of Peer-to-Peer Data M<strong>in</strong><strong>in</strong>gThe proliferation of the P2P networks over the last decade has generated a <strong>large</strong> numberof <strong>in</strong>terest<strong>in</strong>g applications. In this section we barely scratch the surface. For a moredetailed discussion <strong>in</strong>terested readers are refereed to the P2P article <strong>in</strong> Wikipedia [144].2.12.1 File storageThe biggest and probably the most widely used application of P2P networks is filestorage. The past decade has witnessed a <strong>large</strong> number of P2P storage systems both academicand commercial. On the academic side CHORD [175], PASTRY [160] and CAN[158] are some of the well known P2P file storage applications. Commercial P2P storagesystems such as Napster [134], Gnutella [68], BitTorrents [16] and Kazaa [97] have beenextremely popular <strong>for</strong> <strong>data</strong> storage, music, video and photo shar<strong>in</strong>g. More upcom<strong>in</strong>g P2Pstorage applications <strong>in</strong>clude the UbiStorage project [190]. Hasan et al. [74] presents adetailed overview of the different P2P <strong>distributed</strong> file storage systems.These file storage applications offer an important testbed <strong>for</strong> P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>.Data <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> are necessary <strong>for</strong> recommender systems, search eng<strong>in</strong>es,


classification and cluster<strong>in</strong>g of the documents/files <strong>in</strong> such sett<strong>in</strong>gs.422.12.2 EducationSeveral <strong>in</strong>terest<strong>in</strong>g P2P projects are currently be<strong>in</strong>g undertaken at different educational<strong>in</strong>stitutions or <strong>in</strong> a consortium of such <strong>in</strong>stitutions.One of them is the ScienceNet P2P search eng<strong>in</strong>e, currently under deployment by theKarlsruhe Institute of Technology - Liebel-Lab [165]. The idea is to do <strong>distributed</strong> <strong>in</strong>dex<strong>in</strong>gof university or research websites to keep up-to-date <strong>in</strong><strong>for</strong>mation about the collaboratorsand researchers. It is based on “YaCy”, a P2P search eng<strong>in</strong>e software. Universities,research <strong>in</strong>stitutes or <strong>in</strong>dividuals can download the free java software and <strong>in</strong>stall it on theirown mach<strong>in</strong>e. The latter would then jo<strong>in</strong> the SceienceNet network and help <strong>in</strong> the <strong>in</strong>dex<strong>in</strong>gof researchers or universities.With the rapid explosion of <strong>data</strong>, many universities are collaborat<strong>in</strong>g on <strong>data</strong> storageus<strong>in</strong>g a P2P technology. One such endeavor is by the Pennsylvania State University, MITand Simon Fraser University who are carry<strong>in</strong>g on a project called LionShare [113] designed<strong>for</strong> facilitat<strong>in</strong>g file shar<strong>in</strong>g among educational <strong>in</strong>stitutions globally.Another project worth mention<strong>in</strong>g is the PlanetLab project orig<strong>in</strong>ally started by theresearches at UC Berkeley, Intel labs and Pr<strong>in</strong>ceton [151]. The project aims at ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>ga <strong>large</strong> number of nodes as a testbed <strong>for</strong> runn<strong>in</strong>g applications <strong>for</strong> <strong>distributed</strong> systems. Asof October 2007, there 825 nodes spread across the US, Europe and Asia. Most of theparticipants are educational <strong>in</strong>stitutions, research labs and corporate agencies <strong>in</strong>terested<strong>in</strong> <strong>distributed</strong> and network<strong>in</strong>g systems research. Many research papers published todayconta<strong>in</strong> simulation results based on PlanetLab — a sure sign that this testbed is becom<strong>in</strong>gpopular.


2.12.3 Bio<strong>in</strong><strong>for</strong>matics43Bio<strong>in</strong><strong>for</strong>matics is the <strong>in</strong>teraction of Biology and In<strong>for</strong>mation Science or Data M<strong>in</strong><strong>in</strong>g.Be<strong>in</strong>g a <strong>data</strong> rich subject, bio<strong>in</strong><strong>for</strong>matics is a natural candidate <strong>for</strong> application of P2P modeof comput<strong>in</strong>g. P2P networks are specifically used to run <strong>large</strong> programs which can identifydrug candidates such as the one conducted at the Centre <strong>for</strong> Computational Drug Discoveryat Ox<strong>for</strong>d University <strong>in</strong> cooperation with the National Foundation <strong>for</strong> Cancer Research,stared from 2001. Ch<strong>in</strong>ook [35][126] is a P2P bio<strong>in</strong><strong>for</strong>matic service focus<strong>in</strong>g on exchangeof analysis between bio<strong>in</strong><strong>for</strong>matic researchers worldwide. It consists of self-adm<strong>in</strong>isteredprograms <strong>for</strong> computational biologists to run and compare various bio<strong>in</strong><strong>for</strong>matics software.Tranche [188] is an bio<strong>in</strong><strong>for</strong>matics network developed to ease the <strong>data</strong> shar<strong>in</strong>g problemsamong the researchers.2.12.4 Consumer ApplicationsP2P technologies have found a number of uses <strong>in</strong> our everyday lives. Real-time audioand video stream<strong>in</strong>g us<strong>in</strong>g the Peer Distributed Transfer Protocol (PDTP) is more commontoday than simple offl<strong>in</strong>e audio and video download<strong>in</strong>g. One such implementation is the“distribustream” peercast<strong>in</strong>g service [47]. Many TV channels are now broadcast us<strong>in</strong>g theP2P technology, commonly referred to as P2PTV. Sopcast [174], developed <strong>in</strong> Ch<strong>in</strong>a is anextremely popular P2PTV service which people use worldwide <strong>for</strong> view<strong>in</strong>g multi-languageTV channels <strong>in</strong> TV resolution. Joost [88] is also another hugely popular service.P2P technology is actively used <strong>in</strong> tele-communications. Today, many Voice-over-IP(VoIP) service is handled <strong>in</strong> a P2P fashion — Skype [172] be<strong>in</strong>g one of the most popular.Currently it has over fifty million users.Sun Microsystems has developed the JXTA TM [90] protocol specifically to allow developersdevelop P2P applications. Numerous P2P chat applets and <strong>in</strong>stant messag<strong>in</strong>gservice exist today based on JXTA TM .


2.12.5 Sensor Network Applications44A completely different application of P2P networks is the sensor network applicationorig<strong>in</strong>ally started by the US Defense Agency (DARPA) <strong>for</strong> battlefield surveillance.Many sensor network-based civilian applications have emerged such as habitat monitor<strong>in</strong>g,healthcare applications, home automation, and traffic control.Sensors are deployed <strong>in</strong> hostile locations to provide detailed environmental <strong>in</strong><strong>for</strong>mation.These sensors <strong>for</strong>m an ad-hoc P2P connection. For sensors, the preservation of batterypower is critical due to two reasons — (1) the sensors are small <strong>in</strong> size hav<strong>in</strong>g limitedpower and (2) due to their physical placement, it is not possible to change the batteriesoften. In such sensors, wireless communication significantly consumes more energy thanany other operation. There<strong>for</strong>e, <strong>algorithms</strong> designed <strong>for</strong> such networks need to be extremelycommunication-<strong>efficient</strong> and put the sensors <strong>in</strong>to a sleep mode to conserve thebattery power. Moreover, sensors need to operate <strong>in</strong> a router-less environment with noglobal IPs. F<strong>in</strong>ally, due to the limited power and hostile environment, light-weight, wirelesssensor networks are highly dynamic with sensor dropout and edge failure a commonoccurrence.2.12.6 Mobile Ad-hoc Networks (MANETs)MANETs are gett<strong>in</strong>g <strong>in</strong>creas<strong>in</strong>g attention <strong>in</strong> many wireless application doma<strong>in</strong>s. Several<strong>data</strong> rich environments (e.g. vehicular ad-hoc networks, P2P e-commerce environments)are emerg<strong>in</strong>g and they are likely to need <strong>data</strong> analytic supports <strong>for</strong> <strong>efficient</strong> andpersonalized services. Consider a simple profil<strong>in</strong>g application launched via cellular phonesthat tries to automatically connect to MANET like network <strong>for</strong>med by different cellularphones <strong>in</strong> the vic<strong>in</strong>ity and identify peers with similar <strong>in</strong>terest to <strong>for</strong>m a social network. Alightweight P2P classification algorithm may be very handy <strong>in</strong> such an application.Many upcom<strong>in</strong>g vehicular networks are built us<strong>in</strong>g wireless communication technol-


45ogy enabl<strong>in</strong>g cars to talk to each other to get traffic <strong>in</strong><strong>for</strong>mation, suggest alternate routes,alert drivers <strong>in</strong> case of approach<strong>in</strong>g bad weather and prevent life-threaten<strong>in</strong>g accidents bytak<strong>in</strong>g evasive actions. These cars <strong>for</strong>m an ad hoc P2P connection; node failures and reconnectionare very high. All these techniques use variants of <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong> to generate models on the fly and <strong>in</strong> real-time. For example, predict<strong>in</strong>g a futureaction requires one to model the past behavior and current state us<strong>in</strong>g classification orregression <strong>algorithms</strong>. The Campus Vehicular Testbed at the University of Cali<strong>for</strong>nia LosAngeles (CVet@UCLA) [191] is one such <strong>in</strong>itiative.2.13 SummaryIn this chapter first we have presented a detailed background study of the differenttypes of P2P networks. We have classified P2P networks <strong>in</strong>to three ma<strong>in</strong> categories —(1) centralized, (2) decentralized but structured, and (3) decentralized and unstructured.This type of classification has been proposed by Lv et al. [117]. S<strong>in</strong>ce P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong> operate <strong>in</strong> many different environments, we have presented a few important anddesired properties of P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>. A comprehensive literature review of theprevious work related to this area of research is presented next. For convenience, we havedivided the related work <strong>in</strong>to the follow<strong>in</strong>g areas:1. <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>2. <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P networks3. <strong>data</strong> stream monitor<strong>in</strong>g4. <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> sensor networks5. <strong>in</strong><strong>for</strong>mation retrieval <strong>in</strong> P2P networks6. <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> Grids


7. mach<strong>in</strong>e learn<strong>in</strong>g <strong>in</strong> multi-agent systems and <strong>distributed</strong> artificial <strong>in</strong>telligence46F<strong>in</strong>ally we have presented some application areas of P2P <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> which, over thelast decade, have received considerable attention.


47Chapter 3DEFRALC: A DETERMINISTIC FRAMEWORK FORLOCAL COMPUTATION IN LARGE DISTRIBUTEDSYSTEMS3.1 IntroductionIn sensor networks, P2P systems, grid systems, and other <strong>large</strong> <strong>distributed</strong> systemsthere is often the need to model the <strong>data</strong> that is <strong>distributed</strong> over the entire system. In mostcases, centraliz<strong>in</strong>g all or some of the <strong>data</strong> is a costly approach. The cost becomes higherif the <strong>data</strong> distribution changes and the computation needs to be updated to follow thatchange.Statistical models can be built and updated from <strong>distributed</strong> <strong>data</strong> <strong>in</strong> several differentways. The global computation paradigm uses the <strong>in</strong><strong>for</strong>mation from all the nodes <strong>in</strong> thenetwork to construct a <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> model by aggregat<strong>in</strong>g all the <strong>data</strong> to a central location.Such a comput<strong>in</strong>g model has several advantages: (1) a <strong>large</strong> variety of models can be computed,(2) the output can be provably correct <strong>for</strong> many problems, (3) the <strong>algorithms</strong> can bevery simple (e.g. broadcast, convergecast). However their major drawback is scalability.Comput<strong>in</strong>g models us<strong>in</strong>g the <strong>in</strong><strong>for</strong>mation from all the nodes <strong>in</strong> the network means the <strong>algorithms</strong><strong>scale</strong> proportional to the size of the network. This becomes severely problematicespecially <strong>for</strong> <strong>large</strong> networks consist<strong>in</strong>g of millions of nodes. Moreover, these <strong>algorithms</strong>


48are very synchronous, unlikely to be useful <strong>in</strong> today’s asynchronous P2P networks. On theother hand, it should be noted that global computation is not always essential. Consider <strong>for</strong>example the problem of f<strong>in</strong>d<strong>in</strong>g the mean of an array of numbers where each node has onenumber. Instead of transmitt<strong>in</strong>g the entire <strong>data</strong>, each node can simply communicate withits neighbors and then the result can be propagated. Such <strong>algorithms</strong> which compute theresults <strong>in</strong> a small neighborhood and then propagate the results are called <strong>local</strong> <strong>algorithms</strong>.Local <strong>algorithms</strong>, <strong>in</strong>troduced <strong>in</strong> Chapter 2, are highly <strong>efficient</strong> family of <strong>algorithms</strong> developed<strong>for</strong> <strong>distributed</strong> systems. Firstly, <strong>local</strong> <strong>algorithms</strong> are <strong>in</strong>-network <strong>algorithms</strong> <strong>in</strong> whichthe computation nodes only exchange statistics — not raw <strong>data</strong> — with their neighbors tocompute the result. Secondly, at the heart of a <strong>local</strong> algorithm there is a <strong>data</strong> dependentcriteria dictat<strong>in</strong>g when nodes can avoid send<strong>in</strong>g updates to their neighbors. An algorithmis generally called <strong>local</strong> if the total communication per node is upper bounded by someconstant, <strong>in</strong>dependent of the number of nodes <strong>in</strong> the network. Primarily <strong>for</strong> this reason,<strong>local</strong> <strong>algorithms</strong> exhibit high scalability. Moreover <strong>local</strong> <strong>algorithms</strong> exhibit higher faulttolerance due to their ability to <strong>local</strong>ize and isolate the faults and avoid a s<strong>in</strong>gle po<strong>in</strong>t offailure.The superior scalability of <strong>local</strong> <strong>algorithms</strong> motivates us to develop a <strong>local</strong> framework<strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>large</strong> <strong>distributed</strong> environments. Specifically, we focus on <strong>local</strong> <strong>algorithms</strong>which are provably correct compared to a centralized execution. Our goal <strong>in</strong> thischapter is to first <strong>in</strong>troduce the reader with some def<strong>in</strong>itions and properties of <strong>local</strong> <strong>algorithms</strong>.Next we present our DeFraLC framework (Determ<strong>in</strong>istic Framework <strong>for</strong> LocalComput<strong>in</strong>g). Section 3.6 presents the ma<strong>in</strong> theorem <strong>in</strong> DeFraLC. Section 3.7 discusses thePeer-to-Peer Generic Monitor<strong>in</strong>g Algorithm (PeGMA) <strong>in</strong> DeFraLC capable of comput<strong>in</strong>gcomplex functions of <strong>distributed</strong> <strong>data</strong>. F<strong>in</strong>ally we also discuss several <strong>in</strong>stantiations of thePeGMA and also demonstrate how the DeFraLC can be used <strong>for</strong> develop<strong>in</strong>g several <strong>data</strong><strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>.


3.2 What is <strong>local</strong>ity ?49In this section we present a brief <strong>in</strong>tuition of <strong>local</strong> <strong>algorithms</strong> as used <strong>in</strong> this dissertation.Consider a <strong>large</strong> network <strong>in</strong> which every node has some <strong>data</strong> and our task is tocompute some function def<strong>in</strong>ed on the union of all the <strong>data</strong> of all the peers. One way ofdo<strong>in</strong>g this computation is to allow each peer to communicate with all the other nodes <strong>in</strong> thenetwork to get a global model. This process, rem<strong>in</strong>iscent of gossip, flood<strong>in</strong>g or broadcaststylecommunication is too expensive — typically the communication cost per node isO(size of the network). As a result, <strong>in</strong>creas<strong>in</strong>g the network size <strong>in</strong>creases the communicationcost by the same factor. There<strong>for</strong>e these <strong>algorithms</strong> are not suitable <strong>for</strong> modern-day<strong>large</strong>-<strong>scale</strong> P2P networks spann<strong>in</strong>g possibly millions of peers.An alternate approach is to develop <strong>distributed</strong> <strong>algorithms</strong> which satisfy the follow<strong>in</strong>gproperty: each node should base its output by communicat<strong>in</strong>g only with a fixed numberof neighbors and the total size of the query exchanged <strong>for</strong> the entire execution of the <strong>algorithms</strong>hould also be bounded. Ideally, these “fixed bounds” should be small constants,<strong>in</strong>dependent of the size of the system. As a consequence, these <strong>algorithms</strong> exhibit <strong>local</strong>ity<strong>in</strong> their execution and are there<strong>for</strong>e highly scalable. Increas<strong>in</strong>g the system size has virtuallyno effect on the per<strong>for</strong>mance of these <strong>algorithms</strong> mak<strong>in</strong>g them ideal candidate <strong>for</strong> P2Pnetworks. Moreover, <strong>local</strong>ity has <strong>in</strong>tr<strong>in</strong>sic relation to fault tolerance and robustness — <strong>for</strong>any peer s<strong>in</strong>ce the execution is <strong>local</strong>, only peers <strong>in</strong> its fixed neighborhood are affected.An example of <strong>local</strong> communication is shown <strong>in</strong> Figure 3.1. In the figure, peer 47 isconnected to peers 25, 30, 31, 45 and 49. If the algorithm run on this network is such thateach peer can do this <strong>local</strong> computation and still guarantee some global result, it wouldeventually mean high scalability of such <strong>algorithms</strong>.


50FIG. 3.1. Figure show<strong>in</strong>g connections to immediate neighbors.3.3 Local Algorithms: Def<strong>in</strong>itions and PropertiesThe idea of us<strong>in</strong>g <strong>local</strong> rules <strong>for</strong> <strong>algorithms</strong> dates back to the seventies. John Hollanddescribed such rules <strong>for</strong> non-l<strong>in</strong>ear adaptive systems and genetic <strong>algorithms</strong> <strong>in</strong> his sem<strong>in</strong>alwork <strong>for</strong> biological systems [80]. Local evolutionary rules <strong>for</strong> grid-based cellular automatonwere first <strong>in</strong>troduced <strong>in</strong> 1950’s by John Von Neumann [139] and later adopted <strong>in</strong> manyfields such as artificial agents, VLSI test<strong>in</strong>g, physical simulations to mention a few. Inthe context of graph theory, <strong>local</strong> <strong>algorithms</strong> were used <strong>in</strong> the early n<strong>in</strong>eties. L<strong>in</strong>ial [112]and later Afek et al. [1] talked about <strong>local</strong> <strong>algorithms</strong> <strong>in</strong> the context of the graph color<strong>in</strong>gproblems. Kutten and Peleg [108] have <strong>in</strong>troduced <strong>local</strong> <strong>algorithms</strong> <strong>for</strong> fault-detection <strong>in</strong>which the cost depends only on the unknown number of faults and not on the entire graphsize. They proposed solutions <strong>for</strong> some key problems such as the maximal <strong>in</strong>dependent set(MIS) and graph color<strong>in</strong>g. Naor and Stockmeyer [133] def<strong>in</strong>ed <strong>local</strong>ity as follows.Def<strong>in</strong>ition 3.3.1 (Local Algorithm: Naor and Stockmeyer [133]). An algorithm is called<strong>local</strong> if it runs <strong>in</strong> constant time, <strong>in</strong>dependent of the size of the network.This def<strong>in</strong>ition implies that if an algorithm runs <strong>in</strong> constant time t, then it must com-


51pute its results by contact<strong>in</strong>g nodes only <strong>in</strong> a neighborhood of radiust. It further argues that<strong>local</strong> <strong>algorithms</strong> offer good fault tolerance — a failure at a nodencan only affect nodes <strong>in</strong>the neighborhood ofn. However <strong>in</strong> many cases, strict <strong>in</strong>dependence is not always possible.For example, if the computation of a node <strong>in</strong>creases very slowly compared to the size ofthe network, then the algorithm is not <strong>local</strong> accord<strong>in</strong>g to this def<strong>in</strong>ition. However we feelthat it should be called <strong>local</strong> s<strong>in</strong>ce its growth rate is bounded, even though not <strong>in</strong>dependentof the size of the network.Graph theoretic <strong>for</strong>mulations of <strong>local</strong>ity and network computations are discussed extensively<strong>in</strong> the book by Peleg [149]. The <strong>local</strong>ity-sensitive approach to network computationstries to restrict computations with<strong>in</strong> a particular region (cluster). Intuitively, the ideais to cover the network by overlapp<strong>in</strong>g clusters and each node participates <strong>in</strong> computation<strong>in</strong> its own cluster. As a result the cost of a task depends only on the <strong>local</strong>ity properties ofthe cluster.The application of <strong>local</strong> <strong>algorithms</strong> <strong>for</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P networks is relatively new.Data <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P networks requires <strong>algorithms</strong> that work <strong>in</strong> dynamic environments. Thefollow<strong>in</strong>g is a def<strong>in</strong>ition of <strong>local</strong> algorithm proposed by Wolff et al. [195].Def<strong>in</strong>ition 3.3.2 (Local Algorithm: Wolff et al [195]). An algorithm is <strong>local</strong> if there existsa constant c such that <strong>for</strong> any number of peers n, there are <strong>in</strong>stances (i.e., neighbors Γ i ,<strong>in</strong>put and <strong>in</strong>termediate states <strong>for</strong> each peer P i ) such that from the last change of the <strong>data</strong>and system to the term<strong>in</strong>ation state of the algorithm no peer uses more than c additionalmessages, memory bits, CPU cycles, and time units.Although the above def<strong>in</strong>ition is strictly from an efficiency standpo<strong>in</strong>t, it is fairly easyto argue the effectiveness of such a def<strong>in</strong>ition from the standpo<strong>in</strong>t of the graph theoreticnotions of <strong>local</strong>ity. Accord<strong>in</strong>g to Def<strong>in</strong>ition 3.3.2, <strong>for</strong> any graph with unit delay along itsedges, a peer can contact at most c peers <strong>in</strong> its entire neighborhood <strong>for</strong> it to be <strong>local</strong>. Thismeans that the result of a query can propagate at most c hops. In other words the <strong>data</strong>


propagation and the runn<strong>in</strong>g time is bounded by a constantc<strong>for</strong> any graph G.52However, Def<strong>in</strong>ition 3.3.2 is not without its limitations. First of all <strong>in</strong> order <strong>for</strong> a<strong>distributed</strong> algorithm to be termed <strong>local</strong>, the communication per node should be small.Requir<strong>in</strong>g it to be a constant is not sufficient, s<strong>in</strong>ce the constant can itself be <strong>large</strong> withrespect to the size of the network. Moreover, the size of the message is not bounded. Todeal with this, Datta et al. [39] proposed a def<strong>in</strong>ition of <strong>local</strong> algorithm which is laterref<strong>in</strong>ed by Das et al. [37]. First we def<strong>in</strong>e theα-neighborhood of a vertex.Def<strong>in</strong>ition 3.3.3 (α-neighborhood of a vertex). Let G = (V,E) be the graph represent<strong>in</strong>gthe network where V denotes the set of nodes and E represents the edges between thenodes. Theα-neighborhood of a vertexv ∈ V is the collection of vertices at distanceαorless from it <strong>in</strong> G: Γ v (α,v,V) = {u|dist(u,v) ≤ α}, where dist(u,v) denotes the lengthof the shortest path <strong>in</strong> betweenuandv and the length of a path is def<strong>in</strong>ed as the number ofedges <strong>in</strong> it.Next we def<strong>in</strong>e anα-<strong>local</strong> query based on the def<strong>in</strong>ition ofα-neighborhood of a vertex.Def<strong>in</strong>ition 3.3.4 (α-<strong>local</strong> query). Let G = (V,E) be a graph as def<strong>in</strong>ed <strong>in</strong> last def<strong>in</strong>ition.Let each node v ∈ V store a <strong>data</strong> set X v . An α-<strong>local</strong> query by some vertex v is a querywhose response can be computed us<strong>in</strong>g some functionF(X α (v)) whereX α (v) = {X v |v ∈Γ v (α,v,V)}.F<strong>in</strong>ally we present the def<strong>in</strong>ition of (α,γ)-<strong>local</strong> algorithm.Def<strong>in</strong>ition 3.3.5 ((α, γ)-<strong>local</strong> algorithm). An algorithm is called (α,γ)-<strong>local</strong> if it neverrequires computation of a β-<strong>local</strong> query such that β > α and the total size (measured <strong>in</strong>bits) of the response to all such α-<strong>local</strong> queries sent out by a peer is bounded by γ. αcan be a constant or a function parameterized by the size of the network while γ can beparameterized by both the size of the <strong>data</strong> of a peer and the size of the network.


53We call such an (α, γ)-<strong>local</strong> algorithm <strong>efficient</strong> if both α and γ are either small constantsor some slow grow<strong>in</strong>g functions (sub-l<strong>in</strong>ear) with respect to its parameters.Consider an algorithm with broadcast as the communication mode. For such an algorithm,γis O(size of the network). Hence broadcast-based <strong>algorithms</strong> are not <strong>local</strong> accord<strong>in</strong>gto this def<strong>in</strong>ition.Next consider an algorithm which does convergecast over a communication tree <strong>in</strong>order to compute a sum of the numbers held at each peer. The protocol works as follows.A leaf peer sends its <strong>data</strong> to its parent up the tree. Whenever an <strong>in</strong>termediate peer gets <strong>data</strong>from all its childern except one, it adds all the <strong>data</strong> it has received from its children andits own <strong>data</strong>. Then it sends it to the neighbor from whom it has not received anyth<strong>in</strong>g. Ifit has received from everyone, it computes the sum of all those <strong>in</strong>clud<strong>in</strong>g its own <strong>data</strong> andthen broadcasts it. Otherwise, a peer keeps wait<strong>in</strong>g. The follow<strong>in</strong>g lemma (Lemma 3.3.1)shows that such an algorithm is not <strong>local</strong> consider<strong>in</strong>g boundedγ.Lemma 3.3.1. The convergecast algorithm <strong>for</strong> comput<strong>in</strong>g the sum (as stated above) is not(O(1),γ)-<strong>local</strong> <strong>for</strong> a constant γ or slow grow<strong>in</strong>g γ with respect to the size of the networkn.Proof. It is obvious that α=1 s<strong>in</strong>ce, <strong>for</strong> any peer, communication takes place among itsimmediate neighbors only.For bounds onγ, let the <strong>data</strong> at each peer bed i and letmbytes be needed to representd i . For any peerP i , letChild i denote the size of the subtree i.e. number of children rootedat P i . P i does the follow<strong>in</strong>g computation when its gets <strong>data</strong> from all its neighbors:d i + ∑ l∈Γ i −1 d lThe number of bytes necessary to represent this sum ism+m×(Child i ) = m×(1+Child i )


54For the root node, Child i = n − 1, there<strong>for</strong>e the message size at the root is m × n.Thus there exist one peer <strong>in</strong> the network, the root whoseγ = O(n). Hence the convergecastalgorithm <strong>for</strong> comput<strong>in</strong>g the sum is not (0(1),γ)−<strong>local</strong> <strong>for</strong> a constant or slowly grow<strong>in</strong>gγ.However, us<strong>in</strong>g such a convergecast technique <strong>for</strong> comput<strong>in</strong>g the maximum of thenumbers located at each peer is (O(1),γ)-<strong>local</strong> when γ is a constant, <strong>in</strong>dependent of thesize of the network. Lemma 3.3.2 proves this claim.Lemma 3.3.2. The convergecast algorithm <strong>for</strong> comput<strong>in</strong>g the maximum is(O(1),γ)-<strong>local</strong>,when γ is a constant, <strong>in</strong>dependent of the size of the network.Proof. It is obvious that α=1 s<strong>in</strong>ce, <strong>for</strong> any peer, communication takes place among itsimmediate neighbors only.For bounds onγ, let the <strong>data</strong> at each peer bed i and letmbytes be needed to representd i . Peer P i does the follow<strong>in</strong>g computation:max{d i ,max l∈Γi −1{d l }}At any node, the maximum can be represent us<strong>in</strong>g m bytes. Thus <strong>for</strong> any node <strong>in</strong> thenetwork, the total size of the query is a constant m. Hence the convergecast algorithm <strong>for</strong>comput<strong>in</strong>g the maximum is(O(1),γ)−<strong>local</strong>, <strong>for</strong> a constantγ.Although the previous def<strong>in</strong>ition discusses about the efficiency of <strong>local</strong> <strong>algorithms</strong>, itdoes not specify the quality of the result. The yardstick <strong>for</strong> measur<strong>in</strong>g the result of any<strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> algorithm <strong>for</strong> comput<strong>in</strong>g a function F is to compare it with a centralizedalgorithm <strong>for</strong> comput<strong>in</strong>g the same function F hav<strong>in</strong>g access to all of the <strong>data</strong> andthe resources. Traditionally, based on accuracy, <strong>local</strong> <strong>algorithms</strong> can broadly be classifiedas: exact and approximate. In an exact <strong>local</strong> algorithm, once the computation term<strong>in</strong>ates,the result computed by each peer is the same as that compared to a centralized execution


55e.g. [196][195][105]. On the other hand, researchers have also proposed approximate <strong>local</strong><strong>algorithms</strong> us<strong>in</strong>g probabilistic techniques e.g. K-means [40] and top-l elements identification[37].Be<strong>for</strong>e we <strong>for</strong>mally def<strong>in</strong>e the accuracy of <strong>local</strong> <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>, westate what we mean by the accuracy of a centralized algorithm under the same conditions.Def<strong>in</strong>ition 3.3.6 (Accuracy of centralized algorithm). Given a <strong>distributed</strong> algorithm <strong>for</strong>comput<strong>in</strong>g a function F : R d → O, (where O has an arbitrary range) the accuracy of acentralized algorithm <strong>for</strong> comput<strong>in</strong>g F is the accuracy achieved by an algorithm hav<strong>in</strong>gaccess to all of the <strong>data</strong> and other <strong>distributed</strong> resources at a central location.We are now <strong>in</strong> a position to discuss the accuracy of (α,γ)-<strong>local</strong> <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong>.Def<strong>in</strong>ition 3.3.7 ((α,γ)-probably approximately correct <strong>local</strong> algorithm). Let E(A,F) denotethe error <strong>in</strong>duced by an algorithm A when comput<strong>in</strong>g a function F as measuredaga<strong>in</strong>st the trueF. Let X be a random variable which measures the difference of the errorbetween an (α,γ)-<strong>local</strong> <strong>distributed</strong> algorithmA d comput<strong>in</strong>gF and a centralized algorithmA c , comput<strong>in</strong>g the same functionF. A d is (ǫ,δ) correct, ifPr(X > ǫ) < 1−δIn this work, the <strong>local</strong> <strong>algorithms</strong> that we have developed as part of the DeFraLCframework are exact. We focus on develop<strong>in</strong>g <strong>local</strong> <strong>algorithms</strong> which satisfy the follow<strong>in</strong>gdesired properties:• It should guarantee eventual correctness.• It should have a <strong>local</strong> stopp<strong>in</strong>g rule — each peer should be able to determ<strong>in</strong>e <strong>in</strong>dependentlywhen to stop communicat<strong>in</strong>g its <strong>data</strong>.


56• Any peer should ideally communicate with only its immediate neighbors — at themost it can communicate with peers which are at a small distance (bounded numberof hops) from it.• The total communication overhead per peer should be a small number and <strong>in</strong>dependentof the size of the network.Low communication overhead makes <strong>local</strong> <strong>algorithms</strong> highly scalable. Computation<strong>for</strong> <strong>local</strong> <strong>algorithms</strong> is, <strong>in</strong> most cases, <strong>in</strong>dependent of the size of the system. So we canpotentially <strong>in</strong>crease the size of the system with very little <strong>in</strong>crease <strong>in</strong> communication anddeterioration <strong>in</strong> quality. Hence<strong>for</strong>th we will use the(α,γ)-<strong>local</strong>ity def<strong>in</strong>ition as the generaldef<strong>in</strong>ition of <strong>local</strong> <strong>algorithms</strong>.3.4 ApproachLet G denote a collection of <strong>data</strong> tuples (<strong>in</strong>put vectors) that is horizontally <strong>distributed</strong>over a <strong>large</strong> (undirected) network of mach<strong>in</strong>es (peers) where<strong>in</strong> each peer communicatesonly with its immediate neighbors (one hop neighbors) <strong>in</strong> the network. The communicationnetwork can be thought of as a graph with vertices (peers)V .The <strong>algorithms</strong> we describe <strong>in</strong> this dissertation (PeGMA and its <strong>in</strong>stantiations) dealwith determ<strong>in</strong>istic <strong>efficient</strong> <strong>local</strong> function computation def<strong>in</strong>ed on the global <strong>data</strong>set G.Note that vectors <strong>in</strong> G can be comb<strong>in</strong>ed <strong>in</strong> many different ways to produce a statistic:weighted l<strong>in</strong>ear comb<strong>in</strong>ation and its variants (sum, difference, average), cartesian product,<strong>in</strong>ner product, cross product and more. L<strong>in</strong>ear comb<strong>in</strong>ation of the vectors <strong>in</strong> G provide agood measure of the variability of the vectors <strong>in</strong> G. Thus, the <strong>algorithms</strong> described herefocus on function computation def<strong>in</strong>ed on l<strong>in</strong>ear comb<strong>in</strong>ation of vectors <strong>in</strong>G. Furthermore,the <strong>algorithms</strong> are correct when compared to a centralized scenario and this correctness canbe guaranteed even <strong>in</strong> the case of <strong>data</strong> and network changes. Efficiency of our technique is


57due to the use of <strong>local</strong> <strong>data</strong> dependent rules which allows a peer to get the correct result bycommunicat<strong>in</strong>g <strong>in</strong> a small neighborhood and thereby cutt<strong>in</strong>g down the number of messagesdrastically.L<strong>in</strong>ear comb<strong>in</strong>ations such as the average f<strong>in</strong>d use <strong>in</strong> many <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> and signalprocess<strong>in</strong>g problems such as source detection and <strong>local</strong>ization [150], recommender systems<strong>in</strong> the Internet, leader election and many more. As we further show <strong>in</strong> this chapter and thenext chapter, function computation on the global average vector is a very powerful tool— a wide variety of complex <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> such as k-means, eigen-monitor<strong>in</strong>g,multivariate regression and decision trees can be developed from us<strong>in</strong>g a such a simple yetpowerful primitive.The next section gives a <strong>for</strong>mal description of the notation and problem def<strong>in</strong>ition.3.5 Notations, Assumptions, and Problem Def<strong>in</strong>itionThis section we discuss the notations and assumptions which will be used throughoutthe rest of this chapter.3.5.1 NotationsLetV = {P 1 ,...,P n } be a set of peers (we use the term peers to describe the peers ofa peer-to-peer system, motes of a wireless sensor network, etc.) connected to one anothervia an underly<strong>in</strong>g communication <strong>in</strong>frastructure such that the set of P i ’s neighbors, Γ i , isknown to P i . Additionally, P i is given a time vary<strong>in</strong>g set of <strong>in</strong>put vectors <strong>in</strong> R d , S i ={ −→x1i , −→ x 2 i,...}. In this <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> literature this is often referred to as thehorizontally <strong>data</strong> distribution scenario.In this work peers communicate with one another by send<strong>in</strong>g sets of <strong>in</strong>put vectors asdef<strong>in</strong>ed <strong>in</strong> Def<strong>in</strong>ition 3.5.1.


58Def<strong>in</strong>ition 3.5.1 (Message <strong>for</strong> sets). The message that peerP i needs to send toP j consistsof a set of vectors and is denoted by X i,j . Each vector is <strong>in</strong> R d and the size of the setdepends on the <strong>data</strong> that peerP i needs to send to P j .Below, we show that <strong>for</strong> our purposes statistics such as the average and number ofvectors are sufficient. Assum<strong>in</strong>g reliable messag<strong>in</strong>g, once a message is delivered both P iand P j know both X i,j and X j,i . Our goal is to compute functions def<strong>in</strong>ed on the union ofall the <strong>data</strong> of all the peers. We want to achieve this without all peers requir<strong>in</strong>g to send allof their <strong>data</strong>. For this, we def<strong>in</strong>e four sets of vectors which are <strong>local</strong> to a peer. As shown <strong>in</strong>a later section (Section 3.6), conditions on these <strong>local</strong> vectors imply certa<strong>in</strong> conditions onthe global <strong>data</strong>set and hence functions def<strong>in</strong>ed on the union of all <strong>data</strong> of all peers.Def<strong>in</strong>ition 3.5.2 (Knowledge). The knowledge ofP i , denoted asK i , is the union ofS i withX j,i <strong>for</strong> allP j ∈ Γ i and is def<strong>in</strong>ed asK i = S i ∪ ⋃P j ∈Γ iX j,i .Def<strong>in</strong>ition 3.5.3 (Agreement). The agreement of P i and any of its neighbors P j is A i,j =X i,j ∪X j,i .Def<strong>in</strong>ition 3.5.4 (Withhled knowledge). The withheld knowledge of P i with respect to aneighborP j is the subtraction of the agreement from the knowledgeW i,j = K i \A i,j .Def<strong>in</strong>ition 3.5.5 (Global <strong>in</strong>put). The global <strong>in</strong>put at any time is the set of all <strong>in</strong>puts of allthe peers and is denoted byG = ⋃S i .i=1,...,nMany <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> primitives <strong>in</strong> the literature can be posed as a geometric problem.For example consider the problem of check<strong>in</strong>g if the L2 norm of a vector <strong>in</strong> R 2 is greaterthan a user-def<strong>in</strong>ed ǫ. This has applications <strong>in</strong> many complex <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problems as weshow <strong>in</strong> this dissertation. Geometrically, check<strong>in</strong>g if the L2 norm is greater thanǫamountsto check<strong>in</strong>g if the vector lies <strong>in</strong>side a circle of radius ǫ. The theorems <strong>in</strong> the DeFraLCframework deal with convex regions and their collections <strong>in</strong> order to guarantee results on


59the global <strong>data</strong> (and hence functions def<strong>in</strong>ed on them) based on the sets of <strong>local</strong> vectors onlyviz. K i , A i,j and W i,j . Hence, next we def<strong>in</strong>e convex region, <strong>in</strong>variant functions def<strong>in</strong>ed <strong>in</strong>convex regions and ω-cover i.e. C ω .Def<strong>in</strong>ition 3.5.6 (Convex region). A region R ⊆ R d is convex, if <strong>for</strong> every two po<strong>in</strong>ts−→ x,−→ y ∈ R and everyα ∈ [0,1], the weighted averageα−→ x +(1−α)−→ y ∈ R.Def<strong>in</strong>ition 3.5.7 (Invariant function <strong>in</strong> a convex region). LetF be a function fromR d → O.We say F is <strong>in</strong>variant <strong>in</strong> R (<strong>in</strong> short F R -<strong>in</strong>variant), if the value of F on R is constant, i.e.∀ −→ x, −→ y ∈ R : F ( −→ x) = F ( −→ y ).Def<strong>in</strong>ition 3.5.8 (ω-cover of a doma<strong>in</strong> <strong>for</strong> a function F). A collection of non-overlapp<strong>in</strong>gconvex regions C ω = {R 1 ,R 2 ,...,R n } is aω-cover of F if:1. everyR i ∈ C ω is convex andF Ri -<strong>in</strong>variant <strong>in</strong> R i2. the area of R d not covered by ⋃ ni=1 R i is less thanωOur next def<strong>in</strong>ition deals with the tie regions.Def<strong>in</strong>ition 3.5.9 (Tie region). LetC ω = {R 1 ,R 2 ,...,R n } be a convex cover <strong>for</strong> a functionF as def<strong>in</strong>ed earlier. Also let D = { −→ x|−→ x ∈ Rd } be the set of all vectors <strong>in</strong> R d . The tieregionT is the region uncovered byC = ⋃ ni=1 R i i.e. T = { −→ x| −→ x ∈ D \C}.Next we def<strong>in</strong>e the global term<strong>in</strong>ation state of a <strong>distributed</strong> asynchronous algorithm.Def<strong>in</strong>ition 3.5.10 (Global term<strong>in</strong>ation state of a <strong>distributed</strong> asynchronous algorithm [10]).A <strong>distributed</strong> asynchronous algorithm is said to have term<strong>in</strong>ated globally or reached aterm<strong>in</strong>ation at a certa<strong>in</strong> global state if1. all nodes become idle,2. all edges become empty (i.e. all messages are delivered), and3. the system does not change <strong>for</strong> a time period ∆ t , which is parameterized by thediameter of the network and the maximal edge delay of a node.


3.5.2 Assumptions60In this section we state the assumptions that we have made <strong>for</strong> develop<strong>in</strong>g <strong>algorithms</strong>under DeFraLC.Assumption 3.5.1. In this dissertation we are <strong>in</strong>terested <strong>in</strong> comput<strong>in</strong>g functions def<strong>in</strong>ed onl<strong>in</strong>ear comb<strong>in</strong>ations of vectors <strong>in</strong> G. Hence the underly<strong>in</strong>g space is a l<strong>in</strong>ear vector spaceofG.L<strong>in</strong>ear comb<strong>in</strong>ations such as the average is a technique <strong>for</strong> comb<strong>in</strong><strong>in</strong>g vectors <strong>in</strong> G.Average vector also measures the variability of the vectors <strong>in</strong> the set G. Simplicity comb<strong>in</strong>edwith the ability to represent a number of well-known <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problems (as weshow later) motivates us to concentrate <strong>in</strong> such a l<strong>in</strong>ear space.Assumption 3.5.2. Communication among neighbor<strong>in</strong>g peers is reliable and ordered.This assumption can easily be en<strong>for</strong>ced us<strong>in</strong>g standard number<strong>in</strong>g and retransmission(<strong>in</strong> which messages are numbered, ordered and retransmitted if an acknowledgement doesnot arrive <strong>in</strong> time), order<strong>in</strong>g, and heart-beat mechanisms. Moreover, this assumption isnot uncommon and have been made elsewhere <strong>in</strong> the <strong>distributed</strong> <strong>algorithms</strong> literature [63].Khilar and Mahapatra [100] discuss the use of heartbeat mechanisms <strong>for</strong> failure diagnosis<strong>in</strong> mobile ad-hoc networks.Assumption 3.5.3. Communication takes place over a communication tree.It is assumed that sets of vectors sent from P i to P j is never sent back to P j . S<strong>in</strong>cewe are deal<strong>in</strong>g with statistics of <strong>data</strong> (e.g. counts), one way of ensur<strong>in</strong>g this is to makesure that a tree overlay structure is imposed on the orig<strong>in</strong>al network such that there are nocycles. As described <strong>in</strong> Section 3.5.3, this assumption allows us to send statistics of thesets and not the sets themselves. We could get around this assumption <strong>in</strong> one of two ways.(1) Liss et al. [14], have shown how the <strong>distributed</strong> majority vot<strong>in</strong>g algorithm (which


61require a communication tree) can be developed <strong>for</strong> general networks. We could use asimilar technique here, although the generalization to other networks may be non-trivial(2) The underly<strong>in</strong>g tree communication topology could be ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong>dependently ofour framework us<strong>in</strong>g standard techniques such as [63] (<strong>for</strong> wired networks) or [110] (<strong>for</strong>wireless networks).Our goal is to develop a <strong>local</strong> framework <strong>for</strong> correctly comput<strong>in</strong>g a functionF def<strong>in</strong>edon G (the same function output at each node). However, the network is dynamic <strong>in</strong> thesense that the network topology can change (peers may enter or leave at any time) or the<strong>data</strong> held by each peer can change. Hence G, the union of all peers <strong>data</strong>, can be thought ofas time-vary<strong>in</strong>g as well as the set of neighborsΓ i <strong>for</strong> each peerP i . This leads us to our nextassumption.Assumption 3.5.4. Peers are notified on changes (addition/deletion of <strong>data</strong> tuples or modificationto the attribute values) <strong>in</strong> their own <strong>data</strong> S i , and <strong>in</strong> the set of their neighborsΓ i .Assumption 3.5.5. Input <strong>data</strong> tuples (or vectors) <strong>in</strong> S i are unique.Assumption 3.5.6. A ω-coverC ω can be precomputed <strong>for</strong> everyF and desired ω.Note that Assumption 3.5.6 is the weakest of all the assumptions. In this work (seeSection 3.7.3) we show how such a ω-cover can be devised <strong>for</strong> a specific function — thehyper-sphere <strong>in</strong> R d — which solves several <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problems. F<strong>in</strong>d<strong>in</strong>g a cover C ω<strong>for</strong> an arbitrary F <strong>in</strong> R d depends on F and the space R d . The communication complexityof our algorithm might suffer due to an <strong>in</strong><strong>efficient</strong> choice of C ω . Moreover, <strong>for</strong> the hypersphere,we show <strong>in</strong> our extensive experiments that the communication complexity <strong>in</strong>creasesl<strong>in</strong>early with the <strong>in</strong>crease <strong>in</strong> d, while the space <strong>in</strong>creases exponentially. Formulat<strong>in</strong>g analgorithm which can output the set ω-cover <strong>for</strong> all functions F def<strong>in</strong>ed <strong>in</strong> R d is, accord<strong>in</strong>gto us, an open research topic.


3.5.3 Sufficient Statistics62The <strong>algorithms</strong> we describe <strong>in</strong> this dissertation deal with comput<strong>in</strong>g functions of l<strong>in</strong>earcomb<strong>in</strong>ations of vectors <strong>in</strong> G i.e. we restrict ourselves to a l<strong>in</strong>ear vector space. For clarity,we will focus on one such comb<strong>in</strong>ation — the average. When approximat<strong>in</strong>g the average— due to the assumption of a tree overlay used <strong>in</strong> this work — we can completely representthe sets G, S i , K i , A i,j , X i,j and X j,i we previously def<strong>in</strong>ed with some of their statistics:the average vector of the set and its size. We can do that because as long as messages sentfrom P i to P j are <strong>in</strong>dependent of X j,i , both X i,j ∩X j,i = ∅ and W i,j ∩K j = ∅. Next wedef<strong>in</strong>e the average and the size of any arbitrary set.Def<strong>in</strong>ition 3.5.11 (Average vector of a set). Given a setX, the average vector of the set isthe average of all the vectors <strong>in</strong> the set and is denoted by −→ X .Def<strong>in</strong>ition 3.5.12 (Size of a set). Given a set X, the size of the set is the card<strong>in</strong>ality of theset i.e. the number of vectors <strong>in</strong> the set and is denoted by|X|.Def<strong>in</strong>ition 3.5.13 (Average vector of a set union). Given two sets X and Y , the averagevector of the set X ∪Y, is def<strong>in</strong>ed as follows: −−−−→ X ∪Y = −→ X+ −→ Y.2Def<strong>in</strong>ition 3.5.14 (Size of a set union). Given two setsX andY, the size of the setX ∪Y ,is def<strong>in</strong>ed as follows: |X ∪Y| = |X|+|Y|.Similar statistics can be written <strong>for</strong> each of the follow<strong>in</strong>g sets: S i ,K i ,A i,j and W i,j .Given the def<strong>in</strong>itions of the average vectors and the sizes of the sets, the idea is to expressthe statistics of K i , A i,j , and W i,j <strong>in</strong> terms of the statistics of S i and X i,j ’s rather than <strong>in</strong>terms of the statistics of unions thereof. Below we def<strong>in</strong>e the average vectors and sizes ofeach of the sets.Def<strong>in</strong>ition 3.5.15 (Average and size of <strong>local</strong> <strong>data</strong>set). The size and average of <strong>local</strong> <strong>data</strong>setS i are def<strong>in</strong>ed as:


• Size: |S i | = number of <strong>data</strong> tuples <strong>in</strong>S i• Average: −→ S i = 1|S i |∑−→ x−→ x∈Si63Def<strong>in</strong>ition 3.5.16 (Average and size of knowledge). The size and average of knowledgeK iare def<strong>in</strong>ed as:• Size: |K i | = |S i |+ ∑P j ∈Γ i|X j,i |• Average: −→ K i = |S i| −→Si|K i+ ∑ |X j,i |||K i |P j ∈Γ i−→X j,iDef<strong>in</strong>ition 3.5.17 (Average and size of agreement). The size and average of agreementA i,j are def<strong>in</strong>ed as:• Size: |A i,j | = |X i,j |+|X j,i |• Average: −→Ai,j = |X i,j||X i,j |+|X j,i |−−→X i,j + |X j,i|−→|X i,j |+|X j,iX| j,iDef<strong>in</strong>ition 3.5.18 (Average and size of withheld knowledge). The size and average of withheldknowledgeW i,j are def<strong>in</strong>ed as:• Size: |W i,j | = |K i |−|A i,j |• Average: −−→ W i,j = |K i||W i,j |−→K i − |A i,j| −→A i,j or −−→ W i,j = −→ 0 <strong>in</strong> case|W i,j | = 0|W i,j |Def<strong>in</strong>ition 3.5.19 (Average and size of global knowledge). The size and average of globalknowledgeG are def<strong>in</strong>ed as:• Size: |G| = ∑ ni=1 |S i|• Average: −→ G = 1|G|∑ ni=1−→Si


3.5.4 Problem Def<strong>in</strong>ition64Problem 1: Given a functionF : R d → O and a time vary<strong>in</strong>g <strong>data</strong> set S i at each peer( −→G)P i , we are <strong>in</strong>terested <strong>in</strong> comput<strong>in</strong>g the value ofF at each peerP i .Consider<strong>in</strong>g the dynamic and <strong>distributed</strong> nature of the <strong>data</strong>, accurate computation of( −→G)F can be very difficult. This might be, <strong>for</strong> example, because −→ G changes at a ratefaster than the rate at which <strong>data</strong> can be propagated through the network. We there<strong>for</strong>epose a quality standard and a correctness standard suitable <strong>for</strong> this scenario.( −→G)Def<strong>in</strong>ition 3.5.20 (Accuracy). Given an algorithm <strong>for</strong> comput<strong>in</strong>g Fthe accuracyof the algorithm is measured as the fraction of peers which compute the correct outcome( −→G)F at any timet.Def<strong>in</strong>ition 3.5.21 (Eventually correct). An algorithm which converges to a term<strong>in</strong>ationstate is eventually correct if it guarantees that at any term<strong>in</strong>ation state every peer computes( −→G)the correct outcome F , where the correct output is the one which is computed on allthe <strong>data</strong> of all the peers.3.5.5 IllustrationHav<strong>in</strong>g stated the notations and problem def<strong>in</strong>ition, <strong>in</strong> this section we present an illustrativeexample.Let there be two peers P 1 and P 2 connected to each other such that Γ 1 = P 2 andΓ 2 = P 1 . The <strong>local</strong> <strong>data</strong> ofP 1 and P 2 i.e. S 1 and S 2 are vectors <strong>in</strong>R 2 . Let• S 1 = {(2,6),(3,8),(1,4)}• S 2 = {(7,9),(10,12)}Furthermore <strong>for</strong> simplicity we consider an <strong>in</strong>itialization state of the peers wherebyX 1,2 = X 2,1 = ∅.We first calculate the knowledge, agreement and witheld knowledge ofP 1 :


• Sets <strong>for</strong>P 1 :– K 1 = S 1 ∪ ⋃X 2,1 = {(2,6),(3,8),(1,4)}P 2 ∈Γ 1– A 1,2 = X 1,2 ∪X 2,1 = ∅65– W 1,2 = K 1 \A 1,2 = {(2,6),(3,8),(1,4)}• Statistics <strong>for</strong>P 1 :– |S 1 | = 3– −→ S 1 = 1|S 1 |∑−→ x∈S1−→ x = {(2,6)}– |K 1 | = |S 1 |+ ∑|X 2,1 | = 3P 2 ∈Γ 1– −→ K 1 = |S 1| −→|K 1S| 1 + ∑ |X 2,1 | −−→X 2,1 = {(2,6)}|KP 2 ∈Γ 1 |1– |A 1,2 | = |X 1,2 |+|X 2,1 | = 0– −−→ A 1,2 =|X 1,2 ||X 1,2 |+|X 2,1 |−−→X 1,2 + |X 2,1|– |W 1,2 | = |K 1 |−|A 1,2 | = 3– −−→ W 1,2 = |K 1||W 1,2 ||X 1,2 |+|X 2,1 |−→K 1 − |A 1,2| −−→A 1,2 = {(2,6)}|W 1,2 |Similarly <strong>for</strong>P 2 , we can write:−−→X 2,1 = ∅• Sets <strong>for</strong>P 2 :– K 2 = {(7,9),(10,12)}– A 2,1 = ∅– W 2,1 = {(7,9),(10,12)}• Statistics <strong>for</strong>P 2 :– |S 2 | = 2


– −→ S 2 = {(8.5,10.5)}66– |K 2 | = 2– −→ K 2 = {(8.5,10.5)}– |A 2,1 | = 0– −−→ A 2,1 = ∅– |W 2,1 | = 2– −−→ W 2,1 = {(8.5,10.5)}Now the global knowledgeG = {(2,6),(3,8),(1,4),(7,9),(10,12)} and the statisticsare:• |G| = 5• −→ G = {(4.6,7.8)}We may be <strong>in</strong>terested <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g if −→ G lies <strong>in</strong>side a square centered at (0,0) and side oflength 16 as shown <strong>in</strong> Figure 3.2. The output is 0 if −→ G ∈ R s and 1 otherwise. Also shown<strong>in</strong> the figure are C ω = {R s ,R 1 ,R 2 ,R 3 ,R 4 }, −→ K 1 , −→ K 2 and −→ G . It is evident from the figurethat −→ G ∈ R s and −→ K 1 ∈ R s but −→ K 2 /∈ R s . There<strong>for</strong>e the correct answer is 0. Peer P 1 willoutput 0 and P 2 will output 1. The <strong>algorithms</strong> we describe <strong>in</strong> this dissertation enable everypeer to produce the correct outcome as −→ G by exchang<strong>in</strong>g only the statistics and not −→ S 1 and−→S 2 .For a problem <strong>in</strong>R d , the knowledge, agreement and witheld knowledge can be evaluated<strong>in</strong> a similar fashion. However, it becomes significantly more difficult to f<strong>in</strong>d the coverC ω <strong>for</strong> an arbitrary F and R d . The communication complexity of our algorithm mightsuffer due to an <strong>in</strong><strong>efficient</strong> choice ofC ω .The next section presents a theorem (Theorem 3.6.1) which allows a peer to decide if−→ G resides <strong>in</strong>side a convex region <strong>in</strong>Cω based on only <strong>local</strong> <strong>in</strong><strong>for</strong>mation i.e. its knowledge,agreement and witheld knowledge.


67FIG. 3.2. An example show<strong>in</strong>g the cover and the vectors ⃗ G, ⃗ K1 and ⃗ K 2 . Each shadedregion is a convex region along with the unshaded region <strong>in</strong>side the square.3.6 Ma<strong>in</strong> TheoremsThe ma<strong>in</strong> theorems of the DeFraLC framework apply to the eventual correctness of<strong>local</strong> <strong>algorithms</strong>. S<strong>in</strong>ce eventual correctness is concerned with term<strong>in</strong>ation states, the stateof the algorithm can be expressed <strong>in</strong> terms of only the variables of each peer. Local <strong>algorithms</strong>rely on conditions which permit extend<strong>in</strong>g the rules def<strong>in</strong>ed on the states of thedifferent peers onto rules def<strong>in</strong>ed on the global <strong>data</strong>G. Specifically, the theorems describedhere<strong>in</strong> state conditions on −→ −→K i , Ai,j , and −−→ W i,j , subject to which it can be determ<strong>in</strong>ed if −→ Gresides <strong>in</strong>side a specific convex regionR i ∈ C ω .Def<strong>in</strong>ition 3.6.1 (Unification of peers). Let P i and P j be two neighbor<strong>in</strong>g peers. The processof delet<strong>in</strong>gP j , by comb<strong>in</strong><strong>in</strong>g its knowledge −→ K j , agreement −→Aj,i and witheld knowledge−−→W j,i withP i is known as unification.


68Theorem 3.6.1. [Convex Stopp<strong>in</strong>g Rule] Let P 1 ,...,P n be a set of peers connected toeach other via a communication tree G(V,E). Let S i , X i,j , G, K i , A i,j , and W i,j be asdef<strong>in</strong>ed <strong>in</strong> the previous section. Let R be an arbitrary convex region <strong>in</strong> R d . If at time t nomessages traverse the network, and <strong>for</strong> each P i , −→ −→K i ∈ R and <strong>for</strong> every P j ∈ Γ i Ai,j ∈ R,and <strong>for</strong> everyP j ∈ Γ i either −−→ W i,j ∈ R or W i,j = ∅, then −→ G ∈ R.Proof. Assume an arbitrary P i sends all of the vectors <strong>in</strong> W i,j to its neighbor P j and receivesW j,i . The new knowledge of P i is K ′ i = K i ∪W j,i . The average vector of the newknowledge is −→ K ′ i = −−−−−−→ K i ∪W j,i . If W j,i = ∅ then of course K ′ i = K i . Otherwise, s<strong>in</strong>ce<strong>in</strong> a tree W j,i ∩ K i = ∅, K ′ i can be rewritten as a weighted average of the two vectors−→K ′ i = α −→ K i +(1−α) −−→ W j,i <strong>for</strong> someα ∈ [0,1]. S<strong>in</strong>ce both −→ K i and −−→ W j,i are <strong>in</strong> R, and s<strong>in</strong>ceR is convex, any average vector is also <strong>in</strong> R. It follows that −→ K ′ i is <strong>in</strong> R too.The new withheld knowledge of P i with respect to any other neighbor P k , W ′ i,k, isequal to the withheld knowledge prior to the unification, plus the added vectors, W ′ i,k =W i,k ∪W j,i . Aga<strong>in</strong>, W j,i = ∅ then W ′ i,k = W i,k . Otherwise, s<strong>in</strong>ce both −−→ W i,k and −−→ W j,i are<strong>in</strong> R and s<strong>in</strong>ce W j,i ∩W i,k = ∅, it follows that −−−−−−−→ W i,k +W j,i = α −−→ W i,k +(1−α) −−→ W j,i is <strong>in</strong>R too.Note thatA i,k does not change because no message is sent fromP i toP k or vice-versa.It follows that P j can be elim<strong>in</strong>ated by connect<strong>in</strong>g all of its other neighbors to P i , and bysett<strong>in</strong>g the agreement of P i and its new neighbor to the previous agreement that neighborhad withP j .Similarly consider P z — any other neighbor of P i . Us<strong>in</strong>g a similar argument, P z canbe unified withP i and hence deleted. If this process is cont<strong>in</strong>ued, we will eventually be leftwith a s<strong>in</strong>gle peerP i with −→ K i = −→ G . S<strong>in</strong>ce <strong>in</strong> every step of the <strong>in</strong>duction −→ K i rema<strong>in</strong>ed <strong>in</strong>sideR, −→ G must be <strong>in</strong>R.Theorem 3.6.1 can be expla<strong>in</strong>ed us<strong>in</strong>g Figure 3.3. The top row of the figure shows thestate of the peersP 1 ,P 2 ,P 3 andP 4 at any time <strong>in</strong>stance. Note that all the vectors are <strong>in</strong>side


FIG. 3.3. Vectors of four peers be<strong>for</strong>e and after Theorem 3.6.1 is applied. The knowledge,the agreement and the withheld knowledge are shown <strong>in</strong> the figure. The top row representsthe <strong>in</strong>itial states of the four peers. The bottom row depicts what happens if P 2 sends all its<strong>data</strong> toP 3 .69


70some R (<strong>in</strong> this case, a circle). Let us suppose that P 2 and P 3 exchange all of their <strong>data</strong>.The new knowledge ofP 3 , −→ K ′ 3, is a convex comb<strong>in</strong>ation of −→ K 3 and −−→ W 2,3 . The new withheldknowledge ofP 3 with respect toP 4 is denoted by −−→ W 3,4 ′ and is a convex comb<strong>in</strong>ation of−−→ W 3,4and −−→ W 2,1 . Similarly, −−→ W 3,1 ′ −−→is a convex comb<strong>in</strong>ation of W 3,2 and −−→ W 2,1 . The agreement −−→ A 3,1is equal to the agreement P 2 had withP 1 and hence is equal to −−→ A 2,1 . The other agreementsof P 3 do not change s<strong>in</strong>ce no <strong>data</strong> is exchanged. S<strong>in</strong>ce −→ K 3 ′ = −→ K 2 ′ , −−→ −−→W 3,4 ′ and W 3,1 ′ are all<strong>in</strong>side the sameR,P 2 can be elim<strong>in</strong>ated by jo<strong>in</strong><strong>in</strong>g the neighbors ofP 2 withP 3 . Repeat<strong>in</strong>gthis elim<strong>in</strong>ation <strong>for</strong> P 1 and P 4 eventually leaves a s<strong>in</strong>gle peer whose knowledge is equal to−→ G ∈ R.The significance of Theorem 3.6.1 is that under the condition described above, P i( −→Ki)can stop send<strong>in</strong>g messages to its neighbors and output F . If the current state ofthe system is <strong>in</strong>deed a term<strong>in</strong>ation state, then Theorem 3.6.1 guarantees this is the correctsolution. Else, either there must be a message travers<strong>in</strong>g the network, or some peer P k<strong>for</strong> whom the condition does not hold. Eventually, P k will send a message which, oncereceived, may change the knowledge ofP i , and thus guarantee eventual correctness.Theorem 3.6.1 only discusses about a s<strong>in</strong>gle region. However, it is fairly easy toextend it to a collection of non-overlapp<strong>in</strong>g regions.Lemma 3.6.2. Let C ω = {R 1 ,R 2 ,...,R n } be a set of convex non-overlapp<strong>in</strong>g regions <strong>in</strong>R d . If <strong>for</strong> every peer P i , the region R z ∈ C ω conta<strong>in</strong><strong>in</strong>g −→ K i also conta<strong>in</strong>s −−→ W i,j and −→Ai,j <strong>for</strong>every neighbor ofΓ i , then <strong>for</strong> any peerP k , the knowledge −→ K k ∈ R z .Proof. Let R z (⃗x) be the region conta<strong>in</strong><strong>in</strong>g a vector ⃗x. S<strong>in</strong>ce the state of the system is( −→Ki)a term<strong>in</strong>ation state, A i,j = A j,i . Thus, <strong>for</strong> every two neighbors P i and P j , R =( −→) ( −→) ( −→KjR z Ai,j = R z Aj,i = R z). Assum<strong>in</strong>g the communication graph is connected,this equivalence can be carried over to every peer <strong>in</strong> the network.The result of Lemma 3.6.2, states a condition by which peers can agree on the regionwhich <strong>in</strong>cludes −→ G . Each peer chooses R accord<strong>in</strong>g to its knowledge −→ K i and applies the


71condition of Theorem 3.6.1 to R. The agreement on the region also implies an agreement( −→G)on the value ofF .Illustration Recall from Section 3.5.5,• −→ K 1 = {(2,6)}• −−→ A 1,2 = ∅• −−→ W 1,2 = {(2,6)}• −→ K 2 = {(8.5,10.5)}• −−→ A 2,1 = ∅• −−→ W 2,1 = {(8.5,10.5)}• −→ G = {(4.6,7.8)}We may be <strong>in</strong>terested <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g if −→ G lies <strong>in</strong>side a circle of radius 10 and center at(0,0) i.e. if ∣∣ −→ G∣∣ = 9.055 < 10. In this case the output is true. Also, ∣∣ −→ ∣∣ ∣∣ ∣∣K 1 = 6.33 < 10but ∣∣ −→ ∣∣ ∣∣ ∣∣ −→K 2 = 13.51 > 10. There<strong>for</strong>e if the peers decide the output based solely on K, theresult will be wrong.Now let us apply the conditions of Theorem 3.6.1 to the vectors of P 1 and P 2 . TheregionRis the circle of radius 10. For peerP 1 :• ∣∣ −→ ∣∣ ∣∣ ∣∣K 1 = 6.33: Inside circle• ∣∣ −−→ ∣∣ ∣∣ ∣∣A 1,2 = 0: Inside circle• ∣∣ −−→ ∣∣ ∣∣ ∣∣W 1,2 = 6.33: Inside circleThere<strong>for</strong>e the conditions of the theorem dictates that this is a term<strong>in</strong>ation state of peerP 1 . Similarly <strong>for</strong> peerP 2 :


• ∣∣ −→ ∣∣ ∣∣ ∣∣K 2 = 13.51: Outside circle• ∣∣ −−→ ∣∣ ∣∣ ∣∣A 2,1 = 0: Inside circle• ∣∣ −−→ ∣∣ ∣∣ ∣∣W 2,1 = 13.51: Outside circle72In this case, however, the conditions of the theorem states that P 2 has not reached aterm<strong>in</strong>ation state. There<strong>for</strong>e P 2 will send messages which will change its state. In Section3.7.3 we show what message is sent and how each peer outputs the correct result.We are now <strong>in</strong> a position to present our <strong>local</strong> P2P generic algorithm (PeGMA) <strong>for</strong>comput<strong>in</strong>g complex functions <strong>in</strong> <strong>large</strong> <strong>distributed</strong> systems.3.7 Peer-to-Peer Generic Algorithm (PeGMA) and its InstantiationsThis section describes PeGMA — a P2P generic <strong>local</strong> algorithm <strong>for</strong> comput<strong>in</strong>g <strong>in</strong> <strong>large</strong><strong>distributed</strong> systems. It relies on the results presented <strong>in</strong> the previous section to compute thevalue of a given function of the average of the <strong>in</strong>put vectors. As proved below, PeGMAis both <strong>local</strong> and eventually correct. The section proceeds to exemplify how PeGMA canbe used <strong>for</strong> solv<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problems by <strong>in</strong>stantiat<strong>in</strong>g it <strong>for</strong> two differentfunctions: (1) a function that outputs 0 if the norm of the average vector <strong>in</strong> R d is less thanǫ, and 1 otherwise; and (2) a function that outputs 0 <strong>in</strong>side a Pacman shape <strong>in</strong> R 2 and 1outside of it.Note that <strong>for</strong> all the <strong>algorithms</strong>, one of the <strong>in</strong>puts is theω-cover. F<strong>in</strong>d<strong>in</strong>g an algorithmwhich outputs the ω-cover <strong>for</strong> any function F is an open research issue. F<strong>in</strong>ite Elementtechnique (FEM) [55] us<strong>in</strong>g quadtree (<strong>in</strong> R 2 ) or octree (<strong>in</strong> R 3 ) can be used to split anyarbitrary F <strong>in</strong>to convex regions. An example is shown <strong>in</strong> Figure 3.4. However, the costof build<strong>in</strong>g the tree <strong>in</strong>creases exponentially with <strong>in</strong>crease <strong>in</strong> the dimension of the problem.Randomized <strong>algorithms</strong> <strong>for</strong> generat<strong>in</strong>g ω-cover offer better complexities as we show <strong>in</strong>Section 3.7.3.


73FIG. 3.4. An arbitrary F is subdivided us<strong>in</strong>g quadtree <strong>data</strong> structure.3.7.1 PeGMA: Algorithmic DetailsPeGMA, depicted <strong>in</strong> Algorithm 1, receives as <strong>in</strong>put the functionF, theω-cover, whichis a set of non-overlapp<strong>in</strong>g convex regions, and a constant — L — whose function is expla<strong>in</strong>edbelow. Each peer outputs, at every given time, the value of F based on its knowledge−→ K i .The algorithm is event driven. Events could be one of the follow<strong>in</strong>g: a message froma neighbor peer, a change <strong>in</strong> the set of neighbors (e.g., due to failure or recovery), a change<strong>in</strong> the <strong>local</strong> <strong>data</strong>, or the expiry of a timer which is always set to no more than L. On anysuch eventP i calls the OnChange method. If the event is a message −→ X,|X| received froma neighborP j ,P i would update −−→ X i,j to −→ X and |X i,j | to|X| be<strong>for</strong>e it calls OnChange.The objective of the OnChange method is to make certa<strong>in</strong> that the conditions of Theorem3.6.1 are ma<strong>in</strong>ta<strong>in</strong>ed <strong>for</strong> the peer that runs it. These conditions require −→ −→K i , Ai,j , and−−→W i,j (<strong>in</strong> case it is not null) all to be <strong>in</strong> the same convex region. S<strong>in</strong>ce −→ K i cannot be manipulatedby the peer, it dictates the active region R. −→ K i can also be outside all of the regions<strong>in</strong>C ω (when it is <strong>in</strong> a tie area). S<strong>in</strong>ce Theorem 3.6.1 does not provide any guarantee unless


−→74K i is <strong>in</strong> a convex region, eventual correctness can only be guaranteed if the peer floods,<strong>in</strong> this case, any withheld knowledge it has to its neighbors. Otherwise, the conditions <strong>for</strong>which <strong>data</strong> need to be sent are two: (1) if either the agreement −→Ai,j or the withheld knowledge−−→ W i,j are outside R or (2) if the knowledge −→ K i is different from the agreement −−→ A i,j ,but the size of the withheld knowledge is zero. This can only occur if there is no withheldknowledge at all, and then the <strong>data</strong> changes.It should be stressed here that <strong>in</strong> any case other than those described above, the peerdoes not need to do anyth<strong>in</strong>g even if its <strong>data</strong> changes. The peer can rely on the correctness( −→Ki)of the general results from the previous section which ensure that if F is not thecorrect answer then eventually one of its neighbors will send it some new <strong>data</strong> and change−→K i . If, one the other hand, one of the a<strong>for</strong>ementioned cases do occur, then P i sends amessage. This is per<strong>for</strong>med by the SendMessage method. If −→ K i is <strong>in</strong> a tie area, then P isimply sends all of the withheld <strong>data</strong>. Otherwise, a message is computed which will ensure−→A i,j and −−→ W i,j are <strong>in</strong>R. In PeGMA we do not specify exactly how this is done. Nevertheless,−→one way is evident — send<strong>in</strong>g all of the withheld knowledge would set Ai,j to −→ K i (which,from the way we select R, is <strong>in</strong> R) and W i,j to zero.One last mechanism employed <strong>in</strong> the algorithm is a “leaky bucket” mechanism. Thismechanism makes certa<strong>in</strong> that no two messages are sent <strong>in</strong> a period shorter than a constantL. Leaky bucket is often used <strong>in</strong> asynchronous, event-based systems to prevent messageexplosion. Every time a message needs to be sent, the algorithm checks how much timehas passed s<strong>in</strong>ce the last one was sent. If that time is less thanL, the algorithm sets a timer<strong>for</strong> the rem<strong>in</strong>der of the period and calls OnChange aga<strong>in</strong> when the timer expires. Note thatthis mechanism does not en<strong>for</strong>ce any k<strong>in</strong>d of synchronization on the system. It also hasnoth<strong>in</strong>g to do with correctness; at most it might delay convergence. Figure 3.5 shows theflowchart of PeGMA.


7512345678910111213141516171819202122Input: F,C( ω , L, S i , and Γ i −→Ki )Output: F−−→ −→Data structure: For each P j ∈ Γ Xi,j i , |X i,j |, Xj,i, |X i,j |, last message;(if MessageRecvd P j , −→ )X,|X| then −→Xj,i← −→ X and |X j,i | ← |X|;ifS i , Γ i , −→ K i or|K i | changes then call OnChange;Function OnChange()beg<strong>in</strong>{Set R ←R j ∈ C ω ∃R j ∈ C ω : −→ K i ∈ R j∅ Otherwise<strong>for</strong> each[P j ∈ Γ i doif R = ∅ ∧( −→Ai,j≠ −→ )] [∨ ∨K i |Ai,j | ≠ |K i | |W i,j | = 0 ∧ −→Ai,j≠ −→ ] ∨K i[A i,j ∉ R ∨ W i,j ∉ R] then call SendMessage(P j );endendFunction SendMessage(P j )beg<strong>in</strong>iftime()−last message ≥ L thenCompute new −→Xi,jand|X i,j | such that both −→ K −−→i, A i,j and −−→ W i,j are <strong>in</strong>R;if no such −−→ X i,j and |X i,j | exist then−−→X i,j ← |K i| −→ K i−|X j,i | −−→ X j,i|K i |−|X j,i; |X| i,j | ← |K i |−|X j,i |;endlast message ← time();Send −−→ X i,j ,|X i,j | to P j ;endelse Wait L−(time()−last message) time units and then callOnChange();endAlgorithm 1: P2P Generic Local Algorithm (PeGMA)


76FIG. 3.5. Flowchart of PeGMA.3.7.2 PeGMA: Correctness and LocalityThis section proves that PeGMA described above is both <strong>local</strong> and eventually correct.In order to show eventual correctness we need to prove that if, at some po<strong>in</strong>t <strong>in</strong> timeτ,both the underly<strong>in</strong>g communication graph and the <strong>data</strong> at every peer cease to change, then( −→G)after some length of time every peer would output the correct result F ; and that thiswould happen <strong>for</strong> any static communication tree G(V,E), any static <strong>data</strong> S i at the peers,and any possible state of the peers at timeτ.Theorem 3.7.1. [Correctness] The Peer-to-Peer Generic Monitor<strong>in</strong>g Algorithm (PeGMA)is eventually correct.Proof. Regardless of the state of K i , A i,j , W i,j , the algorithm will cont<strong>in</strong>ue to send messages,and accumulate more and more ofG <strong>in</strong> eachK i until one of two th<strong>in</strong>gs happen: either<strong>for</strong> every peer K i = G or <strong>for</strong> every P i both K i , A i,j , and W i,j are <strong>in</strong> the same R l ∈ C ω . In


77the <strong>for</strong>mer case, −→ K i = −→ ( −→Ki) ( −→G)G , so every peer obviously computes F = F . In thelatter case, Theorem 3.6.1 dictates that −→ ( −→Ki) ( −→G)G ∈ R l , soF = F too.Lastly, we claim that PeGMA is (α,γ)-<strong>local</strong>. The art of measur<strong>in</strong>g (α,γ)-<strong>local</strong>ity of<strong>algorithms</strong> is at its <strong>in</strong>fancy. An attempt has been made to def<strong>in</strong>e <strong>local</strong>ity with respect to theVeracity Radius of an aggregation problem [15]. However this method does not extend wellto <strong>algorithms</strong> that conta<strong>in</strong>s randomness (e.g., <strong>in</strong> message schedul<strong>in</strong>g) or to dynamic <strong>data</strong>and topology. Consider<strong>in</strong>g the (α,γ) framework we def<strong>in</strong>ed earlier, there always exist problem<strong>in</strong>stances <strong>for</strong> which any eventually correct algorithm (e.g. [116][196][195][119][163]and <strong>in</strong>clud<strong>in</strong>g all ones described <strong>in</strong> this dissertation) will have worst case γ = O(n) (asshown <strong>in</strong> Theorem 3.7.3), where n is the size of the network. While O(n) is the upperbound on the communication complexity, more accurate bounds on γ can be developedby identify<strong>in</strong>g the specific problems and <strong>in</strong>put <strong>in</strong>stances. We feel that there is an <strong>in</strong>tr<strong>in</strong>sicrelation between γ and ǫ. For example <strong>in</strong>creas<strong>in</strong>g ǫ decreases γ though it needs to be<strong>in</strong>vestigated further and constitutes part of the future work of this dissertation. Moreover,another factor which actively controls γ is the area of the tie region. As we describe here,this area can be made arbitrarily small by <strong>in</strong>troduc<strong>in</strong>g more convex regions. An appropriatemeasure of accuracy, there<strong>for</strong>e can be the ratio of the union of the areas spanned by thecover to the tie region.Lemma 3.7.2. Consider<strong>in</strong>g a two node network, P i and P j , the maximum number of messagesexchanged between them to come to a consensus about the correct output is 2.Proof. Us<strong>in</strong>g the notations def<strong>in</strong>ed earlier, let −→ K i ∈ R k , −→ K j ∈ R l and −→ G ∈ R m , whereR m ,R k ,R l ∈ C ω and m ≠ k ≠ l. Consider<strong>in</strong>g an <strong>in</strong>itialization state, where −−→ X i,j = −→Xj,i =−→0 such that Ai,j = 0 = −→Aj,i . In this case the condition of Theorem 3.6.1 does not hold<strong>for</strong> both P i and P j . P i will send all of its <strong>data</strong> (which is −→ K i ) to P j which will enable P j tocorrectly compute −→ G (s<strong>in</strong>ce −→ G is a convex comb<strong>in</strong>ation of −→ K i and −→ K j ). On receiv<strong>in</strong>g −→ K ifrom P i , P j will apply the conditions of Theorem 3.6.1. S<strong>in</strong>ce clearly −→ K j = −→ G ∈ R m but


78−→Aj,i = −→ K i ∈ R k , the condition of the theorem dictates it to send a message to P i and itwill send all the <strong>data</strong> which it has not received from P i i.e. −→ K j . At this po<strong>in</strong>t both P i andP j have both −→ K i and −→ K j . Hence they can compute −→ G correctly. There<strong>for</strong>e the number ofmessages exchanged is 2.It can also happen that −→ K i or −→ K j or both lie <strong>in</strong>side a tie region. In this case the peerswill exchange all of their <strong>data</strong> and follow<strong>in</strong>g a similar argument as stated above, we canshow that the total number of messages exchanged is 2.Our next theorem bounds the total number of messages sent by PeGMA. Because ofthe dependence on the <strong>data</strong>, count<strong>in</strong>g the number of messages <strong>in</strong> a <strong>data</strong> <strong>in</strong>dependent manner<strong>for</strong> such an asynchronous algorithm seems extremely difficult. There<strong>for</strong>e <strong>in</strong> the follow<strong>in</strong>gtheorem (Theorem 3.7.3), we f<strong>in</strong>d the upper bound of the number of messages exchangedby any peer when the <strong>data</strong> of all the peer changes.Theorem 3.7.3. [Communication Complexity] Let S t be a state of the network at time twhere <strong>for</strong> every P i , −→ K i ∈ R l , R l ∈ C ω . Hence −→ G ∈ R l as well and thus the peers haveconverged to the correct result. Let at time t ′ > t the <strong>data</strong> of each peer changes. Withoutloss of generality, let us assume that at time t ′ , −→ K i ∈ R i where each R i ∈ C ω . Let us alsoassume that −→ G ∈ R g , where g /∈ {1...n}. The maximum number of messages sent by anypeer P i is(n−1)×(|Γ i |−1) <strong>in</strong> order to ensure −→ K i ∈ R g .Proof. It is clear that the output of each peer will be correct only when each −→ K i = −→ G . Thiswill only happen when each P i has communicated with all the peers <strong>in</strong> the network i.e.−→K i = ∑ n −→K i . S<strong>in</strong>ce PeGMA only communicates with immediate neighbors, <strong>in</strong> the worsti=1case any peer P i will be updated with each value of −→ K j , j ≠ i one at a time. Every timeP igets one −→ K j , it communicates with all its neighbors except the one from which it got −→ K j .This process can be repeated <strong>in</strong> the worst case <strong>for</strong>(n−1) times <strong>in</strong> order to get all the −→ K j ’s.At every such update, P i will communicate with |Γ i | − 1 neighbors. There<strong>for</strong>e, the totalnumber of messages sent byP i is (n−1)×(|Γ i |−1).


Our next theorem shows that PeGMA is(O(1),O(n))-<strong>local</strong>.79Theorem 3.7.4. [Locality] PeGMA (O(1),O(n))-<strong>local</strong>.Proof. PeGMA is designed to work by communicat<strong>in</strong>g with immediate neighbors of a peeronly. Hence by design,α = 1.From Lemma 3.7.3, we know that γ = O(n). Hence, PeGMA is (O(1),O(n))-<strong>local</strong>.While the worst case γ = O(n), there exist many <strong>in</strong>terest<strong>in</strong>g problem <strong>in</strong>stances <strong>for</strong>which γ is small and <strong>in</strong>dependent of the size of the network, as corroborated by our extensiveexperimental results. Furthermore, the cases <strong>for</strong> which the communication is globalcan be controlled to any desired degree by reduc<strong>in</strong>g the area of the tie region.3.7.3 Local L2 Norm Threshold<strong>in</strong>gFollow<strong>in</strong>g the description of PeGMA, specific <strong>algorithms</strong> can be implemented <strong>for</strong> variousfunctions F. One of the most <strong>in</strong>terest<strong>in</strong>g functions is that of threshold<strong>in</strong>g the L2norm of the average vector, i.e., decid<strong>in</strong>g if ∥ −→ G∥ ≤ ǫ, where ǫ is a user chosen threshold.Geometrically this means decid<strong>in</strong>g if −→ G lies <strong>in</strong>side a circle of radius of ǫ and centered at(0,0).taken:To produce a specific algorithm from PeGMA, the follow<strong>in</strong>g two steps need to be1. Aω-cover by non-overlapp<strong>in</strong>g convex regions, <strong>in</strong> each of whichF Ri -<strong>in</strong>variant, needsto be found2. A method <strong>for</strong> f<strong>in</strong>d<strong>in</strong>g −−→ X i,j and|X i,j | which ensures that both −→Ai,j and −−→ W i,j ∈ R needsto be <strong>for</strong>mulatedThe case of L2 threshold<strong>in</strong>g problem <strong>in</strong> R 2 is illustrated <strong>in</strong> Figure 3.6. Obviously,the area <strong>for</strong> which F outputs true — the <strong>in</strong>side of an ǫ-circle — is convex. This area is


80ABCDFIG. 3.6. (A) the area <strong>in</strong>side an ǫ circle (B) A random vector (C) A tangent def<strong>in</strong><strong>in</strong>g ahalf-space (D) The areas between the circle and the union of half-spaces are the tie areasdenoted by R <strong>in</strong> . The area outside the ǫ-circle can be divided by randomly select<strong>in</strong>g unitvectorsû 1 ,...,û l and then draw<strong>in</strong>g the half-spacesH j = {⃗x : ⃗x·û j ≥ ǫ}. Each half-spaceis a convex region and is entirely outside the circle. However, these half-spaces overlap.There<strong>for</strong>e, non-overlapp<strong>in</strong>g regions are def<strong>in</strong>ed by remov<strong>in</strong>g from each half-space po<strong>in</strong>tsthat belong to the next <strong>in</strong> order half-spaceR j = { ⃗x : ⃗x·û j ≥ ǫ∧⃗x·u (j+1) ˆ mod l < ǫ } . By<strong>in</strong>creas<strong>in</strong>gl, the area between the halfspaces and the circle or the tie area can be m<strong>in</strong>imizedto any desired degree. Thus, R <strong>in</strong> ,R 1 ,...,R l constitute a ω-cover. S<strong>in</strong>ce the doma<strong>in</strong> canbe <strong>in</strong>f<strong>in</strong>ite, we make sure the tie area is less than aω fraction of the area of theǫ-circle andthus clearly less than aω fraction of the entire doma<strong>in</strong>. The pseudo code of the cover <strong>for</strong> asphere <strong>in</strong>R d is shown <strong>in</strong> Algorithm 2.The runn<strong>in</strong>g time of the algorithm is O(l×d). This is a huge improvement from theexponential time complexity <strong>in</strong>dus<strong>in</strong>g quadtree/octree <strong>data</strong> structure we discussed earlier.We now describe how the SendMessage method computes a message that <strong>for</strong>ces −→Ai,j


12345Input: dimension of the problemd, number of unit vectors required lOutput: C ωGenerate l random vectors: r 1 = {r 11 ,r 12 ,...,r 1d }, r 2 = {r 21 ,r 22 ,...,r 2d },...,r l = {r l1 ,r l2 ,...,r ld }.1Normalize these vectors: û 1 = √r 2 11 ,r 12 ,...,r 1d }11 +r12 1d{r 2 +···+r21û 2 = √r 2 21 ,r 22 ,...,r 2d }, . . . ,21 +r22 2d{r 2 +···+r21û l = √r 2 l1 ,r l2 ,...,r ld }.l1 +rl2 ld{r 2 +···+r2R <strong>in</strong> = { −→ x : || −→ x|| < ǫ}R i = { −→ x : −→ x ·û i > ǫ}C ω = {R <strong>in</strong> ,R 1 ,...,R l }Algorithm 2: Generic cover <strong>for</strong> sphere <strong>in</strong> R d .81and −−→ W i,j <strong>in</strong>to the region which conta<strong>in</strong>s −→ K i if they are not <strong>in</strong> it.A related algorithm,Majority-Rule [196], suggests send<strong>in</strong>g all of the withheld knowledge <strong>in</strong> this case. However,experiments with dynamic <strong>data</strong> h<strong>in</strong>t this method may be unfavorable. If all or most ofthe knowledge is sent and the <strong>data</strong> later changes, the withheld knowledge becomes the differencebetween the old and the new <strong>data</strong>. This difference tends to be far more noisy than−→the orig<strong>in</strong>al <strong>data</strong>. Thus, while the algorithm makes certa<strong>in</strong> that Ai,j and −−→ W i,j are brought<strong>in</strong>to the same region as −→ K i , it still makes an ef<strong>for</strong>t to ma<strong>in</strong>ta<strong>in</strong> some withheld knowledge.Though it may be possible to optimize the size of |W i,j | we take the simple and effectiveapproach of test<strong>in</strong>g an exponentially decreas<strong>in</strong>g sequence of |W i,j | values, and thenchoos<strong>in</strong>g the first such value satisfy<strong>in</strong>g the requirements <strong>for</strong> −→Ai,j and −−→ W i,j .When a peerP i needs to send a message, it computes −→ X , the new value <strong>for</strong> −→Xi,j , basedon the contributions of all sources of <strong>in</strong>put vectors other than P j , −→ X ← |K i| −→ K i −|X j,i | −−→ X j,i|K i |−|X j,i.|Then, it tests a sequence of values <strong>for</strong> |X|, the new weight |X i,j |; first sett<strong>in</strong>g it to−→(|K i |−|X j,i |)/2 and then gradually <strong>in</strong>creas<strong>in</strong>g it until either Ai,j and −−→ W i,j ∈ R or|X| = |K i |−|X j,i |.


821234Input: ǫ, L, S i , Γ i , C ωOutput: 0 if∥ −→ ∥K ∥∥i ≤ ǫ, 1 otherwiseFunction ComputeNew(|X i,j | and −→Xi,j)beg<strong>in</strong>LetRbe the region selected by the PeGMA;−→ |K X ←i | −→ K i−|X j,i | −−→ X j,i|K i |−|X j,i |;5 w ← |X| ← |K i |−|X j,i |;6 repeat7 w ← ⌊ w ⌋; 28 |X| ← |K i |−|X j,i |−w;−→ |X| −→ X+|X A =j,i | −−→ X j,i910111213141516|X|+|X j,i |;|A| =|X|+|X j,i |;−→ |K B =i | −→ K −→ i−|X| X−|X j,i | −−→ X j,i|K i |−|X|−|X j,i |;|B| =|K i |−|X|−|X j,i |;until −→ A, −→ B ∈ R orw = 0 ;Return |X| and −→ X as the new|X i,j | and −−→ X i,j ;endOther than these specifications follow the rules specified <strong>in</strong> PeGMAAlgorithm 3: Local L2 Threshold<strong>in</strong>g


Illustration83In this section we cont<strong>in</strong>ue with our example. We want to test if −→ G lies<strong>in</strong>side a circle of radius 10. The messages exchanged by each peer is shown <strong>in</strong> Table 3.1.Note how at the convergence the output of each peer is the correct global output.3.7.4 Comput<strong>in</strong>g PacmanWe now show how PeGMA can be suited <strong>for</strong> another specific function — that of Pacman.The Pacman shape is a circle with one quadrant cut out (See Fig 3.7). Computation ofF <strong>in</strong>side Pacman is significant s<strong>in</strong>ce it shows howF can be computed <strong>in</strong>side any arbitrarynon-convex figure by decompos<strong>in</strong>g the non-convex shape <strong>in</strong>to convex regions.First we def<strong>in</strong>e theω-cover <strong>for</strong> this problem.LetC ω = R <strong>in</strong> ,R 1 ,...,R l be the cover def<strong>in</strong>ed <strong>in</strong> the previous section <strong>for</strong> the purposeof L2 Threshold<strong>in</strong>g. The cover <strong>for</strong> the Pacman function is computed by remov<strong>in</strong>gR <strong>in</strong> fromthe list of covers, and replac<strong>in</strong>g it with three new regions:1. R a = { −→ x = (rcosβ,rs<strong>in</strong>β) : 0 ≤ r ≤ ǫ∧0 ≤ β < π};2. R b = { −→ x = (rcosβ,rs<strong>in</strong>β) : 0 ≤ r ≤ ǫ∧π ≤ β < π +π2};3. R c = { −→ x = (rcosβ,rs<strong>in</strong>β) : 0 ≤ r ≤ ǫ∧π +π2 ≤ β < 2π} .The angle spanned by each of these regions is less than π. Hence each of these regionsare convex.The value of F <strong>in</strong> each region is constant (1 <strong>for</strong> the first two, and0 <strong>for</strong> the last one). Thus the new cover is a ω-cover. The ω-cover can be written as:C ω = {R a ,R b ,R c ,R 1 ,...,R l }. In order to monitor F <strong>in</strong>side the Pacman, the conditionsof Theorem 3.6.1 are applied to each of these regions <strong>in</strong> C ω . The output of the Pacmanmonitor<strong>in</strong>g algorithm would be:⎧⎪⎨ 0, if −→ K i ∈ {R c ,R 1 ,...,R l };Output =⎪⎩ 1, if −→ K i ∈ {R a ,R b };


PeerP 1 PeerP 284Timet 0 : Timet 0 :∣∣ ∣∣ ∣∣ K ∣∣ ∣∣ ∣∣∣ ∣∣∣ −−→∣∣ ∣∣∣ ∣∣∣ −−→∣∣ ∣∣∣ ∣∣∣ 1 = 6.33, A1,2= 0,W∣∣1,2 = 6.33 ∣∣ ∣∣ ∣∣ K ∣∣ ∣∣ ∣∣∣ ∣∣∣ −−→∣∣ ∣∣∣ ∣∣∣ −−→∣∣ ∣∣∣ ∣∣∣2 = 13.51,A2,1= 0,W∣∣2,1 = 13.51Output: trueOutput: falseAll <strong>in</strong>side circle, so no messageAll not <strong>in</strong>side circle, so send message|X 2,1 |=|K 2 |−|X 1,2 | = 2Timet 1 :When message received fromP 2 :|K 1 | = 3 + 2 = 5−→K 1 = 1 5{(6,18)+(17,21)} = {(4.6,7.8)}∣∣ −→ ∣∣ K ∣∣ ∣∣=9.055 1 (Inside)|A 1,2 | = 2−−→A 1,2 = {(8.5,10.5)}−−→ ∣∣ ∣∣∣ ∣∣∣A∣∣1,2 =13.51 (Outside)|W 1,2 | = 5-2 = 3−−→W 1,2 = 1 3{(23,39)−(17,21)} = {(2,6)}−−→ ∣∣ ∣∣∣ ∣∣∣W∣∣1,2 =6.33 (Inside)Output: trueAll not <strong>in</strong>side circle, so send message|X 1,2 |=|K 1 |−|X 2,1 | = 3−−→X 1,2 = −→ K 1 − −−→ X 2,1 = (2,6)Send −−→ X 1,2 ,|X 1,2 | toP 2After message sent:∣∣ −→ ∣∣ ∣∣ K ∣∣ ∣∣ ∣∣∣ ∣∣∣ −−→∣∣ ∣∣∣ ∣∣∣ 1 = 9.055, A1,2= 9.055,−−→ ∣∣ ∣∣∣ ∣∣∣W∣∣1,2 = 0−−→X 2,1 = −→ K 2 − −−→ X 1,2 = (8.5,10.5)Send −−→ X 2,1 ,|X 2,1 | toP 1After message sent:∣∣ −→ ∣∣ ∣∣ K ∣∣ ∣∣ ∣∣∣ ∣∣∣ −−→∣∣ ∣∣∣ ∣∣∣ 2 = 13.51, A2,1= 13.51,−−→ ∣∣ ∣∣∣ ∣∣∣W∣∣2,1 = 0Timet 2 :When message received fromP 1 :|K 2 | = 2 + 3 = 5−→K 2 = 1 5{(17,21)+(6,18)} = {(4.6,7.8)}∣∣ −→ ∣∣ K ∣∣ ∣∣=9.055 2 (Inside)|A 2,1 | = 5−−→A 2,1 = {(4.6,7.8)}−−→ ∣∣ ∣∣∣ ∣∣∣A∣∣2,1 =9.055 (Inside)|W 2,1 | = 0−−→W 2,1 = −→ 0−−→ ∣∣ ∣∣∣ ∣∣∣W∣∣2,1 =0 (Inside)Output: trueAll <strong>in</strong>side circle, so no messageTable 3.1. Table show<strong>in</strong>g the messages exchanged by two peers P i and P j <strong>for</strong> the L2threshold<strong>in</strong>g algorithm.


85FIG. 3.7. A circle of radius ǫ circumscrib<strong>in</strong>g a Pacman shape. (A) Area <strong>in</strong>side the shape.(B) Area outside the shape. (C) Unit vectors def<strong>in</strong><strong>in</strong>g the tangent l<strong>in</strong>es. (D) Tangent l<strong>in</strong>esdef<strong>in</strong><strong>in</strong>g the bound<strong>in</strong>g polygon.The same algorithm employed <strong>for</strong> L2 Threshold<strong>in</strong>g can be used <strong>for</strong> the Pacman problemwithout any additional change.3.8 Reactive AlgorithmsThe previous section described PeGMA — an <strong>efficient</strong> <strong>local</strong> algorithm capable of comput<strong>in</strong>gcomplex functions def<strong>in</strong>ed on the average of the <strong>data</strong> even when the <strong>data</strong> and systemare constantly chang<strong>in</strong>g. In this section, we leverage this powerful tool to create aframework <strong>for</strong> produc<strong>in</strong>g and ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g various <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> models. This framework issimpler than the current methodology of <strong>in</strong>vent<strong>in</strong>g a specific <strong>distributed</strong> algorithm <strong>for</strong> eachproblem.The basic idea of the framework is to employ a simple, costly, and possibly <strong>in</strong>accurateconvergecast algorithm to sample <strong>data</strong> from the network to a central post and to compute,


86based on this “best-ef<strong>for</strong>t” sample, a <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> model. The second part uses a broadcasttechnique to distribute this model <strong>in</strong> the network. The <strong>local</strong> algorithm is then used to monitorthe quality of the model. If the model is not sufficiently accurate or the <strong>data</strong> has changedto the degree that the model no longer fits it, the monitor<strong>in</strong>g algorithm alerts and triggersanother cycle of <strong>data</strong> collection. It is also possible to tune the algorithm by <strong>in</strong>creas<strong>in</strong>g thesample size if the alerts are frequent and decreas<strong>in</strong>g it when they are <strong>in</strong>frequent. S<strong>in</strong>ce themonitor<strong>in</strong>g algorithm is eventually correct, eventual convergence to a sufficiently accuratemodel is very likely. Furthermore, when the <strong>data</strong> only goes through stationary changes, themonitor<strong>in</strong>g algorithm triggers false alerts <strong>in</strong>frequently and hence can be extremely <strong>efficient</strong>.Thus, the overall cost of the framework is low.Note that the communication complexity of the convergecast process is not <strong>local</strong>.Rather it is equal to the size of the network. However, this process is only <strong>in</strong>voked ifthe <strong>data</strong> undergoes a change <strong>in</strong> the underly<strong>in</strong>g distribution. If the number of times thischange is low, the total cost is amortized by the stationary periods <strong>in</strong> which our algorithmis very <strong>efficient</strong>.We describe four different <strong>in</strong>stantiations of this basic framework, each highlight<strong>in</strong>g adifferent aspect. First we discuss the problem of comput<strong>in</strong>g the average <strong>in</strong>put vector, to adesired degree of accuracy. Second, we present an algorithm <strong>for</strong> comput<strong>in</strong>g a variant of thek-means clusters suitable <strong>for</strong> dynamic <strong>data</strong>. Third, we focus on the problem of comput<strong>in</strong>gthe first eigenvector of the <strong>data</strong>. F<strong>in</strong>ally, we discuss the problem of comput<strong>in</strong>g multivariateregression <strong>in</strong> P2P networks.3.8.1 Mean Monitor<strong>in</strong>gThe problem of monitor<strong>in</strong>g the mean of the <strong>in</strong>put vectors has direct applications tomany <strong>data</strong> analysis tasks. The objective <strong>in</strong> this problem is to compute a vector −→ µ which isa good approximation <strong>for</strong> −→ G . Formally, we require that ∥ −→ G − −→ µ ∥ ≤ ǫ <strong>for</strong> a desired value


87ofǫ.For any given estimate −→ µ , monitor<strong>in</strong>g whether ∥ −→ G − −→ µ ∥ ≤ ǫ is possible via directapplication of the L2 Threshold<strong>in</strong>g algorithm from Section 3.7.3. Every peer P i subtracts−→ µ from every <strong>in</strong>put vector <strong>in</strong> Si . Then, the peers jo<strong>in</strong>tly execute L2 Norm Threshold<strong>in</strong>gover the modified <strong>data</strong>. If the result<strong>in</strong>g average is <strong>in</strong>side theǫ-circle then −→ µ is a sufficientlyaccurate approximation of −→ G ; otherwise, it is not.The basic idea of the Mean Monitor<strong>in</strong>g algorithm is to employ a convergecastbroadcastprocess <strong>in</strong> which the convergecast part computes the average of the <strong>in</strong>put vectorsand the broadcast part delivers the new average to all the peers. The trick is that, be<strong>for</strong>e apeer sends the <strong>data</strong> it collected up the convergecast tree, it waits <strong>for</strong> an <strong>in</strong>dication that thecurrent −→ µ is not a good approximation of the current <strong>data</strong>. Thus, when the current −→ µ isa good approximation, convergecast is slow and only progresses as a result of false alerts.Dur<strong>in</strong>g this time, the cost of the convergecast process is negligible compared to that of theL2 Threshold<strong>in</strong>g algorithm. When, on the other hand, the <strong>data</strong> does change, all peers alertalmost immediately. Thus, convergecast progresses very fast, reaches the root, and <strong>in</strong>itiatesthe broadcast phase. Hence a new −→ µ is delivered to every peer which is a more updatedestimate of −→ G .Algorithm DetailThe exact algorithm, described <strong>in</strong> Algorithm 4, adds very little towhat is described above. The first addition is that of an alert mitigation constant,τ, selectedby the user. The idea here is that an alert should persist <strong>for</strong> a given period of time be<strong>for</strong>ethe convergecast advances. Experimental evidence suggests that sett<strong>in</strong>gτ to even a fractionof the average edge delay greatly reduces the number of convergecasts without <strong>in</strong>curr<strong>in</strong>g asignificant delay <strong>in</strong> updat<strong>in</strong>g −→ µ .A second detail is the separation of the <strong>data</strong> used <strong>for</strong> alert<strong>in</strong>g — the <strong>in</strong>put of the L2Threshold<strong>in</strong>g algorithm — from that which is used <strong>for</strong> comput<strong>in</strong>g the new average. Ifthe two are the same, then the new average tends to be biased. This is because an alert,


88and consequently an advancement <strong>in</strong> the convergecast, is bound to be more frequent whenthe <strong>local</strong> <strong>data</strong> is extreme. Thus, the <strong>in</strong>itial <strong>data</strong>, and later on every new <strong>data</strong>, is randomlyassociated with one of two <strong>data</strong>sets: R i — which is used by the L2 Threshold<strong>in</strong>g algorithm,and T i on whom the average is computed when convergecast advances.A third detail is the implementation of the convergecast process. First, every peertracks changes <strong>in</strong> the knowledge of the underly<strong>in</strong>g L2 Threshold<strong>in</strong>g algorithm. When itmoves from <strong>in</strong>side the ǫ-circle to outside the ǫ-circle the peer takes note of the time, andsets a timer to τ time units. Whenever a timer expires, or a <strong>data</strong> message is received fromone of its neighbors, P i first checks if currently there is an alert and if it was recorded τ ormore time units ago. If so, it counts from how many of its neighbors it has received <strong>data</strong>messages. If it has received <strong>data</strong> messages from all of its neighbors, the peer moves to thebroadcast phase, computes the average of its own <strong>data</strong> and the received <strong>data</strong> and sends itto itself. If it has received <strong>data</strong> messages from all but one of the neighbors then this oneneighbor becomes the peer’s parent <strong>in</strong> the convergecast tree; the peer computes the averageof its own and its other neighbors’ <strong>data</strong>, and sends the average with its cumulative weightto the parent. Then, it moves to the broadcast phase. If it received no <strong>data</strong> messages fromtwo or more of its neighbors the peer just keeps wait<strong>in</strong>g.Lastly, the broadcast phase is fairly straight<strong>for</strong>ward. Every peer which receives thenew −→ µ vector, updates its <strong>data</strong> by subtract<strong>in</strong>g it from every vector <strong>in</strong>R i and transfers thosevectors to the underly<strong>in</strong>g L2 Threshold<strong>in</strong>g algorithm. Then, if it is <strong>in</strong> the broadcast phaseit sends the new vector to its other neighbors and changes the status to convergecast. Therecould be one situation <strong>in</strong> which a peer receives a new −→ µ vector when it is already <strong>in</strong> theconvergecast phase. This is when two neighbor peers concurrently become roots of theconvergecast tree i.e., when each of them concurrently sends the last convergecast messageto the other. To break the tie, a root peer P i which receives −→ µ from a neighbor P j while<strong>in</strong> the convergecast phase compares i to j. If i > j it ignores the message. Otherwise, P i


treats the message just as it would <strong>in</strong> a broadcast phase.89Figure 3.8 shows a snap-shot of the convergecast broadcast steps as it progresses upthe communication tree. In subfigure 3.8(a), the peers do not raise a flag and only runs the<strong>local</strong> L2 algorithm. In subfigure 3.8(b), the two leaves raise their flags and send their <strong>data</strong>up (to the parent) as shown us<strong>in</strong>g arrows. Figure 3.8(c) shows an <strong>in</strong>termediate step wherethe nodes keep wait<strong>in</strong>g. F<strong>in</strong>ally, the roots (two of them) become activated <strong>in</strong> subfigure3.8(d) by exchang<strong>in</strong>g <strong>data</strong> with each other. The new statistic is computed at both the rootsand broadcast. The root with the higher id gets to propagate its result.3.8.2 k-Means Monitor<strong>in</strong>gWe now turn to a more complex problem, that of comput<strong>in</strong>g thek-means of <strong>distributed</strong><strong>data</strong>. The classic <strong>for</strong>mulation of the k-means algorithm is a two step recursive process <strong>in</strong>which every <strong>data</strong> po<strong>in</strong>t is first associated with the nearest of k centroids, and then everycentroid is moved to the average of the po<strong>in</strong>ts associated with it — until the average is thesame as the centroid. To make the algorithm suitable <strong>for</strong> a dynamic setup, we relax thestopp<strong>in</strong>g criteria. In our <strong>for</strong>mulation, a solution is considered admissible when the averageof po<strong>in</strong>t is with<strong>in</strong> anǫ-distance of the centroid with whom they are associated.Similar to the Mean-Monitor<strong>in</strong>g, the k-Means Monitor<strong>in</strong>g algorithm (Algorithm. 7)is per<strong>for</strong>med <strong>in</strong> a cycle of convergecast and broadcast. The algorithm, however, is different<strong>in</strong> some important respects. First, <strong>in</strong>stead of tak<strong>in</strong>g part of just one execution of L2 Threshold<strong>in</strong>g,each peer takes part <strong>in</strong> k such executions — one per centroid. The <strong>in</strong>put of thel th execution are those po<strong>in</strong>ts <strong>in</strong> the <strong>local</strong> <strong>data</strong> set S i <strong>for</strong> which the l th centroid, −→ c l , is theclosest. Thus, each execution monitors whether one of the centroids needs to be updated.If even one execution discovers that the norm of the correspond<strong>in</strong>g knowledge ∥ −→ Kil ∥ isgreater thanǫthe peer raises an alert. If the alert persists <strong>for</strong>τ time units, the peer advancesto the convergecast process.


9012345678910111213141516171819202122232425262728Input: ǫ, L, S i , Γ i , −→ µ 0, τOutput: −→ µ such that∥ −→ G − −→ µ ∥ ∥ < ǫData structure: R i and T i , last change, flags: alert, root, and phase, <strong>for</strong> eachP j ∈ Γ i , a vector −→ v j and a counterc j ;Initialization:beg<strong>in</strong>Set −→ µ ← −→ µ 0, alert ← false, phase ← convergecast;SplitS i evenly between R i{andT i ;−→Initialize L2 with the <strong>in</strong>put xi,1− −→ µ, −→ x i,2 − −→ µ,..., −→ xi,B− −→ µ : −→ x i,j ∈ R i};Set −→ v j ,c j to −→ 0 ,0 <strong>for</strong> all P j ∈ Γ i ;endOn addition of a new vector −→ x to S i :beg<strong>in</strong>Randomly add −→ x to eitherR i orT i and remove the oldest po<strong>in</strong>t;if −→ x was added to R i then pass −→ x − −→ µ to the L2 threshold<strong>in</strong>g algorithm;endOn change <strong>in</strong> −→ K i of the L2 threshold<strong>in</strong>g algorithm:beg<strong>in</strong>if∥ −→ ∥K ∥∥i ≥ ǫ and alert = false thenlast change ← time();alert ← true;timer ← τ;endif∥ −→ ∥K ∥∥i < ǫ then alert ← false;endif DataReceived (P j , −→ v ,c) then−→ vj← −→ v , c j ← c;Call Convergecast (See Alg. 5 <strong>for</strong> pseudocode);endif timer expires then Call Convergecast (See Alg. 5 <strong>for</strong> pseudocode);if newµ ′ received then Call ProcessNewMeans(µ ′ ) (See Alg. 6 <strong>for</strong> pseudocode);Algorithm 4: Mean Monitor<strong>in</strong>g.


91123456789101112131415161718192021Function Convergecastbeg<strong>in</strong>ifalert = false then return;iftime()−last change < τ thentimer ← time()+τ −last change;return;endif∀P k ∈ Γ i except <strong>for</strong> onec k ≠ 0 thens = |T i |+ ∑ P j ∈Γ ic j ;−→ s =|T i−→| Ti + ∑ c j−→ vjs P j ∈Γ i;sSend s, −→ s to P l ;Set root ← false, phase ← Broadcast;endif∀P k ∈ Γ i c k ≠ 0 thens = |T i |+ ∑ P j ∈Γ ic j ;−→ s =|T i−→| Ti + ∑ c j−→ vjs P j ∈Γ i;sroot ← true;phase ← Convergecast;Send −→ µ to all P k ∈ Γ iendendAlgorithm 5: Convergecast function <strong>for</strong> Mean Monitor<strong>in</strong>g.123456789Function ProcessNewMeans( −→ µ ′ )beg<strong>in</strong>if(phase = broadcast ∨ phase = convergecast) ∧ root = true ∧ j > i then−→ µ ← −→ µ ′ ;Send −→ µ to all P k ≠ P j ∈ Γ i ;Replace S i of the L2 Threshold<strong>in</strong>g algorithm withphase ← convergecast;endend{ −→x −−→ µ : −→ x ∈ R i};Algorithm 6: Function handl<strong>in</strong>g the receipt of new meansµ<strong>for</strong> Mean Monitor<strong>in</strong>g.


92(a) Initial state(b) Activated leaves(c) Activated <strong>in</strong>termediate nodes(d) Activated rootsFIG. 3.8. Convergecast and broadcast through the different steps. In subfigure 3.8(a), thepeers do not raise a flag. In subfigure 3.8(b), the two leaves raise their flags and send their<strong>data</strong> up (to the parent) as shown us<strong>in</strong>g arrows. Figure 3.8(c) shows an <strong>in</strong>termediate step.F<strong>in</strong>ally, the roots (two of them) become activated <strong>in</strong> subfigure 3.8(d) by exchang<strong>in</strong>g <strong>data</strong>with each other.


93Another difference between thek-Means Monitor<strong>in</strong>g and the Mean-Monitor<strong>in</strong>g algorithmis the statistics collected dur<strong>in</strong>g convergecast. In k-Means Monitor<strong>in</strong>g, the statisticsis a sample of sizeb(dictated by the user) from the <strong>data</strong>. Each peer samples (with returns)from the samples it received from its neighbors, and from its own <strong>data</strong>, such that the probabilityof sampl<strong>in</strong>g a po<strong>in</strong>t is proportional to the weight. The result of this procedure isthat every <strong>in</strong>put po<strong>in</strong>t stands an equal chance to be <strong>in</strong>cluded <strong>in</strong> the sample that arrives tothe root. The root then computes a centralized k-Means on the sample, and sends the newcentroids <strong>in</strong> a broadcast message.3.8.3 Eigenvector Monitor<strong>in</strong>gThe next algorithm described <strong>in</strong> this chapter is an algorithm <strong>for</strong> monitor<strong>in</strong>g the firsteigenvector of <strong>data</strong>. In this problem we assume that the <strong>data</strong> at each peer, S i , is a matrix.The objective of the algorithm is to compute the first eigenvector and the correspond<strong>in</strong>geigenvalue of the global <strong>data</strong> — the po<strong>in</strong>t-wise addition of all of the <strong>local</strong> matrices. For thispurpose, the algorithm takes advantage of the accepted Power Method. The Power Methodis a numeric method <strong>for</strong> approximat<strong>in</strong>g the first eigenvector — eigenvalue pair of a matrixM. It beg<strong>in</strong>s with a random guess −→ V and θ of the eigenvector-eigenvalue pair and keeps∥updat<strong>in</strong>g them until∥M −→ V −θ −→ V ∥ ≤ ǫ.Given the <strong>for</strong>mulation above, the computational task of eigenvector monitor<strong>in</strong>g revertsto exactly L2 Threshold<strong>in</strong>g. For a given guess −→ V , θ, the peers need to compute ifFor everyP i ,∥∥[ ∑ i Si]·−→ V −θ −→ V ∥ ≤ ǫ=⇒ ∥ ∑ [iS ·−→ ]i V −θ −→ V ∥ ≤ ǫ∑ [=⇒ ∥ 1 n iS ·−→ ] ∥i V − n−→ θ ∥∥ V ≤ǫn[S ·−→ ]∑ [i V can be computed <strong>local</strong>ly and 1 n iS ·−→ ]i V is simply the averageof those vectors. On the other hand, −→ V and θ n−→ V are constants computed and <strong>distributed</strong>


1234567891011121314151617181920212223242526272829303132333435363794Input: ǫ,L, S i ,Γ i , an <strong>in</strong>itial guess <strong>for</strong> the centroids C 0 ,τ, bOutput: k centroids such that the average of the po<strong>in</strong>ts assigned to every centroid is with<strong>in</strong>ǫ of that centroid.{Data structure of peerP i : A partition<strong>in</strong>g ofS i <strong>in</strong>to k sets Si 1 −→c1,..., ...Sk i , C = −→ c k},<strong>for</strong> each centroid j = 1,...,k, a flagalert j , a timestamp last change j , a <strong>data</strong>set B j anda counter b j , a flagroot and a flagphase.Initialization:beg<strong>in</strong>C ← C 0 ;<strong>for</strong>j = 1...k doS j i = { −→x ∈ Si : −→ c j = argm<strong>in</strong> l=1...k∥ ∥∥ −→ x −−→ cl∥ ∥∥};Initialize k <strong>in</strong>stances of L2 threshold<strong>in</strong>g algorithm, such that the j th <strong>in</strong>stance has <strong>in</strong>put{<strong>data</strong> −→x −→ − cj: −→ }x ∈ S j i;<strong>for</strong>all P j ∈ Γ i dob j ← 0;<strong>for</strong>j = 1,...,k doalert j ← false;last change j ← −∞;root ← false;phase ← convergecast;endendOn addition of a new vector −→ x toS i :beg<strong>in</strong>F<strong>in</strong>dc j closest to −→ x and add −→ x − −→ c j toj th L2 <strong>in</strong>stance;endOn removal of a vector −→ x from S i :beg<strong>in</strong>F<strong>in</strong>dc j closest to −→ x and remove −→ x − −→ c j from j th L2 <strong>in</strong>stance;endOn change <strong>in</strong> −→ K i of thej th <strong>in</strong>stance of the L2 threshold<strong>in</strong>g algorithm:beg<strong>in</strong>if ∥ −→ ∥K ∥∥ i ≥ ǫ and alertj = false thenlast change j ← time();alert j ← true;timer ← τ time units;endif ∥ −→ ∥K ∥∥ i < ǫ then alertj ← false;endif DataReceived(P j ,B,b) thenB j ← B,b j ← b;call Convergecast (See Algo. 8 <strong>for</strong> pseudocode);endif timer expires then Call Convergecast (See Alg. 8 <strong>for</strong> pseudocode);if newC ′ received then Call ProcessNewCentroids(C ′ ) (See Alg. 9 <strong>for</strong> pseudocode);Algorithm 7: P2P k-Means Cluster<strong>in</strong>g


951234567891011121314151617181920212223242526Function Convergecastbeg<strong>in</strong>if <strong>for</strong> l ∈ [1,...,k] then alert l = false then return;t ← M<strong>in</strong> l=1...k {last message l : alert l = true};A be a set ofbsamples returned by Sample;iftime() < t+τ thentimer ← t+τ −time();return;endif∀P k ∈ Γ i except <strong>for</strong> oneb k ≠ 0 thenroot ← false;phase ← Broadcast;Send A,1+ ∑ m=1... b m to P l ;return;endif∀P k ∈ Γ i b k ≠ 0 thenC ′ ← centroids by comput<strong>in</strong>g thek-means cluster<strong>in</strong>g ofA;root ← true;Send C ′ to self;return;endendFunction Samplebeg<strong>in</strong>(Return a random sample from S i with probability1/ 1+ ∑ )m=1...|Γ i | b m or(from a <strong>data</strong>set B j with probabilityb j / 1+ ∑ )m=1...|Γ i | b m ;endAlgorithm 8: Convergecast function <strong>for</strong> P2Pk-Mean Monitor<strong>in</strong>g.


1234567891011Function ProcessNewCentroids( −→ C ′ )beg<strong>in</strong>if(phase = Broadcast ∨ phase = convergecast) ∧ root = true ∧ j > ithenC ← C ′ ;<strong>for</strong>j = 1...k do S j i ← { −→x ∈ Si : −→ c j = argm<strong>in</strong> l=1...k∥ ∥∥ −→ x −−→ cl∥ ∥∥};<strong>for</strong>j = 1...|Γ i | do b j ← 0;Restart all the L2 Threshold<strong>in</strong>g <strong>in</strong>stances with the new centroid;Send C to all P k ≠ P j ∈ Γ i ;phase ← Convergecast;endendAlgorithm 9: Function handl<strong>in</strong>g the receipt of new centroids C ′ <strong>for</strong> P2P k-meansMonitor<strong>in</strong>g.96us<strong>in</strong>g broadcast. The problem is there<strong>for</strong>e equivalent to Mean-Monitor<strong>in</strong>g withS i·−→ V tak<strong>in</strong>gthe place of −→ S i , n−→ θ V the place of−→ µ , andǫthe place ofǫ.nThe rest is very similar tok-Means Monitor<strong>in</strong>g. The peers monitor the average vector<strong>for</strong> the current guess of −→ V and θ. If the result<strong>in</strong>g vector is greater than ǫ , they advance thenconvergecast process which samples the <strong>data</strong> to a root. The root then uses the sampled <strong>data</strong>to compute new −→ V andθ, that are broadcast to all of the peers. Because of the similarity ofthe algorithm to thek-Means Monitor<strong>in</strong>g, we avoid present<strong>in</strong>g the pseudo code here.3.8.4 Distributed Multivariate RegressionThe last algorithm discussed <strong>in</strong> this chapter is a <strong>distributed</strong> multivariate regression(MR) algorithm. S<strong>in</strong>ce this algorithm is different <strong>in</strong> many aspects compared to the other<strong>algorithms</strong> presented <strong>in</strong> this chapter, we start with some background material be<strong>for</strong>e wedescribe the algorithm.P2P multivariate regression considers build<strong>in</strong>g and updat<strong>in</strong>g regression models from<strong>data</strong> <strong>distributed</strong> over a P2P network where each peer conta<strong>in</strong>s a subset of the <strong>data</strong> tuples.


97In the <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> literature, this is usually called the horizontally partitionedor homogeneously <strong>distributed</strong> <strong>data</strong> scenario. Build<strong>in</strong>g a global regression model (def<strong>in</strong>edon the union of all the <strong>data</strong> of all the peers) <strong>in</strong> <strong>large</strong>-<strong>scale</strong> networks and ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g it isa vital task. Consider a network where there are a number of nodes and each node getsa stream of tuples (can be sensor read<strong>in</strong>gs, music files etc.) every few seconds therebygenerat<strong>in</strong>g huge volume of <strong>data</strong>. We may wish to build a regression model on the global<strong>data</strong> to (1) compactly represent the <strong>data</strong> and (2) predict the value of a target variable. Thisis difficult s<strong>in</strong>ce the <strong>data</strong> is <strong>distributed</strong> and more so because it is dynamic. Centralizationobviously does not work because the <strong>data</strong> may change at a faster rate than the rate at whichit can be centralized and it can be too costly. Local <strong>algorithms</strong> are an excellent choice<strong>in</strong> such scenarios s<strong>in</strong>ce <strong>in</strong> a <strong>local</strong> algorithm, each peer computes the result based on the<strong>in</strong><strong>for</strong>mation from only a handful of nearby neighbors. Hence <strong>local</strong> <strong>algorithms</strong> are highlyscalable. Furthermore, they guarantee that once the computation term<strong>in</strong>ates, each nodewill have the correct regression model. There<strong>for</strong>e, such an algorithm will enable the userto monitor regression models us<strong>in</strong>g low resources and high accuracy.Related WorkThe problem of <strong>distributed</strong> multivariate regression has been addressedby many researchers till date. Hershberger et al. [78] considered the problemof per<strong>for</strong>m<strong>in</strong>g global MR <strong>in</strong> a vertically partitioned <strong>data</strong> distribution scenario. The authorspropose a wavelet trans<strong>for</strong>m of the <strong>data</strong> such that, after the trans<strong>for</strong>mation, effect of thecross terms can be dealt with easily. The <strong>local</strong> MR models are then transported to the centralsite and comb<strong>in</strong>ed to <strong>for</strong>m the global MR model. Such synchronized techniques willnot <strong>scale</strong> <strong>in</strong> <strong>large</strong>, asynchronous systems such as modern P2P networks.Many researchers have looked <strong>in</strong>to the problem of do<strong>in</strong>g <strong>distributed</strong> MR us<strong>in</strong>g <strong>distributed</strong>kernel regression techniques such as Guestr<strong>in</strong> et al. [72] and Predd et al. [153].The algorithm presented by Guestr<strong>in</strong> et al. [72] per<strong>for</strong>ms l<strong>in</strong>ear regression <strong>in</strong> a network ofsensors us<strong>in</strong>g <strong>in</strong>-network process<strong>in</strong>g of messages. Instead of transmitt<strong>in</strong>g the raw <strong>data</strong>, the


98proposed technique transmits constra<strong>in</strong>ts only, thereby reduc<strong>in</strong>g the communication complexitydrastically. Similar to the work proposed here, their work also uses <strong>local</strong> rules toprune messages. However the major drawback is that their algorithm is not suitable <strong>for</strong>dynamic <strong>data</strong>. It will be very costly if the <strong>data</strong> changes s<strong>in</strong>ce, as the authors po<strong>in</strong>t out, thattwo passes are required over the entire network to make sure that the effect of the measurementsof each node are propagated to every other node. Moreover, contrary to the broadclass of problems that we can solve us<strong>in</strong>g our technique, their technique is only applicable<strong>for</strong> solv<strong>in</strong>g the l<strong>in</strong>ear regression problem.L<strong>in</strong>ear Regression from heterogeneously <strong>distributed</strong> <strong>data</strong> us<strong>in</strong>g variational approximationtechnique has been proposed by Mukherjee and Kargupta [131]. In the paper, theauthors pose the orig<strong>in</strong>al l<strong>in</strong>ear regression problem as a convex optimization problem andthen use variation approximation technique to solve it. Distributed computation of the regressionco<strong>efficient</strong>s reduces to <strong>distributed</strong> <strong>in</strong>ner product computation and the authors usethe technique proposed by Egecioglu et al. [143] to compute the <strong>in</strong>ner products <strong>efficient</strong>ly.This <strong>distributed</strong> regression technique is scalable as corroborated by the theoretical analysisand experimental results.Meta-learn<strong>in</strong>g is an <strong>in</strong>terest<strong>in</strong>g class of <strong>algorithms</strong> typically used <strong>for</strong> supervised learn<strong>in</strong>g.In a meta learn<strong>in</strong>g, such as bagg<strong>in</strong>g [21] or boost<strong>in</strong>g [62] many models are <strong>in</strong>ducedfrom different partitions of the <strong>data</strong> and these “weak” models are comb<strong>in</strong>ed us<strong>in</strong>g a secondlevel algorithm which can be as simple as tak<strong>in</strong>g the average output of the models <strong>for</strong>any new sample. Such a technique is suitable <strong>for</strong> <strong>in</strong>duc<strong>in</strong>g models from <strong>distributed</strong> <strong>data</strong> asproposed by Stolfo et al. [176]. The basic idea is to learn a model at each site <strong>local</strong>ly (nocommunication at all) and then, when a new sample comes, predict the output by simplytak<strong>in</strong>g an average of the <strong>local</strong> outputs. X<strong>in</strong>g et al. [197] present such a framework <strong>for</strong> do<strong>in</strong>gregression <strong>in</strong> heterogenous <strong>data</strong>sets. However, these techniques require synchronizations<strong>in</strong>ce the <strong>local</strong>ly learned models are shipped to a central site <strong>for</strong> the f<strong>in</strong>al comb<strong>in</strong>ation.


99Notation and Problem Def<strong>in</strong>ition The <strong>local</strong> <strong>data</strong> of peer P i at time t is a set−→xof vectors <strong>in</strong> R d idenoted by S i =[(1 ,f( −→ ) −→xx i i1) ,(2 ,f( −→ ) −→xx i i2) ,...,(s ,f( −→ )]x i s) . Eachd-dimensional <strong>data</strong> vector consists of two components — the <strong>in</strong>put set of attributes−→ ](xi −→x )j =[x i j1 xi j2 ...xi j(d−1)of dimensionalityR d−1 iand the real-valued output f j . There<strong>for</strong>eeach <strong>data</strong> vector <strong>in</strong>S i can be viewed as an <strong>in</strong>put-output pair. The functionf is def<strong>in</strong>edas follows: f : R d−1 → R. The rest of the notations are the ones def<strong>in</strong>ed <strong>in</strong> Section 3.5.1.In MR, the task is to learn a function ̂f( −→ x)which “best” approximatesf( −→ x) accord<strong>in</strong>gto some measure such as least square. Now depend<strong>in</strong>g on the representation chosen <strong>for</strong>̂f( −→ x), various types of regression models (l<strong>in</strong>ear or nonl<strong>in</strong>ear) can be developed. We leavethis type specification as part of the problem statement <strong>for</strong> our algorithm, rather than anassumption.In MR, <strong>for</strong> each <strong>data</strong> po<strong>in</strong>t −→ x , the error between ̂f( −→ x) and f( −→ x) can be computed[as f( −→ x)− ̂f( −→ 2.x)]In our scenario, s<strong>in</strong>ce this error value is <strong>distributed</strong> across the peers,a good estimate of the global error is the average error across all the peers, denoted as[Avg f( −→ x)− ̂f( −→ 2.x)]There exist several methods <strong>for</strong> measur<strong>in</strong>g how suitable a model is<strong>for</strong> the <strong>data</strong> under consideration. We have used the L2-norm distance between the currentnetwork <strong>data</strong> and the model as a quality metric <strong>for</strong> the model built. Given a dynamic environment,our goal is to ma<strong>in</strong>ta<strong>in</strong> a ̂f( −→ x) at each peer at any time which best approximatesf( −→ x).Problem 1. [MR Problem] Given a time vary<strong>in</strong>g <strong>data</strong>setS i , a user-def<strong>in</strong>ed thresholdǫand̂f( −→ x) to all the peers, the MR problem∣∣ is to ma<strong>in</strong>ta<strong>in</strong> a ̂f( −→ x) at each peer such that, at anytimet,∣∣ [f( Avg −→ x)− ̂f( −→ 2∣∣∣ ∣∣∣x)]< ǫ.For ease of explanation, we decompose this task <strong>in</strong>to two subtasks. First,∣∣ given ̂f( −→ x)to all the peers, we want to raise an alarm whenever∣∣ [f( Avg −→ x)− ̂f( −→ 2∣∣∣ ∣∣∣x)]> ǫ, where


100ǫ is a user-def<strong>in</strong>ed threshold. This is the model monitor<strong>in</strong>g problem. Secondly, if ̂f( −→ x)no longer represents f ( −→ x), we sample from the network (or even better do an <strong>in</strong>-networkaggregation) to f<strong>in</strong>d an updated ̂f( −→ x). This is the model computation problem. Mathematically,the subproblems can be <strong>for</strong>malized as follows.Problem 2.[Monitor<strong>in</strong>g Problem] Given S i , and∣∣ ̂f( −→ x) to all the peers, the monitor<strong>in</strong>gproblem is to output 0 if∣∣ [f( Avg −→ x)− ̂f( −→ 2∣∣∣ ∣∣∣x)]< ǫ, and 1 otherwise, at any timet.Problem 3.[Computation Problem] The model computation problem is to f<strong>in</strong>d a neŵf( −→ x) based on a sample of the <strong>data</strong> collected from the network.ApproachWith the two problems def<strong>in</strong>ed earlier, it is easy to see that the monitor<strong>in</strong>gproblem reverts exactly to the L2 Monitor<strong>in</strong>g algorithm. The global average error betweenthe true model and the computed one is a po<strong>in</strong>t <strong>in</strong>R. In order to use L2 norm threshold<strong>in</strong>g asthe metric <strong>for</strong> track<strong>in</strong>g this average error, we trans<strong>for</strong>m this 1-D problem to a 2-D problemby def<strong>in</strong><strong>in</strong>g a vector <strong>in</strong>R 2 with the first component set to the average error <strong>for</strong> the peer andthe second component set to 0. There<strong>for</strong>e, deter<strong>m<strong>in</strong><strong>in</strong>g</strong> if the average error is less than ǫ isequivalent to f<strong>in</strong>d<strong>in</strong>g if the average error vector lies <strong>in</strong>side a circle of radius ǫ, i.e. track<strong>in</strong>gthe L2-norm of the global average error.The second problem can be solved us<strong>in</strong>g the application of a convergecast-broadcasttechnique. Once the L2 threshold<strong>in</strong>g algorithm raises a flag, samples are collected andpropagated up the tree us<strong>in</strong>g a convergecast process, the new model ̂f( −→ x) is computedat the root and then propagated down the tree (the broadcast process). One <strong>in</strong>terest<strong>in</strong>gpo<strong>in</strong>t to note is that if f( −→ x) is l<strong>in</strong>ear <strong>in</strong> terms of the weights of the regression model, theconvergecast process is exact rather than approximate.


Example101In this section we illustrate the P2P MR algorithm. Let there be two peersP i andP j . Let the regression model be l<strong>in</strong>ear <strong>in</strong> the regression co<strong>efficient</strong>s: a 0 +a 1 x 1 +a 2 x 2 ,where a 0 ,a 1 and a 2 are the regression co<strong>efficient</strong>s hav<strong>in</strong>g values 1, 2 and -2 respectivelyand x 1 and x 2 are the two attributes of the <strong>data</strong>. The co<strong>efficient</strong>s are given to all the peers.The <strong>data</strong> of peer P i is S i = {(3,1,3.9),(0,−1,3.6)}, where the third entry of each <strong>data</strong>po<strong>in</strong>t is the output generated accord<strong>in</strong>g to the regression model. To this, we add 30% noise.Similarly, <strong>for</strong> peer P j , S j = {(1,4,−6.5),(−3,2,−9.1)}. Now <strong>for</strong> peer P i , the squarederror <strong>for</strong> each po<strong>in</strong>t is: {(0.9) 2 ,(0.6) 2 }. Similarly <strong>for</strong> P j , the errors are {(1.5) 2 ,(2.1) 2 }.The average error is (0.9)2 +(0.6) 2 +(1.5) 2 +(2.1) 24= 1.9575. In R 2 , the task is to determ<strong>in</strong>e if||1.9575,0|| > ǫ i.e. 3.83 > ǫ, <strong>for</strong> a user def<strong>in</strong>ed ǫ.Special case : L<strong>in</strong>ear RegressionIn many cases, sampl<strong>in</strong>g from the network iscommunication <strong>in</strong>tensive. We can f<strong>in</strong>d the co<strong>efficient</strong>s us<strong>in</strong>g an <strong>in</strong>-network aggregation ifwe choose to monitor a widely used regression model viz. l<strong>in</strong>ear regression (l<strong>in</strong>ear withrespect to the parameters or the unknown weights).Let the global <strong>data</strong>set over all the peers be denoted by:⎛x 11 x 12 ... x 1(d−1) f( −→ ⎞x 1 )x 21 x 22 ... x 2(d−1) f( −→ x 2 ). .. .G =x j1 x j2 ... x j(d−1) f( −→ x j )⎜ . .. . ⎟⎝x |G|1 x |G|2 ... x |G|(d−1) f( −→⎠x |G| )where −→ x j = { x j1 x j2 ...x j(d−1)}.In MR, the idea is to learn a function ̂f( −→ x j ) which approximatesf( −→ x j ) <strong>for</strong> all the <strong>data</strong>po<strong>in</strong>ts <strong>in</strong> G. For l<strong>in</strong>ear regression, that function ̂f( −→ x j ) is chosen to be a l<strong>in</strong>ear function i.e.


102a d − 1 degree polynomial fitted to the <strong>in</strong>put attribute po<strong>in</strong>ts { x j1 x j2 ...x j(d−1)}∀j = 1to |G|. More specifically, the l<strong>in</strong>ear model which we want to fit be: ̂f( −→ x j ) = a 0 +a 1 x j1 +a 2 x j2 + ... + a j(d−1) x d−1 , where a i ’s are the co<strong>efficient</strong>s that need to be estimated fromthe global <strong>data</strong>set G. We drop the cross terms <strong>in</strong>volv<strong>in</strong>g x jk and x jl <strong>for</strong> simplicity ∀k,l ∈[1..(d−1)].For every <strong>data</strong> po<strong>in</strong>t <strong>in</strong> the set G, the squared error is:E 1 = [ f ( −→ x 1 )−a 0 −a 1 x 11 −a 2 x 12 −...−a d−1 x 1(d−1)] 2E 2 = [ f ( −→ x 1 )−a 0 −a 1 x 21 −a 2 x 22 −...−a d−1 x 2(d−1)] 2.E |G| = [ f ( −→ x|G|)−a0 −a 1 x |G|1 −a 2 x |G|2 −...−a d−1 x |G|(d−1)] 2Thus the total square error over all the <strong>data</strong> po<strong>in</strong>ts is|G| |G|∑ ∑[ SSE = E j = f (−→ ] 2xj )−a 0 −a 1 x j1 −a 2 x j2 −...−a d−1 x j(d−1)j=1 j=1For l<strong>in</strong>ear regression, closed <strong>for</strong>m expressions exist <strong>for</strong> f<strong>in</strong>d<strong>in</strong>g the co<strong>efficient</strong>sa i ’s byf<strong>in</strong>d<strong>in</strong>g the partial derivatives of SSE with respect to thea i ’s and sett<strong>in</strong>g them to zero:


103∂∂a 0SSE = 2∂∂a 1SSE = 2∂∂a 2SSE = 2∂∂a d−1SSE = 2.|G|∑j=1|G|∑j=1|G|∑j=1|G|∑j=1[f (−→ xj )−a 0 −a 1 x j1 −a 2 x j2 −...−a d−1 x j(d−1)](−1) = 0[f (−→ xj )−a 0 −a 1 x j1 −a 2 x j2 −...−a d−1 x j(d−1)](−xj1 ) = 0[f (−→ xj )−a 0 −a 1 x j1 −a 2 x j2 −...−a d−1 x j(d−1)](−xj2 ) = 0[f (−→ xj )−a 0 −a 1 x j1 −a 2 x j2 −...−a d−1 x j(d−1)](−xj(d−1) ) = 0In the matrix <strong>for</strong>m this can be written as:⎛⎜⎝|G|∑ |G|j=1 x j1.∑ |G|j=1 x j(d−1)⎞∑ |G|j=1 x ∑j1 ... |G|j=1 x j(d−1)∑ |G|j=1 (x ∑j1) 2 ... |G|j=1 x j1 ∗x j(d−1). . .. .⎟j=1 x ∑ ⎠j(d−1) ∗x j1 ... |G|j=1 (x j(d−1)) 2⎛∑ |G|⎛ ⎞a 0a 1×=⎜ .⎟ ⎜⎝ ⎠ ⎝a d−1⎞∑ |G|j=1 f(−→ x j )∑ |G|j=1 f(−→ x j )x j1.⎟∑⎠|G|j=1 f(−→ x j )x j(d−1)⇒ Xa = YThere<strong>for</strong>e <strong>for</strong> comput<strong>in</strong>g the matrix (or more appropriately vector) a, we need to evaluatethe matrices X and Y. This can be done <strong>in</strong> a communication <strong>efficient</strong> manner by notic<strong>in</strong>gthat the entries of these matrices are simply sums. Consider the <strong>distributed</strong> scenariowhereG is <strong>distributed</strong> amongnpeersS 1 ,S 2 ,...,S n . Any entry of X, say ∑ |G|j=1 (x j1) 2 , can


e decomposed as104|G|∑j=1(x j1 ) 2 = ∑x j1 ∈S 1(x j1 ) 2} {{ }<strong>for</strong>S 1+ ∑x j1 ∈S 2(x j1 ) 2} {{ }<strong>for</strong>S 2+...+ ∑x j1 ∈S n(x j1 ) 2} {{ }<strong>for</strong>S nThere<strong>for</strong>e each entry of X and Y can be computed by simple sum over all the peers.Thus, <strong>in</strong>stead of send<strong>in</strong>g the raw <strong>data</strong> <strong>in</strong> the convergecast round, peer P i can <strong>for</strong>ward a<strong>local</strong>ly computed matrix X i and Y i . Peer P j , on receiv<strong>in</strong>g this, can <strong>for</strong>ward a new matrixX j and Y j by aggregat<strong>in</strong>g, <strong>in</strong> a component-wise fashion, its <strong>local</strong> matrix and the receivedones. Note that the avoidance of the sampl<strong>in</strong>g technique ensures that the result is exactlythe same compared to a centralized sett<strong>in</strong>g.Communication ComplexityNext we prove a lemma which states the communicationcomplexity of comput<strong>in</strong>g the l<strong>in</strong>ear regression model.Theorem 3.8.1. The communication complexity of comput<strong>in</strong>g a l<strong>in</strong>ear regression model isonly dependent on the degree of the polynomial (d) and is <strong>in</strong>dependent of the number of<strong>data</strong> po<strong>in</strong>ts i.e. |G|.Proof. As shown <strong>in</strong> Section 3.8.4, the task of comput<strong>in</strong>g the regression co<strong>efficient</strong>s{a 0 ,a 1 ,...a d−1 } can be reduced to comput<strong>in</strong>g the matrices X i and Y i . The dimensionalityof X i d.d = d 2 . Similarly the dimensionality of Y i d.1 = d. There<strong>for</strong>e the total communicationcomplexity isO(d 2 ), which is <strong>in</strong>dependent of the size of the <strong>data</strong>set |G|.The efficiency of the convergecast process is due to the fact thatd ≪ |G|. Hence therecan be significant sav<strong>in</strong>gs <strong>in</strong> terms of communication by not communicat<strong>in</strong>g the raw <strong>data</strong>.


3.9 Experimental Validation105To validate the per<strong>for</strong>mance of our <strong>algorithms</strong> we conducted experiments on a simulatednetwork of thousands of peers. In this section we discuss the experimental setup andanalyze the per<strong>for</strong>mance of the <strong>algorithms</strong>.3.9.1 Experimental SetupOur implementation makes use of the Distributed Data M<strong>in</strong><strong>in</strong>g Toolkit (DDMT)[44] — a <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> development environment from DIADIC research labat UMBC. DDMT uses topological <strong>in</strong><strong>for</strong>mation which can be generate by BRITE [23], auniversal topology generator from Boston University. In our simulations we used topologiesgenerated accord<strong>in</strong>g to the Barabasi Albert (BA) model, which is often considered areasonable model <strong>for</strong> the Internet. BA also def<strong>in</strong>es delays <strong>for</strong> network edges, which are thebasis <strong>for</strong> our time measurement 1 . On top of the network generated by BRITE, we overlayeda communication tree.Data GenerationThe <strong>data</strong> used <strong>in</strong> the simulations of the L2 Threshold<strong>in</strong>g algorithm,the Mean Monitor<strong>in</strong>g algorithm, the k-Means and the Eigen Monitor<strong>in</strong>g algorithmwas generated us<strong>in</strong>g a mixture of Gaussians <strong>in</strong> R d . Every time a simulated peer neededan additional <strong>data</strong> po<strong>in</strong>t, it sampled d Gaussians and multiplied the result<strong>in</strong>g vector witha d × d covariance matrix <strong>in</strong> which the diagonal elements were all 1.0’s while the offdiagonalelements were chosen uni<strong>for</strong>mly between 1.0 and 2.0. Alternatively, 10% of thepo<strong>in</strong>ts were chosen uni<strong>for</strong>mly at random <strong>in</strong> the range of µ ± 3σ. At controlled <strong>in</strong>tervals,the means of the Gaussians were changed, thereby creat<strong>in</strong>g an epoch change. A typical<strong>data</strong> <strong>in</strong> two dimensions can be seen <strong>in</strong> Figure 3.9. We preferred synthetic <strong>data</strong> because ofthe <strong>large</strong> number of factors (twelve, <strong>in</strong> our analysis) which <strong>in</strong>fluence the behavior of an1 Wall time is mean<strong>in</strong>gless when simulat<strong>in</strong>g thousands of computers on a s<strong>in</strong>gle PC.


106algorithm, and the desire to per<strong>for</strong>m a tightly controlled experiment <strong>in</strong> order to understandthe behavior of a complex algorithm which operates <strong>in</strong> an equally as complex environment.For the multivariate regression problem, the <strong>in</strong>put <strong>data</strong> of a peer is a vector(x 1 ,x 2 ,...,x d )∈ R d , where the first d − 1 dimensions correspond to the <strong>in</strong>put variables and the lastdimension corresponds to the output.We have conducted experiments on both l<strong>in</strong>earand non-l<strong>in</strong>ear regression models.For the l<strong>in</strong>ear model, the output is generated accord<strong>in</strong>gto x d = a 0 + a 1 x 1 + a 2 x 2 + ... + a d−1 x d−1 .We have used two functions<strong>for</strong> the non-l<strong>in</strong>ear model: (1) x 3 = a 0 + a 1 a 2 x 1 + a 0 a 1 x 2 (multiplicative) and (2)x 3 = a 0 ∗ s<strong>in</strong>(a 1 + a 2 x 1 ) + a 1 ∗ s<strong>in</strong>(a 2 + a 0 x 2 ) (s<strong>in</strong>usoidal). Every time a simulatedpeer needs an additional <strong>data</strong> po<strong>in</strong>t, it chooses the values of x 1 ,x 2 ,...x d−1 , each <strong>in</strong>dependently<strong>in</strong> the range -100 to +100. Then it generates the value of the target variablex d us<strong>in</strong>gany of the above functions and adds a uni<strong>for</strong>m noise <strong>in</strong> the range 5 to 20% of the valueof the target output. The regression weights a 0 ,a 1 ,...,a d−1 ’s are changed randomly atcontrolled <strong>in</strong>tervals to create an epoch change.Measurement MetricThe two most important qualities measured <strong>in</strong> our experimentsare the quality of the result and the cost of the algorithm. Quality is def<strong>in</strong>ed differently<strong>for</strong> the L2 Threshold<strong>in</strong>g algorithm, the Mean Monitor<strong>in</strong>g algorithm, thek-Means, theEigen Monitor<strong>in</strong>g algorithm and the Multivariate Regression algorithm.For the L2 Threshold<strong>in</strong>g algorithm, quality is measured <strong>in</strong> terms of the number ofpeers correctly comput<strong>in</strong>g an alert i.e. the percentage of peers <strong>for</strong> whom ∥ −→ ∥ ∥∥K i < ǫ when∥ −→ G∥ < ǫ, and the percentage of peers <strong>for</strong> whom ∥ −→ ∥ ∥ ∥∥ ∥∥ −→∥ ∥∥K i ≥ ǫ when G ≥ ǫ. We measurethe maximal, average and m<strong>in</strong>imal quality over all the peers (averaged over a number ofdifferent experiments). Quality is reported <strong>in</strong> three different scenarios: overall quality,averaged over the entire experiment; and quality on stationary <strong>data</strong>, measured separately(∥ ∥∥ −→∥ ) ∥∥<strong>for</strong> periods <strong>in</strong> which the mean of the <strong>data</strong> is <strong>in</strong>side theǫ-circle G < ǫ and <strong>for</strong> periods(∥ ∥∥ −→∥ ) ∥∥<strong>in</strong> which the means of the <strong>data</strong> is outside the circle G ≥ ǫ .


107For the Mean Monitor<strong>in</strong>g algorithm, quality is the average distance between −→ G andthe computed mean vector −→ µ . We plot, separately, the overall quality (dur<strong>in</strong>g the entireexperiment) and the quality after the sufficient statistics have been collected.For the k-Means and eigenvector monitor<strong>in</strong>g algorithm, quality is def<strong>in</strong>ed as the distancebetween the solution of our algorithm and that computed by a centralized algorithm,given all the <strong>data</strong> of all of the peers.Lastly, <strong>for</strong> the regression monitor<strong>in</strong>g algorithm, quality is def<strong>in</strong>ed as the L2 normdistance between the solution of our algorithm and the actual regression weights.We measured the cost of the algorithm accord<strong>in</strong>g to the frequency <strong>in</strong> which messagesare sent by each peer. Because of the leaky bucket mechanism which is part of the algorithm(see Section 3.7.1 <strong>for</strong> explanation), the rate of messages per average peer is boundedby two <strong>for</strong> every L time units (one to each neighbor, <strong>for</strong> an average of two neighbors perpeer). The trivial algorithm that floods every change <strong>in</strong> the <strong>data</strong> would send messages atthis rate. The communication cost of our <strong>algorithms</strong> is thus def<strong>in</strong>ed <strong>in</strong> terms of normalizedmessages - the portion of this maximal rate which the algorithm uses. Thus, 0.1 normalizedmessages means that n<strong>in</strong>e times out of ten the algorithm manages to avoid send<strong>in</strong>g a message.We report both overall cost, which <strong>in</strong>cludes the stationary and transitional phases ofthe experiment (and thus is necessarily higher), and the monitor<strong>in</strong>g cost, which only refersto stationary periods. The monitor<strong>in</strong>g cost is the cost paid by the algorithm even if the <strong>data</strong>rema<strong>in</strong>s stationary; hence, it measures the “wasted ef<strong>for</strong>t” of the algorithm. We also separate,where appropriate, messages perta<strong>in</strong><strong>in</strong>g to the computation of the L2 Threshold<strong>in</strong>galgorithm from those used <strong>for</strong> convergecast and broadcast of statistics.Typical ExperimentThere are many factors which may <strong>in</strong>fluence the per<strong>for</strong>manceof the <strong>algorithms</strong>. First, are those perta<strong>in</strong><strong>in</strong>g to the <strong>data</strong>: the number of dimensions d, thecovarianceσ, and the distance between the means of the Gaussians of the different epochs(the algorithm is oblivious to the actual values of the means), and the length of the epochs


108T . Second, there are factors perta<strong>in</strong><strong>in</strong>g to the system: the topology, the number of peers,and the size of the <strong>local</strong> <strong>data</strong>. Last, there are control arguments of the algorithm: mostimportantly ǫ — the desired alert threshold, and then also L — the maximal frequencyof messages. In all the experiments that we report <strong>in</strong> this section, one parameter of thesystem was changed and the others were kept at their default values. The default valueswere : number of peers = 1000, |S i | = 800, ǫ = 2, d = 5, L = 500 (where the averageedge delay is about 1100 time units), and the Frobenius norm of the covariance of the <strong>data</strong>‖σ‖ F= ∑ ∑i=1...m j=1...n |σ i,j| 2 at 5.0. We selected the distance between the means so thatthe rates of false negatives and false positives are about equal. More specifically, the means<strong>for</strong> one of the epochs was +2 along each dimension and <strong>for</strong> the other it was -2 along eachdimension. For each selection of the parameters, we ran the experiment <strong>for</strong> a long periodof simulated time, allow<strong>in</strong>g 10 epochs to occur.A typical experiment is described <strong>in</strong> Figure 3.10(a) and 3.10(b). In the experiment,after every 2 × 10 5 simulator ticks, the <strong>data</strong> distribution is changed, thereby creat<strong>in</strong>g anepoch. To start with, every peer is given the same mean as the mean of the Gaussian. Thusa very high percentage (∼ 100 %) of the peers states that ∥ −→ G∥ < ǫ. After the a<strong>for</strong>esaidnumber (2 ×10 5 ) of simulator ticks, we change the Gaussian without chang<strong>in</strong>g the meangiven to each peer. Thus, <strong>for</strong> the next epoch, we see that a very low percentage of thepeers (∼ 0 %) output that ∥ −→ G∥ < ǫ. For the cost of the algorithm <strong>in</strong> Figure 3.10(b),we see that messages exchanged dur<strong>in</strong>g the stationary phase are low. Many messages are,however, exchanged as soon as the epoch changes. This is expected s<strong>in</strong>ce all the peers needto communicate <strong>in</strong> order to get conv<strong>in</strong>ced that the distribution has <strong>in</strong>deed changed. Thenumber of messages decreases once the distribution becomes stable aga<strong>in</strong>.


109302010Distribution 1Distribution 2White Noise0−10−20−30−30 −20 −10 0 10 20 30FIG. 3.9. 2-D typical <strong>data</strong> used <strong>in</strong> the experiment. It shows the two gaussians along withrandom noise.% peers report<strong>in</strong>g ||G||


110100100% correct peers9590||G||εOverall80200 500 1000 2000 3000Number of Peers% correct peers959085||G||εOverall2 3 4 5 6 7 8 9 10Dimension(a) Quality vs. number of peers(b) Quality vs. dimensionNormalized Messages0.050.040.02OverallStationary period200 500 1000 2000 3000Number of Peers(c) Cost vs. number of peersNormalized Messages0.40.30.20.10OverallStationary period2 3 4 5 6 7 8 9 10Dimension(d) Cost vs. dimensionFIG. 3.11. Scalability of Local L2 algorithm w.r.t number of peers and the dimension. Bothquality and cost w.r.t number of peers converge to a constant, show<strong>in</strong>g high scalability. Fordimension variability, the cost <strong>in</strong>creases almost l<strong>in</strong>early w.r.t dimension, when the space<strong>in</strong>creases exponentially.


3.9.2 Experiments with Local L2 Threshold<strong>in</strong>g Algorithm111The L2 Threshold<strong>in</strong>g algorithm is the simplest one we present here. In our experiments,we use the L2 Threshold<strong>in</strong>g to establish the scalability of the <strong>algorithms</strong> with respectto both the number of peers and the dimensionality of the <strong>data</strong>, and the dependencyof the algorithm on the ma<strong>in</strong> parameters — the norm of the covariance σ, the size of the<strong>local</strong> <strong>data</strong> set, the toleranceǫ, and the bucket sizeL.In Figure 3.11, we analyze the scalability of the <strong>local</strong> L2 algorithm. As Figure 3.11(a)and Figure 3.11(c) show, the average quality and cost of the algorithm converge to a constantas the number of peers <strong>in</strong>crease. This typifies <strong>local</strong> <strong>algorithms</strong> — because the computationis <strong>local</strong>, the total number of peers do not affect per<strong>for</strong>mance. Hence, there could beno deterioration <strong>in</strong> quality or cost. Also the number of messages per peer become a constant— typical to <strong>local</strong> <strong>algorithms</strong>. Similarly, Figure 3.11(b) and Figure 3.11(d) show thescalability with respect to the dimension of the problem. As shown <strong>in</strong> the figures, qualitydoes not deteriorate when the dimension of the problem is <strong>in</strong>creased. Also note that thecost <strong>in</strong>creases approximately l<strong>in</strong>early with the dimension. This <strong>in</strong>dependence of the qualitycan be expla<strong>in</strong>ed if one th<strong>in</strong>ks of what the algorithm does <strong>in</strong> terms of doma<strong>in</strong> l<strong>in</strong>earization.We hypothesis that when the mean of the <strong>data</strong> is outside the circle, most peers tend to selectthe same half-space. If this is true then the problem is projected along the vector def<strong>in</strong><strong>in</strong>gthat half-space — i.e., becomes uni-dimensional. Inside the circle, the problem is aga<strong>in</strong>uni-dimensional; if thought about <strong>in</strong> terms of the polar coord<strong>in</strong>ate system (rooted at thecenter of the circle), then obviously the only dimension on which the algorithm depends isthe radius. The dependency of the cost on the dimension stems from the l<strong>in</strong>ear dependenceof the variance of the <strong>data</strong> on the number of Gaussians with whose variance is constant.In Figures 3.12—3.15, we explore the dependency of the L2 algorithm on differentparameters viz. Frobenius norm of the covariance of the <strong>data</strong> σ (‖σ‖ F=∑ ∑i=1...m j=1...n |σ i,j| 2 ), the size of the <strong>local</strong> <strong>data</strong>set |S i |, the alert threshold ǫ, and the


112size of the leaky bucket L. As noted earlier, <strong>in</strong> each experiment one parameter was variedand the rest were kept at their default values.The first pair of figures, Figure 3.12(a) and Figure 3.12(b), outl<strong>in</strong>e the dependency ofthe quality and the cost on the covariance of the <strong>data</strong> (σ = A −→ E ) whereAis the covariancematrix and −→ E is the variance of the Gaussians (refer to Section 3.9.1). Matrix A is asdef<strong>in</strong>ed <strong>in</strong> Section 3.9.1 while −→ E is the column vector represent<strong>in</strong>g the variance of theGaussians and takes the values 5, 10, 15 or 25. For epochs with ∥ −→ G∥ < ǫ, the maximal,the average, and the m<strong>in</strong>imal quality <strong>in</strong> every experiment decrease l<strong>in</strong>early with the variance(from around 99% on average to around 96%). Epochs with ∥ −→ G∥ > ǫ, on the other hand,reta<strong>in</strong>ed very high quality, regardless of the level of variance. The overall quality alsodecreases l<strong>in</strong>early from around 97% to 84%, apparently result<strong>in</strong>g from slower convergenceon every epoch change. As <strong>for</strong> the cost of the algorithm, this <strong>in</strong>creases as the square root of‖σ‖ F(i.e., l<strong>in</strong>ear to the variance), both <strong>for</strong> the stationary and overall period. Nevertheless,even with the highest variance, the cost stayed far from the theoretical maximum of twomessages per peer per leaky bucket period.The second pair of figures, Figure 3.13(a) and Figure 3.13(b), show that the variancecan be controlled by <strong>in</strong>creas<strong>in</strong>g the <strong>local</strong> <strong>data</strong>. As |S i | <strong>in</strong>creases, the quality <strong>in</strong>creases,and cost decreases, proportional to √ |S i |. The cause of that is clearly the relation of thevariance of an i.i.d. sample to the sample size which is <strong>in</strong>verse of the square root.The third pair of figures, Figure 3.14(a) and Figure 3.14(b), present the effect of chang<strong>in</strong>gǫ on both the cost and quality of the algorithm. As can be seen, below a certa<strong>in</strong> po<strong>in</strong>t,the number of false positives grows drastically. The number of false negatives, on the otherhand, rema<strong>in</strong>s constant regardless ofǫ. Whenǫis about two, the distances of the two meansof the <strong>data</strong> (<strong>for</strong> the two epochs) from the boundary of the circle are approximately the sameand hence the rates of false positives and false negatives are approximately the same too.As ǫ decreases, it becomes <strong>in</strong>creas<strong>in</strong>gly difficult to judge if the mean of the <strong>data</strong> is <strong>in</strong>side


% correct peers10090Normalized Messages0.250.1580||G||ε0.05Overall7000 5 10 0 5 10||σ||x 10 4F||σ||x 10 4F(a) Quality vs. ‖σ‖ F(b) Cost vs. ‖σ‖ F0.2OverallStationary periodFIG. 3.12. Dependency of cost and quality of L2 Threshold<strong>in</strong>g on‖σ‖ F. Quality is def<strong>in</strong>edas the percentage of peers correctly comput<strong>in</strong>g an alert (separated <strong>for</strong> epochs with∥G⃗ ∥ lessand more thanǫ). Cost is def<strong>in</strong>ed as the portion of the leaky buckets <strong>in</strong>tervals that are used.Both overall cost and cost of just the stationary periods are reported. Quality degrades andcost <strong>in</strong>creases as ‖σ‖ Fis <strong>in</strong>creased.113the smaller circle and <strong>in</strong>creas<strong>in</strong>gly easier to judge that the mean is outside the circle. Thus,the number of false positives <strong>in</strong>crease. The cost of the algorithm decreases l<strong>in</strong>early as ǫgrows from 0.5 to 2.0, and reaches nearly zero <strong>for</strong> ǫ = 3. Note that even <strong>for</strong> a fairly lowǫ = 0.5, the number of messages per peer per leaky bucket period is around 0.75, which isfar less than the theoretical maximum of 2.Figure 3.15(a) and Figure 3.15(b) explore the dependency of the quality and the coston the size of the leaky bucket L. Interest<strong>in</strong>gly, the reduction <strong>in</strong> cost here is far faster thanthe reduction <strong>in</strong> quality, with the optimal po<strong>in</strong>t (assum<strong>in</strong>g 1:1 relation between cost andquality) somewhere between 100 time units and 500 time units. It should be noted thatthe average delay BRITE assigned to and edge is around 1100 time units. This shows thateven a very permissive leaky bucket mechanism is sufficient to greatly limit the number ofmessages.We conclude that the L2 Threshold<strong>in</strong>g provides a moderate rate of false positives even


114% correct peers1009590||G||εOverall85200 800 1600 3200|S i|(a) Quality vs. |S i |Normalized Messages0.250.20.150.10.05OverallStationary period0200 800 1600 3200|S i|(b) Cost vs. |S i |FIG. 3.13. Dependency of cost and quality of L2 Threshold<strong>in</strong>g on |S i |. Quality is def<strong>in</strong>edas the percentage of peers correctly comput<strong>in</strong>g an alert (separated <strong>for</strong> epochs with∥G⃗ ∥ lessand more thanǫ). Cost is def<strong>in</strong>ed as the portion of the leaky buckets <strong>in</strong>tervals that are used.Both overall cost and cost of just the stationary periods are reported. The quality improvesand the cost decreases as |S i | is <strong>in</strong>creased.% correct peers1008060400.5 1 2 3ε||G||εOverallNormalized Messages10.750.50.250OverallStationary period0.5 1 2 3ε(a) Quality vs. ǫ(b) Cost vs. ǫFIG. 3.14. Dependency of cost and quality of L2 Threshold<strong>in</strong>g on ǫ. Quality is def<strong>in</strong>ed asthe percentage of peers correctly comput<strong>in</strong>g an alert (separated <strong>for</strong> epochs with ∥G⃗ ∥ lessand more thanǫ). Cost is def<strong>in</strong>ed as the portion of the leaky buckets <strong>in</strong>tervals that are used.Both overall cost and cost of just the stationary periods are reported. There is a drasticimprovement of quality and decrease <strong>in</strong> cost as ǫ is <strong>in</strong>creased.


115% correct peers100908070||G||εOverall100 250 500 1000LNormalized Messages0.080.060.040.02OverallStationary period100 250 500 1000L(a) Quality vs. L(b) Cost vs. LFIG. 3.15. Dependency of cost and quality of L2 Threshold<strong>in</strong>g the size of the leaky bucketL. Quality is def<strong>in</strong>ed as the percentage of peers correctly comput<strong>in</strong>g an alert (separated <strong>for</strong>epochs with ∥G⃗ ∥ less and more than ǫ). Cost is def<strong>in</strong>ed as the portion of the leaky buckets<strong>in</strong>tervals that are used. Both overall cost and cost of just the stationary periods are reported.Quality is almost unaffected byL, while cost decreased with <strong>in</strong>creas<strong>in</strong>gL.<strong>for</strong> noisy <strong>data</strong> and an excellent rate of false negatives regardless of the noise. It requireslittle communication overhead dur<strong>in</strong>g stationary periods. Furthermore, the algorithm ishighly scalable — both with respect to the number of peers and dimensionality — becauseper<strong>for</strong>mance is <strong>in</strong>dependent of the number of peers and dimension of the problem.3.9.3 Experiments with Means-Monitor<strong>in</strong>gHav<strong>in</strong>g explored the effects of the different parameters of the L2 Threshold<strong>in</strong>g algorithm,we now shift our focus on the experiments with the Mean Monitor<strong>in</strong>g algorithm.We have explored the three most important parameters that affect the behavior of the MeanMonitor<strong>in</strong>g algorithm: τ — the alert mitigation period, T — the length of an epoch, and ǫ— the alert threshold.Figures 3.16—3.18 summarize the results of these experiments. As can be seen, thequality, measured by the distance of the actual means vector −→ G from the computed one −→ µ


116Avg || G − µ ||0.1250.10.0750.050.025OverallAfter Data Collection0100 500 1000 1500 2000Alert mitigation period (τ)(a) Quality vs. τAverage Data Collection32.520.20.150.10.05100 500 1000 1500 2000 0Alert mitigation period (τ)(b) Cost vs. τNormalized L2 MessagesFIG. 3.16. Dependency of cost and quality of Mean-Monitor<strong>in</strong>g on alert mitigation periodτ. The number of <strong>data</strong> collection rounds decrease as τ is <strong>in</strong>creased.Avg || G − µ ||0.240.160.08OverallAfter Data Collection01K2K 5K 10KEpoch Length(T)(a) Quality vs. Epoch LengthAverage Data Collection3.532.51K2K 5K 10KEpoch Length (T)(b) Cost vs. Epoch Length0.40.30.20.1Normalized L2 MessagesFIG. 3.17. Dependency of cost and quality of Mean-Monitor<strong>in</strong>g on epoch length (T ). AsT<strong>in</strong>creases, the average||G−µ|| and L2 monitor<strong>in</strong>g cost decrease s<strong>in</strong>ce the average is takenover a <strong>large</strong>r stationary period <strong>in</strong> which µ is correct.


117Avg || G − µ ||0.060.040.020OverallAfter Data Collection0.5 1 2 3εAverage Data Collection3.530.750.50.250.5 1 2 3 0ε1Normalized L2 Messages(a) Quality vs. ǫ(b) Cost vs. ǫFIG. 3.18. Dependence of cost and quality of Mean-Monitor<strong>in</strong>g on ǫ. Average||G −µ||<strong>in</strong>creases and the cost decreases as ǫ is <strong>in</strong>creased.is excellent <strong>in</strong> all three graphs. Also shown are the cost graphs with separate plots <strong>for</strong> theL2 messages (on the right axis) and the number of convergecast rounds — each cost<strong>in</strong>g twomessages per peer on average — (on the left axis) per epoch. In Figure 3.16(a), the averagedistance between −→ G and −→ µ decreases as the alert mitigation period (τ) is decreased <strong>for</strong>the entire length of the experiment. This is as expected, s<strong>in</strong>ce, with a smaller τ, the peerscan rebuild the model more frequently, result<strong>in</strong>g <strong>in</strong> more accurate models. On the otherhand, the quality after the <strong>data</strong> collection is extremely good and is <strong>in</strong>dependent of τ. With<strong>in</strong>creas<strong>in</strong>g τ, the number of convergecast rounds per epoch decreases (from three to twoon average) as shown <strong>in</strong> Figure 3.16(b). In our analysis, this results from a decrease <strong>in</strong> thenumber of false alerts.Figure 3.17(a) depicts the relation of the quality (both overall and stationary periods)toT . The average distance between the estimated mean vector and the actual one decreasesas the epoch length T <strong>in</strong>creases. The reason is the follow<strong>in</strong>g: at each epoch, several convergecastrounds usually occur. The later the round is, the less polluted is the <strong>data</strong> byremnants of the previous epoch — and thus the more accurate is −→ µ . Thus, when the epochlength <strong>in</strong>creases, the proportion of these later −→ µ ’s, which are highly accurate, <strong>in</strong>creases <strong>in</strong>


118the overall quality lead<strong>in</strong>g to a more accurate average. Figure 3.17(b) shows a similar trend<strong>for</strong> the cost <strong>in</strong>curred. One can see that the number of L2 messages decrease asT <strong>in</strong>creases.Clearly, the more accurate −→ µ is, the less monitor<strong>in</strong>g messages are sent. There<strong>for</strong>e with<strong>in</strong>creas<strong>in</strong>g T , the quality <strong>in</strong>creases and cost decreases <strong>in</strong> the later rounds and these effectsare reflected <strong>in</strong> the figures.F<strong>in</strong>ally, the average distance between −→ G and −→ µ decreases as ǫ decreases as demonstrated<strong>in</strong> Figure 3.18(a). This is as expected, s<strong>in</strong>ce with decreas<strong>in</strong>g ǫ, the L2 algorithmensures that these two quantities be brought closer to each other and thus the average distancebetween them decreases. The cost of the algorithm, however, shows the reverse trend.This result is <strong>in</strong>tuitive — with <strong>in</strong>creas<strong>in</strong>gǫ, the algorithm has bigger area <strong>in</strong> which to boundthe global average and thus the problem becomes easier, and hence less costly, to solve.On the whole, quality of the Mean-Monitor<strong>in</strong>g algorithm outcome behaves well respectiveto all the three parameters <strong>in</strong>fluenc<strong>in</strong>g it. The monitor<strong>in</strong>g cost i.e. L2 messagesis also low. Furthermore, on an average, the number of convergecast rounds per epochis around three — which can easily be reduced further by us<strong>in</strong>g a bigger τ as the defaultvalue.3.9.4 Experiments withk-Means and Eigen-Monitor<strong>in</strong>gIn this set of experiments our goal is to <strong>in</strong>vestigate the effect of the sample size on boththe P2P k-means algorithm and the Eigen-monitor<strong>in</strong>g algorithm. To do that we comparethe results of our algorithm to those of a centralized algorithm that processed the entire<strong>data</strong>. We compute the distance between each centroid computed by the P2P algorithmand the closest centroid computed by the centralized one. S<strong>in</strong>ce our algorithm is not only<strong>distributed</strong> but also sample-based, we <strong>in</strong>clude <strong>for</strong> comparison the results of centralized algorithm(k-means or eigenvector computation, respectively) which takes a sample from theentire <strong>data</strong> as its <strong>in</strong>put. The most outstand<strong>in</strong>g result, seen <strong>in</strong> Figure 3.19(a), is that most


119of the error of the <strong>distributed</strong> algorithm is due to sampl<strong>in</strong>g and not due to decentralization.The error, both average, best case, and worst case, is very similar to that of the centralizedsample-based algorithm. This is significant <strong>in</strong> two ways. First, the decentralized algorithmis obviously an alternative to centralization; especially consider<strong>in</strong>g the far lower communicationcost. Secondly, the error of the decentralized algorithm can be easily controlled by<strong>in</strong>creas<strong>in</strong>g the sample size.The costs of k-means have to be separated to those related to monitor<strong>in</strong>g the currentcentroids and those related to the collection of the sample. Figure 3.19(b) presents the costsof monitor<strong>in</strong>g a s<strong>in</strong>gle centroid and the number of times <strong>data</strong> was collected per epoch.These could be multiplied by k to bound the total costs (note that messages relat<strong>in</strong>g todifferent centroids can be piggybacked on each other). The cost of monitor<strong>in</strong>g decreasesdrastically with <strong>in</strong>creas<strong>in</strong>g sample size — result<strong>in</strong>g from the better accuracy provided bythe <strong>large</strong>r sample. Also there is a decrease <strong>in</strong> the number of convergecast rounds as thesample size <strong>in</strong>creases. The default value of the alert mitigation factorτ <strong>in</strong> this experimentalsetup was 500. For any sample size greater than 2000, the number of convergecast roundsis about two per epoch — <strong>in</strong> the first round, it seems, the <strong>data</strong> is so much polluted by <strong>data</strong>from the previous epoch that a new round is immediately triggered. As noted earlier, thiscan be further decreased us<strong>in</strong>g a <strong>large</strong>r value ofτ.Results <strong>for</strong> the Eigen-monitor<strong>in</strong>g algorithm (Figure 3.20) are very similar to those ofk-means. Even more evidently, the source of error <strong>in</strong> the eigenvector is that it is based on asample from the <strong>data</strong>. S<strong>in</strong>ce the eigenvector is more robust statistics than k-means results,the error is significantly smaller. The monitor<strong>in</strong>g cost and the <strong>data</strong> collection rounds alsodecrease with <strong>in</strong>creas<strong>in</strong>g sample size. Note that <strong>in</strong> this case we chose a smallerǫ = 0.02.


120Distance to optimal centroid1.51.2510.750.50.25Centralized ExperimentDistributed Experiment500 2000 4000 8000Sample Size(a) Average Quality vs. Sample SizeAverage Data Collection2.520.30.250.20.151.5500 2000 4000 8000 0.1Sample Size(b) Average Monitor<strong>in</strong>g Cost vs. Sample SizeNormalized L2 MessagesFIG. 3.19. The Quality and Cost ofk-means Cluster<strong>in</strong>g.Distance to optimal Eigen Vector1.510.5Centralized ExperimentDistributed Experiment01000 2000 4000Sample Size(a) Average Quality vs. Sample SizeAverage Data Collection32.80.70.62.60.52.40.42.20.321000 2000Sample Size0.24000(b) Average Monitor<strong>in</strong>g Cost vs. Sample SizeNormalized L2 MessagesFIG. 3.20. The Quality and Cost of Eigen monitor<strong>in</strong>g.


121%peers with ||G t|| ǫand overall) and cost of the algorithm w.r.t. different parameters. Note that, unless otherwisestated, we have used the follow<strong>in</strong>g default values <strong>for</strong> the different parameters: numberof peers = 1000, |S i | = 50, ǫ = 1.5, d = 10 and L = 500 (where the average edge delay is


122% correct peers1009590||G||εOverall25 50 100 200|S | iNormalized Messages0.250.20.150.10.050StationaryOverall25 50 100 200|S | i(a) Quality vs. |S i |(b) Cost vs. |S i |FIG. 3.22. Dependence of the quality and cost of the monitor<strong>in</strong>g algorithm on |S i |. Asbe<strong>for</strong>e, quality improves and cost decreases as|S i | <strong>in</strong>creases.about 1100 time units). As we have already stated, <strong>in</strong>dependent of the regression functionchosen, the underly<strong>in</strong>g monitor<strong>in</strong>g problem is always <strong>in</strong> R 2 . The results reported <strong>in</strong> thissection are with respect to l<strong>in</strong>ear model s<strong>in</strong>ce it is the most widely used regression model.Results of monitor<strong>in</strong>g more complex models are reported <strong>in</strong> the next section.Figures 3.22(a) and 3.22(b) show the quality and cost of the algorithm as the size of<strong>local</strong> <strong>data</strong>set is changed. As expected, the <strong>in</strong>side quality <strong>in</strong>creases and the cost decreasesas the size of <strong>data</strong>set <strong>in</strong>creases. The outside quality is very high throughout. This stemsfrom the fact that, with the noise <strong>in</strong> the <strong>data</strong>, it is easy <strong>for</strong> a peer to get flipped over whenit is check<strong>in</strong>g <strong>for</strong> <strong>in</strong>side a circle. On the other hand, noise cannot change the belief of thepeer when the average is outside. In the second set of experiments, we varied ǫ from 1.0to 2.5 (Figure 3.23(a) and 3.23(b)). Here also, the quality <strong>in</strong>creases as ǫ is <strong>in</strong>creased. Thisis because with <strong>in</strong>creas<strong>in</strong>gǫ, there is a bigger region <strong>in</strong> which to bound the global average.This is also reflected with decreas<strong>in</strong>g number of messages. Note that, even <strong>for</strong> ǫ = 1.0,the normalized messages are around 1.6, which is far less than the theoretical maximum of2 (assum<strong>in</strong>g two neighbors per peer). The third set of experiments analyzes the effect of


123% correct peers100806040||G||εOverall1 1.3 1.5 2.0 2.5εNormalized Messages21.61.20.80.40StationaryOverall1 1.3 1.5 2 2.5ε(a) Quality vs. ǫ(b) Cost vs. ǫFIG. 3.23. Dependence of the monitor<strong>in</strong>g algorithm on ǫ. As ǫ is <strong>in</strong>creased, qualityimproves and cost decreases s<strong>in</strong>ce the problem becomes significantly simpler.% correct peers1009590||G||εOverall100 200 500 700 1000LNormalized Messages0.20.150.10.050StationaryOverall100 200 500 700 1000L(a) Quality vs. L(b) Cost vs. LFIG. 3.24. Behavior of the monitor<strong>in</strong>g algorithm w.r.t size of the leaky bucketL. Thequality and cost <strong>in</strong> almost <strong>in</strong>variant ofL.


124% correct peers1009080706050||G||εOverall5 10 15 20Percentage noiseNormalized Messages0.40.30.20.10StationaryOverall5 10 15 20Percentage noise(a) Quality vs. Noise(b) Cost vs. NoiseFIG. 3.25. Dependence of the quality and cost of the monitor<strong>in</strong>g algorithm on noise <strong>in</strong> the<strong>data</strong>. The quality degrades and cost <strong>in</strong>creases as percentage is <strong>in</strong>creased.leaky bucket L. As shown <strong>in</strong> Figure 3.24(a) quality does not depend on L, while Figure3.24(b) shows that the cost decreases slowly with <strong>in</strong>creas<strong>in</strong>gL. F<strong>in</strong>ally, Figures 3.25(a) and3.25(b) depict the dependence of the noise on the monitor<strong>in</strong>g algorithm. Quality degradesand cost <strong>in</strong>creases with <strong>in</strong>creas<strong>in</strong>g noise. This is expected, s<strong>in</strong>ce with <strong>in</strong>creas<strong>in</strong>g noise apeer is more prone to random effects. This effect can, however, be nullified by us<strong>in</strong>g a <strong>large</strong><strong>data</strong>set or biggerǫ.Our next experiment analyzes the scalability of the monitor<strong>in</strong>g algorithm w.r.t thenumber of peers and dimension of the multivariate problem. As Figures 3.26(a) and 3.26(c)show, both the quality and cost of the algorithm converge to a constant as the number ofpeers <strong>in</strong>crease. This is a typical behavior of <strong>local</strong> <strong>algorithms</strong>. For any peer, s<strong>in</strong>ce thecomputation is dependent on the result from only a handful of its neighbors, the overall sizeof the network does not degrade the quality or cost. Similarly, Figures 3.26(b) and 3.26(d)show that the quality or the cost does not depend on the dimension of the multivariateproblem either. This <strong>in</strong>dependence of the quality and cost can be expla<strong>in</strong>ed by not<strong>in</strong>g thatthe underly<strong>in</strong>g monitor<strong>in</strong>g problem is <strong>in</strong>R 2 . There<strong>for</strong>e <strong>for</strong> a given problem, the system size


125% correct peersNormalized Messages1009590||G t||εOverall85200500 1000 2000 3000Number of peers0.30.20.1(a) Quality vs. number of peersStationaryOverall200 500 1000 2000 3000Number of peers(c) Cost vs. number of peers% correct peersNormalized Messages10098969492900.20.150.10.050||G||εOverall2 5 10 15 20Dimension(b) Quality vs. dimensionStationaryOverall2 5 10 15 20Dimension(d) Cost vs. dimensionFIG. 3.26. Scalability with respect to both number of peers and dimension of the multivariateproblem. Both quality and cost is <strong>in</strong>dependent of the number of peers and dimension ofthe multi-variate problem.or dimensionality of the problem has no effect on the quality or the cost.Overall, the results show that the monitor<strong>in</strong>g algorithm offers extremely good quality,<strong>in</strong>curs low monitor<strong>in</strong>g cost and has high scalability.3.9.6 Results: Regression ModelsOur next set of experiments measure the quality of the regression model computed byour algorithm aga<strong>in</strong>st a centralized algorithm hav<strong>in</strong>g access to the entire <strong>data</strong>. There aretwo important parameters to be considered here — (1) the alert mitigation constant (τ) and


126Distance to optimal coeff0.250.20.15CentralizedDistributed500 1000 2000 4000Alert Mitigation period (τ)(a) Quality <strong>for</strong> l<strong>in</strong>ear model325001000 2000 4000 0.14Alert mitigation period (τ)Data Collection 4(b) Cost <strong>for</strong> l<strong>in</strong>ear model0.180.16Normalized L2 MessagesFIG. 3.27. Quality and cost of comput<strong>in</strong>g regression co<strong>efficient</strong>s <strong>for</strong> l<strong>in</strong>ear model.(2) the sample size (<strong>for</strong> non-l<strong>in</strong>ear regression). For comput<strong>in</strong>g the non-l<strong>in</strong>ear regressionco<strong>efficient</strong>s, we have implemented the Nelder-Mead simplex method [138].We have conducted experiments on three <strong>data</strong>sets.Each quality graph <strong>in</strong> Figures3.27—3.29 presents two sets of error bars. The square markers show the L2 norm distancebetween the <strong>distributed</strong> co<strong>efficient</strong>s and the actual ones. Also shown <strong>in</strong> each figure is the L2norm distance between the co<strong>efficient</strong>s found by a centralized algorithm and the actual ones(diamond markers). The first pair of figures, Figures 3.27(a) and 3.27(b) show the resultsof comput<strong>in</strong>g a l<strong>in</strong>ear regression model. Our aim is to measure the effect of variation ofalert mitigation period τ on quality and cost. As shown <strong>in</strong> Figure 3.27(a), the quality ofour algorithm deteriorates as τ <strong>in</strong>creases. This is because, on <strong>in</strong>creas<strong>in</strong>g τ, a peer buildsa model later and there<strong>for</strong>e is <strong>in</strong>accurate <strong>for</strong> a longer <strong>in</strong>termediate period. Figure 3.27(b)shows that the number of <strong>data</strong> collection rounds (dot markers) decrease from four times totwice per epoch. This results from a decrease <strong>in</strong> the number of false alerts. Also shown aremonitor<strong>in</strong>g messages (green squares).Figures 3.28(a) and 3.28(b) analyzes the quality of our algorithm while comput<strong>in</strong>g anon-l<strong>in</strong>ear multiplicative regression model viz. x 3 = a 0 +a 1 a 2 x 1 +a 0 a 1 x 2 . Figure 3.28(a)


127Distance to optimal coeff0.650.60.550.5CentralizedDistributed0.45500 1000 2000 4000Sample Size(a) Quality <strong>for</strong> multiplicative modelData Collection30.640.6225001000 2000 4000 0.6Sample Size(b) Cost <strong>for</strong> multiplicative modelNormalized L2 MessagesFIG. 3.28. Quality and cost of comput<strong>in</strong>g regression co<strong>efficient</strong>s <strong>for</strong> multiplicative model.The accuracy of the result becomes more accurate and the cost decreases as the sample sizeis <strong>in</strong>creased.Distance to optimal coeff0.040.030.020.01CentralizedDistributed0500 1000 2000 4000Sample Size(a) Quality <strong>for</strong> s<strong>in</strong>usoidal model325001000 2000 4000 0.1Sample Size(b) Cost <strong>for</strong> s<strong>in</strong>usoidal modelData Collection40.140.12Normalized L2 MessagesFIG. 3.29. Quality and cost of comput<strong>in</strong>g regression co<strong>efficient</strong>s <strong>for</strong> s<strong>in</strong>usoidal model. Theaccuracy of the result becomes more accurate and the cost decreases as the sample size is<strong>in</strong>creased.


128presents the quality as other parameter viz. sampl<strong>in</strong>g size is varied. As expected, the resultsfrom the <strong>distributed</strong> and centralized computations converge with <strong>in</strong>creas<strong>in</strong>g sample size.Also the number of <strong>data</strong> collection rounds as depicted <strong>in</strong> Figure 3.28(b) decrease as samplesize is <strong>in</strong>creased.The third pair of figures, Figures 3.29(a) and 3.29(b) show the same results <strong>for</strong> as<strong>in</strong>usoidal model : x 3 = a 0 ∗s<strong>in</strong>(a 1 +a 2 x 1 )+a 1 ∗s<strong>in</strong>(a 2 +a 0 x 2 ). Here also the qualitybecomes better and the cost decreases as the sample size is <strong>in</strong>creased.To sum everyth<strong>in</strong>g up, the regression computation algorithm offers excellent accuracyand low monitor<strong>in</strong>g cost. Also, the number of convergecast-broadcast rounds is alsotwo times per epoch on an average. We have tested our algorithm on several regressionfunctions and the results are highly satisfactory.3.10 SummaryIn this chapter we have presented DeFraLC — the framework <strong>for</strong> determ<strong>in</strong>istic <strong>data</strong><strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P environments. The generic algorithm, PeGMA is capable of comput<strong>in</strong>gcomplex functions of the average <strong>data</strong> <strong>in</strong> <strong>large</strong> <strong>distributed</strong> system. We present a numberof <strong>in</strong>terest<strong>in</strong>g applications of PeGMA: a <strong>local</strong> L2 threshold<strong>in</strong>g algorithm, a reactive meansmonitor<strong>in</strong>galgorithm, a k-means algorithm, an eigenvector-monitor<strong>in</strong>g algorithm and amultivariate regression algorithm. Besides the direct contributions of this work with respectto the calculation of L2 norm, k-means and eigen monitor<strong>in</strong>g <strong>in</strong> P2P networks, italso suggests an entirely new way by which <strong>local</strong> <strong>algorithms</strong> can be used <strong>in</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong>.Previously, develop<strong>in</strong>g a <strong>local</strong> algorithm <strong>for</strong> the direct computation of a <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> modelwas tedious. However, a small number of those <strong>local</strong> <strong>algorithms</strong> are arguably sufficientto provide tools <strong>for</strong> decid<strong>in</strong>g whether a <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> model correctly represents the <strong>data</strong>.Models can there<strong>for</strong>e be computed by any method (exact, approximate, or even at random)and then <strong>efficient</strong>ly judged by the <strong>local</strong> algorithm. Furthermore, whatever the costs are <strong>for</strong>


129comput<strong>in</strong>g the model, they can be amortized on periods where the <strong>data</strong> distribution doesnot change and only the <strong>efficient</strong> <strong>local</strong> algorithm operates.


130Chapter 4DISTRIBUTED DECISION TREE INDUCTION INPEER-TO-PEER SYSTEMS4.1 IntroductionDecision tree <strong>in</strong>duction [154][20] is a powerful statistical and mach<strong>in</strong>e learn<strong>in</strong>g toolwidely used <strong>for</strong> <strong>data</strong> classification, predictive model<strong>in</strong>g and more. Given a set of learn<strong>in</strong>gexamples (attribute values and correspond<strong>in</strong>g class labels) at a s<strong>in</strong>gle location, thereexist several well-known methods to build a decision tree such as ID3 [154] and C4.5[155]. However, there can be several situations <strong>in</strong> which the <strong>data</strong> is <strong>distributed</strong> over a<strong>large</strong>, dynamic network conta<strong>in</strong><strong>in</strong>g no special server or client peers. A typical exampleis Peer-to-Peer (P2P) networks. Per<strong>for</strong>m<strong>in</strong>g <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> tasks such as build<strong>in</strong>g decisiontrees is very challeng<strong>in</strong>g <strong>in</strong> a P2P network because of the <strong>large</strong> number of <strong>data</strong> sources,the asynchronous nature of the P2P networks, and dynamic nature of the <strong>data</strong>. A schemewhich centralizes the network <strong>data</strong> is unscalable because any change must be reported tothe central site, s<strong>in</strong>ce it might very well alter the result.To deal with this, <strong>in</strong> this chapter we propose a P2P decision tree <strong>in</strong>duction algorithm <strong>in</strong>which every peer learns and ma<strong>in</strong>ta<strong>in</strong>s the correct decision tree compared to a centralizedscenario. Our algorithm is highly scalable, completely decentralized and asynchronous andadapts to changes <strong>in</strong> the <strong>data</strong> and the network. The algorithm is <strong>efficient</strong> <strong>in</strong> the sense that it


131guarantees that as long as the decision tree represents the <strong>data</strong>, the communication overheadis low when compared to a broadcast-based algorithm. As a result, the algorithm is highlyscalable. When the <strong>data</strong> distribution changes, the decision tree is updated automatically.Our work is the first of its k<strong>in</strong>d <strong>in</strong> the sense that it <strong>in</strong>duces decision trees <strong>in</strong> <strong>large</strong> P2Psystems <strong>in</strong> a communication-<strong>efficient</strong> manner without the need <strong>for</strong> global synchronizationand the tree is the same that would have been <strong>in</strong>duced given all the <strong>data</strong> to all the peers.In the next section we present the related work <strong>in</strong> <strong>distributed</strong> classification <strong>algorithms</strong>.4.2 Related Work: Distributed Classification AlgorithmsClassification is one of the classic problems of the <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> and mach<strong>in</strong>e learn<strong>in</strong>gfields. Researchers have proposed several solutions to this problem — Bayesian models[86], ID3 and C4.5 decision trees [154][155], and SVMs [192] be<strong>in</strong>g just a t<strong>in</strong>y selection.In this section we present several <strong>algorithms</strong> which have been developed <strong>for</strong> <strong>distributed</strong>classification. The solutions proposed by these <strong>algorithms</strong> differ <strong>in</strong> three major aspects —(1) how the search doma<strong>in</strong> is represented us<strong>in</strong>g an objective function, (2) which algorithm ischosen to optimize the objective function, and (3) how the work is <strong>distributed</strong> <strong>for</strong> <strong>efficient</strong>search<strong>in</strong>g through the entire space. The latter item has two typical modes — <strong>in</strong> some<strong>algorithms</strong> the learn<strong>in</strong>g examples are only used dur<strong>in</strong>g the search <strong>for</strong> a function (e.g., <strong>in</strong>decision trees and SVMs) while <strong>in</strong> other they are also used dur<strong>in</strong>g the classification of newsamples (notably, <strong>in</strong> Bayesian classifiers).Meta classifiers are another <strong>in</strong>terest<strong>in</strong>g group of classification <strong>algorithms</strong>. In a metaclassification algorithm such as bagg<strong>in</strong>g [21] or boost<strong>in</strong>g [62], many classifiers (of any ofthe previous mentioned k<strong>in</strong>ds) are first built on either samples or partitions of the tra<strong>in</strong><strong>in</strong>g<strong>data</strong>. Then, those “weak” classifiers are comb<strong>in</strong>ed us<strong>in</strong>g a second level algorithm whichcan be as simple as tak<strong>in</strong>g the majority of their outcomes <strong>for</strong> any new sample.Some classification <strong>algorithms</strong> are better suited <strong>for</strong> a <strong>distributed</strong> set up. For <strong>in</strong>stance,


132Stolfo et el. [176] learn a weak classifier on every partition of the <strong>data</strong>, and then centralizethe classifiers and a sample of the <strong>data</strong>. This can be a lot cheaper than transferr<strong>in</strong>g the entireraw <strong>data</strong>. Then the meta-classifier is deduced centrally from these <strong>data</strong>. Another suggestion,by Bar-Or et al. [9] was to execute ID3 <strong>in</strong> a hierarchical network by centraliz<strong>in</strong>g, <strong>for</strong>every node of the tree and at each level, only statistics regard<strong>in</strong>g the most promis<strong>in</strong>g attributes.These statistics can, as the authors show, provide a proof that the selected attributeis <strong>in</strong>deed the one hav<strong>in</strong>g the highest ga<strong>in</strong> — or otherwise trigger the algorithm to requestfurther statistics.Caragea et al. [29] presented a decision tree <strong>in</strong>duction algorithm <strong>for</strong> both <strong>distributed</strong>homogenous and heterogenous environments. Not<strong>in</strong>g that the crux of any decision tree algorithmis the use of an effective splitt<strong>in</strong>g criteria, the authors propose a method by whichthis criteria can be evaluated <strong>in</strong> a <strong>distributed</strong> fashion. More specifically, the paper showsthat by only centraliz<strong>in</strong>g the summary statistics from each site to one location, there canbe huge sav<strong>in</strong>gs <strong>in</strong> terms of communication when compared to brute <strong>for</strong>ce centralization.Moreover, the <strong>distributed</strong> decision tree <strong>in</strong>duced is the same compared to a centralized scenario.Their system is available as part of the INDUS system.A different approach was taken by Giannella et al. [66] and Olsen [141] <strong>for</strong> <strong>in</strong>duc<strong>in</strong>gdecision tree <strong>in</strong> vertically partitioned <strong>data</strong>. They used G<strong>in</strong>i <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> as the impuritymeasure and showed that G<strong>in</strong>i between two attributes can be <strong>for</strong>mulated as a dot productbetween two b<strong>in</strong>ary vectors. To reduce the communication cost, the authors evaluated thedot product after project<strong>in</strong>g the vectors <strong>in</strong> a random smaller subspace. Instead of send<strong>in</strong>geither the raw <strong>data</strong> or the <strong>large</strong> b<strong>in</strong>ary vectors, the <strong>distributed</strong> sites communicate only theseprojected low-dimensional vectors. The paper shows that us<strong>in</strong>g only 20% of the communicationcost necessary to centralize the <strong>data</strong>, they can build trees which are at least 80%accurate compared to the trees produced by centralization.Distributed probabilistic classification on heterogenous <strong>data</strong> sites have also been dis-


133cussed by Merugu and Ghosh [120]. Similarly, Park et al. have proposed a fourierspectrum-based approach <strong>for</strong> decision tree <strong>in</strong>duction <strong>in</strong> vertically partitioned <strong>data</strong>sets [147].When the <strong>scale</strong> of the system grows to millions of partitions — as <strong>in</strong> most modernP2P systems — the quality of such <strong>algorithms</strong> degrade. This is because <strong>for</strong> such <strong>large</strong> <strong>scale</strong>systems, no centralization of <strong>data</strong>, statistics, or models, is practical any longer. By thetime such statistics can be gathered, it is reasonable that both the <strong>data</strong> and the system havechanged to the po<strong>in</strong>t that the model needs to be calculated aga<strong>in</strong>. Thus, classification <strong>in</strong>P2P networks requires a different breed of <strong>algorithms</strong> — ones that are fully decentralized,asynchronous and can deal well with dynamically chang<strong>in</strong>g <strong>data</strong>.The work most related to the one described <strong>in</strong> this chapter is the Distributed PluralityAlgorithm (DPV) by P<strong>in</strong>g et al. [116]. In that work, a meta classification algorithmis described <strong>in</strong> which every peer computes a weak classifier on its own <strong>data</strong>. Then weakclassifiers are merged <strong>in</strong>to a meta classifier by comput<strong>in</strong>g — per new sample — the majorityof the outcomes of the weak classifiers. The computation of weak classifiers requiresno communication overhead at all, and the majority is computed us<strong>in</strong>g an <strong>efficient</strong> <strong>local</strong>algorithm.Our work is different from DPV <strong>in</strong> several ways: firstly, we compute an ID3-like decisiontree from the entire <strong>data</strong> (rather than many weak classifiers). Because the entire <strong>data</strong>is used, smaller sub-populations of the <strong>data</strong> stand a chance to gather statistical significanceand contribute to the model — there<strong>for</strong>e, we argue tat our algorithm can be, <strong>in</strong> general,more accurate. Secondly, as proposed <strong>in</strong> DPV, every peer needs to be aware of each newsample and provide their classification of it. This mode of operation, which is somewhatrem<strong>in</strong>iscent of Bayesian classification, requires broadcast<strong>in</strong>g the new samples to all peersor limits the algorithm to specific cases <strong>in</strong> which all peers cooperate <strong>in</strong> classification of newsamples (given to all) based on their private past experience. In contrast, <strong>in</strong> our work, allpeers jo<strong>in</strong>tly study the same decision tree. Then, when a peer is given a new sample, it can


134be classified with no communication overhead at all. When learn<strong>in</strong>g samples are few andnew samples are <strong>in</strong> abundance, our algorithm can be far more <strong>efficient</strong>.4.3 Background4.3.1 AssumptionsLet S denote a collection of <strong>data</strong> tuples with class labels that are horizontally <strong>distributed</strong>over a <strong>large</strong> (undirected) network of mach<strong>in</strong>es (peers) where<strong>in</strong> each peer communicatesonly with its immediate neighbors (one hop neighbors) <strong>in</strong> the network. The communicationnetwork can be thought of as a graph with vertices (peers) V . For any givenpeer k ∈ V , let Γ k denote the immediate neighbors of k. Peer k will only communicatedirectly with peers <strong>in</strong> Γ k .Our goal is to develop a <strong>distributed</strong> algorithm under which each peer computes thedecision tree overS (the same tree at each peer). 1 However, the network is dynamic <strong>in</strong> thesense that the network topology can change i.e. peers may enter or leave at any time or the<strong>data</strong> held by each peer can change hence S, the union of all peers <strong>data</strong>, can be thought ofas time-vary<strong>in</strong>g. Our <strong>distributed</strong> algorithm is designed to seamlessly adapt to network and<strong>data</strong> changes <strong>in</strong> a communication-<strong>efficient</strong> manner.We assume that communication among neighbor<strong>in</strong>g peers is reliable and ordered andthat when a peer is disconnected or reconnected its neighbors, k, are <strong>in</strong><strong>for</strong>med, i.e. Γ kis known to k and is updated automatically. These assumptions can easily be en<strong>for</strong>cedus<strong>in</strong>g standard number<strong>in</strong>g and retransmission <strong>in</strong> which messages are numbered, orderedand retransmitted if an acknowledgement does not arrive <strong>in</strong> time, order<strong>in</strong>g, and heart-beatmechanisms. Moreover, these assumptions are not uncommon and have been made elsewhere<strong>in</strong> the <strong>distributed</strong> <strong>algorithms</strong> literature [63]. Khilar and Mahapatra [100] discuss the1 Throughout this chapter, peers refer to the mach<strong>in</strong>es <strong>in</strong> the P2P network and nodes refer to the entries ofthe tree (<strong>in</strong> which the attribute is split).


use of heartbeat mechanisms <strong>for</strong> failure diagnosis <strong>in</strong> mobile ad-hoc networks.135Furthermore, <strong>for</strong> simplicity, we assume that the network topology <strong>for</strong>ms a tree. Thisallows us to use a relatively simple <strong>distributed</strong> algorithm as our basic build<strong>in</strong>g block: <strong>distributed</strong>majority vot<strong>in</strong>g (more details later). We could get around this assumption <strong>in</strong> oneof two ways. (1) Liss et al. [14], have developed an extended version of the <strong>distributed</strong>majority vot<strong>in</strong>g algorithm which does not require the assumption that the network topology<strong>for</strong>ms a tree. We could replace our use of simple majority vot<strong>in</strong>g as a basic build<strong>in</strong>g blockwith the extended version developed by Liss et al. (2) The underly<strong>in</strong>g tree communicationtopology could be ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong>dependently of our algorithm us<strong>in</strong>g standard techniquessuch as [63] (<strong>for</strong> wired networks) or [110] (<strong>for</strong> wireless networks).4.3.2 Distributed Majority Vot<strong>in</strong>gOur algorithm utilizes, as a build<strong>in</strong>g block, a variation of the <strong>distributed</strong> algorithm <strong>for</strong>majority vot<strong>in</strong>g developed by Wolff and Schuster [196]. Each peer k ∈ V conta<strong>in</strong>s a realnumberδ k and the objective is to determ<strong>in</strong>e whether∆ = ∑ k∈V δk ≥ 0.The follow<strong>in</strong>g algorithm meets this objective. For peers k,l ∈ V , let δ kl denote themost recent message (a real number) peerk sent tol. Similarlyδ lk denotes the last messagereceived byk froml. Peerk computes∆ k =δ k + ∑ l∈Γ kδ lk , which can be thought of ask ′ sestimate of∆based on all the <strong>in</strong><strong>for</strong>mation available. Peerkalso computes∆ kl = δ kl +δ lk ,<strong>for</strong> each neighborl ∈ Γ k which denotes the <strong>in</strong><strong>for</strong>mation exchanged betweenkandl. Whenan event at peer k occurs, k will decide, <strong>for</strong> each neighbor l, whether a message needs besent to l. An event at k consists of one of the follow<strong>in</strong>g three situations: (i) k is <strong>in</strong>itialized(enters the network or otherwise beg<strong>in</strong>s computation of the algorithm); (ii) k experiencesa <strong>data</strong> change δ k or a change of its neighborhood, Γ k ; (iii) k receives a message from aneighborl.The crux of the algorithm is <strong>in</strong> deter<strong>m<strong>in</strong><strong>in</strong>g</strong> whenk must send a message to a neighbor


136l <strong>in</strong> response tok detect<strong>in</strong>g an event. More precisely, the question is when can a message beavoided, despite the fact that the <strong>local</strong> knowledge has changed. Upon detect<strong>in</strong>g an event,peer k would send a message to neighbor l when either of the follow<strong>in</strong>g two situationsoccurs: (i)k is <strong>in</strong>itialized; (ii) ( ∆ kl ≥ 0∧∆ kl > ∆ k) ∨ ( ∆ kl < 0∧∆ kl < ∆ k) evaluatestrue. Observe that s<strong>in</strong>ce all events are <strong>local</strong>, the algorithm requires no <strong>for</strong>m of globalsynchronization.Whenk detects an event and the conditions above <strong>in</strong>dicate that a message must be sentto neighborl,k setsδ kl toα∆ k −δ lk (thereby mak<strong>in</strong>g∆ kl = α∆ k ) and sends it tol, whereα is a fixed parameter between 0 and 1. If α were set close to one, then small subsequentvariations <strong>in</strong> ∆ k will trigger more messages from k <strong>in</strong>creas<strong>in</strong>g the communication overhead.On the other hand, if α were set close to zero, the convergence rate of the algorithmcould be made unacceptably slow. In all our experiments, we setαto0.5. This mechanismreplicates the one used by Wolff et al. <strong>in</strong> [195]. Detailed discussion on the value of α ispresented <strong>in</strong> Appendix A. The pseudo-code is presented <strong>in</strong> Algorithm 10.To avoid a message explosion, we implement a leaky bucket mechanism such that the<strong>in</strong>terval between messages sent by a peer does not become arbitrarily small. This mechanismwas also used by Wolff et al. <strong>in</strong> [195]. Each peer logs the time when the last messagewas sent. When a peer decides that a message need to be sent (to any of its neighbors), itdoes the follow<strong>in</strong>g. If L time units has passed s<strong>in</strong>ce the time the last message was sent, itsends the new message right away. Otherwise, it buffers the message and sets a timer toL units after the registered time the last message was sent. Once the timer expires all thebuffered messages are sent. For the rema<strong>in</strong>der of the chapter, we leave the leaky bucketmechanism implicit <strong>in</strong> our <strong>distributed</strong> algorithm descriptions.


137123456789101112131415161718192021222324252627282930Input: δ k ,N k ,L,αOutput: if∆ k ≥ 0 then 1 else0Local variables: ∀l ∈ N k : δ lk ,δ klDef<strong>in</strong>itions: ∆ k = δ k + ∑ l∈Nδ lk , ∆ kl = δ kl +δ lkkInitialization:beg<strong>in</strong><strong>for</strong>alll ∈ N k doδ kl = δ lk = 0;SendMessage(l);endendif MessageRecvd ( P l ,δ ) then δ lk ← δ;if PeerFailure ( l ∈ N k) then N k ← N k \{l};if AddNeighbor ( l ∈ N k) then N k ← N k ∪{l};ifN k ,δ k changes or MessageRecvd then call OnChange();Function OnChange()beg<strong>in</strong><strong>for</strong>alll ∈ N k doif ( ∆ kl ≥ 0∧∆ kl > ∆ k) ∨ ( ∆ kl < 0∧∆ kl < ∆ k) thencall SendMessage(l);endendendFunction SendMessage(l)beg<strong>in</strong>iftime()−last message ≥ L thenδ kl ← (α∆ k −δ lk );last message ← time();Send 〈 δ kl〉 to l;endelse Wait L−(time()−last message) time units and then callOnChange();endAlgorithm 10: Local Majority Vote


4.4 P2P Decision Tree Induction Algorithm138This section presents a <strong>distributed</strong> and asynchronous algorithm P 2 DT which <strong>in</strong>ducesa decision tree over a P2P network <strong>in</strong> which every peer has a set of learn<strong>in</strong>g examples.P 2 DT, which is <strong>in</strong>spired by ID3 and C4.5, aims to select at every node the attribute whichwill maximize a ga<strong>in</strong> function. Then,P 2 DT aims to split the node, and the learn<strong>in</strong>g examplesassociated with it, <strong>in</strong>to two new leaf nodes and this process cont<strong>in</strong>ues recursively. Astopp<strong>in</strong>g rule directs P 2 DT to stop this recursion. In this section a simple depth limitationis used. Other, more complex predicates are described <strong>in</strong> Section 4.4.3.The ma<strong>in</strong> computational task ofP 2 DT is choos<strong>in</strong>g the attribute hav<strong>in</strong>g the highest ga<strong>in</strong>among all attributes. Similar to other <strong>distributed</strong> <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong>, P 2 DT needs tocoord<strong>in</strong>ate this decision among the multiple peers. The ma<strong>in</strong> exceptions ofP 2 DT are that itstresses the efficiency of decision mak<strong>in</strong>g and the lack of synchronization. These featuresmake it exceptionally scalable and there<strong>for</strong>e suitable <strong>for</strong> networks spann<strong>in</strong>g millions ofpeers.P 2 DT deviates from the standard decision tree <strong>in</strong>duction <strong>algorithms</strong> <strong>in</strong> the choice ofa simpler ga<strong>in</strong> function — the misclassification error — rather than the more popular (and,arguably, better) <strong>in</strong><strong>for</strong>mation-ga<strong>in</strong> and g<strong>in</strong>i-<strong>in</strong>dex functions. Misclassification error offersless dist<strong>in</strong>ction between attributes: a split can have the same misclassification error <strong>in</strong> thesetwo seem<strong>in</strong>gly different cases — (1) the erroneous examples are divided equally betweenthe two leaves it creates or (2) if one of these leaves is 100% accurate. Comparatively,both <strong>in</strong><strong>for</strong>mation-ga<strong>in</strong> and g<strong>in</strong>i-<strong>in</strong>dex would prefer the latter case to the <strong>for</strong>mer. Still, themisclassification error can yield accurate decision trees (see, e.g., [185]) and its relativesimplicity makes it far easier to compute <strong>in</strong> a <strong>distributed</strong> set-up.In the <strong>in</strong>terest of clarity, we divide the discussion of the algorithm <strong>in</strong>to two subsections:Section 4.4.1 below describes an algorithm <strong>for</strong> the selection of the attribute offer<strong>in</strong>gthe lowest misclassification error from amongst a <strong>large</strong> set of possibilities. Next, Section


1394.4.2 describes how a collection of such decisions can be <strong>efficient</strong>ly used to <strong>in</strong>duce a decisiontree.4.4.1 Splitt<strong>in</strong>g Attribute Choos<strong>in</strong>g us<strong>in</strong>g the Misclassification Ga<strong>in</strong> FunctionNotations Let S be a set of learn<strong>in</strong>g examples — each a vector <strong>in</strong> {0,1} d ×{0,1}— where the first d entries of each example denote the attributes, A 1 ,...,A d , and theadditional one denotes the class Class. The cross table of attribute A i and the class isX i = xi 00 x i 01where x ix i 10 x i 01 is the number of examples <strong>in</strong> the set S <strong>for</strong> which Ai = 011and Class = 1. We also def<strong>in</strong>e the ⎧ <strong>in</strong>dicator variables s i 0 = sign(xi 00 −xi 01 ) and si 1 =⎪⎨ 1 x ≥ 0sign(x i 10 −x i 11), withsign(x) =.⎪⎩ −1 x < 0We have used the ga<strong>in</strong> measure based on misclassification ga<strong>in</strong> (as described <strong>in</strong> [185]page 158 as an alternate to g<strong>in</strong>i and entropy). For a b<strong>in</strong>ary attributeA i , the misclassificationimpurity measure <strong>for</strong> A i = 0 is MI 0 = 1−max(x i 00 ,xi 01 )/xi 0. (where xi 0.is the numberof tuples with A i = 0). Similarly, <strong>for</strong> A i = 1 the misclassification impurity is MI 1 =1−max(x i 10 ,xi 11 )/xi 1. . There<strong>for</strong>e, the misclassification ga<strong>in</strong> (MG) <strong>for</strong>Ai isMG = I(parent)−(x i 0. /|S|)×[MI 0]−(x i 1. /|S|)×[MI 1]= I(parent)−(x i 0./|S|)+max ( x i 00,x i 01)/|S|−(xi1. /|S|)+max ( x i 00,x i 01)/|S|= I(parent)−1+max ( x i 00 ,xi 01)/|S|+max(xi00 ,x i 01)/|S|S<strong>in</strong>ce the maximum is the average plus half the absolute difference, this is equal to


140MG = I(parent)−1+(x i 00 +xi 01 +xi 10 +xi 11 )/2|S|+|xi 00 −xi 01 |/2|S|+|xi 10 −xi 11 |/2|S|= I(parent)−1/2+[|x i 00 −x i 01|+|x i 10 −x i 11|]/2|S|Choos<strong>in</strong>g the attribute with the highest ga<strong>in</strong> is equivalent to maximiz<strong>in</strong>g the misclassificationga<strong>in</strong>. As seen above, maximiz<strong>in</strong>g this quantity is the same as maximiz<strong>in</strong>g|x i 00 −xi 01 | + |xi 10 − xi 11 |. Thus, accord<strong>in</strong>g to the misclassification ga<strong>in</strong> function, the bestattribute to split is A best = argmax|x i 00 −x i 01|+|x i 10 −x i 11|. Note that if A i = A best theni∈[1,d]<strong>for</strong> anyA j ≠ A i C i,j = |x i 00 −xi 01 |+|xi 10 −xi 11 |−∣ ∣∣x j 00 −xj 01∣− ∣ ∣x j 10 −xj ∣11 is either zeroor more. A different derivation of this quantity is given <strong>in</strong> Appendix B <strong>for</strong> any two arbitrarycategorical attributes.In a <strong>distributed</strong> setup, S is partitioned <strong>in</strong>to n sets S 1 through S n . Xk i = xi k,00 x i k,01x i k,10 x i k,11would there<strong>for</strong>e denote the cross table of attribute A i and the class <strong>in</strong> the example set S k .Note that x i 00 = ∑x i k,00 and ∑ [ ] ∣ ∑ [ ] ∣ Ci,j =∣ xik,00 −x i ∣∣k,01 +∣ xik,10 −x i ∣∣k,11 −k=1...nk=1...nk=1...n∑ [ ∣ xjk,00 k,01] ∣ ∑ [ ∣∣ −xj −∣ xjk,10 k,11] ∣ ∣∣ −xj . Also, notice that C i,j is not, <strong>in</strong> general,k=1...nk=1...nequal to∑∑∑∣ xik,00 −x i ∣k,01 + ∣ xik,10 −x i ∣k,11 − ∣ xj∑k,00 −xj ∣k,01 − ∣ xjk,10 −xj ∣k,11 .k=1...n k=1...n k=1...n k=1...nStill, us<strong>in</strong>g the <strong>in</strong>dicatorss i 0 andsi 1 def<strong>in</strong>ed above we can ∑ [ ] writeCi,j = s i 0 xik,00 −x i k,01 +∑ [ k=1...ns i 1 xik,10 −xk,11] i − sj∑ [ ]0 xjk,00 −xj k,01 − sj∑ [1 xjk,10 k,11] −xj which can bek=1...nrewritten ask=1...nk=1...ns i 0∑k=1...n(si0[xik,00 −x i k,01C i,j =]+si1[xik,10 −x i k,11]−sj0[xj]k,00 −xj k,01 −sj[1 xjk,10 k,11]) −xj ;This last expression, <strong>in</strong> turn, is simply a sum — across all peers — of a numberδ i,jk=[ ] [ [ ] [xik,00 −x i k,01 + si1 xik,10 −xk,11] i − sj0 xjk,00 −xj k,01 − sj1 xjk,10 k,11] −xj which canbe computed <strong>in</strong>dependently by each peer, assum<strong>in</strong>g it knows the values of the <strong>in</strong>dicators.


141F<strong>in</strong>ally, denote δ i,jk|abcd the value of δi,jkassum<strong>in</strong>g s i 0 = a, si 1 = b, sj 0 = c, and sj 1 = d.Notice that, unlike δ i,jk , δi,j kof the actual values ofs i 0 ,si 1 , sj 0 , and sj 1 .|abcd can be computed <strong>in</strong>dependently by every peer, regardlessIt is there<strong>for</strong>e possible to compute A best by concurrently runn<strong>in</strong>g the follow<strong>in</strong>g set ofmajority votes: two per attribute A i , with <strong>in</strong>puts x i 00 − x i 01 and x i 10 −x i 11, to compute thevalues of s i 0 and si 1 ; and one <strong>for</strong> every per of attributes and every possible comb<strong>in</strong>ation ofs i 0 ,si 1 ,sj 0, ands j 1. Given the results <strong>for</strong>s i 0 andsi 1 , one could select the right comb<strong>in</strong>ation andignore the other. Then, given all of the selected votes, one could f<strong>in</strong>d the attribute whosega<strong>in</strong> is higher than that of any other attribute. Below, we describe an algorithm whichper<strong>for</strong>ms the same computation far more <strong>efficient</strong>ly.P2P Misclassification M<strong>in</strong>imizationThe P2P Misclassification M<strong>in</strong>imizationP 2 MMalgorithm, Algorithm 11, aims to solve two problems concurrently: it will decide which ofthe attributes A 1 through A d is A best and at the same time compute the true value of s i 0and s i 1 . The general framework of the P2 MM algorithm is that of pivot<strong>in</strong>g: a certa<strong>in</strong> A iis selected as the potential A best . Then the algorithm tries to validate this selection. If A iturns out to be <strong>in</strong>deed the best attribute, then this is reported. On the other hand, the revocationof the validation proves that there must exist another attribute (or attributes)A j whichis (are) better than A best . At this po<strong>in</strong>t, the algorithm aga<strong>in</strong> renames one of these betterattributes A j to be A best , and tries to validate this by compar<strong>in</strong>g it to the other attributesthat turned out to be better than the previous selection i.e. previously suspected best. Toprovide the <strong>in</strong>put to those comparisons, each peer computesδ i,jk|abcd rely<strong>in</strong>g on the currentad-hoc value of s i 0, s i 1, s j 0, and s j 1. The ad-hoc values peer k computes <strong>for</strong> s i 0 and s i 1 willbe denoted bys i k,0 andsi k,1 , respectively. To make sure those ad hoc results converge to thecorrect value, two additional majority votes are carried per attribute concurrent to those ofthe pivot<strong>in</strong>g; <strong>in</strong> these, the <strong>in</strong>puts of peerk are x i 00 −x i 01 and x i 10 −x i 11, respectively.Figure 4.1 shows the pivot<strong>in</strong>g step as it advances. In the figure, there are five attributes


142A 1 ,A 2 ,A 3 ,A 4 and A 5 . The height of the bars denote their ga<strong>in</strong>s. In the first snapshot (firstrow), A 1 is selected as the pivot. A 1 is then compared to all the other attributes A 2 ,A 3 ,A 4andA 5 . All the attributes better thanA 1 are reported <strong>in</strong> the third step —A 2 ,A 4 and A 5 . Inthe fourth step,A 2 is selected as the pivot and compared toA 4 andA 5 . F<strong>in</strong>allyA 2 is outputas the best attribute.TheP 2 MM algorithm works <strong>in</strong> stream<strong>in</strong>g mode: Every peerk takes two <strong>in</strong>puts — theset of its neighborsΓ k and a setS k of learn<strong>in</strong>g examples. Those <strong>in</strong>puts may (and often do)change over time and the algorithm responds to every such change by adjust<strong>in</strong>g its outputand by possibly send<strong>in</strong>g messages. Similarly, messages stream <strong>in</strong> to the peer and, can<strong>in</strong>fluence both the output and the outgo<strong>in</strong>g messages. The output of peer k is the attributewith the highest misclassification ga<strong>in</strong>. This output, by nature, ad-hoc and may change <strong>in</strong>response to any of the events described above.P 2 MM is based on a <strong>large</strong> number of <strong>in</strong>stances of the <strong>distributed</strong> majority vot<strong>in</strong>galgorithm described earlier <strong>in</strong> Section 4.3.2. On <strong>in</strong>itialization,P 2 MM <strong>in</strong>vokes two <strong>in</strong>stancesof majority vot<strong>in</strong>g per attribute to determ<strong>in</strong>e the values ofs i 0 ands i 1 — letM i 0 andM i 1 denotethese majority votes. For each peer k, its <strong>in</strong>puts to these votes (<strong>in</strong>stances) are M i 0 .δ k =x i k,00 − xi k,01 and Mi 1 .δ k = x i k,10 − xi k,11 . Additionally, <strong>for</strong> every pair i < j ∈ [1...d]P 2 MM <strong>in</strong>itializes sixteen <strong>in</strong>stances of majority vot<strong>in</strong>g — one <strong>for</strong> each possible comb<strong>in</strong>ationof values <strong>for</strong>s i 0 ,si 1 ,sj 0 , andsj 1 . Those <strong>in</strong>stances are denoted byMi,jabcdwithabcd referr<strong>in</strong>g tothe comb<strong>in</strong>ation of values <strong>for</strong>s i 0 ,si 1 ,sj 0, ands j 1. For each peerk, its <strong>in</strong>puts to these <strong>in</strong>stancesareM i,jabcd .δ k = a [ [ ] [x i k,00 k,01] −xi +b xik,10 −x i k,11 −c xj] [k,00 −xj k,01 −d xjk,10 k,11] −xj . Arelated algorithm, <strong>distributed</strong> plurality vot<strong>in</strong>g (DPV) [116], could be used to describe ouralgorithm <strong>for</strong> select<strong>in</strong>g one of the sixteen comparisons M i,jabcd|abcd when abcd are static.But s<strong>in</strong>ce <strong>in</strong> our case the values ofs i 0 , si 1 , sj 0 , and sj 1 are always chang<strong>in</strong>g, it seems difficultto follow the same description as given <strong>in</strong> [116].Follow<strong>in</strong>g <strong>in</strong>itialization, the algorithm is event based. Each peer k is idle unless there


143FIG. 4.1. Figure show<strong>in</strong>g the pivot selection process. In the first row, A 1 is selected as thepivot. A 1 is then compared to all other attributes and all the ones better thanA 1 are selected<strong>in</strong> the third snapshot. A 2 is selected as the pivot. F<strong>in</strong>allyA 2 is output as the best attribute.


144is a change <strong>in</strong> Γ k orS k , or an <strong>in</strong>com<strong>in</strong>g message changes the∆ k of one of the <strong>in</strong>stances ofmajority vot<strong>in</strong>g. On such event, the simple solution would be to check the conditions <strong>for</strong>send<strong>in</strong>g messages <strong>in</strong> any of the majority votes. However, as shown by [105][116] pivot<strong>in</strong>gcan be used to reduce the number of conditions checked from O(d 2 ) to an expected O(d).In DPV, the follow<strong>in</strong>g heuristic is suggested <strong>for</strong> pivot selection: the pivot would be thevote which has the <strong>large</strong>st agreement value with any of the neighbors. It is proved that thismethod is deadlock-free and it is shown to be effective <strong>in</strong> reduc<strong>in</strong>g the number of conditionschecked. We implement the same heuristic, choos<strong>in</strong>g the pivot as the attributeA i which hasthe <strong>large</strong>st M j,i .∆ k,l <strong>for</strong> j < i or the smallest M i,m .∆ k,l <strong>for</strong> m > i and <strong>for</strong> any neighborl. If the test <strong>for</strong> any M i,j fails, M i,j .∆ k,l needs to be modified by send<strong>in</strong>g a message to l.P 2 MM does this by send<strong>in</strong>g a message which will setM i,j .∆ k,l toαM i,j .∆ k (α is set to 1 2by default), which is <strong>in</strong> l<strong>in</strong>e with the f<strong>in</strong>d<strong>in</strong>gs of [195].NoticeM i,jabcd .δ k = −M i,j−a−b−c−d .δ k and thus half of the comparisons actually replicatethe other half and can be avoided. This optimization is avoided <strong>in</strong> the pseudocode <strong>in</strong> orderto ma<strong>in</strong>ta<strong>in</strong> conciseness. Also notice that while the peer updates <strong>in</strong> the context of any ofthe majority votesM i,jabcd , it will only respond with a message <strong>for</strong> the majority voteMi,j —the <strong>in</strong>stance witha,b,c, anddequal tos i k,0 ,si k,1 ,sj k,0 , andsj k,1, respectively. The rest of the<strong>in</strong>stances are, <strong>in</strong> effect, ‘suspended’ and cause no communication overhead. Next we showtheP 2 MM algorithm is eventually correct.Theorem 4.4.1. Each peer runn<strong>in</strong>g theP 2 MM algorithm eventually converges to the correctpivot.Proof. To see whyP 2 MM convergence is guaranteed, first notice that eventual correctnessof each of the majority votes M0 i and M1 i is guaranteed because the condition <strong>for</strong> send<strong>in</strong>gmessages is checked <strong>for</strong> every one of them each time the <strong>data</strong> changes or a message isreceived. Next, consider two neighbor peers who choose different pivots, peer k whichselects i and peer l which selects j. S<strong>in</strong>ce both k and l will check the condition of M i,j ,


Input: Γ k ,S kOutput: the attribute A pivot1 Initialization:2 beg<strong>in</strong><strong>for</strong>all A i ∈ { 3 A 1 ...A d} doInitialize two <strong>in</strong>stances of LSD-Majority ( M0 i 1) 4 ,Mi with <strong>in</strong>puts xik,00−x i k,01 andx i k,10 −xi k,11 , knowledge Mi 0 .∆ k and M1 i.∆ k, and agreement M0 i∆ k,l and1455678910111213141516171819202122232425262728293031323334M1 i∆ k,l respectively (∀l ∈ Γ k );end<strong>for</strong>all [a,b,c,d ∈ {−1,1}] ∧[ A i ,A j ∈ { A(1 ...A d}] )doInitialize an <strong>in</strong>stance of LSD-Majority M i,jabcdwith <strong>in</strong>put δ i,jk|abcd, knowledgeM i,jabcd .∆ k and agreement M i,jabcd ∆ k,l (∀l ∈ Γ k ), where a = s i k,0 ,b = si k,1 ,c = s j k,0 , and d = sj k,1 ;endendOn any event:beg<strong>in</strong><strong>for</strong>all A i ∈ { A 1 ...A d} and every l ∈ Γ k doif not M0 i.∆ k ≤ M0 i.∆ k,l < 0 and not M0 i.∆ k ≥ M0 i.∆ k,l ≥ 0 then callSend ( M0 i,l) ;if not M1 i.∆ k ≤ M1 i.∆ k,l < 0 and not M1 i.∆ k ≥ M1 i.∆ k,l ≥ 0 then callSend ( M1 i,l) ;endendrepeat{{Letpivot = arg max max M j,i .∆ k,l −M i,m } }.∆ k,l ;i∈[1...d] l∈N k ,ji<strong>for</strong>all A i ∈ { A 1 ...A pivot−1} and every l ∈ Γ k doif not M i,pivot .∆ k ≤ M i,pivot .∆ k,l < 0 and not M i,pivot .∆ k ≥ M i,pivot .∆ k,l ≥ 0then call Send ( M i,pivot ,l ) ;end<strong>for</strong>all A i ∈ { A pivot+1 ...A d} and every l ∈ Γ k doif not M pivot,i .∆ k ≤ M pivot,i .∆ k,l < 0 and not M pivot,i .∆ k ≥ M pivot,i .∆ k,l ≥ 0then call Send ( M pivot,i ,l ) ;enduntil while pivot changes ;if MessageRecvd(l, (id,δ)) thenLetM be a majority vot<strong>in</strong>g <strong>in</strong>stance withM.id = id;M.δ l,k ← δ;endFunction Send((M,l))beg<strong>in</strong>M.δ k,l = αM.∆ k +M.δ l,k ;Send to l(M.id,M.δ k,l );endAlgorithm 11: P2P Misclassification M<strong>in</strong>imization (P 2 MM)


146and s<strong>in</strong>ce M i,j .∆ k,l = M i,j .∆ l,k at least one of them would have to change its pivot.Thus, peers will cont<strong>in</strong>ue to change their pivot until they all agree on the same pivot. Tosee that peers will converge to the decision that the pivot is the attribute with the highestga<strong>in</strong> (denote it A m ), assume they converge on another pivot. S<strong>in</strong>ce the condition of votescompar<strong>in</strong>g the A pivot to any other attribute is checked whenever the <strong>data</strong> changes, it isguarantee that if m < pivot then M m,pivot .∆ k,l will eventually be <strong>large</strong>r than zero <strong>for</strong> allk and l ∈ Γ k and if pivot < m then M m,pivot .∆ k,l > 0 <strong>for</strong> all k and l ∈ Γ k . Thus, thealgorithm will replace the pivot with eithermor another attribute, but would not be able toconverge on the current pivot. This completes the proof of the correct convergence of theP 2 MM algorithm.ComplexityTheP 2 MM algorithm compares the attributes <strong>in</strong> an asynchronousfashion and outputs the best attribute. Consider the case of compar<strong>in</strong>g only two attributes.The worst case communication complexity of the P 2 MM algorithm is O(size of the network).This can happen when the misclassification ga<strong>in</strong>s of the two attributes are veryclose. S<strong>in</strong>ce our algorithm is eventually correct, the <strong>data</strong> will need to be propagated throughthe entire network i.e. all the peers will need to communicate to f<strong>in</strong>d the correct answer.Thus the overall communication complexity of the P 2 MM algorithm, <strong>in</strong> the worst case,is O(size of the network). Now if the misclassification ga<strong>in</strong>s of the two attributes are notvery close (which is often the case <strong>for</strong> most <strong>data</strong>sets), the algorithm is not global; ratherthe P 2 MM algorithm can prune many messages as shown by our extensive experimentalresults. F<strong>in</strong>ally <strong>for</strong>mulat<strong>in</strong>g the complexity of such <strong>data</strong> dependent <strong>algorithms</strong> <strong>in</strong> terms ofthe complexity of the <strong>data</strong> is a big open research issue even <strong>for</strong> simple primitives such asmajority vot<strong>in</strong>g protocol [196], leave aside the complexP 2 MM algorithm presented here.Figure 4.2 demonstrates how two attributes A i and A j are compared by two peers P kand P l . At time t 0 , the peers <strong>in</strong>itialize all the sixteen votes M i,jabcdalong with the votes <strong>for</strong>the <strong>in</strong>dicator variables. At timet 1 , the values of the <strong>in</strong>dicator variables are{1,1,1,−1} <strong>for</strong>


147P k . Hence it only sends messages correspond<strong>in</strong>g to the vote M i,j111−1 . All the rest fifteenvotes are <strong>in</strong> a suspended state. Similarly the figure shows the suspended votes <strong>for</strong> P l attime t 2 . F<strong>in</strong>ally it depicts that at time t 4 , the peers converge to the same vote which is notsuspended viz. M i,j1−11−1 .Figure 4.3 shows the pivot selection process <strong>for</strong> three attributesA h ,A i andA j <strong>for</strong> peerP k hav<strong>in</strong>g two neighborsP l andP m . Snapshot 1 shows the knowledge (∆ k ) and agreementsof P k (∆ k,l and ∆ k,m ) <strong>for</strong> the three attributes. S<strong>in</strong>ce pivot is the highest agreement, A h isselected as the pivot. Now there is a disagreement between ∆ k and ∆ k,l <strong>for</strong> A h . Thisresults <strong>in</strong> a message and subsequent reevaluation of∆ k,l (to be set equal to∆ k ). In the nextsnapshot,A i is selected as the pivot and s<strong>in</strong>ce∆ k,l < ∆ k , no message needs to be sent.Comment: The P 2 MM algorithm above utilizes the assumption that the <strong>data</strong> at allpeers <strong>in</strong> boolean (attributes and class labels). The boolean attributes assumption can berelaxed to arbitrary categorial <strong>data</strong> at the expense of <strong>in</strong>creased majority vot<strong>in</strong>g <strong>in</strong>stances perattribute pair (the number of <strong>in</strong>stances <strong>in</strong>creases exponentially with the number of dist<strong>in</strong>ctattribute values). Another approach to relax<strong>in</strong>g the boolean attributes assumption couldbe to treat each attribute dist<strong>in</strong>ct value as its own boolean attribute. As a result, eachcategorical attribute with v dist<strong>in</strong>ct values is treated as v boolean attributes. Here, thenumber of majority vot<strong>in</strong>g <strong>in</strong>stances per pair of attributes <strong>in</strong>creases only l<strong>in</strong>early with thenumber of dist<strong>in</strong>ct attribute values. However, the issue of decid<strong>in</strong>g which attribute hashighest misclassification ga<strong>in</strong> on the basis of the associated boolean attributes is not entirelyclear and is the subject of more research. In Appendix B we show how our framework canbe easily extended <strong>for</strong> arbitrarily categorical attributes.4.4.2 Speculative Decision Tree InductionP 2 MM can be used to decide which attribute would best divide a given set of learn<strong>in</strong>gexamples. It is well known that decision trees can be <strong>in</strong>duced by recursively and greedily


148FIG. 4.2. Comparison of two attributesA i and A j <strong>for</strong> two peers P k andP l . The figure alsoshows some example values of the <strong>in</strong>dicator variables and the suspended/not suspendedvotes <strong>in</strong> each case.


149FIG. 4.3. The pivot selection process and how the best attribute is selected. The pivot isshown at each round.divid<strong>in</strong>g a set of learn<strong>in</strong>g examples — start<strong>in</strong>g from the root and proceed<strong>in</strong>g onwards withevery node of the tree (e.g., ID3 and C4.5 <strong>algorithms</strong> [154][155]). In a P2P set-up theprogression of the algorithm needs to be coord<strong>in</strong>ated among all peers, or they might end updevelop<strong>in</strong>g different trees. In smaller <strong>scale</strong> <strong>distributed</strong> systems, occasional synchronizationusually addresses coord<strong>in</strong>ation. However, s<strong>in</strong>ce a P2P system is too <strong>large</strong> to synchronize,we prefer speculation [98].The start<strong>in</strong>g po<strong>in</strong>t of tree development — the root — is known to all peers. Thus,they can all <strong>in</strong>itialize P 2 MM to f<strong>in</strong>d out which attribute best splits the example set of theroot. However, the peers are only guaranteed to converge to the same attribute selectioneventually and may well choose different attributes <strong>in</strong>termediately. Several questions thusarise: How and when should the algorithm commit resources to a specific split of a node,what should be done if such split appears to be wrong after resources were committed andwhat should be done about <strong>in</strong>coherence between neighbor<strong>in</strong>g peers?The P2P Decision Tree Induction (P 2 DTI, see Alg. 12) algorithm has two ma<strong>in</strong> func-


150tionalities. Firstly, it manages the ad hoc solution which is a decision tree composed ofactive nodes. The root is always active and so is any node whose parent is active providedthat the node corresponds with one of the values of the attribute which best splits its parent’sexamples — i.e., the ad hoc solution ofP 2 MM as computed by the parent. The rest ofthe nodes are <strong>in</strong>active. A node (or a whole subtree) can become <strong>in</strong>active because its parent(or <strong>for</strong>e-parent) have changed its preference <strong>for</strong> splitt<strong>in</strong>g attribute. Inactive nodes are notdiscarded; a peer may well accept messages <strong>in</strong>tended <strong>for</strong> <strong>in</strong>active nodes — either because aneighbor considers them active or because the message was delayed by the network. Sucha message would still update the majority vot<strong>in</strong>g to which it is <strong>in</strong>tended. However, peersnever send messages result<strong>in</strong>g from an <strong>in</strong>active node. Instead, they check, whenever a peerbecomes active, whether there are pend<strong>in</strong>g messages (i.e., majority votes whose test requiresend<strong>in</strong>g messages) and if so they send those messages.Another activity which occurs <strong>in</strong> active nodes is further development of the tree. Eachtime a leaf it is generated it is <strong>in</strong>serted <strong>in</strong>to a queue. Once everyτ time units, the peer takesthe first active leaf <strong>in</strong> the queue and develops it accord<strong>in</strong>g to the ad hoc result ofP 2 MM <strong>for</strong>that leaf. Inactive leaves which precede this leaf <strong>in</strong> the queue are re-<strong>in</strong>serted at the end ofthe queue.Last, it may happen that a peer receives a message <strong>in</strong> the context of a node it had noteyet developed. Such messages are stored <strong>in</strong> the out-of-context queue. Whenever a new leafis created, the out-of-context queue is searched and messages perta<strong>in</strong><strong>in</strong>g to the new leaf areprocessed.A high level overview of the speculative decision tree <strong>in</strong>duction process is shown <strong>in</strong>Figure 4.4. Filled rectangles represent newly created nodes. The first snapshot shows thecreation of the root with A 1 as the best attribute. The root is split <strong>in</strong> the next snapshot,followed by further development of the left path. The fourth snapshot shows how the rootis changed to a new attribute A 2 and the entire tree rooted at A 1 is made <strong>in</strong>active (yellow


1234567891011121314151617181920212223242526272829303132333435363738394041151Input: S – a set of learn<strong>in</strong>g examples,τ – mitigation delayOutput: ad-hoc decision treeInitialization:beg<strong>in</strong>Create a root and letroot.S ← S;node ← {root};Pushroot toqueue; Send BRANCH message to self with delay τ;endOn BRANCH message:beg<strong>in</strong>Send BRANCH message to self with delay τ;<strong>for</strong> (i ← 0,l ← null;i < queue.length and not active(l);i++) doPop head of queue <strong>in</strong>tol;if not active(l) then enqueuel;if active(l) then A j ← ad-hoc solution ofP 2 MM <strong>for</strong>l; call Branch(l,j);endendif DataMsgReceived(n,<strong>data</strong>) thenifn ∉ nodes then store〈n,<strong>data</strong>〉 <strong>in</strong>out−of −context;else Transfer the<strong>data</strong> to theP 2 MM <strong>in</strong>stance ofn;if active(n) then Process(n);endFunction Active(n)beg<strong>in</strong>ifn = null orn = root then return true;A j ← ad-hoc solution <strong>for</strong>P 2 MM <strong>for</strong>n.parent;ifn ∉ n.parent.sons[j] then return false;Return Active(n.parent);endFunction Process(n)beg<strong>in</strong>Per<strong>for</strong>m tests required byP 2 MM <strong>for</strong>nand send any result<strong>in</strong>g messages;A j ← ad-hoc solution <strong>for</strong>P 2 MM <strong>for</strong>n;ifn.sons[j] is not empty then <strong>for</strong> each m ∈ n.sons[j] do call Process(m);else pushnto the tail of the queue;endFunction Branch(l,,j)beg<strong>in</strong>Create two new leaves l 0 and l 1 ;l 0 .parent ← l and l 1 .parent ← l;Set l 0 .S ← {s ∈ l.S : s[j] = 0} and l 1 .S ← {s ∈ l.S : s[j] = 1};Deliver the messages from out−of −ccontext tol 0 and l 1 ;l.sons[j] = {l 0 ,l 1 }; addl 0 ,l 1 to nodes; pushl 0 andl 1 to the tail of the queue;endAlgorithm 12: P2P Decision Tree Induction (P 2 DTI)


152part). F<strong>in</strong>ally as time progresses, the tree rooted atA 2 is further developed. If it so happensthat A 1 now becomes better than A 2 , the old <strong>in</strong>active tree will now become active and thetree rooted at A 2 will become <strong>in</strong>active.4.4.3 Stopp<strong>in</strong>g Rule ComputationA decision tree <strong>in</strong>duction algorithm cannot be considered complete without a prun<strong>in</strong>gtechnique stat<strong>in</strong>g which branches over-fit the <strong>data</strong>. Prun<strong>in</strong>g techniques can be dividedbetween post-prun<strong>in</strong>g — remov<strong>in</strong>g nodes of whole subtrees after the tree is <strong>in</strong>duced —and pre-prun<strong>in</strong>g — <strong>in</strong>struct<strong>in</strong>g the algorithm which nodes not to develop <strong>in</strong> the first place.S<strong>in</strong>ce the P 2 DTI algorithm works on stream<strong>in</strong>g <strong>data</strong> the idea of first <strong>in</strong>duc<strong>in</strong>g the tree andafterwards prun<strong>in</strong>g it seems less suitable (because there is no specific po<strong>in</strong>t <strong>in</strong> time <strong>in</strong> whichprun<strong>in</strong>g should beg<strong>in</strong>). We there<strong>for</strong>e focus on pre-prun<strong>in</strong>g heuristics. Several common preprun<strong>in</strong>gtechniques can easily be adopted byP 2 DTI, because they use simple statistics. Wewill describe three such techniques.The simplest pre-prun<strong>in</strong>g technique is to term<strong>in</strong>ate new leave development which thetree reaches a pre-specified depth. The biggest benefit of this technique is that it limits theresources used by the algorithm. This technique is trivial to implement as part of P 2 DTIand per<strong>for</strong>ms quite favorably <strong>in</strong> experiments. It’s greatest disadvantage is that the depthlimit has to be found <strong>in</strong> trial and error, and is the same <strong>for</strong> all leaves.A second popular pre-prun<strong>in</strong>g technique is to require a m<strong>in</strong>imal degree of ga<strong>in</strong>from every split <strong>in</strong> the tree, or abort the split if the ga<strong>in</strong> is below the required number.Splitt<strong>in</strong>g any given nodes accord<strong>in</strong>g to an attribute A i will have non-negative ga<strong>in</strong> if|x i 00 +xi 10 −xi 01 −xi 11 |−|xi 00 −xi 01 |−|xi 10 −xi 11 | ≥ 0. Letsi = sign(x i 00 +xi 10 −xi 01 −xi 11 ),the previous <strong>for</strong>mula can be rewritten s i (x i 00 +x i 10 −x i 01 −x i 11) − s i 0(x i 00 −x i 01) −s i 1 (xi 10 −xi 11 ) ≥ 0. Just as <strong>in</strong> Section 4.4.1, this <strong>for</strong>mula can be evaluated by hold<strong>in</strong>geight different majority votes per attribute, one <strong>for</strong> every possible value of s i , s i 0 , and si 1 ,


153FIG. 4.4. Figure show<strong>in</strong>g how the speculative decision tree is build by a peer. Filled rectanglesrepresent the newly created nodes. In the first snapshot the root is just created withA 1 as the current best attribute. The root is split <strong>in</strong>to two children <strong>in</strong> the second snapshot.The third snapshot shows further development of the tree by splitt<strong>in</strong>g the left child. Inthe fourth snapshot, the peer gets conv<strong>in</strong>ced that A 2 is the best attribute correspond<strong>in</strong>g tothe root. Earlier tree is made <strong>in</strong>active and a new tree is developed with split at A 2 . Fifthsnapshot shows the leaf label assignments.


154and then select<strong>in</strong>g one of them accord<strong>in</strong>g to the ad hoc result of separate votes on the valuesof s i , s i 0, and s i 1. Notice that votes <strong>for</strong> s i 0 and s i 1 are held anyhow. Also, the <strong>in</strong>put <strong>for</strong> thevote <strong>for</strong> s i is <strong>in</strong>dependent of i so just one extra vote is needed <strong>for</strong> all A i . Furthermore, <strong>for</strong>all nodes except the root, a vote on the value of s i is actually held <strong>in</strong> the context of theparent of the node.A third prun<strong>in</strong>g technique is to require non-negative ga<strong>in</strong> from every split when thisga<strong>in</strong> is measured on a test set rather than on the learn<strong>in</strong>g set. This method <strong>in</strong>troduces littlecomplexity beyond the described above: the difference is that each node, start<strong>in</strong>g with theroot, should be associated with an additional set of examples and that the <strong>in</strong>put to the votesregard<strong>in</strong>g prun<strong>in</strong>g be taken from that set. Notice that the <strong>in</strong>put to votes of thes i type shouldstill be taken from the learn<strong>in</strong>g set.In the next section we present a comparative study of the accuracy of a decision treealgorithm us<strong>in</strong>g misclassification ga<strong>in</strong> as the impurity measure with fixed depth stopp<strong>in</strong>gand a standard entropy-based pruned decision tree J48 implemented <strong>in</strong> weka [194].4.4.4 Accuracy on Centralized DatasetThe proposed <strong>distributed</strong> decision tree algorithm P 2 DTI deviates from the standarddecision tree <strong>in</strong>duction algorithm <strong>in</strong> two ways: (1) <strong>in</strong>stead of us<strong>in</strong>g entropy or g<strong>in</strong>i-<strong>in</strong>dexas the splitt<strong>in</strong>g criteria, we have used misclassification ga<strong>in</strong>, and (2) as a stopp<strong>in</strong>g rule, wehave limited the depth of our trees (which effects the communication complexity of our<strong>distributed</strong> algorithm as we show later).In this section we report the results of the comparison of a decision tree <strong>in</strong>duced us<strong>in</strong>gmisclassification ga<strong>in</strong> as the impurity measure and fixed depth stopp<strong>in</strong>g rule to that ofan off-the-shelf entropy-based pruned decision tree J48 implemented <strong>in</strong> Weka [194] ona centralized <strong>data</strong>set. There are 500 tuples and 10 attributes <strong>in</strong> the centralized <strong>data</strong>setgenerated us<strong>in</strong>g the scheme discussed <strong>in</strong> Section 3.9.1. We have varied the noise <strong>in</strong> the


15510090J48 (weka)Misclassification ga<strong>in</strong>10090J48 (weka)Misclassification ga<strong>in</strong>Accuracy8070Accuracy8070600 5 10 20Percentage noise60500 5 10 20Percentage noise(a) Depth of tree = 3.(b) Depth of tree = 5.10090J48 (weka)Misclassification ga<strong>in</strong>Accuracy807060500 5 10 20Percentage noise(c) Depth of tree = 7.FIG. 4.5. Comparison of accuracy (us<strong>in</strong>g 10-fold cross validation) of J48 weka tree anda tree <strong>in</strong>duced us<strong>in</strong>g misclassification ga<strong>in</strong> with fixed depth stopp<strong>in</strong>g rule on a centralized<strong>data</strong>set. The three graphs correspond to depths of 3, 5 and 7 of the misclassification ga<strong>in</strong>decision tree.


156<strong>data</strong> rang<strong>in</strong>g from 0% to 20%. For both the trees, the accuracy is measured us<strong>in</strong>g 10-foldcross validation. Figure 4.5 presents the comparative study. The circles correspond to theaccuracy of J48 and the squares correspond to the accuracy of the misclassification ga<strong>in</strong>decision tree. These set of results po<strong>in</strong>t out some important facts:1. In most cases the misclassification ga<strong>in</strong> decision tree results <strong>in</strong> a loss of accuracyover J48. This is not unexpected due to the restrictive nature of the decision tree<strong>in</strong>duction algorithm employed. But the accuracy loss is modest. Moreover, due tothe heavy communication and synchronization cost of centraliz<strong>in</strong>g and apply<strong>in</strong>g J48,this modest loss of accuracy seems quite reasonable.2. The decrease <strong>in</strong> accuracy of the misclassification ga<strong>in</strong> decision tree when go<strong>in</strong>g todepth 7 at high noise levels is likely due to over-fitt<strong>in</strong>g; s<strong>in</strong>ce <strong>for</strong> a depth of 3 theaverage number of tuples per leaf is 56 compared to only 3 tuples per leaf <strong>for</strong> depthof 7.3. The P 2 DTI algorithm is guaranteed to converge to the misclassification ga<strong>in</strong> decisiontree with fixed stopp<strong>in</strong>g depth. There<strong>for</strong>e, once convergence is reached,P 2 DTImight loose some accuracy with respect to J48 and this is quantified by the accuracyof the misclassification ga<strong>in</strong> decision tree with fixed stopp<strong>in</strong>g depth as shown here.S<strong>in</strong>ce limit<strong>in</strong>g the depth affects the communication complexity of our P 2 DTI algorithm,we will use depths of 3, as it produces quite accurate trees <strong>for</strong> all noise levels.4.5 ExperimentsTo validate the per<strong>for</strong>mance of our decision tree algorithm, we conducted experimentson a simulated network of peers. In this section we discuss the experimental setup, measurementmetric and the per<strong>for</strong>mance of the algorithm.


4.5.1 Experimental Setup157Our implementation makes use of the Distributed Data M<strong>in</strong><strong>in</strong>g Toolkit (DDMT) [44]which is a JADE-LEAP based event simulator developed by the DIADIC research lab atUMBC. The topology is generated us<strong>in</strong>g BRITE [23] — an open software <strong>for</strong> generat<strong>in</strong>gnetwork topologies. We have used the Barabasi Albert (BA) model <strong>in</strong> BRITE s<strong>in</strong>ce it isoften considered a reasonable model <strong>for</strong> the Internet. We use the edge delays def<strong>in</strong>ed <strong>in</strong> BAas the basis <strong>for</strong> our time measurement 2 . On top of each network generated by BRITE, weoverlayed a communication tree.4.5.2 Data GenerationThe <strong>in</strong>put <strong>data</strong> of a peer is generated us<strong>in</strong>g the scheme proposed by Dom<strong>in</strong>gos andHulten [48]. Each <strong>in</strong>put <strong>data</strong> po<strong>in</strong>t is a vector <strong>in</strong> {0,1} d ×{0,1}. The <strong>data</strong> generator is arandom tree built as follows. At each level of the tree, an attribute is selected randomly andmade an <strong>in</strong>ternal node of the tree with the only restriction that attributes are not repeatedalong the same path. After the tree is built up to a depth of 3, a node is randomly made a leafwith a probability ofpalong with a randomly generated label. We limit the depth of the treeto maximum 6, and make all the non-leaf nodes a leaf (with random labels) after it exceedsthat depth. Whenever a peer needs an additional po<strong>in</strong>t, it generates a random vector <strong>in</strong> thed-dimensional space and then passes it through the tree. The label it gets assigned to <strong>for</strong>msthe class label <strong>for</strong> that <strong>in</strong>put vector. This <strong>for</strong>ms noise-free <strong>in</strong>put vectors. In order to addnoise, the bits of the vectors (<strong>in</strong>clud<strong>in</strong>g the class label) are flipped with a certa<strong>in</strong> probability.There<strong>for</strong>e, n% noise means that each bit of the <strong>in</strong>put vector is flipped with n% chance andthe new value of that bit is chosen uni<strong>for</strong>mly from all the possibilities (<strong>in</strong>clud<strong>in</strong>g the orig<strong>in</strong>alvalue). The <strong>data</strong> generator is changed every 5 × 10 5 simulator ticks, thereby creat<strong>in</strong>g anepoch. A typical experiment consists of 10 equal length epochs In addition, throughout the2 Wall time is mean<strong>in</strong>gless when simulat<strong>in</strong>g thousands of computers on a s<strong>in</strong>gle PC.


158experiment we change 10% <strong>data</strong> of each peer after every 1000 clock ticks. There<strong>for</strong>e, <strong>in</strong>all our experiments there are two levels of <strong>data</strong> change — (1) stationary change when wesample from the same <strong>data</strong> distribution every 1000 simulator ticks, and (2) dynamic changewhen the <strong>data</strong> distribution changes after every5×10 5 simulator ticks.4.5.3 Measurement MetricThe two measurements of our algorithm are the quality of the result and the cost<strong>in</strong>curred.Given a test <strong>data</strong>set to each peer, generated from the same distribution as the <strong>local</strong><strong>data</strong>set, quality is measured <strong>in</strong> terms of the percentage of correctly classified tuples of thistest set. We report both the stationary accuracy which refers to the accuracy measured dur<strong>in</strong>gthe last 80% of the epochs (and hence correspond to stationary changes) and the overallaccuracy. Each quality graph <strong>in</strong> Figures 4.6, 4.7, 4.8, 4.9, 4.10 and 4.11—4.13 reports twoquantities — (1) the average quality over all peers, all epochs and 10 <strong>in</strong>dependent trials (thecenter markers) and (2) the standard deviation over 10 <strong>in</strong>dependent trials (error bars).For measur<strong>in</strong>g the cost of the algorithm we report two quantities — normalized messagessent and normalized bytes transferred. Our measurement metric <strong>for</strong> the normalizedmessages is the number of messages sent by each peer per unit of leaky bucket L. For analgorithm whose communication model is broadcast, its normalized messages is 2, consider<strong>in</strong>g2 neighbors on an average per peer. We report both the overall messages andthe monitor<strong>in</strong>g messages; the latter refers to the “wasted ef<strong>for</strong>t” of the algorithm. For agiven time <strong>in</strong>stance, if a peer needs to send k separate messages correspond<strong>in</strong>g to differentmajority votes to one particular peer, it is counted as one message to that neighbor.Similarly, to understand the actual communication overhead of our algorithm <strong>in</strong> termsof the number of bytes sent, we report both the overall and monitor<strong>in</strong>g bytes transferred,per unit of L. In every raw message the <strong>distributed</strong> algorithm sends 5 numbers — the <strong>data</strong>


159of the vote, the id of the vote, the id of the attributes which this vote corresponds to, thepath of the tree and the maximum pivot. Consider<strong>in</strong>g the above example, the number ofbytes sent is 5 ∗ k per neighbor. As be<strong>for</strong>e, <strong>for</strong> a broadcast based algorithm, hav<strong>in</strong>g anattribute cross-table with four entries (2 values and 2 classes), its bytes sent would be no ofattributes×4×2. The factor of two is assum<strong>in</strong>g 2 neighbors per peer. For example, with 10attributes, the number of bytes sent perLis 80. Similar to what we did <strong>for</strong> quality, we haveplotted both the average cost and the standard deviation of the result over 10 <strong>in</strong>dependenttrials.There are three parameters of the P 2 DTI algorithm that we have explored — (1) thenumber of <strong>local</strong> tuples or the size of the <strong>local</strong> <strong>data</strong>set|S i |, (2) the depth of the <strong>in</strong>duced tree,and (3) the size of the leaky bucket L. The measurement po<strong>in</strong>ts <strong>for</strong> the <strong>local</strong> <strong>data</strong> po<strong>in</strong>tsper peer are 250, 500, 1000, 2000 and 4000. For the depth of tree, we used values of 2, 3,4, 5 and 7 while we varied L among 1000, 2000, 3000 and 4000. The values of L are <strong>in</strong>simulator ticks where the average edge delay is about 1100 time units.The <strong>data</strong> generator had two parameters — (1) noise <strong>in</strong> the <strong>data</strong> varied between 0%,5%, 10% and 20%, and (2) number of attributes (10, 15, and 20).F<strong>in</strong>ally, as a system parameter we varied the number of peers from 50 to 1500.Unless otherwise stated, we have used the follow<strong>in</strong>g default values: |S i | = 500, depthof the tree = 3, noise = 10%, number of attributes = 10, number of peers = 1000, andL = 1000 (where the average edge delay is about 1100 time units). Under these values,<strong>for</strong> a broadcast algorithm, the number of normalized messages is 2 while the number ofnormalized bytes is 10×4×2 = 80.In all our experiments we have observed the follow<strong>in</strong>g phenomenon. As soon as theepoch changes, the accuracy of the algorithm goes down and the communication <strong>in</strong>creasess<strong>in</strong>ce the algorithm adapts to the new distribution. As soon as the distribution becomesstable, the message overhead reduces and the accuracy improves. To take note of this, we


160have plotted both the overall and stationary behavior of the algorithm with respect to thedifferent parameters.In the next few sections we present the per<strong>for</strong>mance of the P 2 DTI algorithm on thedifferent parameters.4.5.4 ScalabilityOur first set of results demonstrate the scalability of theP 2 DTI algorithm as the numberof peers is varied from 50 to 1500. The number of peers has no effect on the per<strong>for</strong>manceas we see <strong>in</strong> Figure 4.6. In Figure 4.6(a), both the overall and stationary accuracyconverges to a constant as the number of peers is <strong>in</strong>creased. Normalized messages andnormalized bytes transferred, as shown <strong>in</strong> Figures 4.6(b) and 4.6(c), changes very slowlyand almost rema<strong>in</strong>s a constant as the number of peers is <strong>in</strong>creased. S<strong>in</strong>ce our algorithmrelies on some <strong>data</strong> dependent rules to prune messages, the total number of peers has littleeffect on the quality or the cost. Hence the algorithm is highly scalable. Note that <strong>for</strong> analgorithm which broadcasts sufficient statistics to ma<strong>in</strong>ta<strong>in</strong> the trees the normalized messagesand normalized bytes transferred would be 2 and 80 respectively. Our results show asignificant improvement.4.5.5 Data Tuples per PeerIn this section we have experimented with the first algorithm parameter — the numberof tuples <strong>in</strong> the <strong>local</strong> <strong>data</strong>set |S i |. Figure 4.7 summarizes the results. Stationary accuracy<strong>in</strong>creases from 72% to 85%, stationary messages decrease from 1.21 to 0.32 and stationarybytes reduce from 20.9 to 4.24 as the size of the <strong>local</strong> <strong>data</strong>set is <strong>in</strong>creased from 250 tuplesper peer to 4000 tuples per peer. This is true s<strong>in</strong>ce with <strong>in</strong>creas<strong>in</strong>g |S i |, the global tree is<strong>in</strong>duced on a <strong>large</strong>r <strong>data</strong>set, lead<strong>in</strong>g to better accuracy. Moreover, with <strong>in</strong>creas<strong>in</strong>g |S i |, thealgorithm can capture more variability <strong>in</strong> the distribution (s<strong>in</strong>ce the majority votes are run


161Accuracy8078767472StationaryOverall7050 200 500 1000 1500Number of peers(a) Quality vs. number of peers.Normalized Messages1.11.0510.950.90.850.8Stationary0.75Overall0.750 200 500 1000 1500Number of peers(b) Normalized messages vs. number of peers.Normalized Bytes181716151413StationaryOverall1250 200 500 1000 1500Number of peers(c) Normalized bytes transferred vs.peers.number ofFIG. 4.6. Dependence of the quality and cost of the decision tree algorithm on the numberof peers. The accuracy and the cost is almost <strong>in</strong>variant of the number of peers.on a <strong>large</strong>r <strong>data</strong>set) lead<strong>in</strong>g to lower communication. Even <strong>for</strong> the smallest <strong>data</strong>set size of250, the normalized messages is 1.21 and the stationary bytes is 20.9, both are far less than2 and 80 respectively consider<strong>in</strong>g the broadcast algorithm.4.5.6 Depth of TreeAs po<strong>in</strong>ted out <strong>in</strong> Section 4.4.4, depth of the decision tree <strong>in</strong>duced by the P 2 DTIalgorithm affects the cost of the algorithm. In this section, we validate this result. The


162Accuracy8580757065StationaryOverall60250500 1000 2000 4000|S | i(a) Quality vs. |S i |.Normalized Messages1.51.2510.750.50.25StationaryOverall250500 1000 2000 4000|S | i(b) Normalized messages vs. |S i |.Normalized Bytes30252015105StationaryOverall0250500 1000 2000 4000|S | i(c) Normalized bytes transferred vs. |S i |.FIG. 4.7. Dependence of the quality and cost of the decision tree algorithm on|S i |. Theaccuracy improves and the cost decreases as|S i | <strong>in</strong>creases.


163experimental results are shown <strong>in</strong> Figure 4.8. As shown, the effect of the depth is morepronounced on the communication than on the quality of the result. Accuracy <strong>in</strong>creasesfrom 72% to 81% as the depth is <strong>in</strong>creased from 2 to 5. However <strong>for</strong> a depth of 7, theaccuracy of the P 2 DTI decreases by 2% compared to a depth of 5. The reason <strong>for</strong> this isoverfitt<strong>in</strong>g of the doma<strong>in</strong>. As the depth is <strong>in</strong>creased, there is potentially more ties <strong>for</strong> everymajority vote lead<strong>in</strong>g to a message explosion. As the depth is <strong>in</strong>creased from 2 to 7, thestationary messages <strong>in</strong>crease from 0.71 to 1.56. The stationary bytes goes up to 58.71, <strong>for</strong>a depth of 7.Although the <strong>in</strong>duced tree of depth 5 is around 3% more accurate than the tree ofdepth 3, we have used trees of depth 3 <strong>in</strong> all as a basel<strong>in</strong>e. This is because a tree of depth3 has far lower communication overhead than a tree of depth 5 (0.89 normalized messages<strong>for</strong> depth of 3 compared to 1.19 normalized messages <strong>for</strong> depth of 5).4.5.7 Size of Leaky BucketThe last algorithm parameter that we have experimented with is the leaky bucketmechanism. In this section we present the effect of the size of the leaky bucket |L| onthe accuracy and the cost of the P 2 DTI algorithm. Figure 4.9 summarizes the effect. Asshown <strong>in</strong> Figure 4.9(a), the stationary accuracy rema<strong>in</strong>s constant even as the size of theleaky bucket is made twice or thrice of the edge delay (which is roughly 1100 time units).The overall quality degrades. This is exactly what we expect. As |L| is <strong>in</strong>creased, <strong>for</strong> everyepoch change, the algorithm takes more time to converge, thereby carry<strong>in</strong>g <strong>in</strong>accurateresults <strong>for</strong> a longer time. For this reason the overall accuracy degrades. However, oncethe algorithm adapts to the new distribution, a small number of messages is sufficient toma<strong>in</strong>ta<strong>in</strong> correctness. This is why the leaky bucket has no effect on the stationary accuracy.Contrary to the quality, the cost reduces drastically, from 0.98 stationary messages per peer<strong>for</strong>|L|=500 to 0.71 <strong>for</strong>|L|=4000. Similar is the trend <strong>for</strong> the bytes transferred.


164Accuracy8580757065StationaryOverall2 3 4 5 7Depth of tree(a) Quality vs. depth of the tree.Normalized Messages21.751.51.2510.750.5StationaryOverall2 3 4 5 7Depth of tree(b) Normalized messages vs. depth of the tree.Normalized Bytes70605040302010StationaryOverall2 3 4 5 7Depth of tree(c) Normalized bytes transferred vs. depth of the tree.FIG. 4.8. Dependence of the quality and cost of the decision tree algorithm on the depthof the <strong>in</strong>duced tree. The accuracy first improves and then degrades as the depth of the tree<strong>in</strong>creases. The cost <strong>in</strong>creases as the depth <strong>in</strong>creases.


165Accuracy80757065 StationaryOverall60500 1000 2000 4000|L|(a) Quality vs. size of|L|.Normalized Messages1.41.210.8StationaryOverall0.6500 1000 2000|L|3000 4000(b) Normalized messages vs. size of|L|.Normalized Bytes19171513119StationaryOverall7500 1000 2000 3000 4000|L|(c) Normalized bytes transferred vs. size of|L|.FIG. 4.9. Dependence of the quality and cost of the decision tree algorithm on the size ofthe leaky bucket.


4.5.8 Noise <strong>in</strong> Data166In this section we vary one of the <strong>data</strong> parameters — noise. As the noise <strong>in</strong> the <strong>data</strong>is <strong>in</strong>creased from 0% to 20%, quality degrades and cost <strong>in</strong>creases. This is demonstrated<strong>in</strong> Figures 4.10. In Figure 4.10(a), the stationary accuracy decreases from 83% to approximately75%. The stationary messages <strong>in</strong>crease from 0.52 to 1.24 and stationary bytes<strong>in</strong>crease from 9.43 to 17.89 as demonstrated <strong>in</strong> Figures 4.10(b) and 4.10(c) respectively.This happens because with <strong>in</strong>creas<strong>in</strong>g noise, every comparison consumes more resourcesto decide the better one and this decision can often get flipped every time the <strong>data</strong> changes.As a result, the quality degrades and the cost <strong>in</strong>creases. The important observation here isthat even <strong>for</strong> the highest noise, the number of bytes transferred is 17.89, far less than themaximal allowable rate of 80.4.5.9 Number of AttributesThe last parameter we varied is the number of attributes. We have measured the effect<strong>in</strong> three different ways — (1) <strong>in</strong>creas<strong>in</strong>g the number of attributes while keep<strong>in</strong>g thenumber of tuples constant (Figures 4.11(a), 4.11(b) and 4.11(c)), (2) <strong>in</strong>creas<strong>in</strong>g the numberof attributes while <strong>in</strong>creas<strong>in</strong>g the number of tuples l<strong>in</strong>early with the number of attributes(Figures 4.12(a), 4.12(b) and 4.12(c)), and (3) <strong>in</strong>creas<strong>in</strong>g the number of attributes while <strong>in</strong>creas<strong>in</strong>gthe number of tuples l<strong>in</strong>early with the size of the doma<strong>in</strong> (Figures 4.13(a), 4.13(b)and 4.13(c)).As the number of the attributes is <strong>in</strong>creased keep<strong>in</strong>g the number of tuples constant (at500 tuples per peer), the stationary accuracy decreases from 73% to 63% (Figure 4.11(a)).Similarly Figures 4.11(b) and 4.11(c) demonstrate that the normalized messages <strong>in</strong>creasefrom 0.9 to 1.25 and the normalized bytes <strong>in</strong>crease from approximately 17 to 92. Note that<strong>for</strong> the number of attributes=10, 15 and 20, the maximal bytes transferred <strong>for</strong> a broadcastbasedalgorithm is 80, 120 and 160 respectively.


167Accuracy85807570StationaryOverall650 5 10 20Percentage of noise(a) Quality vs. noise.Normalized Messages1.51.2510.750.5StationaryOverall0 5 10 20Percentage of noise(b) Normalized messages vs. noise.Normalized Bytes40302010StationaryOverall0 5 10 20Percentage of noise(c) Normalized bytes transferred vs. noise.FIG. 4.10. Dependence of the quality and cost of the decision tree algorithm on noise <strong>in</strong>the <strong>data</strong>. The accuracy decreases and cost <strong>in</strong>creases as percentage of noise <strong>in</strong>creases.


168The second set of Figures 4.12(a), 4.12(b) and 4.12(c) show the effect on the numberof attributes as the number of tuples is <strong>in</strong>creased l<strong>in</strong>early with the number of attributes(such that, number of tuples=50× number of attributes. For 10 attributes we have used500 <strong>data</strong> tuples per peer. We have <strong>in</strong>creased it to 750 tuples <strong>for</strong> 15 attributes and further<strong>in</strong>creased it to 1000 tuples <strong>for</strong> 20 attributes. The accuracy degrades from 73% to 63%. Themore <strong>in</strong>terest<strong>in</strong>g is the effect on the communication. The stationary messages <strong>in</strong>crease veryslowly (from 0.89 to 0.92), demonstrat<strong>in</strong>g the fact that the algorithm is scalable.One last variation is the relationship of number of attributes when the number of tuplesis <strong>in</strong>creased <strong>in</strong> proportion to the size of the doma<strong>in</strong> (number of tuples=1%×2 numberofattributes ).For 10, 15 and 20 attributes, the number of tuples per peer we used are 10, 330 and 10000respectively. The accuracy improves and the normalized messages decrease as the numberof attributes is <strong>in</strong>creased. The number of bytes transferred <strong>in</strong>creases, though <strong>for</strong> numberof attributes=20, it is still well below what would have been used <strong>for</strong> a broadcast-basedalgorithm.4.5.10 DiscussionIn the previous section we have presented the quality and the cost of theP 2 DTI algorithmon the different algorithm parameters. Our f<strong>in</strong>d<strong>in</strong>gs can be summarized as follows.• In most cases P 2 DTI results <strong>in</strong> a loss of accuracy over J48. This is because of thesimpler ga<strong>in</strong> function that we have chosen. However, due to the heavy communicationand synchronization cost of centraliz<strong>in</strong>g and apply<strong>in</strong>g J48, the observed modestloss of accuracy seems quite reasonable.• The P 2 DTI algorithm is highly scalable with respect to the number of peers. Asshown by our scalability experiments, <strong>in</strong>creas<strong>in</strong>g the number of peers has little effecton the quality or cost.


169Accuracy80757065StationaryOverall6010 15 20# Attributes(a) Quality <strong>for</strong> number of tuples=const.Normalized Bytes120906030Normalized MessagesStationaryOverall1.41.31.21.110.9010 15 20# Attributes(c) Normalized bytes <strong>for</strong> number of tuples=const.StationaryOverall10 15 20# Attributes(b) Normalized msg. <strong>for</strong> number of tuples=const.FIG. 4.11. Dependence of the quality and cost of the decision tree algorithm on the numberof attributes when number of tuples=constant. As the number of attributes <strong>in</strong>crease, theaccuracy drops and the cost <strong>in</strong>creases.


170Accuracy80757065StationaryOverall6010 15 20# Attributes(a) Quality when number of tuples <strong>in</strong>crease l<strong>in</strong>earlywith #attr.Normalized Bytes12010080604020Normalized MessagesStationaryOverall1.31.21.110.90.8StationaryOverall10 15 20# Attributes(b) Normalized msg. when number of tuples <strong>in</strong>creasel<strong>in</strong>early with #attr.10 15 20# Attributes(c) Normalized bytes when number of tuples <strong>in</strong>creasel<strong>in</strong>early with #attr.FIG. 4.12. Dependence of the quality and cost of the decision tree algorithm on thenumber of attributes when #tuples <strong>in</strong>crease l<strong>in</strong>early with number of attributes.


171Accuracy85807570StationaryOverall656010 15 20# Attributes(a) Quality when number of tuples <strong>in</strong>crease l<strong>in</strong>earlywith|doma<strong>in</strong>|.Normalized Bytes120100755025Normalized MessagesStationaryOverall1.71.61.41.21StationaryOverall0.810 15 20# Attributes(b) Normalized msg. number of tuples <strong>in</strong>crease l<strong>in</strong>earlywith |doma<strong>in</strong>|.010 15 20# Attributes(c) Normalized bytes number of tuples <strong>in</strong>crease l<strong>in</strong>earlywith|doma<strong>in</strong>|.FIG. 4.13. Dependence of the quality and cost of the decision tree algorithm on thenumber of attributes when number of tuples <strong>in</strong>crease l<strong>in</strong>early with|doma<strong>in</strong>|.


172• Increas<strong>in</strong>g the number of tuples <strong>in</strong>creases the accuracy of the tree and decreases thecost. Even <strong>for</strong> tuples per peer = a quarter of the size of the doma<strong>in</strong> (250 tuples perpeer <strong>for</strong> 10 attributes), the monitor<strong>in</strong>g cost is 1.25 — less than the maximal allowablecost of 2.0.• Note that every <strong>in</strong>crement <strong>in</strong> the number of attributes doubles the search space. Thequality and cost of our algorithm rema<strong>in</strong>s moderate even of the number of attributesis doubled.• Noise <strong>in</strong> the <strong>data</strong> degrades the quality and cost — it can be compensated either by<strong>in</strong>creas<strong>in</strong>g the number of tuples or <strong>in</strong>creas<strong>in</strong>g the depth of the tree. Depth of threeseems to a be moderate choice s<strong>in</strong>ce the accuracy is good and the monitor<strong>in</strong>g cost islow as well. Increas<strong>in</strong>g the depth improves the accuracy with a heavy penalty on thecost.4.6 SummaryIn this chapter we presented an asynchronous scalable algorithm <strong>for</strong> <strong>in</strong>duc<strong>in</strong>g a decisiontree <strong>in</strong> <strong>large</strong> P2P networks. With sufficient time, the algorithm converges to the sametree given all the <strong>data</strong> of all the peers. To the best of the authors’ knowledge this is oneof the first attempts on develop<strong>in</strong>g such an algorithm. The algorithm is suitable <strong>for</strong> scenarios<strong>in</strong> which the <strong>data</strong> is <strong>distributed</strong> across a <strong>large</strong> P2P network as it seamlessly handles<strong>data</strong> changes and peer failures. We have conducted extensive experiments with synthetic<strong>data</strong>set to analyze the different parameters of the <strong>algorithms</strong>. The results po<strong>in</strong>t out thatthe algorithm is accurate and suffers moderate communication overhead compared to abroadcast-based algorithm. The algorithm is also highly scalable both with respect to thenumber of peers and number of attributes.This chapter relies on the majority vot<strong>in</strong>g algorithm as a build<strong>in</strong>g block. The majority


173vot<strong>in</strong>g protocol is a highly <strong>efficient</strong> and scalable protocol, ma<strong>in</strong>ly due to the existence of<strong>local</strong> prun<strong>in</strong>g rules. In the literature such <strong>algorithms</strong> are commonly referred to as <strong>local</strong> <strong>algorithms</strong>.Previous work on exact <strong>local</strong> <strong>algorithms</strong> ma<strong>in</strong>ly focused on develop<strong>in</strong>g <strong>efficient</strong>build<strong>in</strong>g blocks such as majority vot<strong>in</strong>g [196], L2-threshold<strong>in</strong>g [195] and more. In thischapter, similar to [105], we leverage these powerful build<strong>in</strong>g blocks to show how morecomplex <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>algorithms</strong> can be developed <strong>for</strong> <strong>large</strong>-<strong>scale</strong> <strong>distributed</strong> systems. Inthe process we have also shown how complex functions such as entropy need to be simplifiedto misclassification ga<strong>in</strong> <strong>in</strong> order to aid <strong>in</strong> the algorithm development process.


174Chapter 5CONCLUSIONS AND FUTURE WORKProliferation of communication technologies and reduction <strong>in</strong> storage costs over thepast decade have led to the emergence of several <strong>distributed</strong> systems. Most of the <strong>in</strong>itialef<strong>for</strong>t was directed towards develop<strong>in</strong>g systems based on client-server architecture such asmost modern web servers, file servers, mail servers, pr<strong>in</strong>t servers, e-commerce applicationsand more. These became popular ma<strong>in</strong>ly because of the simple synchronous natureof the communication between the server and the client. S<strong>in</strong>ce the <strong>data</strong> is stored ma<strong>in</strong>ly atthe server, it is relatively easy to ensure security and privacy. However as more and moreof these systems evolved, several shortcom<strong>in</strong>gs were realized. With <strong>in</strong>creased clients andmore requests from each client, traffic congestion between the server and the client becameheavy, lead<strong>in</strong>g to reduced throughput. Moreover <strong>in</strong> many cases several clients sat idle, lead<strong>in</strong>gto unbalanced load distribution. Another major drawback was the lack of robustnessto failed clients. S<strong>in</strong>ce <strong>in</strong> a typical client server-based <strong>distributed</strong> system the <strong>data</strong> is notreplicated but rather kept at a s<strong>in</strong>gle site, a node failure is devastat<strong>in</strong>g <strong>for</strong> the entire system.Peer-to-Peer (P2P) systems were developed to solve some of these problems. P2Psystems, as already def<strong>in</strong>ed <strong>in</strong> Chapter 2 Section 2.2, is an architecture where there are nospecial mach<strong>in</strong>e or mach<strong>in</strong>es that provide a service or manage the network resources. Insteadall responsibilities are uni<strong>for</strong>mly divided among all mach<strong>in</strong>es, known as peers. Peerscan serve both as clients and servers hence they are often termed as servants. While P2P


175systems orig<strong>in</strong>ally emerged as a <strong>for</strong>um <strong>for</strong> shar<strong>in</strong>g and exchang<strong>in</strong>g <strong>data</strong> such as Gnutella,BitTorrents, e-Mule, Kazaa, and Freenet, more recent research such as P2P web community<strong>for</strong>mation argues that the consumers will greatly benefit from the knowledge locked <strong>in</strong> the<strong>data</strong>. Algorithms designed <strong>for</strong> knowledge extraction from typical client-server systems failto per<strong>for</strong>m well <strong>in</strong> such massive P2P systems ma<strong>in</strong>ly because of (1) asynchronous communicationparadigm, (2) decentralized control, (3) <strong>large</strong> number of nodes and (4) robustnessto system failures.In this dissertation we have proposed several <strong>algorithms</strong> specifically geared towardsP2P systems. More specifically <strong>in</strong> Chapter 3 our contributions are as follows:• The DeFraLC framework describes a theorem which can be used to build severaldeterm<strong>in</strong>istic <strong>local</strong> <strong>algorithms</strong>.• Based on the above mentioned theorem, we have developed a generic <strong>local</strong> algorithm(PeGMA) capable of comput<strong>in</strong>g complex functions def<strong>in</strong>ed on the average of the <strong>data</strong>horizontally <strong>distributed</strong> over a <strong>large</strong> P2P network.• We have also discussed a general framework <strong>for</strong> monitor<strong>in</strong>g, and consequent reactiveupdat<strong>in</strong>g of any model of horizontally <strong>distributed</strong> <strong>data</strong>.• F<strong>in</strong>ally we describe the application of this framework <strong>for</strong> the problems of track<strong>in</strong>gthe average of <strong>distributed</strong> <strong>data</strong>, the first eigenvector of that <strong>data</strong>, and the k-meanscluster<strong>in</strong>g of that <strong>data</strong> and multivariate regression of the <strong>data</strong>. Our experimentalresults have shown that the <strong>algorithms</strong> are accurate and suffer modest communicationcost compared to centralization of the entire <strong>data</strong>.Previously, a <strong>local</strong> algorithm needed to be developed <strong>for</strong> each <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problem.The results of Chapter 3 present an easier solution — we can develop several <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong><strong>algorithms</strong> by follow<strong>in</strong>g the guidel<strong>in</strong>es presented there<strong>in</strong>.


176In the next chapter, Chapter 4 we have shown how a complex yet well understood <strong>data</strong><strong>m<strong>in</strong><strong>in</strong>g</strong> algorithm such as decision tree <strong>in</strong>duction can be developed <strong>for</strong> P2P systems. In theprocess, we have chosen a relatively simple attribute ga<strong>in</strong> metric such as misclassificationga<strong>in</strong> contrary to more complex ones such as entropy and G<strong>in</strong>i s<strong>in</strong>ce (1) it is easier to compute<strong>in</strong> a P2P doma<strong>in</strong> and (2) achieves comparable accuracy. Also we have shown how thetree can be build <strong>in</strong> a top down manner without the need <strong>for</strong> all the peers to synchronizeand exchange <strong>data</strong>.This thesis has shown how traditional computation problems such as k-means, multivariateregression, decision tree <strong>in</strong>duction can be converted to a decision problem andsolved us<strong>in</strong>g <strong>local</strong> <strong>algorithms</strong> <strong>in</strong> <strong>large</strong> <strong>distributed</strong> environments. We have ga<strong>in</strong>ed valuable<strong>in</strong>sights which show that it is possible to develop <strong>distributed</strong> <strong>local</strong> <strong>algorithms</strong> <strong>for</strong> an importantclass of <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> problems viz. decision problems. There<strong>for</strong>e several <strong>efficient</strong><strong>local</strong> <strong>algorithms</strong> can be developed <strong>for</strong> monitor<strong>in</strong>g the <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> models generated us<strong>in</strong>ga centralized computation. It is still unclear to us if <strong>local</strong> <strong>algorithms</strong> can be developed <strong>for</strong>comput<strong>in</strong>g the models. In our case we have used an <strong>in</strong><strong>efficient</strong> technique — convergecastbroadcast— to per<strong>for</strong>m the computation and then rely on the efficiency and correctness ofthe <strong>local</strong> algorithm. This opens up several <strong>in</strong>terest<strong>in</strong>g research areas. They are listed below.Complex <strong>data</strong> <strong>m<strong>in</strong><strong>in</strong>g</strong> algorithm development <strong>for</strong> P2P systems: Previous work on develop<strong>in</strong>gdeterm<strong>in</strong>istic yet <strong>efficient</strong> <strong>algorithms</strong> <strong>for</strong> P2P systems ma<strong>in</strong>ly focused on develop<strong>in</strong>gbasic primitives such as sum computation [119], majority vot<strong>in</strong>g [196] etc. In orderto harness the power of the next generation P2P systems, complex <strong>algorithms</strong> need to bedeveloped <strong>for</strong> extract<strong>in</strong>g knowledge from the <strong>data</strong> locked <strong>in</strong> these systems. These primitivesmight aid <strong>in</strong> the algorithm development process (as we have shown <strong>for</strong> our decisiontree algorithm <strong>in</strong> Chapter 4); however <strong>in</strong> many cases we may need to develop <strong>algorithms</strong>from scratch. Cluster<strong>in</strong>g, classification, association rule <strong>m<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> P2P systems need <strong>efficient</strong>and scalable solutions. This work and some other work such as [105] and [116] have


177barely scratched the surface compared to the huge number of <strong>algorithms</strong> that still need to bedeveloped. Furthermore, these <strong>algorithms</strong> need to be deployed <strong>in</strong> real life to solve practicalproblems.Theoretical analysis of <strong>local</strong> <strong>algorithms</strong> <strong>for</strong> <strong>distributed</strong> systems: Local <strong>algorithms</strong> <strong>for</strong>P2P systems have been an active area of research <strong>for</strong> the last five years. While a few<strong>algorithms</strong> have been developed, the field lacks any theoretical work <strong>in</strong> the “<strong>local</strong>ity” ofthese <strong>algorithms</strong>. The <strong>local</strong> and efficiency claims are, <strong>in</strong> many cases, <strong>in</strong>tuitive and henceexperimental results are the only resource to demonstrate efficiency, <strong>local</strong>ity and scalability.Although <strong>in</strong> this work we have proposed two def<strong>in</strong>itions of <strong>local</strong>ity, it is far from be<strong>in</strong>gperfect. Analysis of the communication complexity, <strong>local</strong>ity and convergence rate <strong>for</strong> theproposed <strong>local</strong> <strong>algorithms</strong> <strong>in</strong> the literature need to be carried out <strong>in</strong> order to provide better<strong>in</strong>sights to the work<strong>in</strong>g of these <strong>algorithms</strong>.Ord<strong>in</strong>al statistics based algorithm development: In many cases <strong>for</strong> determ<strong>in</strong>istic <strong>local</strong><strong>algorithms</strong>, tied votes consume the most resources. Consider the example of compar<strong>in</strong>gtwo real numbers. If the numbers are nearly equal, a lot of resources are consumed bythe majority vot<strong>in</strong>g algorithm. Ord<strong>in</strong>al statistics can provide a cheaper solution s<strong>in</strong>ce, <strong>in</strong>ord<strong>in</strong>al decision theory, the comparison is important, not the actual value. Apply<strong>in</strong>g orderstatistics (as done <strong>in</strong> [37]) to solve no-tied majority vote can be an active area of research.Local <strong>algorithms</strong> <strong>for</strong> computation problems: As already discussed, we have used <strong>local</strong><strong>algorithms</strong> to monitor a model generated us<strong>in</strong>g centralized/sampl<strong>in</strong>g techniques (exceptdecision tree <strong>in</strong>duction). We still do not have any concrete ideas on how to develop <strong>local</strong><strong>algorithms</strong> <strong>for</strong> many of the computational problems that we have discussed <strong>in</strong> this thesis.


Appendix AMAJORITY VOTING PROTOCOL AND ITSRELATION TO L2 THRESHOLDING ALGORITHMThe majority vot<strong>in</strong>g protocol was developed by Wolff and Schuster [196]. In this Appendixwe show how the said protocol can be derived from our L2 Threshold<strong>in</strong>g algorithm.The orig<strong>in</strong>al protocol suffers from one major drawback – small variations of <strong>local</strong> <strong>data</strong> of apeer i.e. ∆ k triggers more messages. This is detrimental to a dynamic behavior where the<strong>data</strong> changes frequently. Our mapp<strong>in</strong>g of the majority protocol to the L2 algorithm solvesthis problem as well be chang<strong>in</strong>g the rules under which a message is sent and therebymak<strong>in</strong>g it more communication <strong>efficient</strong> under such circumstances. The orig<strong>in</strong>al majorityprotocol is presented <strong>in</strong> Algorithm 13.Recall<strong>in</strong>g the notations we have used <strong>in</strong> Chapter 3, −→ K k , −−→ A k,l and −−→ W k,l refer to theaverage knowledge, agreement and witheld knowledge of peerP k ,Now <strong>for</strong> any majority thresholdλ,(1. ∆ k −→Kk)= −λ |K k | ⇔ −→ K k = λ+ ∆k|K k |( −−→)2. ∆ kl = Ak,l −λ |A k,l | ⇔ −−→ A k,l = λ+ ∆kl|A k,l|3. −−→ W k,l =−→K k |K k |− −−→ A k,l|A k,l|= ∆k +λ|K k |−∆ kl −λ|A k,l|=|K k |−|A k,l| |K k |−|A k,l|∆k −∆ kl+λ |K k |−|A k,l|The condition <strong>for</strong> send<strong>in</strong>g messages <strong>in</strong> majority vot<strong>in</strong>g can be written as:178


179123456789101112131415161718192021222324252627282930Input: δ k ,N k ,LOutput: if∆ k ≥ 0 then 1 else0Local variables: ∀l ∈ N k : δ lk ,δ klDef<strong>in</strong>itions: ∆ k = δ k + ∑ l∈Nδ lk , ∆ kl = δ kl +δ lkkInitialization:beg<strong>in</strong><strong>for</strong>alll ∈ N k doδ kl = δ lk = 0;SendMessage(l);endendif MessageRecvd ( P l ,δ ) then δ lk ← δ;if NodeFailure ( l ∈ N k) then N k ← N k \{l};if AddNeighbor ( l ∈ N k) then N k ← N k ∪{l};ifN k ,δ k changes or MessageRecvd then call OnChange();Function OnChange()beg<strong>in</strong><strong>for</strong>alll ∈ N k doif ( ∆ kl ≥ 0∧∆ kl > ∆ k) ∨ ( ∆ kl < 0∧∆ kl < ∆ k) thencall SendMessage(l);endendendFunction SendMessage(l)beg<strong>in</strong>iftime()−last message ≥ L thenδ kl ← (∆ k −δ lk );last message ← time();Send 〈 δ kl〉 to l;endelse Wait L−(time()−last message) time units and then callOnChange();endAlgorithm 13: Local Majority Vote


[λ ≥ ∆ k,l ≥ ∆ k] ∨ [ λ < ∆ k,l ≤ ∆ k]180Similarly, the condition <strong>for</strong> send<strong>in</strong>g messages <strong>in</strong> L2 threshold<strong>in</strong>g can be written as:[λ ≥ −−→ A k,l ∧λ ≥ −−→ [W k,l]∨ λ < −−→ A k,l ∧λ < −−→ ]W k,lConsider the first condition.[λ ≥ −−→ A k,l ∧λ ≥ −−→ ]W k,l]⇒[λ ≥ ∆k,l|A k,l | +λ∧λ ≥ ∆k −∆ k,l|K k |−|A k,l | +λ ]⇒[0 ≥ ∆k,l|A k,l | ∧0 ≥ ∆k −∆ k,l|K k |−|A k,l |⇒ [ 0 ≥ ∆ k,l ∧0 ≥ ∆ k −∆ k,l]⇒ [ 0 ≥ ∆ k,l ≥ ∆ k]The last equation shows that the condition <strong>for</strong> send<strong>in</strong>g messages <strong>in</strong> the majority protocolis a special case of the same <strong>in</strong> the L2 algorithm. Consequently, we can use the samerule that we used <strong>in</strong> L2 <strong>in</strong> deter<strong>m<strong>in</strong><strong>in</strong>g</strong> how much of <strong>data</strong> to send. In L2 <strong>in</strong>stead of sett<strong>in</strong>g−−→A k,l to −→ K k , we set it to a po<strong>in</strong>t between the current direction of −−→ A k,l and −→ K k . Similarly, <strong>in</strong>majority vot<strong>in</strong>g we should set ∆ k,l = α∆ k . This way, subsequent small variations <strong>in</strong> ∆ kwill not trigger any more messages. The modified pseudo-code is presented <strong>in</strong> Algorithm14.


181123456789101112131415161718192021222324252627282930Input: δ k ,N k ,L,αOutput: if∆ k ≥ 0 then 1 else0Local variables: ∀l ∈ N k : δ lk ,δ klDef<strong>in</strong>itions: ∆ k = δ k + ∑ l∈Nδ lk , ∆ kl = δ kl +δ lkkInitialization:beg<strong>in</strong><strong>for</strong>alll ∈ N k doδ kl = δ lk = 0;SendMessage(l);endendif MessageRecvd ( P l ,δ ) then δ lk ← δ;if NodeFailure ( l ∈ N k) then N k ← N k \{l};if AddNeighbor ( l ∈ N k) then N k ← N k ∪{l};ifN k ,δ k changes or MessageRecvd then call OnChange();Function OnChange()beg<strong>in</strong><strong>for</strong>alll ∈ N k doif ( ∆ kl ≥ 0∧∆ kl > ∆ k) ∨ ( ∆ kl < 0∧∆ kl < ∆ k) thencall SendMessage(l);endendendFunction SendMessage(l)beg<strong>in</strong>iftime()−last message ≥ L thenδ kl ← (α∆ k −δ lk );last message ← time();Send 〈 δ kl〉 to l;endelse Wait L−(time()−last message) time units and then callOnChange();endAlgorithm 14: Dynamic Local Majority Vote


Appendix BINFORMATION GAIN COMPARISONIn this appendix first we show how the misclassification ga<strong>in</strong> difference between anytwo arbitrary attributesA i andA j and b<strong>in</strong>ary class labelC = 0 and C = 1 can be posed asthe difference of sums of the entries of their cross tables. Then we derive the <strong>for</strong>mula <strong>for</strong>zero threshold<strong>in</strong>g the g<strong>in</strong>i <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> <strong>for</strong> two b<strong>in</strong>ary attributes.B.1 NotationWe follow the same notations as <strong>in</strong> Section 4.4.1. Let S be a set of learn<strong>in</strong>g examples— each a vector <strong>in</strong>{0,1} d ×{0,1} — where the firstdentries of each example denote theattributes, A 1 ,...,A d , and the additional one denotes the class C. Here we are <strong>in</strong>terested<strong>in</strong> compar<strong>in</strong>g arbitrarily two attributes A i and A j . Let x i k0 denote the number of examples<strong>in</strong> the setS <strong>for</strong> whichA i = k andC = 0 ∀k ∈ [0...(m i −1)]. The follow<strong>in</strong>g table (TableB.1) shows all the possible values of the attributes and the class.Similarly <strong>for</strong>A j the values are shown <strong>in</strong> Table B.2.B.2 Threshold<strong>in</strong>g Misclassification Ga<strong>in</strong>In this section we show that the misclassification ga<strong>in</strong> between two attributes A i andA j can be written as the difference of sums of the entries <strong>in</strong> their cross tables. The follow<strong>in</strong>gtheorem states it more <strong>for</strong>mally.182


183Attribute value (A i ) C=0 C=1 (C=0)+(C=1)01x i 00x i 10x i 01x i 11x i 0·x i 1·.. ..m i −1 x i (m i −1)0x i (m i −1)1x i (m i −1)·Table B.1. Number of entries of attributeA i and the class C.Attribute value (A j ) C=0 C=1 (C=0)+(C=1)0 x j 00 x j 01 x j 0·1 x j 10 x j 11 x j 1·.. ..m j −1 x j (m j −1)0x j (m j −1)1x j (m j −1)·Table B.2. Number of entries of attributeA j and the class C.Theorem B.2.1. Given two attributes A i and A j , threshold<strong>in</strong>g the misclassification ga<strong>in</strong>difference aga<strong>in</strong>st zero is given by ∑ m i −1k=0|x i k0 −xi k1 |−∑ ∣m j −1l=0∣x j l0 −xj ∣l1> 0.Proof. As given <strong>in</strong> the table, let x i k· denote the number of examples <strong>in</strong> which Ai = k both<strong>for</strong>C = 0 and C = 1 i.e. x i k· = xi k0 +xi k1 . Similarly, letxj l· denotes the same <strong>for</strong>Aj . Notethat, ∑ m i −1k=0x i k· = ∑ m j −1l=0x j l·= |S|. The attribute ga<strong>in</strong> difference between A i and A j isgiven as:


184=====m∑i −1k=0m∑i −1k=0⎡∑⎣m i −1k=0[ ∣x i k· 1 ∣∣∣|S| 2 − x i k0− 1 ]x i k·2∣−[ ∣ ]x i k· 1 ∣∣∣|S| 2 − 2x i k0 −xi k·2x i ∣ −k·⎤x i m j −1k·2|S| − ∑ x j l· ⎦+2|S|[ |S|2|S| − |S| ]2|S|⎡m1 ∑j −1⎣ ∣ xj2|S|l=0l=0+ 12|S|l0 −xj l1m∑j −1l=0m∑j −1l=0x j l·|S|m∑j −1l=0⎡m∑j −1⎣ ∣ 2xjl=0x j l·|S|[1x j l·|S|∣l0 −xj l·∣ ] ∣∣∣∣2 − x j l0x j − 1 2∣l·[1∣ ] ∣∣∣∣2 − 2x j l0 −xj l·2x j ∣l·m i −1∣ − ∑ x i k·|S|2x j l0 −xj l·2x j l·⎤m ∑i −1∣ − ∣ xik0 −x i ∣⎦k1k=0k=0⎤m ∑i −1∣ − ∣ 2xik0 −x i ∣⎦k·k=02x i k0 −xi k·∣ ∣2x i k·The last equality follows from the fact that x j l·= x j l0 + xj l1 and xi k· = x i k0 + xi k1 .There<strong>for</strong>e, zero threshold<strong>in</strong>g the attribute ga<strong>in</strong> difference is the same as zero threshold<strong>in</strong>gthe above quantity:∑ m i −1k=0 |xi k0 −xi k1 |−∑ m j −1l=0∣ xjl0 −xj l1∣ > 0B.3 Threshold<strong>in</strong>g G<strong>in</strong>i In<strong>for</strong>mation Ga<strong>in</strong>In this section we show that threshold<strong>in</strong>g the G<strong>in</strong>i <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> is more complicatedthan threshold<strong>in</strong>g the misclassification error.Theorem B.3.1. LetA i andA j be two b<strong>in</strong>ary attributes. Threshold<strong>in</strong>g the G<strong>in</strong>i <strong>in</strong><strong>for</strong>mationga<strong>in</strong> difference of A i and A j aga<strong>in</strong>st zero is the same as check<strong>in</strong>g if [ ] 2xx j j01 1·x i 0·x i 1· +[xj11] 2x j0·xi 0·xi 1· −[xi 01 ]2 x i 1·xj 0·xj 1· −[xi 11 ]2 x i 0·xj 0·xj 1· > 0.Proof. The G<strong>in</strong>i <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> difference ofA i andA j is:


185[ ( ) x= x i i 2 ( )0· 1− 00 xi 2]− 01=−x j 0·⎡⎣1−x i 0·(x j 00x j 0·x i 0·) 2−(x j 01x j 0·+x i 1·) 2⎤⎦−x j 1·[ ( ) xi 2 ( )1− 10 xi 2]− 11⎡⎣1−x i 1·(x j 10x j 1·[ ] [ ]x i 0· − (xi 00) 2− (xi 01) 2+ x i 1· − (xi 10) 2− (xi 11) 2−[x j 0· −x i 0·(xj00x j 0·) 2−x i 0·= x i 0· +xi 1· −xj 0· −xj 1·(xj) 2 (00 + xj) 201+ +x j 0·= |S|−|S|+(xj01x j 0·) 2]−[x j 1· −(xj) 2 (10 + xj) 211x j 1·[ ]xj00 +x j 201 −2xj00x j 01x j 0·x i 1·+(xj10x j 1·) 2x i 1·−x i 1·) 2−((xj11x j 1·) 2− (xi 00 )2 +(x i 01 )2x i 0·]x j 11x j 1·[ ]xj10 +x j 211 −2xj10x j 11x j 1·) 2⎤⎦− (xi 10 )2 +(x i 11 )2x i 1·==− [xi 00 +xi 01 ]2 −2x i 00 xi 01− [xi 10 +xi 11 ]2 −2x i 10 xi 11x i 0·[ ]xj 20· −2xj00x j 01[xj0·x j 0·+] 2−2[xj0· −xj 01x j 0·[ ]xj 21· −2xj10x j 11]xj01x j 1·+[xj1·x i 1·− [xi 0·]2 −2x i 00 xi 01x i 0·] 2−2[xj1· −xj 11x j 1·]xj11− [xi 1·]2 −2x i 10 xi 11x i 1·− [xi0·]2 −2[x i 0· −xi 01 ]xi 01− [xi1·]2 −2[x i 1· −xi 11 ]xi 11x i 0·= x j 0· −2xj 01 + 2[ x j 01+ 2x j 0·] 2−x i 0· +2x i 01 − 2[xi 01 ]2x i 0·x j 0·+x i 1·+x j 1· −2xj 11 + 2[ x j 11x j 1·] 2−x i 1· +2x i 11 − 2[xi 11 ]2x i 1·= [ x j [ ]0· +xj 1· −xi 0· 1·] −xi +2 xi01 +x i 11 −xj 01 −xj 11[[ ]xj 2 [ ]01 xj 2]11x j 1·− [xi 01 ]2x i 0·− [xi 11 ]2x i 1·= [|S|−|S|]+2[(#class = 1)−(#class = 1)]+2= 2[[xj] 2 [01 xjx j +0·] 211x j 1·− [xi 01] 2x i 0·]− [xi 11] 2x i 1·[[xj] 2 [01 xjx j +0·] 211x j 1·− [xi 01] 2x i 0·]− [xi 11] 2x i 1·


186There<strong>for</strong>e, threshold<strong>in</strong>g the g<strong>in</strong>i <strong>in</strong><strong>for</strong>mation ga<strong>in</strong> difference about zero is the same asdeter<strong>m<strong>in</strong><strong>in</strong>g</strong> if the follow<strong>in</strong>g expression is greater than zero[ ]xj 2 [01 xjx j +0·] 211x j 1·− [xi 01 ]2x i 0·− [xi 11 ]2x i 1·> 0which is equalvalent to deter<strong>m<strong>in</strong><strong>in</strong>g</strong> if the follow<strong>in</strong>g polynomial is greater than zero[ ]xj 2x j01 1·xi0·xi 1· +[ ] 2xx j j 2x11 0·xi 0·xi 1· −[ x01] i i1·x j 2x 0·xj 1· −[ x11] i i0·x j 0·xj 1· > 0


REFERENCES[1] Y. Afek, S. Kutten, and M. Yung. The Local Detection Paradigm and Its Applicationto Self-Stabilization. In Theoretical Computer Science, 186(1–2):199–229, October1997.[2] Gul Agha. Actors: A Model of Concurrent Computation <strong>in</strong> Distributed Systems.MIT Press, Cambridge, MA, USA, 1986.[3] Amazon Music Recommendation. http://www.amazon.com/.[4] S. Bailey, R. Grossman, H. Sivakumar, and A. Tur<strong>in</strong>sky. Papyrus: a System <strong>for</strong> DataM<strong>in</strong><strong>in</strong>g Over Local and Wide Area Clusters and Super-Clusters. In Proceed<strong>in</strong>gs ofCDROM’99, page 63, New York, NY, USA, 1999.[5] T. Balch and R. C. Ark<strong>in</strong>. Motor Schema-based Formation Control <strong>for</strong> MultiagentRobot Teams. In Proceed<strong>in</strong>gs of ICMAS’95, pages 10–16, San Francisco, CA, USA,1995.[6] Wolf-Tilo Balke, Wolfgang Nejdl, Wolf Siberski, and Uwe Thaden. ProgressiveDistributed Top-k Retrieval <strong>in</strong> Peer-to-Peer Networks. In Proceed<strong>in</strong>gs of ICDE’05,pages 174–185, Tokyo, Japan, 2005.[7] Sanghamitra Bandyopadhyay, Chris Giannella, Ujjwal Maulik, Hillol Kargupta, KunLiu, and Souptik Datta. Cluster<strong>in</strong>g Distributed Data Streams <strong>in</strong> Peer-to-Peer Environments.In<strong>for</strong>mation Science, 176(14):1952–1985, 2006.[8] A. Bar-Or, D. Keren, A. Schuster, and R. Wolff. Hierarchical Decision Tree Induction<strong>in</strong> Distributed Genomic Databases. IEEE Transactions on Knowledge and Data187


188Eng<strong>in</strong>eer<strong>in</strong>g – Special Issue on M<strong>in</strong><strong>in</strong>g Biological Data, 17(8):1138 – 1151, August2005.[9] A. Bar-Or, D. Keren, A. Schuster, and R. Wolff. Hierarchical Decision Tree Induction<strong>in</strong> Distributed Genomic Databases. IEEE Transactions on Knowledge and DataEng<strong>in</strong>eer<strong>in</strong>g – Special Issue on M<strong>in</strong><strong>in</strong>g Biological Data, 17(8):1138 – 1151, August2005.[10] Valmir C. Barbosa. An Introduction to Distributed Algorithms. The MIT Press,1996.[11] Mayank Bawa, Aristides Gionis, Hector Garcia-Mol<strong>in</strong>a, and Rajeev Motwani. ThePrice of Validity <strong>in</strong> Dynamic Networks. In Proceed<strong>in</strong>gs of SIGMOD’04, pages 515–526, Paris, France, June 2004.[12] Mikhail Bessonov, Udo Heuser, Igor Nekrestyanov, and Ahmed Patel. Open Architecture<strong>for</strong> Distributed Search Systems.Lecture Notes <strong>in</strong> Computer Science,1597:55–69, 1999.[13] K. Bhaduri and H. Kargupta. An <strong>efficient</strong> <strong>local</strong> Algorithm <strong>for</strong> Distributed MultivariateRegression <strong>in</strong> Peer-to-Peer Networks. In Accepted <strong>for</strong> publication at SDM’08,2008.[14] Y. Birk, L. Liss, A. Schuster, and R. Wolff. A Local Algorithm <strong>for</strong> Ad Hoc MajorityVot<strong>in</strong>g via Charge Fusion. In Proceed<strong>in</strong>gs of ICDCS’04, 2004.[15] Yitzhak Birk, Idit Keidar, Liran Liss, Assaf Schuster, and Ran Wolff. VeracityRadius - Captur<strong>in</strong>g the Locality of Distributed Computations. In Proceed<strong>in</strong>gs ofPODC’06, pages 102–111, 2006.[16] BitTorrents homepage. http://www.bittorrent.com.


189[17] G. Bontempi and Y.-A Borgne. An Adaptive Modular Approach to the M<strong>in</strong><strong>in</strong>g ofSensor Network Data. In Proceed<strong>in</strong>gs of 1st International Workshop on Data M<strong>in</strong><strong>in</strong>g<strong>in</strong> Sensor Networks, pages 3–9, 2005.[18] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip Algorithms: Design, Analysisand Applications. In Procedd<strong>in</strong>gs Infocom’05, pages 1653–1664, Miami, March2005.[19] J. Branch, B. Szymanski, C. Gionnella, R. Wolff, and H. Kargupta. In-Network OutlierDetection <strong>in</strong> Wireless Sensor Networks. In Proceed<strong>in</strong>gs of ICDCS’06, Lisbon,Portugal, July 2006.[20] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification andRegression Trees. Wadsworth, Belmont, 1984.[21] Leo Breiman. Bagg<strong>in</strong>g Predictors. Mach<strong>in</strong>e Learn<strong>in</strong>g, 24(2):123–140, 1996.[22] Leo Breiman. Random Forests. Mach<strong>in</strong>e Learn<strong>in</strong>g, 45(1):5–32, 2001.[23] BRITE. http://www.cs.bu.edu/brite/.[24] Lawrence Bull, Terence C. Fogarty, and M. Snaith. Evolution <strong>in</strong> Multi-agent Systems:Evolv<strong>in</strong>g Communicat<strong>in</strong>g Classifier Systems <strong>for</strong> Gait <strong>in</strong> a Quadrupedal Robot.In Proceed<strong>in</strong>gs of ICGA’95, pages 382–388, San Francisco, CA, USA, 1995.[25] J. Callan. Distributed In<strong>for</strong>mation Retrieval. In W.B. Croft, editor, Advances <strong>in</strong> In<strong>for</strong>mationRetrieval, chapter 5, pages 127–150. Kluwer Academic Publishers, 2000.[26] Mario Cannataro, Antonio Congiusta, Andrea Pugliese, Domenico Talia, and PaoloTrunfio. Distributed Data M<strong>in</strong><strong>in</strong>g on Grids: Services, Tools, and Applications. IeeeTransactions On Systems, Man, And CyberneticsPart B: Cybernetics, 34(6):2465–2451, 2004.


190[27] Mario Cannataro and Domenico Talia. The Knowledge Grid. Communication ofACM, 46(1):89–93, 2003.[28] Pei Cao and Zhe Wang. Efficient top-K Query Calculation <strong>in</strong> Distributed Networks.In Proceed<strong>in</strong>gs of PODC’04, pages 206–215, St. John’s, Newfoundland, Canada,2004.[29] D. Caragea, A. Silvescu, and V. Honavar. A Framework <strong>for</strong> Learn<strong>in</strong>g from DistributedData Us<strong>in</strong>g Sufficient Statistics and Its Application to Learn<strong>in</strong>g DecisionTrees. International Journal of Hybrid Intelligent Systems, 1(1-2):80–89, 2004.[30] Philip K. Chan and Salvatore J. Stolfo. Experiments on Multistrategy Learn<strong>in</strong>g byMeta-Learn<strong>in</strong>g. In Proceed<strong>in</strong>gs of CIKM’93, pages 314–323, 1993.[31] Phillip K. Chan and Salvatore J. Stolfo. Toward Parallel and Distributed Learn<strong>in</strong>gby Meta-Learn<strong>in</strong>g. In Work<strong>in</strong>g Notes of AAAI Workshop on Knowledge Discovery<strong>in</strong> Databases, pages 227–240, 1993.[32] J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a Scalable Cont<strong>in</strong>uous QuerySystem <strong>for</strong> Internet Databases. In Proceed<strong>in</strong>gs of SIGMOD’00, pages 379–390,Dallas, Texas, 2000.[33] R. Chen, K. Sivakumar, and H. Kargupta. Collective M<strong>in</strong><strong>in</strong>g of Bayesian Networksfrom Distributed Heterogeneous Data. Knowledge and In<strong>for</strong>mation Systems,6(2):164–187, 2004.[34] W. Chen, J. C. Hou, and L. Sha. Dynamic Cluster<strong>in</strong>g <strong>for</strong> Acoustic Target Track<strong>in</strong>g<strong>in</strong> Wireless Sensor Networks. IEEE Transactions on Mobile Comput<strong>in</strong>g, 3(3):258–271, 2004.[35] Ch<strong>in</strong>ook Bio<strong>in</strong><strong>for</strong>matic homepage. http://smweb.bcgsc.bc.ca/ch<strong>in</strong>ook/.


191[36] Francisco Matias Cuenca-Acuna and Thu D. Nguyen. Text-Based Content Searchand Retrieval <strong>in</strong> Ad-hoc P2P Communities. In Revised Papers from the NETWORK-ING 2002 Workshops on Web Eng<strong>in</strong>eer<strong>in</strong>g and Peer-to-Peer Comput<strong>in</strong>g, pages 220–234, London, UK, 2002.[37] Kamalika Das, Kanishka Bhaduri, Kun Liu, and Hillol Kargupta. Distributed Identificationof Top-l Inner Product Elements and its Application <strong>in</strong> a Peer-to-Peer Network.IEEE Transactions on Knowledge and Data Eng<strong>in</strong>eer<strong>in</strong>g (TKDE), 20(4):475–488, 2008.[38] The DataGrid Project. http://eu-<strong>data</strong>grid.web.cern.ch/eu-<strong>data</strong>grid/default.htm.[39] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta. Distributed DataM<strong>in</strong><strong>in</strong>g <strong>in</strong> Peer-to-Peer Networks. IEEE Internet Comput<strong>in</strong>g Special Issue on DistributedData M<strong>in</strong><strong>in</strong>g, 10(4):18–26, 2006.[40] S. Datta, C. Giannella, and H. Kargupta. K-Means Cluster<strong>in</strong>g over Large, DynamicNetworks. In Proceed<strong>in</strong>gs of SDM’06, pages 153–164, Maryland, 2006.[41] Souptik Datta and Hillol Kargupta. Uni<strong>for</strong>m Data Sampl<strong>in</strong>g from a Peer-to-PeerNetwork. In Proceed<strong>in</strong>gs of ICDCS’02, page 50, Toronto, Ontario, 2007.[42] I. Davidson and A. Ravi. Distributed Pre-Process<strong>in</strong>g of Data on Networks of BerkeleyMotes Us<strong>in</strong>g Non-Parametric EM. In In Proceed<strong>in</strong>gs of 1st International Workshopon Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Sensor Networks, pages 17–27, 2005.[43] Distributed Data M<strong>in</strong><strong>in</strong>g Bibliography homepage at UMBC. http://www.cs.umbc.edu/˜hillol/DDMBIB/.[44] DDMT. http://www.umbc.edu/ddm/wiki/software/DDMT/.


192[45] K. S. Decker. Distributed Problem Solv<strong>in</strong>g Techniques: A survey. IEEE Transactionson Systems Man Cybernetics, 17(5):729–740, 1987.[46] Inderjit S. Dhillon and Dharmendra S. Modha. A Data-Cluster<strong>in</strong>g Algorithm onDistributed Memory Multiprocessors. In Revised Papers from Large-Scale ParallelData M<strong>in</strong><strong>in</strong>g, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages245–260, London, UK, 2000.[47] Distribustream Peercast<strong>in</strong>g Service homepage. http://distribustream.org/.[48] Pedro Dom<strong>in</strong>gos and Geoff Hulten. M<strong>in</strong><strong>in</strong>g High-Speed Data Streams. In Proceed<strong>in</strong>gsof KDD’00, pages 71–80, Boston, MA, 2000.[49] E. H. Durfee. Blissful ignorance: Know<strong>in</strong>g just enough to coord<strong>in</strong>ate well. InProceed<strong>in</strong>gs of ICMAS’95, pages 406–413, 1996.[50] E. H. Durfee, V. R. Lesser, and D. D. Corkill. Coherent Cooperation Among Communicat<strong>in</strong>gProblem Solvers. IEEE Transactions on Computers, 36(11):1275–1291,1987.[51] E. H. Durfee, V. R. Lesser, and D. D. Corkill. Trends <strong>in</strong> Cooperative DistributedProblem Solv<strong>in</strong>g. IEEE Transactions on Knowledge and Data Eng<strong>in</strong>eer<strong>in</strong>g, 1(1):63–83, 1989.[52] E. H. Durfee and T. A. Montgomery. Coord<strong>in</strong>ation as Distributed Search <strong>in</strong> a HierarchicalBehavior Space. IEEE Transactions of Systems, Man and Cybernetics,21(6):1363–1378, 1991.[53] Mart<strong>in</strong> Eisenhardt, Wolfgang Müller, and Andreas Henrich. Classify<strong>in</strong>g Documentsby Distributed P2P Cluster<strong>in</strong>g. In GI Jahrestagung, pages 286–291, 2003.


193[54] Ronald Fag<strong>in</strong>, Amnon Lotem, and Moni Naor. Optimal Aggregation Algorithms <strong>for</strong>Middleware. Journal of Computer and System Sciences, 66:614–656, 2003.[55] F<strong>in</strong>ite element method: Wiki. http://en.wikipedia.org/wiki/F<strong>in</strong>ite_element_method.[56] T. F<strong>in</strong><strong>in</strong>, R. Fritzson, D. McKay, and R. McEntire. KQML as an Agent CommunicationLanguage. In Proceed<strong>in</strong>gs of CIKM’94, pages 456–463, Gaithersburg, MD,USA, 1994. ACM Press.[57] L.R. Ford and D.R. Fulkerson. Flows <strong>in</strong> Networks. Pr<strong>in</strong>ceton University Press, 1962.[58] George Forman and B<strong>in</strong> Zhang. Distributed <strong>data</strong> cluster<strong>in</strong>g can be <strong>efficient</strong> andexact. SIGKDD Explorations, 2(2):34–38, 2000.[59] Ian Foster and Carl Kesselman. The Grid: Bluepr<strong>in</strong>t <strong>for</strong> a New Comput<strong>in</strong>g Infrastructure.Morgan Kaufmann, 2004.[60] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowledge Discovery <strong>in</strong>Databases: An Overview. AI Magaz<strong>in</strong>e, 1992.[61] A. Fred and A. Ja<strong>in</strong>. Data Cluster<strong>in</strong>g Us<strong>in</strong>g Evidence Accumulation. In Proceed<strong>in</strong>gsof ICPR’02, page 40276, Wash<strong>in</strong>gton, DC, USA, 2002.[62] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive Logistic Regression:A Statistical View of Boost<strong>in</strong>g. The Annals of Statistics, 38(2):337–374, 2000.[63] J. Garcia-Luna-Aceves and S. Murthy. A Path-F<strong>in</strong>d<strong>in</strong>g Algorithm <strong>for</strong> Loop-FreeRout<strong>in</strong>g. IEEE Transactions on Network<strong>in</strong>g, 5(1):148–160, 1997.[64] S. Ghiasi, A. Srivastava, X. Yang, and M. Sarrafzadeh. Optimal Energy AwareCluster<strong>in</strong>g <strong>in</strong> Sensor Networks. Sensors, 2:258–269, 2002.


194[65] Sukumar Ghosh. Distributed Systems: An Algorithmic Approach. CRC press, 2006.[66] C. Giannella, K. Liu, T. Olsen, and H. Kargupta. Communication Efficient Constructionof Deicision Trees Over Heterogeneously Distributed Data. In Proceed<strong>in</strong>gs ofICDM’04, pages 67–74, Brighton, UK, 2004.[67] P. Gibbons and S. Tirthapura. Estimat<strong>in</strong>g Simple Functions on the Union of DataStreams. In Proceed<strong>in</strong>gs of SPAA’01, pages 281–291, Crete, Greece, 2001.[68] Gnutella homepage. http://www.gnutella.com.[69] Claudia V. Goldman and Jeffrey S. Rosensche<strong>in</strong>. Emergent coord<strong>in</strong>ation through theuse of cooperative state-chang<strong>in</strong>g rules. In Proceed<strong>in</strong>gs of AAAI’94, pages 408–413,1994.[70] Michael B Greenwald and Sanjeev Khanna. Power-Conserv<strong>in</strong>g Computation ofOrder-Statistics over Sensor Networks. In Proceed<strong>in</strong>gs of PODS’04, pages 275–285, Paris, France, June 2004.[71] What is Grid ? http://www.eu-degree.eu/DEGREE/General%\20questions/copy_of_what-is-grid.[72] C. Guestr<strong>in</strong>, P. Bodik, R. Thibaux, M. A. Pask<strong>in</strong>, and S. Madden. Distributed Regression:an Efficient Framework <strong>for</strong> Model<strong>in</strong>g Sensor Network Data. In Procedd<strong>in</strong>gsof IPSN’04, 2004.[73] David J. Hand, Heikki Mannila, and Padhraic Smyth. Pr<strong>in</strong>ciples of Data M<strong>in</strong><strong>in</strong>g.MIT Press, Cambridge, MA, 2001.[74] Ragib Hasan, Zahid Anwar, William Yurcik, Larry Brumbaugh, and Roy Campbell.A Survey of Peer-to-Peer Storage Techniques <strong>for</strong> Distributed File Systems. In Pro-


195ceed<strong>in</strong>gs of ITCC’05, pages 205–213, Wash<strong>in</strong>gton, DC, USA, 2005. IEEE ComputerSociety.[75] Thomas Haynes and Sandip Sen. Evolv<strong>in</strong>g Behavioral Strategies <strong>in</strong> Predators andPrey. In Workshop on Adaptation and Learn<strong>in</strong>g <strong>in</strong> Multiagent Systems part of IJ-CAI’95, pages 32–37, Montreal, Quebec, Canada, 20-25 1995.[76] Wendi He<strong>in</strong>zelman, Anantha Chandrakasan, and Hari Balakrishnan. Energy-Efficient Communication Protocol <strong>for</strong> Wireless Microsensor Networks. In Proceed<strong>in</strong>gsof HICSS’00, page 10, Hawaii, 2000.[77] Wendi He<strong>in</strong>zelman, Anantha Chandrakasan, and Hari Balakrishnan. AnApplication-Specific Protocol Architecture <strong>for</strong> Wireless Microsensor Networks.IEEE Transactions on Wireless Communications, 1(4):660–670, 2002.[78] D. E. Hershberger and H. Kargupta. Distributed Multivariate Regression Us<strong>in</strong>gWavelet-based Collective Data M<strong>in</strong><strong>in</strong>g. Journal of Parallel and Distributed Comput<strong>in</strong>g,61(3):372–400, 2001.[79] Carl Hewitt. View<strong>in</strong>g Control Structures as Patterns of Pass<strong>in</strong>g Messages. ArtificialIntelligence, 8(3):323–364, 1977.[80] John H. Holland. Adaptation <strong>in</strong> Natural and Artificial Systems. University of MichiganPress, 1975.[81] Wolfgang Hoschek, Francisco Javier Jaén-Martínez, Asad Samar, He<strong>in</strong>z Stock<strong>in</strong>ger,and Kurt Stock<strong>in</strong>ger. Data Management <strong>in</strong> an International Data Grid Project. InProceed<strong>in</strong>gs of the First International Workshop on Grid Comput<strong>in</strong>g, pages 77–90,London, UK, 2000.[82] Qiang Huang, Helen J. Wang, and Nikita Borisov. Privacy-Preserv<strong>in</strong>g Friends Troubleshoot<strong>in</strong>gNetwork. In Proceed<strong>in</strong>gs of NDSS’05, 2005.


196[83] T. Jaakkola. Tutorial on Variational Approximation Methods. In Advanced MeanField Methods: Theory and Practice, 2000.[84] J.M. Jaffe and F.H. Moss. A Responsive Rout<strong>in</strong>g Algorithm <strong>for</strong> Computer Networks.IEEE Transactions on Communications, pages 1758–1762, July 1982.[85] Mrk Jelasity, Alberto Montresor, and Ozalp Babaoglu. Gossip-based Aggregation<strong>in</strong> Large Dynamic Networks.ACM Transactions on Computer Systems (TOCS),23(3):219–252, August 2005.[86] F<strong>in</strong>n V. Jensen. Introduction to Bayesian Networks. Spr<strong>in</strong>ger, 1997.[87] Erik L. Johnson and Hillol Kargupta. Collective, Hierarchical Cluster<strong>in</strong>g from Distributed,Heterogeneous Data. In Revised Papers from Large-Scale Parallel DataM<strong>in</strong><strong>in</strong>g, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 221–244,London, UK, 2000.[88] Joost homepage. http://www.joost.com/.[89] Michael I. Jordan, Zoub<strong>in</strong> Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul.An Introduction to Variational Methods <strong>for</strong> Graphical Models. Mach<strong>in</strong>e Learn<strong>in</strong>g,37(2):183–233, November 1999.[90] JXTA homepage. https://jxta.dev.java.net/.[91] Vana Kalogeraki, Dimitrios Gunopulos, and D. Ze<strong>in</strong>alipour-Yazti. A Local SearchMechanism <strong>for</strong> Peer-to-Peer Networks. In Proceed<strong>in</strong>gs of CIKM’02, pages 300–307,McLean, Virg<strong>in</strong>ia, USA, 2002.[92] H. Kargupta, B. Park, D. Hershbereger, and E. Johnson. Collecttive Data M<strong>in</strong><strong>in</strong>g:A New Perspective Towards Distributed Data M<strong>in</strong><strong>in</strong>g. In Hillol Kargupta and


197Philip Chan, editors, Advances <strong>in</strong> Distributed and Parallel Knowledge Discovery.AAAI/MIT Press, 1999.[93] H. Kargupta and K. Sivakumar. Existential Pleasures of Distributed Data M<strong>in</strong><strong>in</strong>g.Data M<strong>in</strong><strong>in</strong>g: Next Generation Challenges and Future Directions. AAAI/MIT press,2004.[94] Hillol Kargupta and Philip Chan, editors. Advances <strong>in</strong> Distributed and ParallelKnowledge Discovery. MIT Press, 2000.[95] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and Erik L. Johnson.Distributed Cluster<strong>in</strong>g Us<strong>in</strong>g Collective Pr<strong>in</strong>cipal Component Analysis. Knowledgeand In<strong>for</strong>mation Systems, 3(4):422–448, 2001.[96] Hillol Kargupta, Byung-Hoon Park, and Haimonti Dutta. Orthogonal DecisionTrees. IEEE Trans. on Knowl. and Data Eng., 18(8):1028–1042, 2006.[97] Kazaa homepage. http://www.kazaa.com.[98] Idit Keidar and Assaf Schuster. Want scalable comput<strong>in</strong>g?: speculate! SIGACTNews, 37(3):59–66, 2006.[99] David Kempe, Al<strong>in</strong> Dobra, and Johannes Gehrke. Comput<strong>in</strong>g Aggregate In<strong>for</strong>mationus<strong>in</strong>g Gossip. In Proceed<strong>in</strong>gs of FOCS’03, Cambridge, 2003.[100] P. M. Khilar and S. Mahapatra. Heartbeat Based Fault Diagnosis <strong>for</strong> Mobile Ad-HocNetwork. In Proceed<strong>in</strong>gs of IASTED’07, pages 194–199, Phuket, Thailand, 2007.[101] Iraklis A. Klampanos and Joemon M. Jose. An Architecture <strong>for</strong> In<strong>for</strong>mation Retrievalover Semi-Collaborat<strong>in</strong>g Peer-to-Peer Networks. In Proceed<strong>in</strong>gs of SAC’04,pages 1078–1083, Nicosia, Cyprus, 2004.


198[102] M. Klusch, S. Lodi, and G. L. Moro. Distributed Cluster<strong>in</strong>g Based on Sampl<strong>in</strong>gLocal Density Estimates.In Proceed<strong>in</strong>gs of IJCAI’03, pages 485–490, Mexico,August 2003.[103] J. Kotecha, V. Ramachandran, and A. Sayeed. Distributed Multi-target Classification<strong>in</strong> Wireless Sensor Networks. IEEE Journal of Selected Areas <strong>in</strong> Communications(Special Issue on Self-Organiz<strong>in</strong>g Distributed Collaborative Sensor Networks),23(4):703–713, 2005.[104] Wojtek Kowalczyk, Márk Jelasity, and A. E. Eiben. Towards Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Largeand Fully Distributed Peer-to-Peer Overlay Networks. In Proceed<strong>in</strong>gs of BNAIC’03,pages 203–210, University of Nijmegen, 2003.[105] D. Krivitski, A. Schuster, and R. Wolff. A Local Facility Location Algorithm <strong>for</strong>Large-Scale Distributed Systems. Journal of Grid Comput<strong>in</strong>g, 5(4):361–378, 2007.[106] Fabian Kuhn, Thomas Moscibroda, and Roger Wattenhofer. What Cannot Be ComputedLocally! In Proceed<strong>in</strong>gs of PODC’04, pages 300–309, St. John’s, Newfoundland,Canada, 2004.[107] A. Kulakov and D. Davcev. Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Wireless Sensor Networks Based on ArtificialNeural-Networks Algorithms. In Proceed<strong>in</strong>gs of 1st International Workshopon Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Sensor Networks, pages 10–17, 2005.[108] S. Kutten and D. Peleg. Fault-Local Distributed Mend<strong>in</strong>g. In Proceed<strong>in</strong>gs ofPODC’95, pages 20–27, Ottawa, Canada, August 1995.[109] V. R. Lesser and D. D. Corkill. Functionally Accurate Cooperative Distributed Systems.IEEE Transactions on Systems, Mach<strong>in</strong>es and Cybernetics, SMC-11(1):81–96, 1981.


199[110] N. Li, J. Hou, and L. Sha. Design and Analysis of an MST-Based Topology ControlAlgorithm. IEEE Transactions on Wireless Communications, 4(3):1195–1205, 2005.[111] Stephanie L<strong>in</strong>dsey, Cauligi Raghavendra, and Krishna M. Sival<strong>in</strong>gam. Data Gather<strong>in</strong>gAlgorithms <strong>in</strong> Sensor Networks Us<strong>in</strong>g Energy Metrics. IEEE Transactions onParallel and Distributed Systems, 13(9):924–935, 2002.[112] N. L<strong>in</strong>ial. Locality <strong>in</strong> Distributed Graph Algorithms. SIAM Journal of Comput<strong>in</strong>g,21:193–201, 1992.[113] LionShare: Secure P2P file shar<strong>in</strong>g application <strong>for</strong> higher education. http://lionshare.psu.edu/.[114] Michael L. Littman. Markov Games as a Framework <strong>for</strong> Multi-Agent Re<strong>in</strong><strong>for</strong>cementLearn<strong>in</strong>g. In Proceed<strong>in</strong>gs of ICML’94, pages 157–163, New Brunswick, NJ, 1994.[115] K. Liu, K. Bhaduri, K. Das, P. Nguyen, and H. Kargupta. Client-side Web M<strong>in</strong><strong>in</strong>g<strong>for</strong> Community Formation <strong>in</strong> Peer-to-Peer Environments. SIGKDD Explorations,8(2):11–20, 2006.[116] P<strong>in</strong>g Luo, Hui Xiong, Kev<strong>in</strong> Lü, and Zhongzhi Shi. Distributed Classification <strong>in</strong>Peer-to-Peer Networks. In Proceed<strong>in</strong>gs of SIGKDD’07, pages 968–976, 2007.[117] Q<strong>in</strong> Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and Replication <strong>in</strong>Unstructured Peer-to-Peer Networks. In Proceed<strong>in</strong>gs of ICS ’02, pages 84–95, 2002.[118] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. F<strong>in</strong>d<strong>in</strong>g (Recently) FrequentItems <strong>in</strong> Distributed Data Streams. In Proceed<strong>in</strong>gs of ICDE’05, pages 767–778, Wash<strong>in</strong>gton, DC, USA, 2005.[119] Mortada Mehyar, Demetri Spanos, John Pongsajapan, Steven H. Low, and Richard


200Murray. Distributed Averag<strong>in</strong>g on Peer-to-Peer Networks. In Proceed<strong>in</strong>gs ofCDC’05, Spa<strong>in</strong>, 2005.[120] Srujana Merugu and Joydeep Ghosh. A Distributed Learn<strong>in</strong>g Framework <strong>for</strong> HeterogeneousData Sources. In Proceed<strong>in</strong>gs of KDD’05, pages 208–217, 2005.[121] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.Equations of State Calculations by Fast Comput<strong>in</strong>g Mach<strong>in</strong>es. Journal of ChemicalPhysics, 21:1087–1092, 1953.[122] Sebastian Michel, Peter Triantafillou, and Gerhard Weikum. KLEE: a framework<strong>for</strong> <strong>distributed</strong> top-k query <strong>algorithms</strong>. In Proceed<strong>in</strong>gs of VLDB’05, pages 637–648,Trondheim, Norway, 2005.[123] Ingo Mierswa, Kathar<strong>in</strong>a Morik, and Michael Wurst. Collaborative Use of Features<strong>in</strong> a Distributed System <strong>for</strong> the Organization of Music Collections. In Shen,Shepherd, Cui, and Liu, editors, Intelligent Music In<strong>for</strong>mation Systems: Tools andMethodologies, pages 147–175. In<strong>for</strong>mation Science Reference, 2007.[124] Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja, Jim Pruyne,Bruno Richard, Sami Roll<strong>in</strong>s, and Zhichen Xu. Peer-to-peer comput<strong>in</strong>g.[125] Marv<strong>in</strong> M<strong>in</strong>sky. The Society of M<strong>in</strong>d. New York: Simon and Schuster, 1986.[126] Stephen B Montgomery, Tony Fu, Jun Guan, Keven L<strong>in</strong>, and Steven J M Jones. AnApplication of Peer-to-Peer Technology to the Discovery, Use and Assessment ofBio<strong>in</strong><strong>for</strong>matics Programs. Nature Methods, page 563, 2005.[127] Yishay Mor and Jeffrey S. Rosensche<strong>in</strong>. Time and the Prisoner’s Dilemma. InProceed<strong>in</strong>gs of the First International Conference on Multiagent Systems., pages276–282, Menlo park, Cali<strong>for</strong>nia, June 1995.


201[128] Alexandros Moukas, Konstant<strong>in</strong>os Chandr<strong>in</strong>os, and Pattie Maes. Trafficopter: ADistributed Collection System <strong>for</strong> Traffic In<strong>for</strong>mation. In Proceed<strong>in</strong>gs of the SecondInternational Workshop on Cooperative In<strong>for</strong>mation Agents II, Learn<strong>in</strong>g, Mobilityand Electronic Commerce <strong>for</strong> In<strong>for</strong>mation Discovery on the Internet, pages 33–43,London, UK, 1998.[129] S. Mukherjee and H. Kargupta. Distributed Probabilistic Inferenc<strong>in</strong>g <strong>in</strong> Sensor Networksus<strong>in</strong>g Variational Approximation. Journal of Parallel and Distributed Comput<strong>in</strong>g(JPDC) (<strong>in</strong> press), 2007.[130] Sourav Mukherjee. A Variational Approach Towards Distributed Data M<strong>in</strong><strong>in</strong>g. Master’sthesis, University of Maryland, baltimore County, May 2007.[131] Sourav Mukherjee and Hillol Kargupta. Communication-<strong>efficient</strong> multivariate regressionfrom heterogeneously <strong>distributed</strong> <strong>data</strong> us<strong>in</strong>g variational approximation. InCommunication, 2007.[132] Kev<strong>in</strong> Murphy. A Variational Approximation <strong>for</strong> Bayesian Networks with Discreteand Cont<strong>in</strong>uous Latent. In Proceed<strong>in</strong>gs of UAI’99, pages 457–46, San Francisco,CA, 1999. Morgan Kaufmann.[133] Moni Naor and Larry Stockmeyer. What Can be Computed Locally? In Proceed<strong>in</strong>gsof STOC’93, pages 184–193, 1993.[134] Napster homepage. http://www.napster.com.[135] Luis Ernesto Navarro-Serment, John Dolan, and Pradeep Khosla. Optimal SensorPlacement <strong>for</strong> Cooperative Distributed Vision. In Proceed<strong>in</strong>gs of ICRA’04, pages939–944, April 2004.[136] Wolfgang Nejdl, Wolf Siberski, Uwe Thaden, and Wolf-Tilo Balke. Top-k Query


202Evaluation <strong>for</strong> Schema-Based Peer-to-Peer Networks. In Proceed<strong>in</strong>gs of ISWC’04,pages 137–151, Hiroshima, Japan, 2004.[137] Wolfgang Nejdl, Boris Wolf, Changtao Qu, Stefan Decker, Michael S<strong>in</strong>tek, AmbjörnNaeve, Mikael Nilsson, Matthias Palmér, and Tore Risch.EDUTELLA: A P2PNetwork<strong>in</strong>g Infrastructure Based on RDF. In Proceed<strong>in</strong>gs of WWW’02, pages 604–615, Honolulu, Hawaii, USA, 2002.[138] J.A. Nelder and R. Mead. A Simplex Method <strong>for</strong> Function M<strong>in</strong>imization. TheComputer Journal, 7:308–313, 1965.[139] John Von Neumann. Theory of Self-Reproduc<strong>in</strong>g Automata. University of Ill<strong>in</strong>oisPress, Champaign, IL, USA, 1966.[140] H. Penny Nii. Blackboard Systems, Part One: The Blackboard Model of ProblemSolv<strong>in</strong>g and the Evolution of Blackboard Architectures. AI Magaz<strong>in</strong>e, 7(2):38–53,1986.[141] Todd Olsen. Distributed Decision Tree Learn<strong>in</strong>g From Multiple Heterogeneous DataSources.Master’s thesis, University of Maryland, Baltimore County, Baltimore.Maryland, October 2006.[142] C. Olston, J. Jiang, and J. Widom. Adaptive Filters <strong>for</strong> Cont<strong>in</strong>uous Queries over DistributedData Streams. In Proceed<strong>in</strong>gs of SIGMOD’03, pages 563–574, San Diego,Cali<strong>for</strong>nia, 2003.[143] Ömer Egecioglu and Hakan Ferhatosmanoglu. Dimensionality Reduction and SimilarityComputation by Inner Product Approximations. In Proceed<strong>in</strong>gs of CIKM’00,pages 219–226, New York, 2000.[144] P2P Wikipedia homepage. http://en.wikipedia.org/wiki/Peer-to-peer.


203[145] Noam Palat<strong>in</strong>, Arie Leizarowitz, Assaf Schuster, and Ran Wolff. M<strong>in</strong><strong>in</strong>g <strong>for</strong> MisconfiguredMach<strong>in</strong>es <strong>in</strong> Grid Systems. In Proceed<strong>in</strong>gs of SIGKDD’06, pages 687–692,Philadelphia, PA, USA, 2006.[146] Themistoklis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, and DimitriosGunopulos. Distributed Deviation Detection <strong>in</strong> Sensor Networks. ACM SIGMODRecord, 32(4):77–82, December 2003.[147] B. Park, R. Ayyagari, and H. Kargupta. A Fourier Analysis-Based Approach toLearn Classifier from Distributed Heterogeneous Data. In Proceed<strong>in</strong>gs of SDM’01,Chicago, IL, April 2001.[148] B. Park, H. Kargupta, E. Johnson, E. Sansever<strong>in</strong>o, D. Hershberger, and L. Silvestre.Distributed, Collaborative Data Analysis from Heterogeneous Sites Us<strong>in</strong>g a ScalableEvolutionary Technique. Applied Intelligence, 16(1):19–42, 2002.[149] David Peleg. Distributed Comput<strong>in</strong>g: a Locality-Sensitive Approach. Society <strong>for</strong>Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.[150] T. Pham, D. S. Scherber, and H. C. Papadopoulos. Distributed Source LocalizationAlgorithms <strong>for</strong> Acoustic Ad-hoc Sensor Networks. In Proceed<strong>in</strong>gs of Sensor Arrayand Multichannel Signal Process<strong>in</strong>g Workshop, pages 613–617, Barcelona, Spa<strong>in</strong>,2004.[151] PlanetLab homepage. http://www.planet-lab.org/.[152] PodPlaylist: Shar<strong>in</strong>g iTune Playlists. http://www.podplaylist.com/.[153] J.B. Predd, S.R. Kulkarni, and H.V. Poor. Distributed Kernel Regression: An Algorithm<strong>for</strong> Tra<strong>in</strong><strong>in</strong>g Collaboratively. arXiv.cs.LG archive, 2006.[154] J. R. Qu<strong>in</strong>lan. Induction of Decision Trees. Mach<strong>in</strong>e Learn<strong>in</strong>g, 1(1):81–106, 1986.


204[155] J. Ross Qu<strong>in</strong>lan. C4.5: Programs <strong>for</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g. Morgan Kauffman, 1993.[156] Michael Rabbat and Robert Nowak. Distributed Optimization <strong>in</strong> Sensor Networks.In Proceed<strong>in</strong>gs of IPSN’04, pages 20–27, Berkeley, Cali<strong>for</strong>nia, USA, 2004.[157] Predrag Radivojac, Uttara Korad, K. M. Sival<strong>in</strong>gam, and Zoran Obradovic. Learn<strong>in</strong>gfrom Class-Imbalanced Data <strong>in</strong> Wireless Sensor Networks. In Proceed<strong>in</strong>gs ofVTC’03 Fall, pages 3030 – 3034, Orlando, Florida, 2003.[158] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Schenker.A Scalable Content-Addressable Network. In Proceed<strong>in</strong>gs of SIGCOMM’01, pages161–172, 2001.[159] S. E. Robertson and K. S. Jones. Relevance Weight<strong>in</strong>g of Search Terms. Journal ofthe American Society <strong>for</strong> In<strong>for</strong>mation Science, 27:129–146, 1976.[160] Antony Rowstron and Peter Druschel. Pastry: Scalable, Decentralized ObjectLocation, and Rout<strong>in</strong>g <strong>for</strong> Large-Scale Peer-to-Peer Systems.In Proceed<strong>in</strong>gs ofIFIP/ACM Middleware’01, pages 329–350, 2001.[161] G. Salton, A.Wang, and C. Yang. A Vector Space Model <strong>for</strong> In<strong>for</strong>mation Retrieval.Journal of the American Society <strong>for</strong> In<strong>for</strong>mation Science, 18:613–620, 1975.[162] Andrea Schaerf, Yoav Shoham, and Moshe Tennenholtz. Adaptive Load Balanc<strong>in</strong>g:A Study <strong>in</strong> Multi-Agent Learn<strong>in</strong>g. Journal of Artificial Intelligence Research, 2:475–500, 1995.[163] D. S. Scherber and H. S. Papadopoulos. Distributed Computation of Averages Overad hoc Networks. IEEE Journal on Selected Areas <strong>in</strong> Communications, 23(4):776–787, 2005.


205[164] A. Schuster and R. Wolff. Communication-Efficient Distributed M<strong>in</strong><strong>in</strong>g of AssociationRules. Data M<strong>in</strong><strong>in</strong>g and Knowledge Discovery, 8(2):171–196, 2004.[165] ScienceNet Peer-to-Peer Secarch Eng<strong>in</strong>e homepage.http://liebel.fzk.de/collaborations/sciencenet-search-eng<strong>in</strong>e-based-on-yacy-p2p-technology.[166] Sandip Sen. Multiagent Systems: Milestones and New Horizons. Trends <strong>in</strong> CognitiveScience, 1(9):334–339, 1997.[167] Sandip Sen, Shounak Roychowdhury, and Neeraj Arora. Effects of Local In<strong>for</strong>mationon Group Behavior. In BIBICMAS, pages 315–321, Menlo Park, CA, 1996.[168] I. Sharfman, A. Schuster, and D. Keren. A Geometric Approach to Monitor<strong>in</strong>gThreshold Functions over Distributed Data Streams. In Proceed<strong>in</strong>gs of SIGMOD’06,pages 301–312, Chicago, Ill<strong>in</strong>ois, June 2006.[169] Onn Shehory and Sarit Kraus. Methods <strong>for</strong> Task Allocation via Agent CoalitionFormation. Artificial Intelligence, 101(1-2):165–200, 1998.[170] ShuM<strong>in</strong>g Shi, J<strong>in</strong> Yu, GuangWen Yang, and D<strong>in</strong>gX<strong>in</strong>g Wang. Distributed PageRank<strong>in</strong>g <strong>in</strong> Structured P2P Networks. In Proceed<strong>in</strong>gs of ICPP’03, page 179, LosAlamitos, CA, USA, 2003.[171] Ka Cheung Sia. P2P In<strong>for</strong>mation Retrieval: A Self-Organiz<strong>in</strong>g Paradigm. Technicalreport, 2002. http://www.cse.cuhk.edu.hk/˜kcsia/tp3.pdf.[172] Skype homepage. http://www.skype.com/.[173] R. G. Smith. The Contract Net Protocol: High-level Communication and Control<strong>in</strong> a Distributed Problem Solver. Distributed Artificial Intelligence, pages 357–366,1988.


[174] SopCast homepage. http://www.sopcast.org/.206[175] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan.Chord: A Scalable Peer-To-Peer Lookup Service <strong>for</strong> Internet Applications. In Proceed<strong>in</strong>gsof SIGCOMM’01, pages 149–160, 2001.[176] Salvatore J. Stolfo, Andreas L. Prodromidis, Shelley Tselepis, Wenke Lee, Dave W.Fan, and Philip K. Chan. JAM: Java Agents <strong>for</strong> Meta-Learn<strong>in</strong>g over DistributedDatabases. In Proceed<strong>in</strong>gs of KDD’97, pages 74–81, Newport Beach, CA, 1997.[177] Peter Stone. Layered Learn<strong>in</strong>g <strong>in</strong> Multi-Agent Systems. PhD thesis, Carnegie MellonUniversity, December 1998.[178] Peter Stone and Manuela M. Veloso. Multiagent Systems: A Survey from a Mach<strong>in</strong>eLearn<strong>in</strong>g Perspective. Autonomous Robots, 8(3):345–383, 2000.[179] Alexander Strehl and Joydeep Ghosh. Cluster Ensembles — A Knowledge ReuseFramework <strong>for</strong> Comb<strong>in</strong><strong>in</strong>g Multiple Partitions. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research,3:583–617, 2003.[180] Torsten Suel, Chandan Mathur, Jo wen Wu, Jiangong Zhang, Alex Delis, MehdiKharrazi, Xiaohui Long, and Kulesh Shanmugasundaram. ODISSEA: A Peer-to-Peer Architecture <strong>for</strong> Scalable Web Search and In<strong>for</strong>mation Retrieval. In WebDB’03,pages 67–72, 2003.[181] Katia Sycara, Anandee Pannu, Mike Williamson, Dajun Zeng, and Keith Decker.Distributed Intelligent Agents. IEEE Expert: Intelligent Systems and Their Applications,11(6):36–46, 1996.[182] D. Talia and D. Skillicorn. M<strong>in</strong><strong>in</strong>g Large Data Sets on Grids: Issues and Prospects.Comput<strong>in</strong>g and In<strong>for</strong>matics, 21(4):347–362, 2002.


207[183] D. Talia and P. Trunfio. Toward a Synergy Between P2P and Grids. IEEE InternetComput<strong>in</strong>g, 7(4):94–95, August 2003.[184] M<strong>in</strong>g Tan. Multi-agent Re<strong>in</strong><strong>for</strong>cement Learn<strong>in</strong>g: Independent vs. CooperativeAgents. Read<strong>in</strong>gs <strong>in</strong> agents, pages 487–494, 1998.[185] Pang-N<strong>in</strong>g Tan, Michael Ste<strong>in</strong>bach, and Vip<strong>in</strong> Kumar. Introduction to Data M<strong>in</strong><strong>in</strong>g.Addison-Wesley, 2006.[186] Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-Peer In<strong>for</strong>mationRetrieval us<strong>in</strong>g Self-organiz<strong>in</strong>g Semantic Overlay Networks. In Proceed<strong>in</strong>gs of SIG-COMM’03, pages 175–186, Karlsruhe, Germany, 2003.[187] C. Tempich, A. Löser, and J. Heizmann. Community Based Rank<strong>in</strong>g <strong>in</strong> Peer-to-PeerNetworks, volume 3761, chapter Lecture Notes <strong>in</strong> Computer Science, pages 1261–1278. Spr<strong>in</strong>ger-Verlag GmbH, October 2005.[188] Tranche homepage. http://tranche.proteomecommons.org/.[189] Kagan Tumer and Joydeep Ghosh. Robust comb<strong>in</strong><strong>in</strong>g of disparate classifiers throughorder statistics. Pattern Analysis and Applications, 5:189–200, 2001.[190] Ubistorage homepage. http://osa.<strong>in</strong>ria.fr/wiki/Developments/Model<strong>in</strong>gOfPeer-to-peerFileStorageSystems.[191] Vehicular Lab: UCLA Compter Science Department. http://www.vehicularlab.org/.[192] V. Vapnik. Statistical Learn<strong>in</strong>g Theory. Wiley-Interscience, 1998.[193] Grid Page: Wiki. http://en.wikipedia.org/wiki/Grid_comput<strong>in</strong>g.


208[194] Ian H. Witten and Eibe Frank. Data M<strong>in</strong><strong>in</strong>g: Practical Mach<strong>in</strong>e Learn<strong>in</strong>g Tools andTechniques. Morgan Kaufmann, 2005.[195] R. Wolff, K. Bhaduri, and H. Kargupta. Local L2 Threshold<strong>in</strong>g Based Data M<strong>in</strong><strong>in</strong>g<strong>in</strong> Peer-to-Peer Systems. In Proceed<strong>in</strong>gs of SDM’06, pages 428–439, 2006.[196] R. Wolff and A. Schuster. Association Rule M<strong>in</strong><strong>in</strong>g <strong>in</strong> Peer-to-Peer Systems. IEEETransactions on Systems, Man and Cybernetics - Part B, 34(6):2426 – 2438, December2004.[197] Y. X<strong>in</strong>g, M. G. Madden, J. Duggan, and G. J. Lyons. Distributed Regression <strong>for</strong>Heterogeneous Data Sets. Lecture Notes <strong>in</strong> Computer Science, 2810:544–553, 2003.[198] Beverly Yang and Hector Garcia-Mol<strong>in</strong>a. Improv<strong>in</strong>g Search <strong>in</strong> Peer-to-Peer Networks.In Proceed<strong>in</strong>gs of ICDCS’02, pages 5–14, Wash<strong>in</strong>gton, DC, USA, 2002.IEEE Computer Society.[199] Wai Gen Yee and Ophir Frieder. On Search <strong>in</strong> Peer-to-Peer File Shar<strong>in</strong>g Systems.In Proceed<strong>in</strong>gs of SAC’05, pages 1023–1030, Santa Fe, New Mexico, 2005.[200] O. Younis and S. Fahmy. Heed: A hybrid, Energy-Efficient, Distributed Cluster<strong>in</strong>gApproach <strong>for</strong> ad-hoc Sensor Networks. IEEE Transactions on Mobile Comput<strong>in</strong>g,3(4):258–269, 2004.[201] Mohammed Javeed Zaki. Parallel and Distributed Association M<strong>in</strong><strong>in</strong>g: A Survey.IEEE Concurrency, 7(4):14–25, 1999.[202] Mohammed Javeed Zaki. Parallel and Distributed Data M<strong>in</strong><strong>in</strong>g: An Introduction. InRevised Papers from Large-Scale Parallel Data M<strong>in</strong><strong>in</strong>g, Workshop on Large-ScaleParallel KDD Systems, SIGKDD, pages 1–23, London, UK, 2000.


209[203] Ben Y. Zhao, L<strong>in</strong>g Huang, Jeremy Stribl<strong>in</strong>g, Sean C. Rhea, Anthony D. Joseph,and John D. Kubiatowicz. Tapestry: A Resilient Global-Scale Overlay <strong>for</strong> ServiceDeployment. IEEE Journal On Selected Areas In Communications, 22(1):41–53,2004.[204] Feng Zhao and Leonidas Guibas. Wireless Sensor Networks: An In<strong>for</strong>mation Process<strong>in</strong>gApproach. Morgan Kaufmann, 2004.[205] J. Zhao, R. Gov<strong>in</strong>dan, and D. Estr<strong>in</strong>. Comput<strong>in</strong>g Aggregates <strong>for</strong> Monitor<strong>in</strong>g WirelessSensor Networks. In Proceed<strong>in</strong>gs of the First IEEE International Workshop onSensor Network Protocols and Applications, pages 139–148, 2003.[206] Alice X. Zheng, Jim Lloyd, and Eric Brewer. Failure Diagnosis Us<strong>in</strong>g DecisionTrees. In Proceed<strong>in</strong>gs of ICAC ’04, pages 36–43, Wash<strong>in</strong>gton, DC, USA, 2004.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!