Web Spam Detection

Fighting Web SpamC. Castillo, M. Sydow, J. Piskorski, D. Weiss09/2007

Carlos CastilloYahoo! Research Barcelona, SpainMarcin SydowPolish-Japanese Institute of Information Technology, PolandJakub PiskorskiJoint Research Centre of the European Commission, ItalyDawid WeissPoznan University of Technology, Poland

Contents1 Introduction2 Reference Corpus & Features3 New Experimental Results

Part IIntroductionMarcin Sydow

Search Engines Web Spam1 Search Engines2 Web Spam

Search Engines Web SpamThe Web today the largest source of information

Search Engines Web SpamThe Web today the largest source of informationsize:

Search Engines Web SpamThe Web today the largest source of informationsize:22.800.000.000 (WorldWideWebSize.com, 28 Aug 2007)11.500.000.000 (A. Gulli, 2005)

Search Engines Web SpamThe Web today the largest source of informationsize:22.800.000.000 (WorldWideWebSize.com, 28 Aug 2007)11.500.000.000 (A. Gulli, 2005)content:

Search Engines Web SpamThe Web today the largest source of informationsize:22.800.000.000 (WorldWideWebSize.com, 28 Aug 2007)11.500.000.000 (A. Gulli, 2005)content:over 100TB of text+ multimedia

Search Engines Web SpamThe Web today the largest source of informationsize:22.800.000.000 (WorldWideWebSize.com, 28 Aug 2007)11.500.000.000 (A. Gulli, 2005)content:over 100TB of text+ multimediaWeb population:

Search Engines Web SpamThe Web today the largest source of informationsize:22.800.000.000 (WorldWideWebSize.com, 28 Aug 2007)11.500.000.000 (A. Gulli, 2005)content:over 100TB of text+ multimediaWeb population:300.000.000 (Nielsen/NetRatings 2007)700.000.000 unique users (comScore World Metrix, 2006.03)

Search Engines Web SpamSearching information among the top Web activities1 (source: Alexa.com, August 2007)

Search Engines Web SpamSearching information among the top Web activitiesWhat are the 3 most popular Web sites today? 11 (source: Alexa.com, August 2007)

Search Engines Web SpamSearching information among the top Web activitiesWhat are the 3 most popular Web sites today? 1• Google.com• Yahoo.com• MSN.com1 (source: Alexa.com, August 2007)

Search Engines Web SpamSearching information among the top Web activitiesWhat are the 3 most popular Web sites today? 1• Google.com• Yahoo.com• MSN.comsearch-focused portals1 (source: Alexa.com, August 2007)

Search Engines Web SpamWhy search engines? to make this ocean of information usable for humans

Search Engines Web SpamWhy search engines? to make this ocean of information usable for humansSearch engines are the main gate to the Web, today

Search Engines Web SpamWhy search engines? to make this ocean of information usable for humansSearch engines are the main gate to the Web, todayFacts:256.000.000 people used a search engine in December 2006(Nielsen/NetRatings, 2006)

Search Engines Web SpamSome available statistics500.000.000 queries per day globally (after Google, 2005)

Search Engines Web SpamSome available statistics500.000.000 queries per day globally (after Google, 2005)

Search Engines Web SpamSome available statistics500.000.000 queries per day globally (after Google, 2005)For a major global search engine it is:• 250,000,000 queries daily,• almost 3000 queries/sec over, 80TB textual corpus (say)• each query must be served under 1 second...

Search Engines Web SpamSome available statistics500.000.000 queries per day globally (after Google, 2005)For a major global search engine it is:• 250,000,000 queries daily,• almost 3000 queries/sec over, 80TB textual corpus (say)• each query must be served under 1 second...... the competitors are one click away...

Search Engines Web SpamSearch Engine ArchitectureCrawler(s)(after: Searching the Web, A. Arasu, et al.)

Search Engines Web SpamSearch Engines seemingly simple taskReturn Web documents containing specied keywords

Search Engines Web SpamSearch Engines seemingly simple taskReturn Web documents containing specied keywordsModules:• Crawler• follow links and collect documents• Repository• store the docs enable updates, access, persistence• Index• record: which word in which document?• Ranking System• which docs t best to the users' needs?• which docs are inherently valuable?• Presentation Module• nd a good form of result visualisation• Service• process queries, nd docs, present results

Search Engines Web SpamCrawler architecture()(Page fetching context/threadHandles spider traps,robots.txtwork(after: Mining the Web S. Chakrabarti, Morgan-Kaufmann, 2003)

Search Engines Web SpamRankingAn average query: thousands of returned documentsAverage Human capability: a few inspected results

Search Engines Web SpamRankingAn average query: thousands of returned documentsAverage Human capability: a few inspected resultsHow to select these few out of thousands for the beginning of theresult list? search engines' primary issue

Search Engines Web SpamRankingAn average query: thousands of returned documentsAverage Human capability: a few inspected resultsHow to select these few out of thousands for the beginning of theresult list? search engines' primary issueThe Ranking System plays a central role in search quality

Search Engines Web SpamRankingAn average query: thousands of returned documentsAverage Human capability: a few inspected resultsHow to select these few out of thousands for the beginning of theresult list? search engines' primary issueThe Ranking System plays a central role in search qualityRanking systems existed in classic IR, before, but neededsubstantial adaptation to the needs of WWW.(search engine revolution AD 1998)

Search Engines Web SpamRanking SystemInuences the search quality (= mission-critical), kept secret1 Assign a score to each document.2 Sort docs in non-increasing order.Factors used for computing the ranking:• text analysis (doc's content, URL, meta tags, etc.)• anchor text analysis• link analysis• query log analysis• trac analysis• user history analysis (personalisation)

Search Engines Web SpamText-based Ranking classic IR approachA bag of words representation of text (document, query):• A vector: keywords as dimensions, some statistics ascoordinates• TF-IDF (term freq. inverted doc. freq.) or its variants• Text-based ranking: vector similarity between query anddocument (dimensionality reduction (SVD, etc.),context-building, etc.)Some drawbacks, but this model worked quite well for controlledtextual document collections.

Search Engines Web SpamWWW-specic issues concerning text analysisClassic IR techniques are faced with Web-specic issues:• low quality mixed with high quality• extreme diversity (versus homogeneity in classic IR)• self-description problem• noise, errors, etc.• adversarial aspects easy to spam

Search Engines Web SpamA Remedy Link AnalysisLinks represent a social aspect of Web publishing(to some extent).A link from document p to document q: a positive judgement• the author of p concerns q as valuable,because it was choosen out of billions other documents tolink to (except link nepotism).A simplistic assumption, but works in mass.Web users implicitly assess the Web documents.Example: PageRank a famous link-based ranking algorithm

Search Engines Web SpamExample: PageRank Basic Idea of Authority Flow1 each page has some authority2 each page distributes its authority equally through links3 the authority of a page is the authority owing into this page

PageRank Equations• simplied PageRank:R(p) =XR(i)/outDeg(i), (1)i∈IN(p)

PageRank Equations• simplied PageRank:R(p) =Xi∈IN(p)R(i)/outDeg(i), (1)• introducing dumping factor d and personalization vector v(p):R(p) = (1 − d) Xi∈IN(p)R(i)+ d · v(p) (2)outDeg(i)

PageRank Equations• simplied PageRank:R(p) =Xi∈IN(p)R(i)/outDeg(i), (1)• introducing dumping factor d and personalization vector v(p):R(p) = (1 − d) X• simple dangling-links correction:R(p) = (1−d) Xi∈IN(p)i∈IN(p)R(i)+ d · v(p) (2)outDeg(i)R(i)X+d ·v(p)+(1−d)v(p)outDeg(i)i∈ZEROSR(i), (3)

Search Engines Web SpamPageRank summaryPageRank, introduced in Google (1998), now patented in USA.Most search engines apply similar algorithms, nowadays.Properties:1 A pioneer successful link-based ranking algorithm (also: HITS)2 Quite immune to spamming3 Gave birth to numerous variants:• personalized PageRank• Topic-sensitive PageRank (i.e. dynamic version)• Trust-Rank, and Anti-TrustRank, (SE spam combating)• extensions of the underlying random surfer model (e.g. RBS)

Search Engines Web Spam1 Search Engines2 Web Spam

Search Engines Web SpamA bit of Web Economics. . .What makes Search Engines survive?

Search Engines Web SpamA bit of Web Economics. . .What makes Search Engines survive?search-based advertising 97% of Web search revenues(A. Broder, Foundations of Web Advertising, tutorial, Edinburgh, 2006)Main types:• sponsored links (aside search results)• contextual ads (placed on Web-sites)

Advertising Market Shares (USA, 2006)

Search Engines Web SpamInternet Advertising (USA, 2006)Search-based ads take the major share (40%) $6.76B

Search Engines Web SpamThe Central Role of Search Engines in WWWWeb pages are accessed through search engines1 Search engine ranking → Web page visibility2 Web page visibility → trac on the page3 trac on the page → incomesThus it is incentive today to rank highly in search engines!

Search Engines Web SpamWhat is Spam?DenitionWeb Spam (Search Engine Spam) is any manipulation of Webdocuments in order to mislead Search Engines to obtainundeservedly high ranking, without improving the realdocument information quality (for humans)or (the extreme version):DenitionWeb Spam (Search Engine Spam) is anything that Web authors doonly because Search Engines exist.Web Spam is motivated economically: $16.9B × 40% = $6.76B(in 2006)

Search Engines Web SpamSpam is destructiveSpam aects every-day life of Web community

Search Engines Web SpamSpam is destructiveSpam aects every-day life of Web community• undermines mission and business of search engines

Search Engines Web SpamSpam is destructiveSpam aects every-day life of Web community• undermines mission and business of search engines• seriously deteriorates information search quality in the Web

Search Engines Web SpamSpam is destructiveSpam aects every-day life of Web community• undermines mission and business of search engines• seriously deteriorates information search quality in the WebCombating Web spam is a primary issue not only for search engines.

Search Engines Web SpamSpam vs SEONot all actions taken in order to improve Web visibility of pages areregarded as spam.• white hat techniques for improving Web page visibility exist(SEO)• SE publish their guidelines in their Terms of Service• There is a gray area in between, however. . .

Search Engines Web SpamSpam taxonomyTwo groups of techniques:

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques• boosting techniques

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques• boosting techniques

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques• boosting techniquesWith regard to factors used in ranking algorithms:

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques• boosting techniquesWith regard to factors used in ranking algorithms:• content-based techniques

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques• boosting techniquesWith regard to factors used in ranking algorithms:• content-based techniques• link-based techniques

Search Engines Web SpamSpam taxonomyTwo groups of techniques:• hiding techniques• boosting techniquesWith regard to factors used in ranking algorithms:• content-based techniques• link-based techniques• other

Search Engines Web SpamSpam techniques• content-based• hidden text (size, color)• repetition• keyword stung/dilution• language-model-based (phrase stealing, dumping)

Search Engines Web SpamSpam techniques• content-based• hidden text (size, color)• repetition• keyword stung/dilution• language-model-based (phrase stealing, dumping)• link-based• honey pot• anchor-text spam• blog/wiki spam• link exchange• link farms• expired domains

Search Engines Web SpamSpam techniques• content-based• hidden text (size, color)• repetition• keyword stung/dilution• language-model-based (phrase stealing, dumping)• link-based• honey pot• anchor-text spam• blog/wiki spam• link exchange• link farms• expired domains• other• cloaking• redirection

Naïve Web Spam

Hidden text

Made for Advertising

Search Engines Web SpamSearch engine?

Search Engines Web SpamFake search engine

Search Engines Web SpamNormal content in link farms

Search Engines Web SpamCloaking

Search Engines Web SpamRedirection

Search Engines Web SpamRedirects using JavascriptSimple redirectdocument.location="http://www.topsearch10.com/";Hidden redirectvar1=24; var2=var1;if(var1==var2) {document.location="http://www.topsearch10.com/";}

Search Engines Web SpamProblem: obfuscated codeObfuscated redirectvar a1="win",a2="dow",a3="loca",a4="tion.",a5="replace",a6="('http://www.top10search.com/')";var i,str="";for(i=1;i

Search Engines Web SpamProblem: really obfuscated codeEncoded javascriptvar s = "%5CBE0D%5C%05GDHJ_BDE%16...%04%0E";var e = , i;eval(unescape('s%eDunescape%28s%29%3Bfor...%3B'));More examples: [Chellapilla and Maykov, 2007]

Search Engines Web SpamFighting SpamOn the search engines' side:

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)• maintaining up-to-date black lists

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)• maintaining up-to-date black lists• recently ML-based

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)• maintaining up-to-date black lists• recently ML-based• Maintaining spam-reporting interfaces

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)• maintaining up-to-date black lists• recently ML-based• Maintaining spam-reporting interfaces• Punishment (excluding from index)

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)• maintaining up-to-date black lists• recently ML-based• Maintaining spam-reporting interfaces• Punishment (excluding from index)

Search Engines Web SpamFighting SpamOn the search engines' side:• Education (what is regarded spam and what is not)• Spam detection• text-based (contents, URLs, meta-tags)• link-based (Trust-Rank, Anti-TrustRank, etc.)• language-model based (Language Model Disagreementmethod, etc.)• maintaining up-to-date black lists• recently ML-based• Maintaining spam-reporting interfaces• Punishment (excluding from index)For researchers:Very interesting applications of Data Mining/Information Retrieval.

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:1 spammers apply new deceptive technique

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:1 spammers apply new deceptive technique2 search engine improves the ranking system

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:1 spammers apply new deceptive technique2 search engine improves the ranking system3 spammers apply new deceptive technique

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:1 spammers apply new deceptive technique2 search engine improves the ranking system3 spammers apply new deceptive technique4 search engine improves the ranking system. . .

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:1 spammers apply new deceptive technique2 search engine improves the ranking system3 spammers apply new deceptive technique4 search engine improves the ranking system. . .

Search Engines Web SpamML can help greatlyThe struggle gets harder:• There are a lot of factors used to compute search engineranking• There is an arms race:1 spammers apply new deceptive technique2 search engine improves the ranking system3 spammers apply new deceptive technique4 search engine improves the ranking system. . .Machine Learning approach recently applied to support SearchEngines in combating Web spam

Part IIReference Corpus & State of the ArtCarlos Castillo

Collection Links Content Both SIGIR'07Tools for dealing with Web SpamTools for dealing with Web Spam

Collection Links Content Both SIGIR'07MotivationFetterly [Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:in a number of these distributions, outlier values areassociated with web spam

Collection Links Content Both SIGIR'07Challenges: Machine LearningMachine Learning Challenges:• Instances are not really independent (graph)• Learning with few examples• Scalability

Collection Links Content Both SIGIR'07Challenges: Information RetrievalInformation Retrieval Challenges:• Feature extraction: which features?• Feature aggregation: page/host/domain• Feature propagation (graph)• Recall/precision tradeos• Scalability

Collection Links Content Both SIGIR'073 A Reference Collection4 Link-based features5 Content-based features6 Using Links and Contents7 SIGIR'07: Exploiting Topology

Collection Links Content Both SIGIR'07Data is really important• It is dangerous for a search engine to provide labelled data forthis• Even if they do, it would never reect a consensus

Collection Links Content Both SIGIR'07Assembling Process• Crawling of base data• Elaboration of the guidelines and classication interface• Labeling• Post-processing

Collection Links Content Both SIGIR'07Crawling of base dataU.K. collection77.9 M pages downloaded from the .UK domain in May 2006(LAW, University of Milan)

Collection Links Content Both SIGIR'07Crawling of base dataU.K. collection77.9 M pages downloaded from the .UK domain in May 2006(LAW, University of Milan)• Large seed of about 150,000 .uk hosts• 11,400 hosts• 8 levels depth, with

Classication interface

Collection Links Content Both SIGIR'07Labeling process• We asked 20+ volunteers to classify entire hosts

Collection Links Content Both SIGIR'07Labeling process• We asked 20+ volunteers to classify entire hosts• Asked to classify normal / borderline / spam

Collection Links Content Both SIGIR'07Labeling process• We asked 20+ volunteers to classify entire hosts• Asked to classify normal / borderline / spam• Do they agree? Mostly. . .

AgreementCollection Links Content Both SIGIR'07

Collection Links Content Both SIGIR'07Results• Labels:• Agreement:Label Frequency Percentagenormal 4,046 61.75%borderline 709 10.82%spam 1,447 22.08%can not classify 350 5.34%Category Kappa Interpretationnormal 0.62 Substantial agreementspam 0.63 Substantial agreementborderline 0.11 Slight agreementglobal 0.56 Moderate agreement

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/• Web Spam challenge

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/• Web Spam challenge• Track I: Information retrieval + Machine learning

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/• Web Spam challenge• Track I: Information retrieval + Machine learning• Track II: Machine learning

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/• Web Spam challenge• Track I: Information retrieval + Machine learning• Track II: Machine learning• http://webspam.lip6.fr/

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/• Web Spam challenge• Track I: Information retrieval + Machine learning• Track II: Machine learning• http://webspam.lip6.fr/• AIRWeb 2007 Workshop

Collection Links Content Both SIGIR'07Result: rst public Web Spam collection• Public spam collection• Labels for 6,552 hosts• 2,725 hosts classied by at least 2 humans• 3,106 automatically considered normal (.ac.uk, .sch.uk,.gov.uk, .mod.uk, .nhs.uk or .police.uk)• http://www.yr-bcn.es/webspam/• Web Spam challenge• Track I: Information retrieval + Machine learning• Track II: Machine learning• http://webspam.lip6.fr/• AIRWeb 2007 Workshop• GraphLab 2007 Workshop


Link farmsCollection Links Content Both SIGIR'07

Collection Links Content Both SIGIR'07Link farmsSingle-level farms can be detected by searching groups of nodessharing their out-links [Gibson et al., 2005]

Collection Links Content Both SIGIR'07Semi-streaming modelHandling large graphs:• Memory size enough to hold some data per-node• Disk size enough to hold some data per-edge• A small number of sequential passes over the data

Collection Links Content Both SIGIR'07Link-Based Features• Degree-related measures• PageRank• TrustRank [Gyöngyi et al., 2004]• Truncated PageRank [Becchetti et al., 2006]• Estimation of supporters [Becchetti et al., 2006]

In−degree δ = 0.35Degree-Based0.4NormalSpam0.12NormalSpam0.100.30.080.20.060.040.10.0201 100 100000.4Number of in−linksDegree / Degree of neighbors δ = 0.31NormalSpam0.0040.140.1218763231380589925212107764460609NormalSpam19687530.30.100.20.080.060.10.040.0200.001 0.01 0.1 1 10 100 10000.000.00.00.00.10.64.940.0327.92686.522009.9

Collection Links Content Both SIGIR'07TrustRank Idea

0.80.7NormalSpam0.60.50.40.30.20.100.3 1 10 100TrustRank score/PageRank1.00NormalSpam0.900.800.700.600.500.400.300.200.10TrustRank / PageRank0.000.4141e+014e+011e+023e+021e+033e+039e+03

Hop-plot and PageRankNumber of Nodesx 10 41210864Top 0%−10%Top 40%−50%Top 60%−70%201 5 10 15 20Distance

Hop-plot and PageRankNumber of Nodesx 10 41210864Top 0%−10%Top 40%−50%Top 60%−70%201 5 10 15 20DistanceAreas below the curves are equal if we are in the same strongly-connectedcomponent

Collection Links Content Both SIGIR'07Neighbors: spam

Collection Links Content Both SIGIR'07Neighbors: normal

Collection Links Content Both SIGIR'07Bottleneck numberb d (x) = min j≤d {|N j (x)|/|N j−1 (x)|}. Minimum rate of growth ofthe neighbors of x up to a certain distance. We expect that spampages form clusters that are somehow isolated from the rest of theWeb graph and they have smaller bottleneck numbers thannon-spam pages.0.40NormalSpam0.350.300.250.200.150.100.050.001.111.301.521.782.072.422.833.313.874.52

Probabilistic counting100010110000000110Propagation ofbits using the“OR” operation100010100011110111100010000011Targetpage100010100011Count bits setto estimatesupporters

Probabilistic counting100010110000000110Propagation ofbits using the“OR” operation100010100011110111100010000011Targetpage100010100011Count bits setto estimatesupporters[Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002]based on probabilistic counting [Flajolet and Martin, 1985]


Collection Links Content Both SIGIR'07Content-Based FeaturesMost of the features reported in [Ntoulas et al., 2006]• Number of word in the page and title• Average word length• Fraction of anchor text• Fraction of visible text• Compression rate• Corpus precision and corpus recall• Query precision and query recall• Independent trigram likelihood• Entropy of trigramsMore about this in the last part of the talk

Collection Links Content Both SIGIR'07Content-based features (entropy related)T = {(w 1 , p 1 ), . . . , (w k , p k )} the set of trigrams in a page,where trigram w i has frequency p iFeatures:• Entropy of trigrams H = − ∑ w i ∈T p i log p i• Also, compression rate, as measured by bzip

Collection Links Content Both SIGIR'07Content-based features (related to popular keywords)F set of most frequent terms in the collectionQ set of most frequent terms in a query logP set of terms in a pageFeatures:• Corpus precision |P ∩ F |/|P|• Corpus recall |P ∩ F |/|F |• Query precision |P ∩ Q|/|P|• Query recall |P ∩ Q|/|Q|

Collection Links Content Both SIGIR'07Average word length0.120.10NormalSpam0.080.060.040.020.003.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5Figure: Histogram of the average word length in non-spam vs. spampages for k = 500.

Collection Links Content Both SIGIR'07Corpus precision0.100.090.080.070.060.050.040.030.020.01NormalSpam0.000.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Figure:Histogram of the corpus precision in non-spam vs. spam pages.

Collection Links Content Both SIGIR'07Query precision0.120.10NormalSpam0.080.060.040.020.000.0 0.1 0.2 0.3 0.4 0.5 0.6Figure: Histogram of the query precision in non-spam vs. spam pages fork = 500.


Collection Links Content Both SIGIR'07Cost-sensitive decision tree with baggingBagging of 10 decision trees, asymmetrical costs.Cost ratio 1 10 20 30 50True positive rate 65.8% 66.7% 71.1% 78.7% 84.1%False positive rate 2.8% 3.4% 4.5% 5.7% 8.6%F-Measure 0.712 0.703 0.704 0.723 0.692

Collection Links Content Both SIGIR'07Link- and content-based featuresLink-based and content-basedBoth Link-only Content-onlyTrue positive rate 78.7% 79.4% 64.9%False positive rate 5.7% 9.0% 3.7%F-Measure 0.723 0.659 0.683


Collection Links Content Both SIGIR'07General hypothesisPages topologically close to each other are more likely tohave the same label (spam/nonspam) than random pairs ofpages.

Collection Links Content Both SIGIR'07General hypothesisPages topologically close to each other are more likely tohave the same label (spam/nonspam) than random pairs ofpages.Pages linked together are more likely to be on the same topic thanrandom pairs of pages [Davison, 2000]

Collection Links Content Both SIGIR'07Topological dependencies: in-linksHistogram of fraction of spam hosts in the in-links• 0 = no in-link comes from spam hosts• 1 = all of the in-links come from spam hosts0.40.35In-links of non spamIn-links of spam0.30.250.20.150.10.0500.0 0.2 0.4 0.6 0.8 1.0

Collection Links Content Both SIGIR'07Topological dependencies: out-linksHistogram of fraction of spam hosts in the out-links• 0 = none of the out-links points to spam hosts• 1 = all of the out-links point to spam hosts10.90.80.70.60.50.40.30.20.10Out-links of non spamOutlinks of spam0.0 0.2 0.4 0.6 0.8 1.0

Collection Links Content Both SIGIR'07Idea 1: ClusteringClassify, then cluster hosts, then assign the same label to all hostsin the same cluster by majority voting

Collection Links Content Both SIGIR'07Idea 1: Clustering (cont.)Initial prediction:

Collection Links Content Both SIGIR'07Idea 1: Clustering (cont.)Clustering:

Collection Links Content Both SIGIR'07Idea 1: Clustering (cont.)Final prediction:

Collection Links Content Both SIGIR'07Idea 1: Clustering ResultsV Reduces error rateBaselineClusteringWithout baggingTrue positive rate 75.6% 74.5%False positive rate 8.5% 6.8%F-Measure 0.646 0.673With baggingTrue positive rate 78.7% 76.9%False positive rate 5.7% 5.0%F-Measure 0.723 0.728

Collection Links Content Both SIGIR'07Idea 2: Propagate the labelClassify, then interpret spamicity as a probability, then do arandom walk with restart from those nodes

Collection Links Content Both SIGIR'07Idea 2: Propagate the label (cont.)Initial prediction:

Collection Links Content Both SIGIR'07Idea 2: Propagate the label (cont.)Propagation:

Collection Links Content Both SIGIR'07Idea 2: Propagate the label (cont.)Final prediction, applying a threshold:

Collection Links Content Both SIGIR'07Idea 2: Propagate the label ResultsBaseline Fwds. Backwds. BothClassier without baggingTrue positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%F-Measure 0.646 0.665 0.664 0.676Classier with baggingTrue positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%F-Measure 0.723 0.716 0.733 0.724

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning• Meta-learning scheme [Cohen and Kou, 2006]• Derive initial predictions• Generate an additional attribute for each object by combiningpredictions on neighbors in the graph• Append additional attribute in the data and retrain

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning (cont.)• Let p(x) ∈ [0..1] be the prediction of a classication algorithmfor a host x using k features• Let N(x) be the set of pages related to x (in some way)• Computef (x) =∑g∈N(x) p(g)|N(x)|• Add f (x) as an extra feature for instance x and learn a newmodel with k + 1 features

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning (cont.)Initial prediction:

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning (cont.)Computation of new feature:

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning (cont.)New prediction with k + 1 features:

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning - ResultsAvg. Avg. Avg.Baseline of in of out of bothTrue positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%F-Measure 0.723 0.733 0.742 0.750V Increases detection rate

Collection Links Content Both SIGIR'07Idea 3: Stacked graphical learning x2And repeat ...Baseline First pass Second passTrue positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%F-Measure 0.723 0.750 0.763V Signicant improvement over the baseline

Part IIINew Experimental ResultsJakub Piskorski, Marcin Sydow, Dawid Weiss

8 New resultsLinguistic featuresIDEA 1: simple addition of linguistic featuresIDEA 2: pruning incomplete dataIDEA 3: selecting good pure hostsSummaryNew results Features IDEA 1 IDEA 2 IDEA 3 Summary

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryWhy linguistic features?• Using linguistic and language features such as languagediversity, complexity, expressivity, immediacy, uncertainty andemotional consistency turned to have discriminatory potentialfor deception detection [Zhou et al., 2004].• In previous research linguistic features not exstensivelyexploited for web spam detection.• Explore prevalence of spam relative to linguistic features inWEB-SPAM-2006UK corpus.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryHow to measure?• Complexity: average number of: sentences, clauses, nounphrases.• Diversity: lexical diversity, content word diversity.• Expressivity: preference of specic part-of-speech categories toothers.• Non-immediacy: self-reference, passive voice, generalizingterms.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryLinguistic features• Length = total number of tokens (word-like units)• Lexical diversity =• Lexical validity =• Text-like fraction =• Emotiveness =• Self-referencing =number of dierent tokenstotal number of tokensnumber of tokens which constitute valid word formstotal number of potential word formstotal number of potential word formstotal number of tokensnumber of adjectives and adverbsnumber of nouns and verbsnumber of 1st-person pronounstotal number of pronouns• Passive voice =number of verb phrases in passive voicetotal number of verb phrases

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryComputing linguistic features• Only for the summary of the WEB-SPAM-2006UK corpus(< 400 pages per host), 64GB.• Utilized Corleone (Core Linguistic Entity Extraction),developed at JRC, and LingPipe(www.alias-i.com/lingpipe).• 14.36% of pages had no textual content.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIDEA 1Just add the linguistic features to the attribute set.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 1: Just linguistic featureswithlinguistic featuresinstances 8 411 8 411attributes 287 280withoutclassied correctly 7 666 91.14% 7 687 91.39%missclassied 745 8.85% 724 8.60%• The results are not much dierent.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 1: Just linguistic featuresWith linguistic features:Figures in red are better.Class TP FP Precision Recall F-Measurenormal 0.970 0.435 0.946 0.970 0.958undecided 0.091 0.010 0.162 0.091 0.116spam 0.525 0.033 0.615 0.525 0.566Without linguistic features:Class TP FP Precision Recall F-Measurenormal 0.970 0.415 0.949 0.970 0.959undecided 0.108 0.010 0.186 0.108 0.137spam 0.552 0.033 0.629 0.552 0.588

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIDEA 2Prune the input by removing records with missing values.Rerun the experiments with and without linguistic attributes.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 2: prune records with missing valueswithlinguistic featuresinstances 6 644 6 644attributes 287 280withoutclassied correctly 6 016 90.54% 6 009 90.44%missclassied 628 9.45% 635 9.55%• Not much improvement (dierence so small it is most likelystatistically insignicant).

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 2: prune records with missing valuesWith linguistic features:Class TP FP Precision Recall F-Measurenormal 0.958 0.343 0.954 0.958 0.956undecided 0.112 0.019 0.119 0.112 0.115spam 0.608 0.039 0.622 0.608 0.615Without linguistic features:Class TP FP Precision Recall F-Measurenormal 0.958 0.348 0.954 0.958 0.956undecided 0.105 0.019 0.113 0.105 0.109spam 0.601 0.039 0.616 0.601 0.608

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIDEA 3Choose only pure hosts (for which class decision was univocal).Rerun the experiments with and without linguistic attributes.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryPure hosts explanationThe notion of a spam host is quite vague, inter-judgeclassication agreement is not perfect.Selecting representative spam/ not spam records by lteringunivocally-classied examples;• 1 049 NNN hosts,• 391 SS hosts,• 57 BB hosts,• (no SSS or BBB examples in the original data).The above gives a total of 1 497 pure hosts used as input.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 3: pure hostswithlinguistic featuresinstances 1 497 1 497attributes 287 280withoutclassied correctly 1 328 88.71% 1 330 88.84%missclassied 169 11.28% 167 11.15%

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 3: pure hostsWith linguistic features:Class TP FP Precision Recall F-Measurenormal 0.949 0.107 0.954 0.949 0.952undecided 0.193 0.042 0.155 0.193 0.172spam 0.821 0.055 0.840 0.821 0.831Without linguistic features:Class TP FP Precision Recall F-Measurenormal 0.950 0.103 0.956 0.950 0.953undecided 0.175 0.041 0.145 0.175 0.159spam 0.826 0.056 0.839 0.826 0.832

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 3: pure hosts, incomplete records removedwithlinguistic featuresinstances 1 211 1 211attributes 287 280withoutclassied correctly 1 099 90.75% 1 095 90.42%missclassied 112 9.24% 116 9.57%• Further reduction of noisy examples results in qualityimprovement.• The improvement gained from linguistic features is small, butclear.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryIdea 3: pure hosts, incomplete records removedWith linguistic features:Class TP FP Precision Recall F-Measurenormal 0.970 0.089 0.961 0.970 0.966undecided 0.306 0.031 0.294 0.306 0.300spam 0.834 0.048 0.861 0.834 0.848Without linguistic features:Class TP FP Precision Recall F-Measurenormal 0.969 0.098 0.958 0.969 0.963undecided 0.245 0.032 0.245 0.245 0.245spam 0.834 0.048 0.861 0.834 0.848

New results Features IDEA 1 IDEA 2 IDEA 3 Summary9000800070006000ALL all records,PRUNED without missing values,PURE only pure hosts500040003000200010000with without with without with without with withoutALL PRUNED PURE PRUNED/PUREAll together: number of instances.

New results Features IDEA 1 IDEA 2 IDEA 3 Summary0.90.80.70.60.50.40.3KYN reference set,ALL all records,PRUNED without missing values,PURE only pure hosts0.20.10-- with without with without with without with withoutKYN ALL PRUNED PURE PRUNED/PUREAll together: f-measure of the spam class.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryDistribution of linguistic features in the Web-Spam2006UK corpus.• Explore the distribution of each linguistic feature.

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryDistribution of linguistic features in the Web-Spam2006UK corpus.• Explore the distribution of each linguistic feature.• Explore fraction of spam within each range.

New results Features IDEA 1 IDEA 2 IDEA 3 Summary11%18%percantage of pages10%9%8%7%6%5%4%3%2%1%0%50150250350450550650750850950105011501250135014501550165017501850200021002220230017%16%15%14%13%12%11%10%9%8%7%6%5%percentage of spam#tokens

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages13%12%11%10%9%8%7%6%5%4%3%2%1%0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.0022%21%20%19%18%17%16%15%14%13%12%11%10%9%8%7%6%5%percentage of spamlexical diversity

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages16%15%14%13%12%11%10%9%8%7%6%5%4%3%2%1%0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.0030%27.5%25%22.5%20%17.5%15%12.5%10%7.5%5%percentage of spamlexical diversity-[100-]

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages22.5%20%17.5%15%12.5%10%7.5%5%2.5%0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.0018%17%16%15%14%13%12%11%10%9%8%7%6%5%4%3%percentage of spamlexical validity

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages27.5%25%22.5%20%17.5%15%12.5%10%7.5%5%2.5%0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.0019%18%17%16%15%14%13%12%11%10%9%8%7%6%5%4%3%percentage of spamlexical validity-[100-]

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages22.5%20%17.5%15%12.5%10%7.5%5%2.5%18%16%14%12%10%8%6%4%2%percentage of spam0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.000%text-like fraction

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages25%22.5%20%17.5%15%12.5%10%7.5%5%2.5%50%45%40%35%30%25%20%15%10%5%percentage of spam0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.000%text-like fraction-[100-]

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages40%37,5%35%32,5%30%27,5%25%22,5%20%17,5%15%12,5%10%7,5%5%2,5%0%0,100,200,300,400,500,600,700,800,901,001,101,201,301,401,501,601,701,801,902,0050%45%40%35%30%25%20%15%10%5%0%percentage of spamemotiveness

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages45%40%35%30%25%20%15%10%5%0%0,100,200,300,400,500,600,700,800,901,001,101,201,301,401,501,601,701,801,902,0065%60%55%50%45%40%35%30%25%20%15%10%5%0%percentage of spamexpressivity-[100-]

New results Features IDEA 1 IDEA 2 IDEA 3 Summary50%14%percantage of pages45%40%35%30%25%20%15%10%5%13%12%11%10%9%8%7%6%5%4%3%percentage of spam0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.002%self referencing

New results Features IDEA 1 IDEA 2 IDEA 3 Summarypercantage of pages22.5%20%17.5%15%12.5%10%7.5%5%2.5%0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.0070%65%60%55%50%45%40%35%30%25%20%15%10%5%percentage of spampassive voice

New results Features IDEA 1 IDEA 2 IDEA 3 Summary25%75%percantage of pages22.5%20%17.5%15%12.5%10%7.5%5%2.5%70%65%60%55%50%45%40%35%30%25%20%15%10%percentage of spam0%0.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.005%passive voice-[100-]

New results Features IDEA 1 IDEA 2 IDEA 3 SummaryConclusionsPreliminary experimental results seem to indicate:• linguistic features introduced in [Zhou et al., 2004] slightlyimprove classication accuracy,• pruning inconsistently labeled examples improves classicationaccuracy.Further research:• including other types of linguistic features (e.g. sentimentanalysis, etc.),• more systematic evaluation methods.

AcknowledgementsRicardo Baeza-Yates Y ,S , Luca Becchetti R , Paolo Boldi M ,Debora Donato Y , Aristides Gionis Y , Stefano Leonardi R ,Vanessa Murdock Y , Massimo Santini M , Fabrizio Silvestri P ,Sebastiano Vigna M , Leszek Krupi«ski JY. Yahoo! Research Barcelona Catalunya, SpainR. Università di Roma La Sapienza Rome, ItalyS. Yahoo! Research Santiago ChileP. ISTI-CNR Pisa,ItalyM. Università degli Studi di Milano Milan, ItalyJ. Polish-Japanese Institute of Information Technology, Poland

Thank you for your attention!

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006).Using rank propagation and probabilistic counting for link-based spam detection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD),Pennsylvania, USA. ACM Press.Chellapilla, K. and Maykov, A. (2007).A taxonomy of javascript redirection spam.In AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial informationretrieval on the web, pages 8188, New York, NY, USA. ACM Press.Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov random elds using very shortinhomogeneous markov chains.Technical report.Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference on research anddevelopment in information retrieval, pages 272279, Athens, Greece. ACM Press.Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages.In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 16, Paris,France.Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182209.Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB '05: Proceedings of the 31st international conference on Very large data bases, pages721732. VLDB Endowment.

Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages576587, Toronto, Canada. Morgan Kaufmann.Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 8392, Edinburgh, Scotland.Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery anddata mining, pages 8190, New York, NY, USA. ACM Press.Zhou, A., Burgoon, J., Nunamaker, J., and Twitchell, D. (2004).Automating Linguistics-Based Cues for Detecting Deception of Text-based AsynchronousComputer-Mediated Communication.Group Decision and Negotiations, 12:81106.

Web Spam Detection

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?