12.07.2015 Views

Filtering Multi-Lingual Terrorist Content with Graph-Theoretic ...

Filtering Multi-Lingual Terrorist Content with Graph-Theoretic ...

Filtering Multi-Lingual Terrorist Content with Graph-Theoretic ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Outline• Introduction– Internet as a <strong>Terrorist</strong> Weapon• Selected Examples of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>– Challenges in <strong>Filtering</strong> <strong>Terrorist</strong> <strong>Content</strong>• Web Document Representation andCategorization– The Vector-Space Approach– The <strong>Graph</strong>-Based Approach– The Hybrid Approach• Case Studies• Conclusions and Future Work2October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Important Preliminaries• The terrorist organizations mentioned in this presentation areincluded in the list of U.S.-Designated Foreign <strong>Terrorist</strong>Organizations, which is updated periodically by the U.S. Departmentof State, Office of Counterterrorism.– The latest list can be downloaded fromhttp://www.infoplease.com/ipa/A0908746.html• Affiliations of specific web sites <strong>with</strong> terrorist organizations areavailable from several sources such as:– SITE Institute http://www.siteinstitute.org/– Internet Haganah http://www.haganah.org.il/– The Intelligence and Terrorism Information Center http://www.terrorisminfo.org.il• Definition of “terrorism” is beyond the scope of this talk3October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Internet as a <strong>Terrorist</strong> WeaponLaw enforcement officials in Europe reportthat the number of jihadi Web sites wentfrom a dozen on Sept. 10, 2001, to close to5,000 today (ABC News, March 10, 2006)


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Propaganda in ArabicOrganization: Palestinian Islamic Jihad5October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Propaganda in RussianOrganization: Hamas7October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Propaganda in English and HebrewOrganization: Hezbollah8October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Tactical Orders (?)• Madrid – March 2004– “[The Islamist cell] took its inspiration from a Web site that called onlocal Islamists to stage attacks in Spain before the 2004 generalelections to prompt <strong>with</strong>drawal of troops from Iraq”, [the courtspokeswoman] said. (The New York Times, April 11, 2006)• London – July 2005– A message posted on May 29 on an Islamist Internet site: "We ask allwaiting mujahedeen, wherever they are, to carry out the planned attack"(The New York Times, July 13, 2005)9– “The July 7 bombings in London were a low-budget operation carriedout by four men who had no connection to Al Qaeda and who obtainedall the information they needed from the Internet” (The New York Times,April 11, 2006)October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Challenges in <strong>Filtering</strong> <strong>Terrorist</strong><strong>Content</strong>• Finding relevant content in multiple languages– <strong>Terrorist</strong> web sites frequently switch their URLs– There is more online information about terrorists than informationcreated and posted by terrorists– What makes terrorist content different from a regular news report orcommentary?• <strong>Terrorist</strong> group identification– The true web site affiliation is often concealed• How can we tell that the “Palestinian Information Center” is associated <strong>with</strong>Hamas?• Topic identification– Propaganda, fundraising, bomb-making, etc.• Real-time understanding of multi-lingual content– On Sept. 10, 2001, the NSA intercepted two Arabic-languagemessages, "Tomorrow is zero hour" and "The match is about to begin."The sentences weren't translated until Sept. 12, 2001 (Michael Erard,MIT Technology Review, March 2004)10October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Web Document Representationand Categorization


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Text Categorization (TC)Basic Definition• TC – task of assigning a Boolean {T, F}value to each pair dj , ci∈ D×CwhereD = (d 1 , …, d |D| ) is a collection of documentsC = (c 1 , …, c |C| ) is a set of pre-defined categories–Sample categories: “terrorist”, “non-terrorist”,“bomb-making”, etc.12October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Text Categorization (TC) Tasks• Binary TC – two non-overlapping categories only– Example: “terrorist” vs. “non-terrorist”• <strong>Multi</strong>-Class TC – more than two non-overlapping categories– Example: “PIJ” or “Hamas” or “Al-Aqsa Brigades”– A multi-class problem can be reduced into multiple binary tasks (oneagainst-the-reststrategy)• <strong>Multi</strong>-Label TC – overlapping categories are allowed– Example: a “Hamas” document on “bomb-making”– A multi-label task can be split into a set of binary classification tasks• Ranking categorization– Category ranking: which categories match a given document best?– Document ranking: which documents match a given category best?13October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The Vector-Space Model(Salton et al., 1975)• A text document is considered a “bag of words (terms / features)”– Document d j = (w 1j ,… ,w |T|j ) where T = (t 1 ,…,t |T| ) is set of terms(features) that occurs at least once in at least one document(vocabulary)• Term: n-gram, single word, noun phrase, keyphrase, etc.• Term weights: binary, frequency-based, etc.• Meaningless (“stop”) words are removed• Stemming operations may be applied– Leaders => Leader– Expiring => expire• The ordering and position of words, as well as document logicalstructure and layout, are completely ignored14October 3, 2007


Mark Last (BGU)Text 1<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The “Bag of Words” ApproachA Practical ExampleFrom palestine-info.co.ukDec 10, 2005Earlier, Khaled Mishaal, the Movement's top political leader, said in a rally in the Palestinian refugeecamp of Yarmouk in the Syrian capital, Damascus, Friday that there was no more room for furthercalm in the light of the Israeli daily hostilities against the Palestinian people.Text 2By ASSOCIATED PRESSDec. 10, 2005Hamas will not renew its truce <strong>with</strong> Israel when it expires at the end of the year, the political leaderof the Palestinian terrorist group, Khaled Mashaal, told a rally Friday.15October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Text 1The “Bag of Words” ApproachA Practical ExampleFrom palestine-info.co.ukDec 10, 2005Earlier, Khaled Mishaal, the Movement's top political leader, said in a rally in the Palestinian refugeecamp of Yarmouk in the Syrian capital, Damascus, Friday that there was no more room for furthercalm in the light of the Israeli daily hostilities against the Palestinian people.Friday further hostilities Israel Khaled leader light Mishaal Movement Palestinian people politicalrally refugee room Syrian top YarmoukText 2By ASSOCIATED PRESSDec. 10, 2005Hamas will not renew its truce <strong>with</strong> Israel when it expires at the end of the year, the political leaderof the Palestinian terrorist group, Khaled Mashaal, told a rally Friday.Expires Friday group Hamas Israel Khaled leader Mashaal Palestinian political rally renew terroristtruce year16October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The “Bag of Words” ApproachA Practical ExampleBag of Words 1<strong>Terrorist</strong>Friday further hostilities Israel Khaled leader light Mishaal Movement Palestinian people politicalrally refugee room Syrian top YarmoukBag of Words 28 words incommon!Non-<strong>Terrorist</strong>Expires Friday group Hamas Israel Khaled leader Mashaal Palestinian political rallyrenew terrorist truce year17October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The “Bag of Words” ApproachText 1A Practical ExampleFrom palestine-info.co.ukDec 10, 2005Earlier, Khaled Mishaal, the Movement's top political leader, said in a rally in the Palestinian refugeecamp of Yarmouk in the Syrian capital, Damascus, Friday that there was no more room for furthercalm in the light of the Israeli daily hostilities against the Palestinian people.Text 2Friday further hostilities Israel Khaled leader light Mishaal Movement Palestinian people politicalrally refugee room Syrian top YarmoukBy ASSOCIATED PRESSDec. 10, 2005Hamas will not renew its truce <strong>with</strong> Israel when it expires at the end of the year, the political leaderof the Palestinian terrorist group, Khaled Mashaal, told a rally Friday.18Expires Friday group Hamas Israel Khaled leader Mashaal Palestinian political rallyrenew terrorist truce yearOctober 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The “Bag of Words” ApproachA Practical ExampleBag of Words 1<strong>Terrorist</strong>Friday further hostilities Israel Khaled leader light Mishaal Movement Palestinian people politicalrally refugee room Syrian top YarmoukBag of Words 28 words incommon!Non-<strong>Terrorist</strong>Expires Friday group Hamas Israel Khaled leader Mashaal Palestinian political rallyrenew terrorist truce year19October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Advantages of the Vector-Space Model(based on Joachims, 2002)• A simple and straightforward representation for Englishand other languages, where words have a clear delimiter• Most weighting schemes require a single scan of eachdocument• A fixed-size vector representation makes unstructuredtext accessible to most classification algorithms (fromdecision trees to SVMs)• Consistently good results in the information retrievaldomain (mainly, on English corpora)20October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Limitations of the Vector-SpaceModel• Text documents– Ignoring the word position in the document– Ignoring the ordering of words in the document• Web Documents– Ignoring the information contained in HTML tags (e.g.,document sections)• <strong>Multi</strong>lingual documents– Word separation may be tricky in some languages(e.g., Latin, German, Chinese, etc.)– No comprehensive evaluation on large non-Englishcorpora21October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>DIVIDE ET IMPERA(“Divide and Rule”)The Word Separation in the Ancient LatinThe Arch of Titus,Rome(1 st Century AD)Words areseparatedbytrianglesDedication to JuliusCaesar(1 st Century BC)22October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Alternative Representation of<strong>Multi</strong>lingual Web Documents:The <strong>Graph</strong>-Based Model(introduced in Schenker et al., 2005)


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Relevant Definitions(Based on Bunke and Kandel, 2000)•A (labeled) graph G is a 4-tupleWhereG =( ) V, E,α,βV is a set of nodes (vertices), E ⊆ V × V is a set ofedges connecting the nodes, α is a functionlabeling the nodes and β is a function labelingthe edges.EdgelabelAxByCNodelabel24•Node and edge IDs are omitted for brevity•<strong>Graph</strong> size: |G|=|V|+|E|October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The <strong>Graph</strong>-Based Model ofWeb Documents• Basic ideas:– one node for each unique term– if word B follows word A, there is an edge from A to B• In the presence of terminating punctuation marks (periods, questionmarks, and exclamation points) no edge is created between twowords– stop words are removed– graph size is limited by including only the most frequent terms– Stemming• Alternate forms of the same term (singular/plural,past/present/future tense, etc.) are conflated to the most frequentlyoccurring form– Several variations for node and edge labeling (see the nextslides)25October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The Standard Representation• Edges are labeled according to the document sectionwhere the words are followed by each other– Title (TI) contains the text related to the document’s title and anyprovided keywords (meta-data);– Link (L) is the “anchor text” that appears in clickable hyper-linkson the document;– Text (TX) comprises any of the visible text in the document (thisincludes anchor text but not title and keyword text)TILYAHOONEWSMORETXTXSERVICEREPORTSREUTERSTX26October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The Simple Representation• The graph is based only the visible text onthe page (title and meta-data are ignored)• Edges are not labeledNEWSMORESERVICEREPORTSREUTERS27October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The n-distance Representation• Based on the visible text only• Instead of considering only terms immediately followinga given term in a web document, we look up to n termsahead and connect the succeeding terms <strong>with</strong> an edgethat is labeled <strong>with</strong> the distance between them (unlessthe words are separated by certain punctuation marks)• n is a user-provided parameter.n = 3SERVICE11NEWS2REPORTS2311MOREREUTERS28October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The n-simple Representation• Based on the visible text only• We look up to n terms ahead and connectthe succeeding terms <strong>with</strong> an unlabelededge• n is a user-provided parameter.n = 2n = 3NEWSMORESERVICEREUTERSREPORTS29October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The Absolute FrequencyRepresentation• No section-related information• Each node and edge is labeled <strong>with</strong> anabsolute frequency measure1SERVICE112NEWS111MORE11REUTERSREPORTS30October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The Relative FrequencyRepresentation• No section-related information• Each node and edge is labeled <strong>with</strong> a relative frequencymeasure• A normalized value in [0,1] is assigned by dividing eachnode frequency value by the maximum node frequencyvalue that occurs in the graph• A similar procedure is performed for the edges0.5SERVICE1.01.01.0NEWS0.5REPORTS1.01.00.5MORE0.5REUTERS31October 3, 2007


<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong><strong>Graph</strong> Based DocumentRepresentation – Detailed ExampleSource: www.cnn.com, May 24, 2005Mark Last (BGU)32October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong><strong>Graph</strong> Based Document Representation -Parsingtitlelinktext33October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong><strong>Graph</strong> Based DocumentRepresentation - PreprocessingTITLECNN.com InternationalTextA car bomb has exploded outside a popular Baghdadrestaurant, killing three Iraqis and wounding morethan 110 others, police officials said. Earlier an aide tothe office of Iraqis Prime Minister Ibrahim al-Jaafariand his driver were killing in a driver shooting.LinksIraqis bomb: Four dead, 110 wounding.FULL STORY.35October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Standard <strong>Graph</strong> Based DocumentRepresentationTen most frequentterms are usedTXWordIraqisKillingBombWoundingDriverExplodedFrequency322221CARTXTXDRIVERTextBOMBTXLLinkKILLINGTXIRAQISTXBaghdadInternationalCNNCar1111TXEXPLODEDTitleINTERNATIONALBAGHDADTIWOUNDINGCNN36October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Simple <strong>Graph</strong> Based DocumentRepresentationTen most frequentterms are usedWordIraqisFrequency3CARDRIVERKILLINGKilling2BombWoundingDriver222BOMBIRAQISExploded1BaghdadInternational11EXPLODEDBAGHDADWOUNDINGCNN1Car137October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>“Lazy” Categorization <strong>with</strong> <strong>Graph</strong>-Based Models• The Basic k-Nearest Neighbors Algorithm– Input: a set of labeled training documents, a query document d,and a parameter k defining the number of nearest neighbors touse– Output: a label indicating the category of the query document d– Step 1. Find the k nearest training documents to d according to adistance measure– Step 2. Select the category of d to be the category held by themajority of the k nearest training documents• k-Nearest Neighbors <strong>with</strong> <strong>Graph</strong>s (Schenker et al., 2005)– Represent the documents as graphs (done)– Use a graph-theoretical distance measure38October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Distance between two <strong>Graph</strong>s• Required properties–(1) boundary condition: d(G 1 ,G 2 )≥0–(2) identical graphs have zero distance:d(G 1 ,G 2 )=0 → G 1 ≅G 2–(3) symmetry: d(G 1 ,G 2 )=d(G 2 ,G 1 )–(4) triangle inequality:d(G 1 ,G 3 )≤d(G 1 ,G 2 )+d(G 2 ,G 3 )39October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Relevant Definitions(Based on Bunke and Kandel, PRL, 2000)G 1 = ( V 1, E1,α1,β 1)( V 2, E 2,α 2 β 2)•A graph, denoted G ⊆ , if V 1 ⊆ V 2 ,E 1 ⊆ E2∩( V1×V1), α 1 ( x)= α 2(x)∀x∈V1 and1 ( x,y)= β 2(x,y)∀(x,y)∈ E1G 2 = ,1 G2βis a sub-graph of a graph•Conversely, the graph G 2 is also called asupergraph of G 1AxBAxByC40G 1G 2October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>More <strong>Graph</strong>-<strong>Theoretic</strong> DefinitionsG 1 = ( V 1, E1,α1,β 1)G 2 = ( V 2, E 2,α 2,β 2)•A graphand a graphsaid to be isomorphic, denotedexists a bijective function fα 1 ( x)= α 2(f ( x))∀x∈V1 and β ( x,y)= β 2(f∀( x , y)∈V1×V 1.G1 ≅ G2: V 1 → V 21 ( x),f ( y)), if theresuch thatAxBAwDCwyzDxBzCy41G 1G 2October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>More <strong>Graph</strong>-<strong>Theoretic</strong> Definitions• Subgraph Isomorphism – graph is isomorphic to a part(subgraph) of another graph• <strong>Graph</strong> isomorphism is not known as NP-complete• Subgraph isomorphism is NP-complete.AxBAwzxCyDBzC42G 1G 2October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>More <strong>Graph</strong>-<strong>Theoretic</strong> Definitions• Let G, G 1 and G 2 be graphs. The graph Gis a common subgraph of G 1 and G 2 ifthere exist subgraph isomorphisms from Gto G 1 and from G to G 2AxBAAqFCwyzDxBxBpErG 1GG 243October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>More <strong>Graph</strong>-<strong>Theoretic</strong> Definitions(cont.)• The graph G is a maximum commonsubgraph (mcs) if G is a common subgraph ofG 1 and G 2 and there exist no other commonsubgraph G’ of G 1 and G 2 such that |G’| > |G|AxBAAqFCwyzDxBxBpEr44G 1G|G|= |V|+|E| = 2+1 = 3G 2October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>More <strong>Graph</strong>-<strong>Theoretic</strong> Definitions(cont.)• Let G, G 1 and G 2 be graphs. The graph Gis a common supergraph of G 1 and G 2 ifthere exist subgraph isomorphisms fromG 1 to G and from G 2 to GAAwDDxxyyBBzCCG 1GG 245October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>More <strong>Graph</strong>-<strong>Theoretic</strong> Definitions(cont.)• The graph G is a minimum commonsupergraph (MCS) if G is a commonsupergraph of G 1 and G 2 and there exist noother common supergraph G’ of G 1 and G 2 suchthat |G’| < |G|AAwDDxxyyBBzCCG 1GG 246|G|= |V|+|E| = 4+2 = 6October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>MMCSN Distance Measurebetween two <strong>Graph</strong>s• MMCSN Measure (Schenker et al., 2005):d MMCSN(G 1,G 2) =1− mcs(G 1,G 2)MCS(G 1,G 2)• mcs(G 1, G 2) - maximum common subgraph• MCS(G 1, G 2) - minimum common supergraph47ABA BA Bmcs (G 1 ,G 2 )CADG 1BCMCS (G 1 ,G 2 )d MMCSN(2G 2D2 + 1G1 , G ) = 1−=4 + 50.667October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>k-Nearest Neighbors <strong>with</strong> <strong>Graph</strong>s• Advantages– Keeps HTML structure information– Retains original order of words– More accurate than k-NN <strong>with</strong> the vector-space model• Limitation– Very low classification speed• Up to three times slower than vector classification• Conclusion– <strong>Graph</strong> models cannot be used for real-time filtering ofweb documents48October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>The Hybrid Approach to Document• Basic IdeaCategorization(Markov et al., 2006)– Represent a document as a vector of sub-graphs– Categorize documents <strong>with</strong> a model-based classifier (e.g., adecision tree), which is much faster than a “lazy” method• The “Naïve” Approach– Select sub-graphs that are most frequent in each category• The “Smart” Approach– Select sub-graphs that are more frequent in a specific categorythan in other categories• The Smart Approach <strong>with</strong> Fixed Threshold– Select sub-graphs that are frequent in a specific categoryand not frequent in other categories49October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Predictive Model Induction <strong>with</strong>Hybrid RepresentationSet of documents <strong>with</strong> knowncategories – the training setWeb or textdocumentsDocuments graphrepresentationExtraction ofsub-graphsrelevant forclassification<strong>Graph</strong>ConstructionDocumentclassificationrulesSubgraphExtractionCreation ofprediction modelText representationFeature selection(optional)Representation of all documents as vectors <strong>with</strong> Boolean values for everysub-graph in the setIdentification of best attributes (boolean features) for classificationFinally – prediction model induction and extraction of classification rules50October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Frequent Subgraph ExtractionExampleSubgraphs Document <strong>Graph</strong> ExtensionsArabArabArabArabArabBankWestPoliticWestArabPoliticArabArabArabWestBankPoliticPoliticPoliticPolitic51October 3, 2007


Mark Last (BGU)52<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Frequent Subgraph Extraction:ComplexityAssumptionA labeled vertex is unique in each graphSubgraph isomorphismIsomorphism between graph G 1=(V 1,E 1,α 1,β 1) and part of graphG 2=(V 2,E 2,α 2,β 2) can be found by two simple actions:1. Determine that V 1 ⊆V 2 - O(|V 1 |*|V 2 |)2. Determine that E 1 ⊆E 2 – O(|V 1 | 2 )Total complexity:O(|V 1|*|V 2| + |V 1| 2 ) ≤ O(|V 2| 2 )<strong>Graph</strong> isomorphismIsomorphism between graphs G 1=(V 1,E 1,α 1,β 1) and G 2=(V 2,E 2,α 2,β 2)can be found by two simple actions:1. Determine G 1 ⊆G 2 - O(|V 2 |)2. Determine G 2 ⊆G 1 - O(|V 2 |)Total complexity: O(|V 2 |)October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Case Study 1Categorization of Web Documentsin Arabic(Based on Last et al., 2006)


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Document Collection• 648 Arabic documents– 200 documents downloaded from terrorist web sites– 448 belong to non-terrorist categories• <strong>Terrorist</strong> web sites– http://www.qudsway.com (Palestinian Islamic Jihad )– http://www.palestine-info.com/ (Hamas)• Normal (non-terrorist) web sites– www.aljazeera.net/News– http://arabic.cnn.com– http://news.bbc.co.uk/hi/arabic/news– http://www.un.org/arabic/news54October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Preprocessing of Documents inArabic• Normalizing orthographic variationsا to plain Alif أ – E.g., convert the initial Alif Hamza• Normalize the feminine ending, the Ta-Marbutaه to Ha ‏,ة• Removal of vowel marksك Kaf ‏,و • Removal of certain letters (such as: Waw, Ba ‏,ب and Fa ‏(ف appearing before the Arabic) لا article THE (Alif + Lam• Removal of pre-defined stop words in Arabic• Final vocabulary size: 47,836 words55October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Accuracy ResultsResults for Naïve Approach <strong>with</strong> C4.5 Classifier99%Classification Accuracy98%97%96%95%0.1 0.15 0.2 0.25 0.3 0.35 0.4Subgraph Frequency Threshold t min30 node graphs 40 node graphs 50 node graphs 100 node graphsResults for Smart Approach <strong>with</strong> C4.5 Classifier99%Classification Accuracy98%97%96%95%1 1.25 1.5 1.75 256Subgraph Classification Rate Threshold CR min30 node graphs 40 node graphs 50 node graphs 100 node graphsOctober 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Resulting Decision TreeYesينويهصلاThe Zionist (Adj., Sing. M.)NoTerrorديهشلاThe MartyrYesNoTerrorهينويهصلاThe Zionist (Adj.,Sing. F. or Pl.)NoYesTerrorءادنCallسدقلا TextAl-QudsTerrorYesTerrorYesNoودعلاThe EnemyNoNon-Terror57October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Does the word الصهيوني (“Zionist”)indicate a terrorist document?• The word “Zionist” occurred only in six normaldocuments out of 448• It never occurred more than once in the samenormal document• On normal documents, the word was used in thefollowing expressions:The Zionist Movement -The Zionist aggression –The Zionist plotThe Zionist extremists -The First Zionist CongressThe extremist Zionist groupsالحرآة الصهيونيةالعدوان الصهيونيالمؤامرة الصهيونية –غلاة الصهيونيةالمؤتمر الصهيوني الأول –الجماعات الصهيونية المتطرفة –––––––58October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Case Study 2Categorization of <strong>Terrorist</strong> WebDocuments in English


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Document Collection• 1,004 English documents– 913 documents downloaded from a Hezbollah website (http://www.moqawama.org/english/)– 91 documents downloaded from a Hamas web site(www.palestine-info.co.uk/am/publish/)• Goal– Identify the source of web documents (Hamas vs.Hezbollah)• Document Representation– The Hybrid Smart approach• Classifier– C4.5 Decision Tree60October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Results for the Hybrid SmartApproachMaximum <strong>Graph</strong> Size: 100 Nodes99.1299.1099.1099.10 99.10 99.10 99.1012Accuracy (%)99.0899.0499.0098.9698.9211 1111 110.30 0.35 0.40 0.45 0.50 0.55 0.601191199.0010998Tree Size (Nodes)Subgraph Frequency ThresholdClassification Accuracy (%) Tree Size61October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Resulting Decision TreeSubgraph Frequency Threshold: 0.5562October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Conclusions• Automated filtering of multi-lingual terroristcontent is a feasible task– <strong>Graph</strong> representations contribute tocategorization accuracy– Hybrid (graph and vector) methods improvethe processing speed– Decision trees provide an interpretablestructure that can be tested by a humanexpert63October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>Future Work• Some open challenges– Developing graph representations of webdocuments for more languages– Finding optimal parameters for subgraphextraction– <strong>Multi</strong>-label categorization of terroristdocuments– Improving classification accuracy usingontologies of the terrorist domain– Identification of groups and topics64October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>References (1)• H. Bunke and A. Kandel, “Mean and maximum common subgraphof two graphs”, Pattern Recognition Letters, Vol. 21, 2000,pp. 163–168.• M. Kuramochi and G. Karypis. An Efficient Algorithm forDiscovering Frequent Subgraphs. IEEE Transactions onKnowledge and Data Engineering 16, 9 (Sep. 2004).• M. Last and A. Kandel (Editors), “Fighting Terror in Cyberspace”,World Scientific, Series in Machine Perception and ArtificialIntelligence, Vol. 65, 2005.65October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>References (2)• M. Last, A. Markov, and A. Kandel, "<strong>Multi</strong>-<strong>Lingual</strong> Detection of<strong>Terrorist</strong> <strong>Content</strong> on the Web", Proceedings of the PAKDD'06International Workshop on Intelligence and Security Informatics(WISI'06), Lecture Notes in Computer Science, Vol. 3917, pp. 16-30, Springer, 2006.• A. Markov, M. Last, and A. Kandel, “Model-Based Classification ofWeb Documents Represented by <strong>Graph</strong>s”, Proceedings ofWebKDD 2006 Workshop on Knowledge Discovery on the Web atKDD 2006, pp. 31-38, Philadelphia, PA, USA, Aug. 20, 2006.• G. Salton, A. Wong, and C. Yang, C. (1975). A Vector SpaceModel for Automatic Indexing, Comm. of the ACM, 18(11), pp.613--620.66October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>References (3)• G. Salton, and M. McGill, "Introduction to Modern InformationRetrieval", McGraw Hill, 1983.• A. Schenker, H. Bunke, M. Last, A. Kandel, "<strong>Graph</strong>-<strong>Theoretic</strong>Techniques for Web <strong>Content</strong> Mining", World Scientific, 2005.• A. Schenker, M. Last, H. Bunke, A. Kandel, "Classification of WebDocuments Using <strong>Graph</strong> Matching", International Journal ofPattern Recognition and Artificial Intelligence, Vol. 18, No. 3, pp.475-496, 2004.67October 3, 2007


Mark Last (BGU)<strong>Filtering</strong> of <strong>Multi</strong>-<strong>Lingual</strong> <strong>Terrorist</strong> <strong>Content</strong>68October 3, 2007

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!