Advances in Data Analysis using PARAFAC2 and Three-way ...

Advances in Data Analysis usingPARAFAC2 and Three-wayDEDICOMBrett W. BaderSandia National LaboratoriesTRICAP 2009June 15, 2009Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administrationunder contract DE-AC04-94AL85000.

Acknowledgements• Peter Chew, Sandia• Tammy Kolda, Sandia• Daniel Dunlavy, Sandia• Evrim Acar, Sandia• Ahmed Abdelali, New Mexico State University• Alla Rozovskaya, University of Illinois, Urbana-Champaign

Multi-way AnalysisTensor orN-way array3-way DEDICOMPARAFAC2Interested in algorithms foranalysis of large data sets

DEDICOM• DEcomposition into DIrectional COMponents• Introduced in 1978 by Harshman• Past applications- Study asymmetries in telephone calls among cities- Marketing research• car switching: car owners and what they buy next• free associations of words for advertising- Asymmetric measures of world trade (import/export)• Variations- Two-way DEDICOM- Three-way DEDICOM

Two-way DEDICOMSingle domain modelX ≈ ARA T NX=NARA TminA,R∥ X − ARA T ∥ ∥∥2s.t. A orthogonalFNP• A (N x P) is an orthogonal matrix of loadings or weights• R (P x P) is a dense matrix that captures asymmetric relationships• Decomposition is not unique- A can be transformed with no loss of fit to the data- Nonsingular transformation Q:ARA T =(AQ)(Q −1 RQ −T )(AQ) T- Usually “fix” A with some standard rotation (e.g., VARIMAX)

Three-way DEDICOMN=NADRDA TNKPX k ≈ AD k RD k A Tfor k =1, . . . , K∑min ∥ Xk − AD k RD k A ∥ T 2A,R,DFk• A (N x P) is a matrix of loadings or weights (not necessarily orthogonal)• R (P x P) is a dense matrix that captures asymmetric relationships• D (P x P x K) is a tensor with diagonal frontal slices giving the weightsof the columns of A for each slice in third mode• *Unique* solution with enough slices of X with sufficient variation- i.e., no rotation of A possible- greater confidence in interpretation of results

PARAFAC2N=NADHDA TNKPX k ≈ AD k HD k A Tfor k =1, . . . , K∑min ∥ Xk − AD k HD k A ∥ T 2A,H,DFk• A (N x P) is a matrix of loadings or weights (not necessarily orthogonal)• H (P x P) is a dense, symmetric matrix (usually positive definite)• D (P x P x K) is a tensor with diagonal frontal slices giving the weightsof the columns of A for each slice in third mode• *Unique* solution with enough slices of X with sufficient variation- i.e., no rotation of A possible- greater confidence in interpretation of results

DEDICOM Models & Algorithms2-wayDEDICOMX=ARA T• Generalized Takane method• ASALSAN variant• All-at-once optimization(Takane, 1985; Kiers et al., 1990)(Bader, Harshman, Kolda, 2007)3-wayDEDICOMX=ARA T• Kiersʼ method• ASALSAN• All-at-once optimization(Kiers, 1993)(Bader, Harshman, Kolda, 2007)

Kiersʼ AlgorithmminA,R,Dm∑i=1∥ X i − AD i RD i A T ∥ ∥ 2 F(Kiers, 1993)Alternating Least Squares1) ALS over columns in A(involves EVD of dense n x n matrix)2) Least-squares problem for Rminimize:f(R) =⎛ ⎞Vec(X 1 )⎜⎝.∥ Vec(X m )⎟⎠ −O(pn 3 )Accurate algorithmbut not suitablefor large-scale data⎛⎞AD 1 ⊗ AD 1⎜⎝⎟. ⎠ Vec(R)AD m ⊗ AD m∥Vec(R) =(∑ m) −1∑ m(D i A T AD i ) ⊗ (D i A T AD i ) Vec(D i A T X i AD i )i=1i=13) ALS over elements in D

Solving for A:minA,R,DASALSANm∑i=1∥ X i − AD i RD i A T ∥ ∥ 2 F(Bader, Harshman, Kolda, 2007)“Alternating Simultaneous Approximation, Least Squares, and Newton”(X1 X T 1 ··· X m X T m)= A(D1 RD 1 D 1 R T D 1 ··· D m RD m D m R T D m)(I2m ⊗ A T )Y= AZ TA = YZ(Z T Z) −1A =[∑ m(Xi AD i R T D i + X T i AD ) ][ ∑ miRD i (B i + C i )i=1i=1] −1whereB i ≡ D i RD i (A T A)D i R T D i ,C i ≡ D i R T D i (A T A)D i RD i .

Solving for R:minRm∑i=1ASALSAN∥ X i − AD i RD i A T ∥ ∥ 2 FUse the approach in (Kiers, 1993)minimize:f(R) =⎛ ⎞Vec(X 1 )⎜⎝.∥ Vec(X m )⎟⎠ −⎛⎞AD 1 ⊗ AD 1⎜⎝⎟. ⎠ Vec(R)AD m ⊗ AD m∥Vec(R) =(∑ m) −1∑ m(D i A T AD i ) ⊗ (D i A T AD i ) Vec(D i A T X i AD i )i=1i=1

ASALSANGradient:Hessian:Solving for D:g k = − ∑ i,jminD i∥ ∥ Xi − AD i RD i A T ∥ ∥ 2 FUse Newtonʼs method to solve the optimization problem forh st = −2 ∑ i,jd new = d − H −1 g[]2(X − ADRDA T ) ∗ (ADr k a T k + a k r k,: DA T )i,j[(X − ADRDA T ) ∗ (a s r st a T t + a t r ts a T s )d = diag(D i )]− (ADr s a T s + a s r s: DA T ) ∗ (ADr t a T t + a t r t: DA T )i,jUse compressionQR factorization:A = QÃ,minD i∥ ∥∥Q T X i Q − ÃD i RD i Ã T ∥ ∥∥2FSmaller problem (p x p)

Algorithm CostsFor small p, updating A is most expensive partDominant costs:Q T X i Qlinear in nnz ofX iO(p 2 n)X i AR TX T i ARA T AQR factorization of AASALSAN is capable of handling large data sets

Timingsnd an orthonormal.g., a compact QR(11)se Q to project Xity of Q, the mini-(Bader, Harshman, Kolda, 2007)Table 1. Time in seconds per iteration (averagenumber of iterations) on both data sets.Time in seconds per iteration (avg iterations)Algorithm World trade EnronASALSAN 0.069 (50) 0.85 (184)NN-ASALSAN 0.083 (47) 1.0 (74)Kiers [23] 0.022 (67) 22.3 (400+)Ã T ∥ ∥∥2F, (12)18x18x103060 nonzeros4 Experimental results184x184x449838 nonzerossize p × p. We usend A, respectively,d (10) above.er iteration are linnd/orO(p 2 n) andR factorization ofcontrast, the dom-3] come from upensen × n matrix,We consider two applications: a small example usingthe international trade data used previously in [20] and thelarger email graph of the Enron corporation that was madepublic during the federal investigation.ASALSAN was written in MATLAB, using the TensorToolbox [5, 6, 7], and Kiers’ algorithm [23] was compiledPascal code obtained from the author. All tests were performedon a dual 3GHz Pentium Xeon desktop computerwith 2GB of RAM.Table 1 shows the timings per iteration and average numberof iterations to satisfy a tolerance of 10 −5 (World trade)or 10 −7 (Enron) in the change of fit for the three algorithms(using the same stopping criteria). We suspect the perfor-

New Approach:All-at-once OptimizationMinimization problem:minxf(x)• Need a flexible method to handle- constraints (e.g., non-negativity)- regularization (e.g., sparsity)- missing data- large-scale data sets• Want- speed comparable to ASALSAN- improved accuracy, if possible• Follow what has been done with CANDECOMP/PARAFAC(see Evrim Acarʼs presentation)• Use Poblano Optimization Toolbox in MATLAB (Dunlavy, etal. 2009)- Vectorize factor variables and derivatives

Optimization TheoryMinimization problem:minxf(x)Gradient∇f(x) =⎛⎜⎝∂f ⎞∂x 1.∂f∂x nHessian⎟⎠ ∇ 2 f(x) =⎛⎜⎝∂ 2 f∂x 2 1∂ 2 f∂x 2 ∂x 1.∂ 2 f∂x n ∂x 1∂ 2 f∂x 1 ∂x 2· · ·∂ 2 f∂x 2 2· · ·.∂ 2 f∂x n ∂x 2· · ·∂ 2 f∂x 1 ∂x n∂ 2 f∂x 2 ∂x n.∂ 2 f∂x 2 n⎞⎟⎠Taylor series approx.f(x + s)≈ +f(x)∇f(x) Ts+12s T∇ 2 f(x)s

Newtonʼs MethodTaylor series approx.f(x + s)≈ +f(x)∇f(x) Ts+12s T∇ 2 f(x)sLocal modelm(x k + s) =f(x k )+∇f(x k ) T s + 1 2 sT ∇ 2 f(x k )sx 2Newtonʼs method:Solve for sk∇ 2 f(x k )s k = −∇f(x k ),x k+1 = x k + s kx ks kx 1

Quasi-Newton MethodsApproximation to Hessian:New linear system:H k ≈∇ 2 f(x k )H k s k = −∇f(x k )Secant method: use gradients atpast points to approximate HessianB k = H −1ks k = −B k ∇f(x k )minB k‖ B k − B k−1 ‖ Fsubject to B k = B T k , B k y k = s k−1 ,y k = ∇f(x k ) −∇f(x k−1 )BFGS updateB k = B k−1 + s k−1s T k−1s T k−1 y k− B k−1y k y T k B k−1y T k B k−1y k.

Large-scale MethodsLimited memory BFGS (L-BFGS) keepsonly m past update vectors sk and ykBFGS updateB k = B k−1 + s k−1s T k−1s T k−1 y k− B k−1y k y T k B k−1y T k B k−1y k.Nonlinear ConjugateGradient (NCG)x 2Steepest descent then move insubsequent conjugate directions−∇f(x k )x CPkx k−H −1k ∇f(x k)x 1These methods are implemented in the PoblanoOptimization Toolbox for MATLAB (Dunlavy et al. 2009)

Derivatives for 2-way DEDICOMError:E ≡ Z − ARA TObjective function: f(A, R) ≡ 1 2 ‖ E ‖2 F = 1 2∥ Z − ARA T ∥ ∥ 2 FGradient:∂f∂A ij=[−E T AR − EAR T ] ij∂f∂R ij=[−A T EA] ijMATLAB:AR = A*R;ARt = A*R';AtA = A'*A;G{1} = -Z'*AR - Z*ARt + ARt*(AtA*R) + AR*(AtA*R');G{2} = -A'*Z*A + AtA*R*AtA;

Derivatives for 3-way DEDICOMError:Objective function:Gradient:E k ≡ Z k − AD k RD k A Tf(A, R, D) ≡ 1 2∂f∂A ij=∂f∂R ij=[[− ∑ k− ∑ kK∑‖ E k ‖ 2 F = 1 2k=1K∑k=1E T AD k RD k + EAD k R T D k]A T D k ED k A]ij∥ Zk − AD k RD k A T ∥ ∥ 2 Fij∂f∂D kj= AD k r j Ea j + a T j Er j,: D k A T

Derivatives for 3-way DEDICOMMATLAB:for k = 1:size(D,1)DktDk = D(k,:)'*D(k,:);Rk = R.*DktDk;AR = A*Rk;ARt = A*Rk';E = AR*A' - Z(:,:,k);G{1} = G{1} + E'*AR + E*ARt;G{2} = G{2} + (A'*E*A).*DktDk;ADR = A*(diag(D(k,:))*R);RDA = (R*diag(D(k,:)))*A';for j = 1:size(D,2)G{3}(k,j) = ADR(:,j)'*E*A(:,j) + A(:,j)'*E*RDA(j,:)';endend

ResultsRel. Residual Error10 0 Time (sec)10 !210 !410 !610 !8ASALSANNCGL!BFGS• Synthetic data• 20x20x9• Random initialization(5% error)• No noise in X• p = 210 !1010 !120 2 4 6 8 10 12 14

ResultsRel. Residual Error10 0 Time (sec)10 !110 !2ASALSANNCGL!BFGS• Synthetic data• 20x20x9• Random initialization(5% error)• 0.1% noise in X• p = 210 !30 1 2 3 4 5 6

ResultsRel. Residual Error10 2 Time (sec)10 1ASALSANNCGL!BFGS• Synthetic data• 50x50x35• Random initialization(5% error)• 0.1% noise in X• Oblique factors A• p = 210 00 10 20 30 40 50 60 70 80 90 100

ResultsRel. Residual Error0.990.980.970.960.950.940.93• Enron email data• 184x184x44• 9838 nnz• p = 4• Randominitialization0.920.91ASALSANNCGL!BFGS0.90 10 20 30 40 50 60 70 80Time (sec)

Derivatives for PARAFAC2Error:Objective function:Gradient:E k ≡ Z k − AD k HD k A Tf(A, H, D) ≡ 1 2∂f∂A ij=∂f∂H ij=[[−2 ∑ k− ∑ kK∑‖ E k ‖ 2 F = 1 2k=1E T AD k HD k]A T D k ED k A]ijK∑k=1ij∥ Zk − AD k HD k A T ∥ ∥ 2 F∂f∂D kj=2AD k h j Ea j

Discussion• All-at-once optimization is competitive but not the best.ASALSAN is fastest on real data.- DEDICOM and PARAFAC2 are difficult nonlinearoptimization problems with multiple minima- ALSALSAN includes some second order information• Optimization approach convenient for:- Different loss functions (other than least squares)- General constraints (e.g., non-negativity)- Regularization (e.g., sparsity)- Missing data• Future research will explore these benefits and aim for betterperformance

ApplicationsTensor orN-way arrayDEDICOM• Temporal social network analysis (Enron)• Community finding• Part-of-speech taggingPARAFAC2• Multilingual document analysis

Using PARAFAC2 forMultilingual DocumentClusteringJoint work with Peter Chew

T1: bak(e,ing)T2: recipesT3: breadT4: cakeT5: pastr(y,ies)T6: pieT5: pastr(y,ies)Basics: T6: pieGraphs and MatricesThe d = 5 document titles:example from (Berry, Drmac, Jessup, 1999)Documentshe d = 5 document titles:D1: How to Bake Bread Without RecipesD2: The Classic Art of Viennese PastryD3: Numerical Recipes: The Art of Scientific ComputingD4: Breads, Pastries, Pies and Cakes: Quantity Baking RecipesD5: Pastry: A Book of Best French RecipesMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 341he 6 t × = 56 term-by-document terms: matrix before normalization, where thelement Terms â ij is the number of times term i appears in document title j:T1: bak(e,ing)T2: recipes⎛⎞1 0 0 1 0T3: bread1 0 1 1 1T4: cake1 0 0 1 0Â =T5: pastr(y,ies) ⎜ 0 0 0 1 0 ⎟⎝⎠T6: pie0 1 0 1 1he d = 5 document titles:D1: How to Bake Bread Without Bipartite Recipes graphD2: The Classic Art of Viennese T1 PastryD3: Numerical Recipes: The Art of ScientificD1T2ComputinD2D4: Breads, Pastries, Pies and T3 Cakes: Quantity BakingD3D5: Pastry: A Book of BestT4French RecipesThe 6 × 5 term-by-document matrix before normalization, wheelement â ij is the number of times term i appears in document0 0 0 1 0Concepts• Bag of wordshe 6 × 5 term-by-document matrix with unit columns:D1: How to Bake•Bread Without Recipes⎛D2: The Classic 0.5774 ArtVector of Viennese 0space0Pastrymodel ⎞0.4082 0• D3: Numerical 0.5774 Recipes: Stemming 0The Art 1.0000 of Scientific 0.4082 Computing0.7071D4: Breads, A =Pastries, 0.5774•Pies 0 and Cakes: 0 0.4082 Quantity 0StoplistsBaking RecipesD5: Pastry: ⎜⎝A Book0of Best0French0Recipes0.4082 0•⎟⎠0 Scaling 1.0000 for 0 information 0.4082 0.7071 content0 0 0 0.4082 0Term-by-doc (adjacency) matrixD1 D2 D3 D4 D5Â =T5T6⎛⎜⎝D4D51 0 0 1 01 0 1 1 11 0 0 1 00 0 0 1 00 1 0 1 10 0 0 1 0⎞⎟⎠T1T2T3T4T5T6

Cross-language InformationRetrieval (CLIR)Web documents could be in any languageEnglishGermanJapaneseFrenchChinese SimplifiedSpanishRussianDutchKoreanPolishPortugueseChinese TraditionalSwedishCzechNorwegianItalianDanishHungarianFinnishHebrewArabicTurkishSlovakIndonesianBulgarianCroatianCatalanSlovenianGreekRomanianSerbianEstonianIcelandicLithuanianLatvianLanguages on the webGoal: Cluster documentsby topic regardless oflanguageEnglishFrenchArabicSpanish(P. Chew, B. Bader, A. Abdelali (NMSU), T. Kolda, P. Kegelmeyer)

Bible as a ʻRosetta Stoneʼ• The Bible has been translated carefully and widely- 451 complete & 2479 partial translations• Verse alignedSandiaʼs database: 54 languages: 99.76 % coverage of webAfrikaans Estonian NorwegianAlbanian Finnish Persian (Farsi)Amharic French PolishArabic German PortugueseAramaic Greek (New Testament) RomaniArmenian Eastern Greek (Modern) RomanianArmenian Western Hebrew (Old Testament) RussianBasque Hebrew (Modern) Scots GaelicBreton Hungarian SpanishChamorro Indonesian SwahiliChinese (Simplified) Italian SwedishChinese (Traditional) Japanese TagalogCroatian Korean ThaiCzech Latin TurkishDanish Latvian UkrainianDutch Lithuanian VietnameseEnglish Manx Gaelic WolofEsperanto Maori Xhosa

Bible as Parallel Corpus5 languages for training and testingTranslation Terms Total WordsEnglish (King James) 12,335 789,744Spanish (Reina Valera 1909) 28,456 704,004Russian (Synodal 1876) 47,226 560,524Arabic (Smith Van Dyke) 55,300 440,435French (Darby) 20,428 812,947• Languages convey information in different number of wordsIsolating languageSynthetic language

2,")(4-/""I.+&'#+',(3-,5%4)%/&(-,"#.)"(7,(+4*%,:,5(7,-,(0/-(3+%-"(/0( .+&'#+',"( 7%)*( "%$%.+-( ")+)%")%4"( C0/-( ,;+$3.,6( ((!"#$%&'(&)$$*+,-",./0&/1&+,",.+,.2"$&3.11%-%02%+&#%,4%%0&$"05*"5%+&&& !%6,& 4/-3&2/*0,&7&/1&,/,"$&!A( !"#$%(&$'()*$(+$(,-.(/01*$(23>(L( MK(((AU( 4( 567689( :;?8( @;A( 59B;( ?( V( MV(C9D8E>((#&+:/%5+2.,6($+8,()*,(+""/)*,@(4/I/44#-

Language MorphologyTranslation Terms Total WordsEnglish (King James) 12,335 789,744Arabic (Smith Van Dyke) 55,300 440,435Languages convey information in different number of wordsIsolating languageChineseSynthetic languageQuechua, Inuit (Eskimo)• Isolating language: One morpheme per word- e.g., "He travelled by hovercraft on the sea." Largely isolating, but travelled and hovercraft eachhave two morphemes per word. (Wikipedia)• Synthetic language: High morpheme-per-word ratio- German: Aufsichtsratsmitgliederversammlung => "On-view-council-with-limbs-gathering" meaning"meeting of members of the supervisory board". (Wikipedia)- Chulym: Aalychtypiskem => "I went out moose hunting"- Yupʼik Eskimo: tuntusssuatarniksaitengqiggtuq => “He had not yet said again that he was going tohunt reindeer.” (Payne, 1997)

Term-Doc MatrixTerm-by-verse matrixfor all languagesBible versesEnglishtermsSpanishRussianLook for co-occurrence ofterms in the same versesand across languages tocapture latent conceptsArabicFrench163,745 x 31,230• Approach is not new- pairs of languages in LSA suggested by (Berry et al., 1994)- multi-parallel corpus is new

Latent Semantic IndexingTerm-by-verse matrixfor all languagestermsBible versesEnglishSpanishRussianArabicFrenchTruncated SVDA k = U k Σ k V Tk =Uterm x conceptΣk∑σ i u i viTi=1VTProjectnew documentDocumentfeaturevectordimension 1 0.1375dimension 2 0.1052dimension 3 0.0341dimension 4 0.0441dimension 5 -0.0087dimension 6 0.0410dimension 7 0.1011dimension 8 0.0020dimension 9 0.0518dimension 10 0.0822dimension 11 -0.0101dimension 12 -0.1154dimension 13 -0.0990dimension 14 0.0228dimension 15 -0.0520dimension 16 0.1096dimension 17 0.0294dimension 18 0.0495dimension 19 0.0553dimension 20 0.1598Project new documents of interest into subspace of UΣ -1• cosine similarities for clustering• machine learning applications

Quran as Test Set• Quran is translated into many languages, just like the Bible• 114 suras (or chapters)• More variation across translations => harder IR task

Performance Metricsquery?Lang 1• Precision at 1 document (P1)- Equals 100% if the translation of the query rankedhighest, 0% otherwise- Calculated as an average over all queries for eachlanguage pair or as a total average- Essentially, P1 measures success in retrievingdocuments when the source and target languages arespecifiedquery??Lang 1Lang 2• Average multilingual precision at 5 (or n) documents (MP5)- The average percentage of the top 5 documents thatare translations of the query document- Calculated as an average for all queries & all languages- Essentially, MP5 measures success in multilingualclustering• MP5 is a stricter measure than P1 because the retrievaltask is harder

LSA Results(Chew,Bader,Kolda,Abdelali, 2007)5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA 76.0% 26.1%Documents tend to cluster more by language than by topic

LSA Results5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA (α=1) 76.0% 26.1%SVD/LSA (α=1.8) 87.6% 65.5%

New Approach: Multi-matrix Array(Chew, Bader, Kolda, Abdelali, 2007)FrenchTerm-by-verse matrixfor each languageArabicX 5RussianSpanishX 4EnglishX 1X 2X 3Array size: 55,300 x 31,230 x 5 with 2,765,719 nonzeros

Tucker1Tucker≈Tucker1≈= U1X 1X 2X 3U 2U 3S 1V T

Tucker1 Results5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA (α=1) 76.0% 26.1%SVD/LSA 87.6% 65.5%Tucker1 89.5% 71.3%Only minor improvement because each Uk is not orthogonal

PARAFAC2X k ≈ U k HS k V T (Harshman, 1972)Where each Uk is orthonormaland Sk is diagonal

PARAFAC2 Results5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA (α=1) 76.0% 26.1%SVD/LSA 87.6% 65.5%Tucker1 89.5% 71.3%PARAFAC2 89.8% 78.5%Modest improvement over LSA

clusters containing related books; this kind of visualization is possible only when MP6 issatisfactorily high. In summary, these techniques have effectively allowed us to factorlanguage out,Biblefocusing onlyClusteringon topic, just as wewithhad hoped.LMSATA• Books of the Bible colorcoded by language• MP5 about 90%• Books cluster first withtheir counterparts inother languages, then inlarger clusters by topic• Visualization software: Tamale 1.2• Graph layout: VxOrd (Boyack etal., 2005)

Graph Layout (close-up)• John and Acts have tight clusters• Some mixing with Matthew, Mark,Luke (synoptic gospels - share asimilar perspective)7 DiscussionFigure 7. Partial visualization of multilingual Bible books in vector space

Using DEDICOM forCompletely UnsupervisedPart-of-Speech TaggingJoint work with Peter Chew

Approaches to POS tagging (1)•Supervised–Rule-based (e.g. Harris 1962)• Dictionary + manually developed rules• Brittle – approach doesn’t port to new domains–Stochastic (e.g. Stolz et al. 1965, Church 1988)• Examples: HMMs, CRFs• Relies on estimation of emission and transitionprobabilities from a tagged training corpus• Again, difficulty in porting to new domains

Approaches to POS tagging (2)•Unsupervised–All approaches exploit distributional patterns–Singular Value Decomposition (SVD) of termadjacencymatrix (Schütze 1993, 1995)–Graph clustering (Biemann 2006)–Our approach: DEDICOM of term-adjacencymatrix–Most similar to Schütze (1993, 1995)Advantages:• can be reconciled to stochastic approaches• completely unsupervised, like SVD and graph clustering• initial results appear promising

DEDICOM – application to POS taggingTerm adjacency matrix‘R’ matrix‘A’ matrix• The assumption that terms are a “single set of objects”, whetherthey precede or follow, sets DEDICOM apart from SVD and otherunsupervised approaches• This assumption models the fact that tokens play the samesyntactic role whether we view them as the first or second elementin a bigram

Comparing DEDICOM output to HMM inputOutput of DEDICOM‘R’ matrixInput to HMM(after normalization of counts)Transition prob. matrix‘A’ matrixEmission prob. matrix• The output of DEDICOM is essentially a transition and emissionprobability matrix• DEDICOM offers the possibility of getting the familiar transitionand emission probabilities without tagged training data

Validation: method 1 (theoretical)• Hypothetical example - suppose tagged training corpus existsCorpus: The man walked the big dogDT NN VBD DT JJ NNX:sparse matrix ofbigram countsA*: term-tag countsR*: tag-adjacency counts• By definition (subject to difference of 1 for final token):– row sums of X = col sums of X = row sums of A*– col sums of A* = row sums of R* = col sums of R*

Validation: method 1 (theoretical)• To turn A* and R* into transition and emission probabilitymatrices, we simply multiply each by a diagonal matrix D wherethe entries are the inverses of the row-sum vector• If the DEDICOM model is a good one, we should be able tomultiply A*DR*D(A*) T to approximate the original matrix X• In our example, A*DR*D(A*) T =• This approximates X, and it also captures some syntacticregularities which aren’t instantiated in the corpus (this is onereason HMM-based POS tagging is successful)

Validation: method 2 (empirical)• Use a tagged corpus (CONLL 2000)– CONLL 2000 has 19,440 distinct terms– There are 44 distinct tags in the tagset (our “gold standard”)• Tabulate X matrix (solely from bigram frequencies, blind to tags)• Fit DEDICOM model to X (using k = 44) to ‘learn’ emission andtransition probability matrices• Use these as input to a HMM; tag each token with a numericalindex (one of the DEDICOM ‘dimensions’)• Evaluate by looking at correlation of induced tags with the 44“gold standard” tags in a confusion matrix

Validation: method 2 (empirical)Assuming the ‘gold standard’ is the optimal tagging scheme, the idealconfusion matrix would have one DEDICOM class per ‘gold standard’ tag(either a diagonal matrix or some permutation thereof).DEDICOM tagsGoldstandardtagConfusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494

Validation: method 2 (empirical)• Examples of DEDICOM dimensions or clusters:“NN”“NN”“DT”Tag 3: Grouping of ‘new’ with ‘the’ and ‘a’ explained by distributional similarities (allprecede nouns). This is also in accordance with a traditional English grammar, which heldthat determiners are a type of adjective.Soft clustering: U.S. appears under tag 2 (nouns) and tag 8 (adjectives)

Future Work in POS Tagging• Constrained DEDICOM- Tags for common terms (e.g., determiners, prepositions)- Semi-supervised learning• Analyze multiple corpora via 3-way DEDICOM• Constraints on R matrix in DEDICOM

Discussion and Questions• Challenges with global optimization- Ways reduce search space and find global minimizer- Deterministic algorithm that returns a good approximationto global solution• Fast algorithms for large-scale problems• Other data analysis models

Selected Publications• Bader, Harshman, and Kolda. 2007. Temporal analysis of semanticgraphs using ASALSAN. Proceedings of IEEE InternationalConference on Data Mining (ICDM), 2007.• Chew, Bader, Kolda and Abdelali. 2007. Cross-languageinformation retrieval using PARAFAC2. Proceedings of KDD 2007.• Chew, Kegelmeyer, Bader and Abdelali. 2008. The Knowledge ofGood and Evil: Multilingual Ideology Classification with PARAFAC2and Machine Learning. Language Forum (34), 37-52.• Chew, Bader, and Rozovskaya. 2009. Using DEDICOM forCompletely Unsupervised Part-of-Speech Tagging. Proceedings ofNAACL-HLT, Workshop on Unsupervised and Minimally SupervisedLearning of Lexical Semantics.• B. W. Bader. 2009. Constrained and Unconstrained Optimization.Brown, Tauler, Walczak (eds.) Comprehensive Chemometrics,volume 1, pp. 507-545 Oxford: Elsevier.Brett Bader (bwbader@sandia.gov)

Advances in Data Analysis using PARAFAC2 and Three-way ...

Create successful ePaper yourself

Delete template?

Save as template?