11.07.2015 Views

Advances in Data Analysis using PARAFAC2 and Three-way ...

Advances in Data Analysis using PARAFAC2 and Three-way ...

Advances in Data Analysis using PARAFAC2 and Three-way ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Advances</strong> <strong>in</strong> <strong>Data</strong> <strong>Analysis</strong> us<strong>in</strong>g<strong>PARAFAC2</strong> <strong>and</strong> <strong>Three</strong>-<strong>way</strong>DEDICOMBrett W. BaderS<strong>and</strong>ia National LaboratoriesTRICAP 2009June 15, 2009S<strong>and</strong>ia is a multiprogram laboratory operated by S<strong>and</strong>ia Corporation, a Lockheed Mart<strong>in</strong> Company,for the United States Department of Energyʼs National Nuclear Security Adm<strong>in</strong>istrationunder contract DE-AC04-94AL85000.


Acknowledgements• Peter Chew, S<strong>and</strong>ia• Tammy Kolda, S<strong>and</strong>ia• Daniel Dunlavy, S<strong>and</strong>ia• Evrim Acar, S<strong>and</strong>ia• Ahmed Abdelali, New Mexico State University• Alla Rozovskaya, University of Ill<strong>in</strong>ois, Urbana-Champaign


Multi-<strong>way</strong> <strong>Analysis</strong>Tensor orN-<strong>way</strong> array3-<strong>way</strong> DEDICOM<strong>PARAFAC2</strong>Interested <strong>in</strong> algorithms foranalysis of large data sets


DEDICOM• DEcomposition <strong>in</strong>to DIrectional COMponents• Introduced <strong>in</strong> 1978 by Harshman• Past applications- Study asymmetries <strong>in</strong> telephone calls among cities- Market<strong>in</strong>g research• car switch<strong>in</strong>g: car owners <strong>and</strong> what they buy next• free associations of words for advertis<strong>in</strong>g- Asymmetric measures of world trade (import/export)• Variations- Two-<strong>way</strong> DEDICOM- <strong>Three</strong>-<strong>way</strong> DEDICOM


Two-<strong>way</strong> DEDICOMS<strong>in</strong>gle doma<strong>in</strong> modelX ≈ ARA T NX=NARA Tm<strong>in</strong>A,R∥ X − ARA T ∥ ∥∥2s.t. A orthogonalFNP• A (N x P) is an orthogonal matrix of load<strong>in</strong>gs or weights• R (P x P) is a dense matrix that captures asymmetric relationships• Decomposition is not unique- A can be transformed with no loss of fit to the data- Nons<strong>in</strong>gular transformation Q:ARA T =(AQ)(Q −1 RQ −T )(AQ) T- Usually “fix” A with some st<strong>and</strong>ard rotation (e.g., VARIMAX)


<strong>Three</strong>-<strong>way</strong> DEDICOMN=NADRDA TNKPX k ≈ AD k RD k A Tfor k =1, . . . , K∑m<strong>in</strong> ∥ Xk − AD k RD k A ∥ T 2A,R,DFk• A (N x P) is a matrix of load<strong>in</strong>gs or weights (not necessarily orthogonal)• R (P x P) is a dense matrix that captures asymmetric relationships• D (P x P x K) is a tensor with diagonal frontal slices giv<strong>in</strong>g the weightsof the columns of A for each slice <strong>in</strong> third mode• *Unique* solution with enough slices of X with sufficient variation- i.e., no rotation of A possible- greater confidence <strong>in</strong> <strong>in</strong>terpretation of results


<strong>PARAFAC2</strong>N=NADHDA TNKPX k ≈ AD k HD k A Tfor k =1, . . . , K∑m<strong>in</strong> ∥ Xk − AD k HD k A ∥ T 2A,H,DFk• A (N x P) is a matrix of load<strong>in</strong>gs or weights (not necessarily orthogonal)• H (P x P) is a dense, symmetric matrix (usually positive def<strong>in</strong>ite)• D (P x P x K) is a tensor with diagonal frontal slices giv<strong>in</strong>g the weightsof the columns of A for each slice <strong>in</strong> third mode• *Unique* solution with enough slices of X with sufficient variation- i.e., no rotation of A possible- greater confidence <strong>in</strong> <strong>in</strong>terpretation of results


DEDICOM Models & Algorithms2-<strong>way</strong>DEDICOMX=ARA T• Generalized Takane method• ASALSAN variant• All-at-once optimization(Takane, 1985; Kiers et al., 1990)(Bader, Harshman, Kolda, 2007)3-<strong>way</strong>DEDICOMX=ARA T• Kiersʼ method• ASALSAN• All-at-once optimization(Kiers, 1993)(Bader, Harshman, Kolda, 2007)


Kiersʼ Algorithmm<strong>in</strong>A,R,Dm∑i=1∥ X i − AD i RD i A T ∥ ∥ 2 F(Kiers, 1993)Alternat<strong>in</strong>g Least Squares1) ALS over columns <strong>in</strong> A(<strong>in</strong>volves EVD of dense n x n matrix)2) Least-squares problem for Rm<strong>in</strong>imize:f(R) =⎛ ⎞Vec(X 1 )⎜⎝.∥ Vec(X m )⎟⎠ −O(pn 3 )Accurate algorithmbut not suitablefor large-scale data⎛⎞AD 1 ⊗ AD 1⎜⎝⎟. ⎠ Vec(R)AD m ⊗ AD m∥Vec(R) =(∑ m) −1∑ m(D i A T AD i ) ⊗ (D i A T AD i ) Vec(D i A T X i AD i )i=1i=13) ALS over elements <strong>in</strong> D


Solv<strong>in</strong>g for A:m<strong>in</strong>A,R,DASALSANm∑i=1∥ X i − AD i RD i A T ∥ ∥ 2 F(Bader, Harshman, Kolda, 2007)“Alternat<strong>in</strong>g Simultaneous Approximation, Least Squares, <strong>and</strong> Newton”(X1 X T 1 ··· X m X T m)= A(D1 RD 1 D 1 R T D 1 ··· D m RD m D m R T D m)(I2m ⊗ A T )Y= AZ TA = YZ(Z T Z) −1A =[∑ m(Xi AD i R T D i + X T i AD ) ][ ∑ miRD i (B i + C i )i=1i=1] −1whereB i ≡ D i RD i (A T A)D i R T D i ,C i ≡ D i R T D i (A T A)D i RD i .


Solv<strong>in</strong>g for R:m<strong>in</strong>Rm∑i=1ASALSAN∥ X i − AD i RD i A T ∥ ∥ 2 FUse the approach <strong>in</strong> (Kiers, 1993)m<strong>in</strong>imize:f(R) =⎛ ⎞Vec(X 1 )⎜⎝.∥ Vec(X m )⎟⎠ −⎛⎞AD 1 ⊗ AD 1⎜⎝⎟. ⎠ Vec(R)AD m ⊗ AD m∥Vec(R) =(∑ m) −1∑ m(D i A T AD i ) ⊗ (D i A T AD i ) Vec(D i A T X i AD i )i=1i=1


ASALSANGradient:Hessian:Solv<strong>in</strong>g for D:g k = − ∑ i,jm<strong>in</strong>D i∥ ∥ Xi − AD i RD i A T ∥ ∥ 2 FUse Newtonʼs method to solve the optimization problem forh st = −2 ∑ i,jd new = d − H −1 g[]2(X − ADRDA T ) ∗ (ADr k a T k + a k r k,: DA T )i,j[(X − ADRDA T ) ∗ (a s r st a T t + a t r ts a T s )d = diag(D i )]− (ADr s a T s + a s r s: DA T ) ∗ (ADr t a T t + a t r t: DA T )i,jUse compressionQR factorization:A = QÃ,m<strong>in</strong>D i∥ ∥∥Q T X i Q − ÃD i RD i à T ∥ ∥∥2FSmaller problem (p x p)


Algorithm CostsFor small p, updat<strong>in</strong>g A is most expensive partDom<strong>in</strong>ant costs:Q T X i Ql<strong>in</strong>ear <strong>in</strong> nnz ofX iO(p 2 n)X i AR TX T i ARA T AQR factorization of AASALSAN is capable of h<strong>and</strong>l<strong>in</strong>g large data sets


Tim<strong>in</strong>gsnd an orthonormal.g., a compact QR(11)se Q to project Xity of Q, the m<strong>in</strong>i-(Bader, Harshman, Kolda, 2007)Table 1. Time <strong>in</strong> seconds per iteration (averagenumber of iterations) on both data sets.Time <strong>in</strong> seconds per iteration (avg iterations)Algorithm World trade EnronASALSAN 0.069 (50) 0.85 (184)NN-ASALSAN 0.083 (47) 1.0 (74)Kiers [23] 0.022 (67) 22.3 (400+)Ã T ∥ ∥∥2F, (12)18x18x103060 nonzeros4 Experimental results184x184x449838 nonzerossize p × p. We usend A, respectively,d (10) above.er iteration are l<strong>in</strong>nd/orO(p 2 n) <strong>and</strong>R factorization ofcontrast, the dom-3] come from upensen × n matrix,We consider two applications: a small example us<strong>in</strong>gthe <strong>in</strong>ternational trade data used previously <strong>in</strong> [20] <strong>and</strong> thelarger email graph of the Enron corporation that was madepublic dur<strong>in</strong>g the federal <strong>in</strong>vestigation.ASALSAN was written <strong>in</strong> MATLAB, us<strong>in</strong>g the TensorToolbox [5, 6, 7], <strong>and</strong> Kiers’ algorithm [23] was compiledPascal code obta<strong>in</strong>ed from the author. All tests were performedon a dual 3GHz Pentium Xeon desktop computerwith 2GB of RAM.Table 1 shows the tim<strong>in</strong>gs per iteration <strong>and</strong> average numberof iterations to satisfy a tolerance of 10 −5 (World trade)or 10 −7 (Enron) <strong>in</strong> the change of fit for the three algorithms(us<strong>in</strong>g the same stopp<strong>in</strong>g criteria). We suspect the perfor-


New Approach:All-at-once OptimizationM<strong>in</strong>imization problem:m<strong>in</strong>xf(x)• Need a flexible method to h<strong>and</strong>le- constra<strong>in</strong>ts (e.g., non-negativity)- regularization (e.g., sparsity)- miss<strong>in</strong>g data- large-scale data sets• Want- speed comparable to ASALSAN- improved accuracy, if possible• Follow what has been done with CANDECOMP/PARAFAC(see Evrim Acarʼs presentation)• Use Poblano Optimization Toolbox <strong>in</strong> MATLAB (Dunlavy, etal. 2009)- Vectorize factor variables <strong>and</strong> derivatives


Optimization TheoryM<strong>in</strong>imization problem:m<strong>in</strong>xf(x)Gradient∇f(x) =⎛⎜⎝∂f ⎞∂x 1.∂f∂x nHessian⎟⎠ ∇ 2 f(x) =⎛⎜⎝∂ 2 f∂x 2 1∂ 2 f∂x 2 ∂x 1.∂ 2 f∂x n ∂x 1∂ 2 f∂x 1 ∂x 2· · ·∂ 2 f∂x 2 2· · ·.∂ 2 f∂x n ∂x 2· · ·∂ 2 f∂x 1 ∂x n∂ 2 f∂x 2 ∂x n.∂ 2 f∂x 2 n⎞⎟⎠Taylor series approx.f(x + s)≈ +f(x)∇f(x) Ts+12s T∇ 2 f(x)s


Newtonʼs MethodTaylor series approx.f(x + s)≈ +f(x)∇f(x) Ts+12s T∇ 2 f(x)sLocal modelm(x k + s) =f(x k )+∇f(x k ) T s + 1 2 sT ∇ 2 f(x k )sx 2Newtonʼs method:Solve for sk∇ 2 f(x k )s k = −∇f(x k ),x k+1 = x k + s kx ks kx 1


Quasi-Newton MethodsApproximation to Hessian:New l<strong>in</strong>ear system:H k ≈∇ 2 f(x k )H k s k = −∇f(x k )Secant method: use gradients atpast po<strong>in</strong>ts to approximate HessianB k = H −1ks k = −B k ∇f(x k )m<strong>in</strong>B k‖ B k − B k−1 ‖ Fsubject to B k = B T k , B k y k = s k−1 ,y k = ∇f(x k ) −∇f(x k−1 )BFGS updateB k = B k−1 + s k−1s T k−1s T k−1 y k− B k−1y k y T k B k−1y T k B k−1y k.


Large-scale MethodsLimited memory BFGS (L-BFGS) keepsonly m past update vectors sk <strong>and</strong> ykBFGS updateB k = B k−1 + s k−1s T k−1s T k−1 y k− B k−1y k y T k B k−1y T k B k−1y k.Nonl<strong>in</strong>ear ConjugateGradient (NCG)x 2Steepest descent then move <strong>in</strong>subsequent conjugate directions−∇f(x k )x CPkx k−H −1k ∇f(x k)x 1These methods are implemented <strong>in</strong> the PoblanoOptimization Toolbox for MATLAB (Dunlavy et al. 2009)


Derivatives for 2-<strong>way</strong> DEDICOMError:E ≡ Z − ARA TObjective function: f(A, R) ≡ 1 2 ‖ E ‖2 F = 1 2∥ Z − ARA T ∥ ∥ 2 FGradient:∂f∂A ij=[−E T AR − EAR T ] ij∂f∂R ij=[−A T EA] ijMATLAB:AR = A*R;ARt = A*R';AtA = A'*A;G{1} = -Z'*AR - Z*ARt + ARt*(AtA*R) + AR*(AtA*R');G{2} = -A'*Z*A + AtA*R*AtA;


Derivatives for 3-<strong>way</strong> DEDICOMError:Objective function:Gradient:E k ≡ Z k − AD k RD k A Tf(A, R, D) ≡ 1 2∂f∂A ij=∂f∂R ij=[[− ∑ k− ∑ kK∑‖ E k ‖ 2 F = 1 2k=1K∑k=1E T AD k RD k + EAD k R T D k]A T D k ED k A]ij∥ Zk − AD k RD k A T ∥ ∥ 2 Fij∂f∂D kj= AD k r j Ea j + a T j Er j,: D k A T


Derivatives for 3-<strong>way</strong> DEDICOMMATLAB:for k = 1:size(D,1)DktDk = D(k,:)'*D(k,:);Rk = R.*DktDk;AR = A*Rk;ARt = A*Rk';E = AR*A' - Z(:,:,k);G{1} = G{1} + E'*AR + E*ARt;G{2} = G{2} + (A'*E*A).*DktDk;ADR = A*(diag(D(k,:))*R);RDA = (R*diag(D(k,:)))*A';for j = 1:size(D,2)G{3}(k,j) = ADR(:,j)'*E*A(:,j) + A(:,j)'*E*RDA(j,:)';endend


ResultsRel. Residual Error10 0 Time (sec)10 !210 !410 !610 !8ASALSANNCGL!BFGS• Synthetic data• 20x20x9• R<strong>and</strong>om <strong>in</strong>itialization(5% error)• No noise <strong>in</strong> X• p = 210 !1010 !120 2 4 6 8 10 12 14


ResultsRel. Residual Error10 0 Time (sec)10 !110 !2ASALSANNCGL!BFGS• Synthetic data• 20x20x9• R<strong>and</strong>om <strong>in</strong>itialization(5% error)• 0.1% noise <strong>in</strong> X• p = 210 !30 1 2 3 4 5 6


ResultsRel. Residual Error10 2 Time (sec)10 1ASALSANNCGL!BFGS• Synthetic data• 50x50x35• R<strong>and</strong>om <strong>in</strong>itialization(5% error)• 0.1% noise <strong>in</strong> X• Oblique factors A• p = 210 00 10 20 30 40 50 60 70 80 90 100


ResultsRel. Residual Error0.990.980.970.960.950.940.93• Enron email data• 184x184x44• 9838 nnz• p = 4• R<strong>and</strong>om<strong>in</strong>itialization0.920.91ASALSANNCGL!BFGS0.90 10 20 30 40 50 60 70 80Time (sec)


Derivatives for <strong>PARAFAC2</strong>Error:Objective function:Gradient:E k ≡ Z k − AD k HD k A Tf(A, H, D) ≡ 1 2∂f∂A ij=∂f∂H ij=[[−2 ∑ k− ∑ kK∑‖ E k ‖ 2 F = 1 2k=1E T AD k HD k]A T D k ED k A]ijK∑k=1ij∥ Zk − AD k HD k A T ∥ ∥ 2 F∂f∂D kj=2AD k h j Ea j


Discussion• All-at-once optimization is competitive but not the best.ASALSAN is fastest on real data.- DEDICOM <strong>and</strong> <strong>PARAFAC2</strong> are difficult nonl<strong>in</strong>earoptimization problems with multiple m<strong>in</strong>ima- ALSALSAN <strong>in</strong>cludes some second order <strong>in</strong>formation• Optimization approach convenient for:- Different loss functions (other than least squares)- General constra<strong>in</strong>ts (e.g., non-negativity)- Regularization (e.g., sparsity)- Miss<strong>in</strong>g data• Future research will explore these benefits <strong>and</strong> aim for betterperformance


ApplicationsTensor orN-<strong>way</strong> arrayDEDICOM• Temporal social network analysis (Enron)• Community f<strong>in</strong>d<strong>in</strong>g• Part-of-speech tagg<strong>in</strong>g<strong>PARAFAC2</strong>• Multil<strong>in</strong>gual document analysis


Us<strong>in</strong>g <strong>PARAFAC2</strong> forMultil<strong>in</strong>gual DocumentCluster<strong>in</strong>gJo<strong>in</strong>t work with Peter Chew


T1: bak(e,<strong>in</strong>g)T2: recipesT3: breadT4: cakeT5: pastr(y,ies)T6: pieT5: pastr(y,ies)Basics: T6: pieGraphs <strong>and</strong> MatricesThe d = 5 document titles:example from (Berry, Drmac, Jessup, 1999)Documentshe d = 5 document titles:D1: How to Bake Bread Without RecipesD2: The Classic Art of Viennese PastryD3: Numerical Recipes: The Art of Scientific Comput<strong>in</strong>gD4: Breads, Pastries, Pies <strong>and</strong> Cakes: Quantity Bak<strong>in</strong>g RecipesD5: Pastry: A Book of Best French RecipesMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 341he 6 t × = 56 term-by-document terms: matrix before normalization, where thelement Terms â ij is the number of times term i appears <strong>in</strong> document title j:T1: bak(e,<strong>in</strong>g)T2: recipes⎛⎞1 0 0 1 0T3: bread1 0 1 1 1T4: cake1 0 0 1 0Â =T5: pastr(y,ies) ⎜ 0 0 0 1 0 ⎟⎝⎠T6: pie0 1 0 1 1he d = 5 document titles:D1: How to Bake Bread Without Bipartite Recipes graphD2: The Classic Art of Viennese T1 PastryD3: Numerical Recipes: The Art of ScientificD1T2Comput<strong>in</strong>D2D4: Breads, Pastries, Pies <strong>and</strong> T3 Cakes: Quantity Bak<strong>in</strong>gD3D5: Pastry: A Book of BestT4French RecipesThe 6 × 5 term-by-document matrix before normalization, wheelement â ij is the number of times term i appears <strong>in</strong> document0 0 0 1 0Concepts• Bag of wordshe 6 × 5 term-by-document matrix with unit columns:D1: How to Bake•Bread Without Recipes⎛D2: The Classic 0.5774 ArtVector of Viennese 0space0Pastrymodel ⎞0.4082 0• D3: Numerical 0.5774 Recipes: Stemm<strong>in</strong>g 0The Art 1.0000 of Scientific 0.4082 Comput<strong>in</strong>g0.7071D4: Breads, A =Pastries, 0.5774•Pies 0 <strong>and</strong> Cakes: 0 0.4082 Quantity 0StoplistsBak<strong>in</strong>g RecipesD5: Pastry: ⎜⎝A Book0of Best0French0Recipes0.4082 0•⎟⎠0 Scal<strong>in</strong>g 1.0000 for 0 <strong>in</strong>formation 0.4082 0.7071 content0 0 0 0.4082 0Term-by-doc (adjacency) matrixD1 D2 D3 D4 D5Â =T5T6⎛⎜⎝D4D51 0 0 1 01 0 1 1 11 0 0 1 00 0 0 1 00 1 0 1 10 0 0 1 0⎞⎟⎠T1T2T3T4T5T6


Cross-language InformationRetrieval (CLIR)Web documents could be <strong>in</strong> any languageEnglishGermanJapaneseFrenchCh<strong>in</strong>ese SimplifiedSpanishRussianDutchKoreanPolishPortugueseCh<strong>in</strong>ese TraditionalSwedishCzechNorwegianItalianDanishHungarianF<strong>in</strong>nishHebrewArabicTurkishSlovakIndonesianBulgarianCroatianCatalanSlovenianGreekRomanianSerbianEstonianIcel<strong>and</strong>icLithuanianLatvianLanguages on the webGoal: Cluster documentsby topic regardless oflanguageEnglishFrenchArabicSpanish(P. Chew, B. Bader, A. Abdelali (NMSU), T. Kolda, P. Kegelmeyer)


Bible as a ʻRosetta Stoneʼ• The Bible has been translated carefully <strong>and</strong> widely- 451 complete & 2479 partial translations• Verse alignedS<strong>and</strong>iaʼs database: 54 languages: 99.76 % coverage of webAfrikaans Estonian NorwegianAlbanian F<strong>in</strong>nish Persian (Farsi)Amharic French PolishArabic German PortugueseAramaic Greek (New Testament) RomaniArmenian Eastern Greek (Modern) RomanianArmenian Western Hebrew (Old Testament) RussianBasque Hebrew (Modern) Scots GaelicBreton Hungarian SpanishChamorro Indonesian SwahiliCh<strong>in</strong>ese (Simplified) Italian SwedishCh<strong>in</strong>ese (Traditional) Japanese TagalogCroatian Korean ThaiCzech Lat<strong>in</strong> TurkishDanish Latvian Ukra<strong>in</strong>ianDutch Lithuanian VietnameseEnglish Manx Gaelic WolofEsperanto Maori Xhosa


Bible as Parallel Corpus5 languages for tra<strong>in</strong><strong>in</strong>g <strong>and</strong> test<strong>in</strong>gTranslation Terms Total WordsEnglish (K<strong>in</strong>g James) 12,335 789,744Spanish (Re<strong>in</strong>a Valera 1909) 28,456 704,004Russian (Synodal 1876) 47,226 560,524Arabic (Smith Van Dyke) 55,300 440,435French (Darby) 20,428 812,947• Languages convey <strong>in</strong>formation <strong>in</strong> different number of wordsIsolat<strong>in</strong>g languageSynthetic language


2,")(4-/""I.+&'#+',(3-,5%4)%/&(-,"#.)"(7,(+4*%,:,5(7,-,(0/-(3+%-"(/0( .+&'#+',"( 7%)*( "%$%.+-( ")+)%")%4"( C0/-( ,;+$3.,6( ((!"#$%&'(&)$$*+,-",./0&/1&+,",.+,.2"$&3.11%-%02%+&#%,4%%0&$"05*"5%+&&& !%6,& 4/-3&2/*0,&7&/1&,/,"$&!A( !"#$%(&$'()*$(+$(,-.(/01*$(23>(L( MK(((AU( 4( 567689( :;?8( @;A( 59B;( ?( V( MV(C9D8E>((#&+:/%5+2.,6($+8,()*,(+""/)*,@(4/I/44#-


Language MorphologyTranslation Terms Total WordsEnglish (K<strong>in</strong>g James) 12,335 789,744Arabic (Smith Van Dyke) 55,300 440,435Languages convey <strong>in</strong>formation <strong>in</strong> different number of wordsIsolat<strong>in</strong>g languageCh<strong>in</strong>eseSynthetic languageQuechua, Inuit (Eskimo)• Isolat<strong>in</strong>g language: One morpheme per word- e.g., "He travelled by hovercraft on the sea." Largely isolat<strong>in</strong>g, but travelled <strong>and</strong> hovercraft eachhave two morphemes per word. (Wikipedia)• Synthetic language: High morpheme-per-word ratio- German: Aufsichtsratsmitgliederversammlung => "On-view-council-with-limbs-gather<strong>in</strong>g" mean<strong>in</strong>g"meet<strong>in</strong>g of members of the supervisory board". (Wikipedia)- Chulym: Aalychtypiskem => "I went out moose hunt<strong>in</strong>g"- Yupʼik Eskimo: tuntusssuatarniksaitengqiggtuq => “He had not yet said aga<strong>in</strong> that he was go<strong>in</strong>g tohunt re<strong>in</strong>deer.” (Payne, 1997)


Term-Doc MatrixTerm-by-verse matrixfor all languagesBible versesEnglishtermsSpanishRussianLook for co-occurrence ofterms <strong>in</strong> the same verses<strong>and</strong> across languages tocapture latent conceptsArabicFrench163,745 x 31,230• Approach is not new- pairs of languages <strong>in</strong> LSA suggested by (Berry et al., 1994)- multi-parallel corpus is new


Latent Semantic Index<strong>in</strong>gTerm-by-verse matrixfor all languagestermsBible versesEnglishSpanishRussianArabicFrenchTruncated SVDA k = U k Σ k V Tk =Uterm x conceptΣk∑σ i u i viTi=1VTProjectnew documentDocumentfeaturevectordimension 1 0.1375dimension 2 0.1052dimension 3 0.0341dimension 4 0.0441dimension 5 -0.0087dimension 6 0.0410dimension 7 0.1011dimension 8 0.0020dimension 9 0.0518dimension 10 0.0822dimension 11 -0.0101dimension 12 -0.1154dimension 13 -0.0990dimension 14 0.0228dimension 15 -0.0520dimension 16 0.1096dimension 17 0.0294dimension 18 0.0495dimension 19 0.0553dimension 20 0.1598Project new documents of <strong>in</strong>terest <strong>in</strong>to subspace of UΣ -1• cos<strong>in</strong>e similarities for cluster<strong>in</strong>g• mach<strong>in</strong>e learn<strong>in</strong>g applications


Quran as Test Set• Quran is translated <strong>in</strong>to many languages, just like the Bible• 114 suras (or chapters)• More variation across translations => harder IR task


Performance Metricsquery?Lang 1• Precision at 1 document (P1)- Equals 100% if the translation of the query rankedhighest, 0% otherwise- Calculated as an average over all queries for eachlanguage pair or as a total average- Essentially, P1 measures success <strong>in</strong> retriev<strong>in</strong>gdocuments when the source <strong>and</strong> target languages arespecifiedquery??Lang 1Lang 2• Average multil<strong>in</strong>gual precision at 5 (or n) documents (MP5)- The average percentage of the top 5 documents thatare translations of the query document- Calculated as an average for all queries & all languages- Essentially, MP5 measures success <strong>in</strong> multil<strong>in</strong>gualcluster<strong>in</strong>g• MP5 is a stricter measure than P1 because the retrievaltask is harder


LSA Results(Chew,Bader,Kolda,Abdelali, 2007)5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA 76.0% 26.1%Documents tend to cluster more by language than by topic


LSA Results5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA (α=1) 76.0% 26.1%SVD/LSA (α=1.8) 87.6% 65.5%


New Approach: Multi-matrix Array(Chew, Bader, Kolda, Abdelali, 2007)FrenchTerm-by-verse matrixfor each languageArabicX 5RussianSpanishX 4EnglishX 1X 2X 3Array size: 55,300 x 31,230 x 5 with 2,765,719 nonzeros


Tucker1Tucker≈Tucker1≈= U1X 1X 2X 3U 2U 3S 1V T


Tucker1 Results5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA (α=1) 76.0% 26.1%SVD/LSA 87.6% 65.5%Tucker1 89.5% 71.3%Only m<strong>in</strong>or improvement because each Uk is not orthogonal


<strong>PARAFAC2</strong>X k ≈ U k HS k V T (Harshman, 1972)Where each Uk is orthonormal<strong>and</strong> Sk is diagonal


<strong>PARAFAC2</strong> Results5 languages, 240 latent dimensionsMethod Average P1 Average MP5SVD/LSA (α=1) 76.0% 26.1%SVD/LSA 87.6% 65.5%Tucker1 89.5% 71.3%<strong>PARAFAC2</strong> 89.8% 78.5%Modest improvement over LSA


clusters conta<strong>in</strong><strong>in</strong>g related books; this k<strong>in</strong>d of visualization is possible only when MP6 issatisfactorily high. In summary, these techniques have effectively allowed us to factorlanguage out,Biblefocus<strong>in</strong>g onlyCluster<strong>in</strong>gon topic, just as wewithhad hoped.LMSATA• Books of the Bible colorcoded by language• MP5 about 90%• Books cluster first withtheir counterparts <strong>in</strong>other languages, then <strong>in</strong>larger clusters by topic• Visualization software: Tamale 1.2• Graph layout: VxOrd (Boyack etal., 2005)


Graph Layout (close-up)• John <strong>and</strong> Acts have tight clusters• Some mix<strong>in</strong>g with Matthew, Mark,Luke (synoptic gospels - share asimilar perspective)7 DiscussionFigure 7. Partial visualization of multil<strong>in</strong>gual Bible books <strong>in</strong> vector space


Us<strong>in</strong>g DEDICOM forCompletely UnsupervisedPart-of-Speech Tagg<strong>in</strong>gJo<strong>in</strong>t work with Peter Chew


Approaches to POS tagg<strong>in</strong>g (1)•Supervised–Rule-based (e.g. Harris 1962)• Dictionary + manually developed rules• Brittle – approach doesn’t port to new doma<strong>in</strong>s–Stochastic (e.g. Stolz et al. 1965, Church 1988)• Examples: HMMs, CRFs• Relies on estimation of emission <strong>and</strong> transitionprobabilities from a tagged tra<strong>in</strong><strong>in</strong>g corpus• Aga<strong>in</strong>, difficulty <strong>in</strong> port<strong>in</strong>g to new doma<strong>in</strong>s


Approaches to POS tagg<strong>in</strong>g (2)•Unsupervised–All approaches exploit distributional patterns–S<strong>in</strong>gular Value Decomposition (SVD) of termadjacencymatrix (Schütze 1993, 1995)–Graph cluster<strong>in</strong>g (Biemann 2006)–Our approach: DEDICOM of term-adjacencymatrix–Most similar to Schütze (1993, 1995)Advantages:• can be reconciled to stochastic approaches• completely unsupervised, like SVD <strong>and</strong> graph cluster<strong>in</strong>g• <strong>in</strong>itial results appear promis<strong>in</strong>g


DEDICOM – application to POS tagg<strong>in</strong>gTerm adjacency matrix‘R’ matrix‘A’ matrix• The assumption that terms are a “s<strong>in</strong>gle set of objects”, whetherthey precede or follow, sets DEDICOM apart from SVD <strong>and</strong> otherunsupervised approaches• This assumption models the fact that tokens play the samesyntactic role whether we view them as the first or second element<strong>in</strong> a bigram


Compar<strong>in</strong>g DEDICOM output to HMM <strong>in</strong>putOutput of DEDICOM‘R’ matrixInput to HMM(after normalization of counts)Transition prob. matrix‘A’ matrixEmission prob. matrix• The output of DEDICOM is essentially a transition <strong>and</strong> emissionprobability matrix• DEDICOM offers the possibility of gett<strong>in</strong>g the familiar transition<strong>and</strong> emission probabilities without tagged tra<strong>in</strong><strong>in</strong>g data


Validation: method 1 (theoretical)• Hypothetical example - suppose tagged tra<strong>in</strong><strong>in</strong>g corpus existsCorpus: The man walked the big dogDT NN VBD DT JJ NNX:sparse matrix ofbigram countsA*: term-tag countsR*: tag-adjacency counts• By def<strong>in</strong>ition (subject to difference of 1 for f<strong>in</strong>al token):– row sums of X = col sums of X = row sums of A*– col sums of A* = row sums of R* = col sums of R*


Validation: method 1 (theoretical)• To turn A* <strong>and</strong> R* <strong>in</strong>to transition <strong>and</strong> emission probabilitymatrices, we simply multiply each by a diagonal matrix D wherethe entries are the <strong>in</strong>verses of the row-sum vector• If the DEDICOM model is a good one, we should be able tomultiply A*DR*D(A*) T to approximate the orig<strong>in</strong>al matrix X• In our example, A*DR*D(A*) T =• This approximates X, <strong>and</strong> it also captures some syntacticregularities which aren’t <strong>in</strong>stantiated <strong>in</strong> the corpus (this is onereason HMM-based POS tagg<strong>in</strong>g is successful)


Validation: method 2 (empirical)• Use a tagged corpus (CONLL 2000)– CONLL 2000 has 19,440 dist<strong>in</strong>ct terms– There are 44 dist<strong>in</strong>ct tags <strong>in</strong> the tagset (our “gold st<strong>and</strong>ard”)• Tabulate X matrix (solely from bigram frequencies, bl<strong>in</strong>d to tags)• Fit DEDICOM model to X (us<strong>in</strong>g k = 44) to ‘learn’ emission <strong>and</strong>transition probability matrices• Use these as <strong>in</strong>put to a HMM; tag each token with a numerical<strong>in</strong>dex (one of the DEDICOM ‘dimensions’)• Evaluate by look<strong>in</strong>g at correlation of <strong>in</strong>duced tags with the 44“gold st<strong>and</strong>ard” tags <strong>in</strong> a confusion matrix


Validation: method 2 (empirical)Assum<strong>in</strong>g the ‘gold st<strong>and</strong>ard’ is the optimal tagg<strong>in</strong>g scheme, the idealconfusion matrix would have one DEDICOM class per ‘gold st<strong>and</strong>ard’ tag(either a diagonal matrix or some permutation thereof).DEDICOM tagsGoldst<strong>and</strong>ardtagConfusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494


Validation: method 2 (empirical)• Examples of DEDICOM dimensions or clusters:“NN”“NN”“DT”Tag 3: Group<strong>in</strong>g of ‘new’ with ‘the’ <strong>and</strong> ‘a’ expla<strong>in</strong>ed by distributional similarities (allprecede nouns). This is also <strong>in</strong> accordance with a traditional English grammar, which heldthat determ<strong>in</strong>ers are a type of adjective.Soft cluster<strong>in</strong>g: U.S. appears under tag 2 (nouns) <strong>and</strong> tag 8 (adjectives)


Future Work <strong>in</strong> POS Tagg<strong>in</strong>g• Constra<strong>in</strong>ed DEDICOM- Tags for common terms (e.g., determ<strong>in</strong>ers, prepositions)- Semi-supervised learn<strong>in</strong>g• Analyze multiple corpora via 3-<strong>way</strong> DEDICOM• Constra<strong>in</strong>ts on R matrix <strong>in</strong> DEDICOM


Discussion <strong>and</strong> Questions• Challenges with global optimization- Ways reduce search space <strong>and</strong> f<strong>in</strong>d global m<strong>in</strong>imizer- Determ<strong>in</strong>istic algorithm that returns a good approximationto global solution• Fast algorithms for large-scale problems• Other data analysis models


Selected Publications• Bader, Harshman, <strong>and</strong> Kolda. 2007. Temporal analysis of semanticgraphs us<strong>in</strong>g ASALSAN. Proceed<strong>in</strong>gs of IEEE InternationalConference on <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (ICDM), 2007.• Chew, Bader, Kolda <strong>and</strong> Abdelali. 2007. Cross-language<strong>in</strong>formation retrieval us<strong>in</strong>g <strong>PARAFAC2</strong>. Proceed<strong>in</strong>gs of KDD 2007.• Chew, Kegelmeyer, Bader <strong>and</strong> Abdelali. 2008. The Knowledge ofGood <strong>and</strong> Evil: Multil<strong>in</strong>gual Ideology Classification with <strong>PARAFAC2</strong><strong>and</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g. Language Forum (34), 37-52.• Chew, Bader, <strong>and</strong> Rozovskaya. 2009. Us<strong>in</strong>g DEDICOM forCompletely Unsupervised Part-of-Speech Tagg<strong>in</strong>g. Proceed<strong>in</strong>gs ofNAACL-HLT, Workshop on Unsupervised <strong>and</strong> M<strong>in</strong>imally SupervisedLearn<strong>in</strong>g of Lexical Semantics.• B. W. Bader. 2009. Constra<strong>in</strong>ed <strong>and</strong> Unconstra<strong>in</strong>ed Optimization.Brown, Tauler, Walczak (eds.) Comprehensive Chemometrics,volume 1, pp. 507-545 Oxford: Elsevier.Brett Bader (bwbader@s<strong>and</strong>ia.gov)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!